Logo do repositório
 
Publicação

A Dataset of Photos and Videos for Digital Forensics Analysis Using Machine Learning Processing

datacite.subject.fosCiências Sociais::Economia e Gestão
datacite.subject.fosCiências Naturais::Ciências da Computação e da Informação
datacite.subject.sdg08:Trabalho Digno e Crescimento Económico
datacite.subject.sdg09:Indústria, Inovação e Infraestruturas
datacite.subject.sdg10:Reduzir as Desigualdades
dc.contributor.authorFerreira, Sara
dc.contributor.authorAntunes, Mário
dc.contributor.authorCorreia, Manuel E.
dc.date.accessioned2026-02-18T12:51:55Z
dc.date.available2026-02-18T12:51:55Z
dc.date.issued2021-08-05
dc.description.abstractDeepfake and manipulated digital photos and videos are being increasingly used in a myriad of cybercrimes. Ransomware, the dissemination of fake news, and digital kidnapping-related crimes are the most recurrent, in which tampered multimedia content has been the primordial disseminating vehicle. Digital forensic analysis tools are being widely used by criminal investigations to automate the identification of digital evidence in seized electronic equipment. The number of files to be processed and the complexity of the crimes under analysis have highlighted the need to employ efficient digital forensics techniques grounded on state-of-the-art technologies. Machine Learning (ML) researchers have been challenged to apply techniques and methods to improve the automatic detection of manipulated multimedia content. However, the implementation of such methods have not yet been massively incorporated into digital forensic tools, mostly due to the lack of realistic and well-structured datasets of photos and videos. The diversity and richness of the datasets are crucial to benchmark the ML models and to evaluate their appropriateness to be applied in real-world digital forensics applications. An example is the development of third-party modules for the widely used Autopsy digital forensic application. This paper presents a dataset obtained by extracting a set of simple features from genuine and manipulated photos and videos, which are part of state-of-the-art existing datasets. The resulting dataset is balanced, and each entry comprises a label and a vector of numeric values corresponding to the features extracted through a Discrete Fourier Transform (DFT). The dataset is available in a GitHub repository, and the total amount of photos and video frames is 40, 588 and 12, 400, respectively. The dataset was validated and benchmarked with deep learning Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) methods; however, a plethora of other existing ones can be applied. Generically, the results show a better F1-score for CNN when comparing with SVM, both for photos and videos processing. CNN achieved an F1-score of 0.9968 and 0.8415 for photos and videos, respectively. Regarding SVM, the results obtained with 5-fold cross-validation are 0.9953 and 0.7955, respectively, for photos and videos processing. A set of methods written in Python is available for the researchers, namely to preprocess and extract the features from the original photos and videos files and to build the training and testing sets. Additional methods are also available to convert the original PKL files into CSV and TXT, which gives more flexibility for the ML researchers to use the dataset on existing ML frameworks and tools.eng
dc.description.sponsorshipThe authors acknowledge the facilities provided by INESC TEC, Faculty of Sciences, and University of Porto, for the support to this research. This work is financed by National Funds through the Portuguese funding agency, FCT- Fundação para a Ciência e a Tecnologia, within project UIDB/50014/2020.
dc.identifier.citationFerreira, S.; Antunes, M.; Correia, M.E. A Dataset of Photos and Videos for Digital Forensics Analysis Using Machine Learning Processing. Data 2021, 6, 87. https://doi.org/10.3390/data6080087.
dc.identifier.doi10.3390/data6080087
dc.identifier.eissn2306-5729
dc.identifier.urihttp://hdl.handle.net/10400.8/15666
dc.language.isoeng
dc.peerreviewedyes
dc.publisherMDPI
dc.relationINESC TEC- Institute for Systems and Computer Engineering, Technology and Science
dc.relation.hasversionhttps://www.mdpi.com/2306-5729/6/8/87
dc.relation.ispartofData
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectdigital forensics
dc.subjectmachine learning
dc.subjectphotos and videos manipulation
dc.subjectDiscrete Fourier Transform
dc.subjecttampered multimedia
dc.subjectdeepfake
dc.titleA Dataset of Photos and Videos for Digital Forensics Analysis Using Machine Learning Processingeng
dc.typejournal article
dspace.entity.typePublication
oaire.awardTitleINESC TEC- Institute for Systems and Computer Engineering, Technology and Science
oaire.awardURIinfo:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F50014%2F2020/PT
oaire.citation.endPage15
oaire.citation.issue8
oaire.citation.startPage1
oaire.citation.titleData
oaire.citation.volume6
oaire.fundingStream6817 - DCRRNI ID
oaire.versionhttp://purl.org/coar/version/c_970fb48d4fbd8a85
person.familyNameAntunes
person.givenNameMário
person.identifierR-000-NX4
person.identifier.ciencia-idAF10-7EDD-5153
person.identifier.orcid0000-0003-3448-6726
person.identifier.scopus-author-id25930820200
project.funder.identifierhttp://doi.org/10.13039/501100001871
project.funder.nameFundação para a Ciência e a Tecnologia
relation.isAuthorOfPublicatione3e87fb0-d1d6-44c3-985d-920a5560f8c1
relation.isAuthorOfPublication.latestForDiscoverye3e87fb0-d1d6-44c3-985d-920a5560f8c1
relation.isProjectOfPublication42e9aded-d47f-4d1e-a69d-f5566c0b595d
relation.isProjectOfPublication.latestForDiscovery42e9aded-d47f-4d1e-a69d-f5566c0b595d

Ficheiros

Principais
A mostrar 1 - 1 de 1
Miniatura indisponível
Nome:
A dataset of photos and videos for digital forensics analysis using machine learning processing.pdf
Tamanho:
1.38 MB
Formato:
Adobe Portable Document Format
Descrição:
Deepfake and manipulated digital photos and videos are being increasingly used in a myriad of cybercrimes. Ransomware, the dissemination of fake news, and digital kidnapping-related crimes are the most recurrent, in which tampered multimedia content has been the primordial disseminating vehicle. Digital forensic analysis tools are being widely used by criminal investigations to automate the identification of digital evidence in seized electronic equipment. The number of files to be processed and the complexity of the crimes under analysis have highlighted the need to employ efficient digital forensics techniques grounded on state-of-the-art technologies. Machine Learning (ML) researchers have been challenged to apply techniques and methods to improve the automatic detection of manipulated multimedia content. However, the implementation of such methods have not yet been massively incorporated into digital forensic tools, mostly due to the lack of realistic and well-structured datasets of photos and videos. The diversity and richness of the datasets are crucial to benchmark the ML models and to evaluate their appropriateness to be applied in real-world digital forensics applications. An example is the development of third-party modules for the widely used Autopsy digital forensic application. This paper presents a dataset obtained by extracting a set of simple features from genuine and manipulated photos and videos, which are part of state-of-the-art existing datasets. The resulting dataset is balanced, and each entry comprises a label and a vector of numeric values corresponding to the features extracted through a Discrete Fourier Transform (DFT). The dataset is available in a GitHub repository, and the total amount of photos and video frames is 40, 588 and 12, 400, respectively. The dataset was validated and benchmarked with deep learning Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) methods; however, a plethora of other existing ones can be applied. Generically, the results show a better F1-score for CNN when comparing with SVM, both for photos and videos processing. CNN achieved an F1-score of 0.9968 and 0.8415 for photos and videos, respectively. Regarding SVM, the results obtained with 5-fold cross-validation are 0.9953 and 0.7955, respectively, for photos and videos processing. A set of methods written in Python is available for the researchers, namely to preprocess and extract the features from the original photos and videos files and to build the training and testing sets. Additional methods are also available to convert the original PKL files into CSV and TXT, which gives more flexibility for the ML researchers to use the dataset on existing ML frameworks and tools.
Licença
A mostrar 1 - 1 de 1
Miniatura indisponível
Nome:
license.txt
Tamanho:
1.32 KB
Formato:
Item-specific license agreed upon to submission
Descrição: