An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Carnaz, Gonçalo; Antunes, Mário; Nogueira, Vitor Beires

Publicação

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

2021-06-26Artigo científico

datacite.subject.fos	Ciências Sociais::Economia e Gestão
datacite.subject.fos	Ciências Naturais::Ciências da Computação e da Informação
datacite.subject.sdg	08:Trabalho Digno e Crescimento Económico
datacite.subject.sdg	09:Indústria, Inovação e Infraestruturas
datacite.subject.sdg	10:Reduzir as Desigualdades
dc.contributor.author	Carnaz, Gonçalo
dc.contributor.author	Antunes, Mário
dc.contributor.author	Nogueira, Vitor Beires
dc.date.accessioned	2026-05-12T13:46:09Z
dc.date.available	2026-05-12T13:46:09Z
dc.date.issued	2021-06-26
dc.description.abstract	Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.	eng
dc.description.sponsorship	Funding The APC was financed by Polytechnic of Leiria. Acknowledgments The authors acknowledge the facilities provided by Polytechnic of Leiria and University of Évora, for the support to this research.
dc.identifier.citation	Carnaz, G.; Antunes, M.; Nogueira, V. An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data 2021, 6, 71. https://doi.org/10.3390/data6070071.
dc.identifier.doi	10.3390/data6070071
dc.identifier.eissn	2306-5729
dc.identifier.uri	http://hdl.handle.net/10400.8/16273
dc.language.iso	eng
dc.peerreviewed	yes
dc.publisher	MDPI
dc.relation.hasversion	https://www.mdpi.com/2306-5729/6/7/71
dc.relation.ispartof	Data
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	crime-related documents
dc.subject	cybersecurity
dc.subject	criminal investigation
dc.subject	Portuguese language corpus
dc.subject	natural language processing
dc.subject	5W1H
dc.title	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing	eng
dc.type	journal article
dcterms.references	https://github.com/goncalofcarnaz/Annotated-Corpus-of-Criminal-Related-Portuguese- Documents
dspace.entity.type	Publication
oaire.citation.endPage	11
oaire.citation.issue	7
oaire.citation.startPage	1
oaire.citation.title	Data
oaire.citation.volume	6
oaire.version	http://purl.org/coar/version/c_970fb48d4fbd8a85
person.familyName	Antunes
person.givenName	Mário
person.identifier	R-000-NX4
person.identifier.ciencia-id	AF10-7EDD-5153
person.identifier.orcid	0000-0003-3448-6726
person.identifier.scopus-author-id	25930820200
relation.isAuthorOfPublication	e3e87fb0-d1d6-44c3-985d-920a5560f8c1
relation.isAuthorOfPublication.latestForDiscovery	e3e87fb0-d1d6-44c3-985d-920a5560f8c1

Ficheiros

Principais

A mostrar 1 - 1 de 1

Nome:: An annotated corpus of crime-related portuguese documents for nlp and machine learning processing.pdf
Tamanho:: 349.16 KB
Formato:: Adobe Portable Document Format
Descrição:: Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Ver/Abrir

Licença

A mostrar 1 - 1 de 1

Nome:: license.txt
Tamanho:: 1.32 KB
Formato:: Item-specific license agreed upon to submission
Descrição:

Ver/Abrir

Coleções

CIIC - Artigos em Revistas com Peer Review
ESTG - Artigos em revistas internacionais