| Name: | Description: | Size: | Format: | |
|---|---|---|---|---|
| 6.54 MB | Adobe PDF |
Abstract(s)
Este projeto centrou-se na deteção de anomalias através da aplicação da lei de Benford,
explorando a sua capacidade para identificar desvios estatĆsticos de forma eficiente e
precisa. A abordagem adotada baseou-se nesta lei, amplamente reconhecida pela sua
utilidade na deteção de fraudes, especialmente em dados financeiros, ao analisar a
distribuição dos primeiros dĆgitos.
A escassez de dados públicos de qualidade dificultava a avaliação rigorosa de modelos
estatĆsticos. Para superar esta limitação, desenvolveu-se um gerador de dados
sintƩticos parametrizƔvel, capaz de reproduzir padrƵes correspondentes tanto a eventos
normais como a manipulações realistas. A aplicação desenvolvida permitiu simular
condiƧƵes diversas e aproximar os testes a situaƧƵes do mundo real, facilitando a
anĆ”lise do desempenho e do comportamento dos mĆ©todos estatĆsticos.
Com os dados simulados obtidos, tornou-se possĆvel avaliar a eficĆ”cia de diferentes
mĆ©todos estatĆsticos em condiƧƵes mais próximas da realidade. Neste contexto, a lei
de Benford assumiu um papel central, destacando-se pela sua utilidade na deteção de
anomalias em múltiplos cenÔrios. Para explorar de forma sistemÔtica esta capacidade,
desenvolveu-se um modelo estatĆstico como alternativa aos modelos tradicionais de
machine learning, que apresentam elevadas taxas de falsos positivos e grandes exigĆŖncias
computacionais. A proposta assentou na aplicação da lei de Benford combinada
com medidas de dissemelhanƧa, permitindo quantificar o desvio entre as distribuiƧƵes
observadas e a distribuição esperada segundo esta lei.
Realizaram-se simulaƧƵes com o gerador desenvolvido para criar conjuntos de dados
conformes e não conformes com a lei de Benford, obtendo-se assim dados classificados.
Para medir o desvio, utilizaram-se o qui-quadrado, o desvio mƩdio absoluto,
o teste de KolmogorovāSmirnov, a distĆ¢ncia euclidiana, a distĆ¢ncia de Hellinger,
a divergĆŖncia de Kullback-Leibler e a combinação dos valores-š dos testes atravĆ©s do
mƩtodo de Fisher. O desempenho das diferentes medidas de dissemelhanƧa foi avaliado
com recurso a métricas de classificação como a precisão, recall e F1-score, os mesmos
critƩrios utlizados em machine learning, permitindo comparar o desempenho do modelo
em estudo com modelos de machine learning. A anƔlise foi complementada pela matriz
de confusão e pela curva ROC, ferramentas que permitem uma avaliação mais detalhada
do comportamento do modelo, possibilitando a comparação do seu desempenho
com o de modelos de machine learning.
The scarcity of quality public data makes it difficult to rigorously evaluate statistical models. To overcome this limitation, this work develops a parametrizable synthetic data generator capable of reproducing realistic patterns, noises and manipulations. This tool allows you to simulate diverse conditions and approximate the tests to real world situations, facilitating the analysis of performance and behavior of statistical methods. With the simulated data obtained, it is possible to evaluate the effectiveness of different statistical methods in conditions closer to reality. In this context, Benfordās Law assumes a central role, standing out for its usefulness in the detection of anomalies in multiple scenarios. To systematically explore this capacity, a statistical model was developed as an alternative to the traditional models of machine learning which have high false positive rates and large computational requirements. The proposal is based on the application of Benfordās Law combined with dissimilarity measures, allowing to quantify the deviation between the observed distributions and the expected distribution according to Benfordās law. Simulations were performed where, using the developed generator, compliant and non-compliant datasets were generated, allowing to obtain classified data. To measure the deviation, we used the chi-square, the mean absolute deviation, the Kolmogorov- Smirnov test, the Euclidean distance, the Hellinger distance, the Kullback-Leibler divergence and the combination of the š-values of the tests made through the Fisher method. The performance of the different measures of divergence is evaluated using classification metrics such as precision, recall and F1-score, the same criteria used in machine learning, which allows to compare the performance of the model under study with machine learning models. The analysis was complemented by the confusion matrix and ROC curve, tools that allow a more detailed evaluation of the behavior of the model, allowing the comparison of its performance with that of machine learning models.
The scarcity of quality public data makes it difficult to rigorously evaluate statistical models. To overcome this limitation, this work develops a parametrizable synthetic data generator capable of reproducing realistic patterns, noises and manipulations. This tool allows you to simulate diverse conditions and approximate the tests to real world situations, facilitating the analysis of performance and behavior of statistical methods. With the simulated data obtained, it is possible to evaluate the effectiveness of different statistical methods in conditions closer to reality. In this context, Benfordās Law assumes a central role, standing out for its usefulness in the detection of anomalies in multiple scenarios. To systematically explore this capacity, a statistical model was developed as an alternative to the traditional models of machine learning which have high false positive rates and large computational requirements. The proposal is based on the application of Benfordās Law combined with dissimilarity measures, allowing to quantify the deviation between the observed distributions and the expected distribution according to Benfordās law. Simulations were performed where, using the developed generator, compliant and non-compliant datasets were generated, allowing to obtain classified data. To measure the deviation, we used the chi-square, the mean absolute deviation, the Kolmogorov- Smirnov test, the Euclidean distance, the Hellinger distance, the Kullback-Leibler divergence and the combination of the š-values of the tests made through the Fisher method. The performance of the different measures of divergence is evaluated using classification metrics such as precision, recall and F1-score, the same criteria used in machine learning, which allows to compare the performance of the model under study with machine learning models. The analysis was complemented by the confusion matrix and ROC curve, tools that allow a more detailed evaluation of the behavior of the model, allowing the comparison of its performance with that of machine learning models.
Description
Keywords
Lei de Benford Deteção de irregularidades Modelos estatĆsticos Medidas de dissemelhanƧa Avaliação de desempenho Distribuição do 1Āŗ dĆgito Dados sintĆ©ticos
