GPT and Interpolation-Based Data Augmentation for Multiclass Intrusion Detection in IIoT

Melicias, Francisco S.; Ribeiro, Tiago F. R.; Rabadão, Carlos; Santos, Leonel; Costa, Rogério Luís de C.

http://hdl.handle.net/10400.8/9867

Use this identifier to reference this record.

Name:	Description:	Size:	Format:
GPT_and_Interpolation-Based_Data_Augmentation_for_Multiclass_Intrusion_Detection_in_IIoT.pdf		3.15 MB	Adobe PDF	Download

Send Feedback

Authors

Melicias, Francisco S.

Ribeiro, Tiago F. R.

Rabadão, Carlos

Santos, Leonel

Costa, Rogério Luís de C.

Abstract(s)

The absence of essential security protocols in Industrial Internet of Things (IIoT) networks introduces cybersecurity vulnerabilities and turns them into potential targets for various attack types. Although machine learning has been used for intrusion detection in the IIoT, datasets with representative data of common attacks of IIoT network traffic are limited and often imbalanced. Data augmentation techniques address these problems by creating artificial data in classes with fewer samples. In this work, we evaluate the use of data augmentation when training intrusion detection models based on IIoT traffic data. We compare Generative Pre-trained Transformers (GPT) and variations on the Synthetic Minority Over-sampling TEchnique (SMOTE) and evaluate their capability to enhance intrusion detection performance. We examine the performance of five intrusion detection algorithms when trained with augmented datasets to models trained with the original non-augmented dataset. To ensure a fair comparison, we evaluated the algorithms’ performance in the different scenarios using the same test dataset, which does not contain synthetic data. The results show the need for a systematic evaluation before employing data augmentation, as its impact on classification performance depends on the algorithm, data, and used technique. While deep neural networks benefit from data augmentation, the eXtreme Gradient Boosting (XGBoost), which achieved superior performance in intrusion detection between all evaluated classifiers (with F1-Score over 91%), didn’t have any performance improvement when trained with augmented data. The evaluation of data generated by GPT-based methods shows such methods (especially GReaT) generate invalid data for both numerical and categorical features in a way that leads to performance degradation in multiclass classification.

Description

This work was supported in part by the Fundação para a Ciência e a Tecnologia (FCT), I.P., under Project UIDB/04524/2020; in part by the Scientific Employment Stimulus-Institutional Call under Grant CEECINST/00051/2018; and in part by the Agência Nacional de Inovação (ANI), S.A., under Project POCI-01-0247-FEDER-046083.

Keywords

IIoT Cybersecurity Data augmentation Machine learning

URI

http://hdl.handle.net/10400.8/9867

Citation

Melicias, F. S., Ribeiro, T. F. R., Rabadao, C., Santos, L., & Costa, R. L. D. C. (2024). GPT and Interpolation-Based Data Augmentation for Multiclass Intrusion Detection in IIoT. IEEE Access, 12, 17945–17965. https://doi.org/10.1109/ACCESS.2024.3360879