Amazing technological breakthrough possible @S-Logix pro@slogix.in

Office Address

  • #5, First Floor, 4th Street Dr. Subbarayan Nagar Kodambakkam, Chennai-600 024 Landmark : Samiyar Madam
  • pro@slogix.in
  • +91- 81240 01111

Social List

Improving Malicious PDF Classifier with Feature Engineering: A Data-Driven Approach - 2021

improving-malicious-pdf-classifier-with-feature-engineering-a-data-driven-approach.jpg

Improving Malicious PDF Classifier with Feature Engineering: A Data-Driven Approach | S-Logix

Research Area:  Machine Learning

Abstract:

Several approaches and tools have been developed to analyse and detect the presence of malicious content within the PDF; however, the fundamental approach in designing the existing tools and techniques has not been entirely considerate. Existing tools are based on the available datasets and the observation made during the maldoc manual analysis, making them susceptible to various types of attacks such as Mimicry and Parser confusion. We aim to enhance PDF maldoc classification by identifying the most conclusive feature-set required for accurately classifying PDF maldocs. We extract features using two popular PDF analysis tools and derive a set of features backed by data that further complements classification. We subsequently evaluate all features through a wrapper function. The features with the highest importance values are used to construct a classifier that outperforms the baseline models in terms of classification accuracy and efficiency. Our proposed method helps us identify a useful set of tool-independent features that prolong the current tools lifespan and usability. It provides us with an in-depth understanding of how these chosen features cumulatively impact the classification. In addition, we evaluate our findings using real-world samples from VirusTotal. Using our proposed technique, we managed to decrease the size of the feature-set by more than 60% while increasing the classification accuracy by around 2%.

Keywords:  
Wrapper function
Malicious content
Baseline models
Classification accuracy
PDF classifier

Author(s) Name:  Ahmed Falah, Lei Pan, Shamsul Huda

Journal name:  Future Generation Computer Systems

Conferrence name:  

Publisher name:  Elsevier

DOI:  10.1016/j.future.2020.09.015

Volume Information:  Volume 115