Email Header Analysis for Digital Forensics Using Machine Learning

Deepa B; Dr. Balamurugan S

doi:10.64388/IREV9I11-1717944

Home / Current Issue / Paper 1717944

1717944PublishedVol 9 · Issue 11

Email Header Analysis for Digital Forensics Using Machine Learning

Deepa B Dr. Balamurugan S

Subject area: Science,Engineering and Technology · Area of research: Cybersecurity and Digital Forensics

DOI: https://doi.org/10.64388/IREV9I11-1717944

Abstract

Email communication remains the dominant vector for advanced cyber-attacks, including phishing, spoofing, and Business Email Compromise (BEC). Conventional email security mechanisms rely predominantly on content-based analysis encompassing Natural Language Processing (NLP), keyword filtering, and signature-based detection which are increasingly inadequate against modern adversarial techniques such as clean-text phishing, image-based payloads, and AI-generated deceptive messages. This paper presents a novel forensic-aware, machine learning-driven framework that shifts the analytical focus from email body content to Simple Mail Transfer Protocol (SMTP) header metadata. Email headers encode verifiable forensic information relay paths (Received fields), originating IP addresses, timestamp sequences, and authentication outcomes (SPF, DKIM, DMARC) that collectively represent a tamper-resistant record of an email's transmission behavior. Unlike body content, header metadata is structurally constrained and considerably more difficult for attackers to consistently manipulate across multiple relay nodes. The proposed framework introduces a comprehensive feature engineering pipeline that extracts temporal, network, topological, and authentication-level attributes from SMTP headers. These features are processed through an ensemble of machine learning models Random Forest (RF), Isolation Forest, and XGBoost enabling both supervised classification of known attack patterns and unsupervised detection of novel anomalies. A critical contribution of this research is the integration of forensic traceability with automated detection: the system reconstructs email transmission paths and preserves evidentiary artifacts suitable for digital forensic investigation and legal proceedings. Experimental evaluations on datasets derived from SpamAssassin and PhishTank repositories demonstrate that the proposed ensemble model achieves an F1-score of approximately 0.969 and an AUC-ROC of 0.983, representing a 5–8% improvement over content-only baseline models. False positive rates are simultaneously reduced from 11.2% to 3.1%, ensuring operational reliability in enterprise environments. This research establishes a scalable, intelligent, and forensically defensible paradigm for combating sophisticated email-based cyber-threats.

Keywords

Anomaly Detection, Digital Forensics, Email Spoofing, Ensemble Learning, Feature Engineering, Graph-Based Topology, SMTP Header Analysis, Digital Forensics, Phishing Detection

How to cite this paper

Deepa B, Dr. Balamurugan S "Email Header Analysis for Digital Forensics Using Machine Learning" Iconic Research And Engineering Journals Volume 9 Issue 11 2026 Page 2589-2599 https://doi.org/10.64388/IREV9I11-1717944

Deepa B, Dr. Balamurugan S "Email Header Analysis for Digital Forensics Using Machine Learning" Iconic Research And Engineering Journals, vol. 9, no. 11, May. 2026, doi: https://doi.org/10.64388/IREV9I11-1717944

Deepa B, Dr. Balamurugan S (2026). Email Header Analysis for Digital Forensics Using Machine Learning. Iconic Research And Engineering Journals, 9(11). doi: https://doi.org/10.64388/IREV9I11-1717944

Deepa B, Dr. Balamurugan S "Email Header Analysis for Digital Forensics Using Machine Learning" Iconic Research And Engineering Journals, vol. 9, no. 11, May. 2026. Crossref, https://doi.org/10.64388/IREV9I11-1717944

@article{1717944,
      author = {Deepa B, Dr. Balamurugan S},
      title = {Email Header Analysis for Digital Forensics Using Machine Learning},
      journal = {Iconic Research And Engineering Journals},
      year = {2026},
      volume = {9},
      number = {11},
      pages = {2589-2599},
      issn = {2456-8880},
      url = {https://www.irejournals.com/formatedpaper/1717944.pdf},
      abstract = {Email communication remains the dominant vector for advanced cyber-attacks, including phishing, spoofing, and Business Email Compromise (BEC). Conventional email security mechanisms rely predominantly on content-based analysis encompassing Natural Language Processing (NLP), keyword filtering, and signature-based detection which are increasingly inadequate against modern adversarial techniques such as clean-text phishing, image-based payloads, and AI-generated deceptive messages. This paper presents a novel forensic-aware, machine learning-driven framework that shifts the analytical focus from email body content to Simple Mail Transfer Protocol (SMTP) header metadata. Email headers encode verifiable forensic information relay paths (Received fields), originating IP addresses, timestamp sequences, and authentication outcomes (SPF, DKIM, DMARC) that collectively represent a tamper-resistant record of an email's transmission behavior. Unlike body content, header metadata is structurally constrained and considerably more difficult for attackers to consistently manipulate across multiple relay nodes. The proposed framework introduces a comprehensive feature engineering pipeline that extracts temporal, network, topological, and authentication-level attributes from SMTP headers. These features are processed through an ensemble of machine learning models Random Forest (RF), Isolation Forest, and XGBoost enabling both supervised classification of known attack patterns and unsupervised detection of novel anomalies. A critical contribution of this research is the integration of forensic traceability with automated detection: the system reconstructs email transmission paths and preserves evidentiary artifacts suitable for digital forensic investigation and legal proceedings. Experimental evaluations on datasets derived from SpamAssassin and PhishTank repositories demonstrate that the proposed ensemble model achieves an F1-score of approximately 0.969 and an AUC-ROC of 0.983, representing a 5–8% improvement over content-only baseline models. False positive rates are simultaneously reduced from 11.2% to 3.1%, ensuring operational reliability in enterprise environments. This research establishes a scalable, intelligent, and forensically defensible paradigm for combating sophisticated email-based cyber-threats.},
      keywords = {Anomaly Detection, Digital Forensics, Email Spoofing, Ensemble Learning, Feature Engineering, Graph-Based Topology, SMTP Header Analysis, Digital Forensics, Phishing Detection},
      month = {May},
      doi = {https://doi.org/10.64388/IREV9I11-1717944}
  }