A Multi-Metric Evaluation Perspective on Hallucination Detection in Low-Resource Governance Documents

Pranjal Gahlot; Prof. Rakshitha BS

doi:10.64388/IREV9I11-1717980

Home / Current Issue / Paper 1717980

1717980PublishedVol 9 · Issue 11

A Multi-Metric Evaluation Perspective on Hallucination Detection in Low-Resource Governance Documents

Pranjal Gahlot Prof. Rakshitha BS

Subject area: Science,Engineering and Technology · Area of research: AI, NLP, LLMs

DOI: https://doi.org/10.64388/IREV9I11-1717980

Abstract

The rapid advancement of Large Language Models (LLMs) has significantly improved natural language processing applications across domains such as governance, healthcare, legal analysis, and public information systems. Despite these advancements, LLMs frequently generate hallucinated outputs, where responses appear plausible but contain incorrect or fabricated information. This issue poses serious risks in governance-related applications, where inaccurate information can influence policy interpretation, administrative decision-making, and public trust. Existing studies have proposed several approaches to address hallucinations, including semantic entropy–based detection, benchmark evaluation frameworks, and adversarial testing methods. However, the literature indicates that current solutions remain fragmented and often focus on isolated aspects such as model performance, dataset construction, or benchmark capability rather than comprehensive reliability assessment. This literature review examines recent research on hallucination detection, multilingual and low-resource natural language processing, and evaluation frameworks for LLM reliability. The reviewed studies highlight key challenges, including the lack of multilingual hallucination evaluation, insufficient harm-oriented risk assessment, and limited adversarial robustness testing in governance contexts. Furthermore, existing benchmarks often measure task accuracy rather than factual reliability or societal impact. Based on the analysis of the literature, this review identifies major methodological and contextual gaps and proposes the need for an integrated evaluation framework combining meaning level hallucination detection, harm aware risk modeling, and multilingual robustness assessment. Such an approach could improve the reliability and safety of LLM systems deployed in governance and public service environments.

Keywords

Large Language Models, Hallucination Detection, Semantic Entropy, Multilingual NLP, Low-Resource Languages, Governance AI, Adversarial Prompting, Benchmark Evaluation, AI Reliability, Natural Language Processing.

How to cite this paper

Pranjal Gahlot, Prof. Rakshitha BS "A Multi-Metric Evaluation Perspective on Hallucination Detection in Low-Resource Governance Documents" Iconic Research And Engineering Journals Volume 9 Issue 11 2026 Page 2631-2642 https://doi.org/10.64388/IREV9I11-1717980

Pranjal Gahlot, Prof. Rakshitha BS "A Multi-Metric Evaluation Perspective on Hallucination Detection in Low-Resource Governance Documents" Iconic Research And Engineering Journals, vol. 9, no. 11, May. 2026, doi: https://doi.org/10.64388/IREV9I11-1717980

Pranjal Gahlot, Prof. Rakshitha BS (2026). A Multi-Metric Evaluation Perspective on Hallucination Detection in Low-Resource Governance Documents. Iconic Research And Engineering Journals, 9(11). doi: https://doi.org/10.64388/IREV9I11-1717980

Pranjal Gahlot, Prof. Rakshitha BS "A Multi-Metric Evaluation Perspective on Hallucination Detection in Low-Resource Governance Documents" Iconic Research And Engineering Journals, vol. 9, no. 11, May. 2026. Crossref, https://doi.org/10.64388/IREV9I11-1717980

@article{1717980,
      author = {Pranjal Gahlot, Prof. Rakshitha BS},
      title = {A Multi-Metric Evaluation Perspective on Hallucination Detection in Low-Resource Governance Documents},
      journal = {Iconic Research And Engineering Journals},
      year = {2026},
      volume = {9},
      number = {11},
      pages = {2631-2642},
      issn = {2456-8880},
      url = {https://www.irejournals.com/formatedpaper/1717980.pdf},
      abstract = {The rapid advancement of Large Language Models (LLMs) has significantly improved natural language processing applications across domains such as governance, healthcare, legal analysis, and public information systems. Despite these advancements, LLMs frequently generate hallucinated outputs, where responses appear plausible but contain incorrect or fabricated information. This issue poses serious risks in governance-related applications, where inaccurate information can influence policy interpretation, administrative decision-making, and public trust. Existing studies have proposed several approaches to address hallucinations, including semantic entropy–based detection, benchmark evaluation frameworks, and adversarial testing methods. However, the literature indicates that current solutions remain fragmented and often focus on isolated aspects such as model performance, dataset construction, or benchmark capability rather than comprehensive reliability assessment. This literature review examines recent research on hallucination detection, multilingual and low-resource natural language processing, and evaluation frameworks for LLM reliability. The reviewed studies highlight key challenges, including the lack of multilingual hallucination evaluation, insufficient harm-oriented risk assessment, and limited adversarial robustness testing in governance contexts. Furthermore, existing benchmarks often measure task accuracy rather than factual reliability or societal impact. Based on the analysis of the literature, this review identifies major methodological and contextual gaps and proposes the need for an integrated evaluation framework combining meaning level hallucination detection, harm aware risk modeling, and multilingual robustness assessment. Such an approach could improve the reliability and safety of LLM systems deployed in governance and public service environments.},
      keywords = {Large Language Models, Hallucination Detection, Semantic Entropy, Multilingual NLP, Low-Resource Languages, Governance AI, Adversarial Prompting, Benchmark Evaluation, AI Reliability, Natural Language Processing.},
      month = {May},
      doi = {https://doi.org/10.64388/IREV9I11-1717980}
  }