Current Volume 9
The advent of Large Language Models (LLMs)—prominently including architectures such as ChatGPT, Gemini, and Claude—has precipitated a paradigm shift in natural language processing. These models exhibit unprecedented proficiency in complex text generation, summarization, and reasoning. However, their integration into high-stakes, real-world applications is critically impeded by their propensity to generate “hallucinations.” In the context of LLMs, hallucinations are defined as outputs that are grammatically fluent and presented with high confidence, yet are factually inaccurate, logically inconsistent, or entirely fabricated. Because these models lack true epistemic uncertainty and ground their outputs in statistical probability rather than verified truth, they can seamlessly weave falsehoods into otherwise credible discourse, posing significant risks to user trust and system safety. To systematically address this challenge, this study presents a comprehensive investigation into the detection, categorization, and mitigation of hallucination behaviors across four critical, knowledge-intensive domains: medicine, law, science, and history. We construct a novel, rigorously curated benchmark dataset comprising 200 highly specialized prompts designed specifically to stress-test the factual boundaries and reasoning limits of state-of-the-art models. Furthermore, this research proposes a granular taxonomy of hallucination types, differentiating between intrinsic hallucinations (direct contradictions of established facts) and extrinsic hallucinations (the inclusion of verifiable but un-prompted or irrelevant assertions). To quantify this phenomenon accurately, we introduce the Confidence-Weighted Hallucination Score (CHS), a novel evaluation metric that recalibrates traditional accuracy measurements by heavily penalizing models that output false information with high lexical certainty. Extensive experimental evaluations utilizing the CHS frame-work reveal significant variance in model reliability. Overall hallucination rates across the tested models ranged from 11% to 22%, heavily modulated by the complexity and specialization of the queried domain. Notably, our results demonstrate a stark prevalence of hallucinations in highly technical spheres—such as medical diagnostics and legal case synthesis—where parametric knowledge is dense and external factual verification is computationally difficult. In these edge cases, models frequently masked their knowledge deficits by generating highly plausible but synthetic citations, precedents, and terminology. By systematically mapping the conditions under which these failures occur, the proposed evaluation framework provides critical insights into the limitations of current generative architectures. Ultimately, this research lays the theoretical and empirical groundwork for more robust mitigation strategies, contributing directly to the development of safer, more transparent, and highly reliable LLM systems capable of operating within specialized environments.
Large Language Models, Hallucination Detection, Natural Language Processing, AI Reliability, Generative AI Evaluation
IRE Journals:
Subhradip Sarkar, Pinaki Karmakar, Rishit Chowdhury, Ankan Roy "Hallucination Detection, Categorization, and Mitigation in Large Language Models: A Cross-Domain Evaluation Framework" Iconic Research And Engineering Journals Volume 9 Issue 10 2026 Page 2902-2907 https://doi.org/10.64388/IREV9I10-1716821
IEEE:
Subhradip Sarkar, Pinaki Karmakar, Rishit Chowdhury, Ankan Roy
"Hallucination Detection, Categorization, and Mitigation in Large Language Models: A Cross-Domain Evaluation Framework" Iconic Research And Engineering Journals, 9(10) https://doi.org/10.64388/IREV9I10-1716821