Comparative Study of Deep Learning ArchitecturesforAutomated Diabetic Retinopathy Grading:Vision Transformer, Swin Transformer, and InceptionResNetV2

Samir Mulla; Mahamadtohid Naikwadi; Prajwal Khandait; Aditya Sutar; Rajesh Kumar; Uma Gurav

doi:10.64388/IREV9I12-1718689

Home / Current Issue / Paper 1718689

1718689PublishedVol 9 · Issue 12

Comparative Study of Deep Learning ArchitecturesforAutomated Diabetic Retinopathy Grading:Vision Transformer, Swin Transformer, and InceptionResNetV2

Samir Mulla Mahamadtohid Naikwadi Prajwal Khandait Aditya Sutar Rajesh Kumar Uma Gurav

Subject area: Science,Engineering and Technology · Area of research: Artificial Intelligence, Deep learning

DOI: https://doi.org/10.64388/IREV9I12-1718689

Abstract

Diabetic Retinopathy (DR) is a vision-threatening complication of diabetes mellitus that progresses silently through five clinically defined severity grades. Timely automated screen-ing is critical to prevent irreversible vision loss, particu-larly in resource-constrained healthcare settings. This paper presents a systematic comparative study of three state-of-the-art deep learning architectures-Vision Transformer (ViT-Base/16), Swin Transformer (swin base patch4 window7 224), and InceptionResNetV2-applied to five-class DR grading on the APTOS 2019 fundus image dataset (3,662 images). All models employ transfer learning from ImageNet-pretrained weights. We analyze each architecture from the perspectives of classification accuracy, per-class F1-score, macro-averaged AUC, GradCAM-based explainability, training dynamics, and parameter efficiency. Our ViT-Base/16 model, fine-tuned end-to-end with AdamW, cosine annealing, and label smoothing, achieves the highest validation accuracy of 85.40% with a macro-averaged F1-score of 0.7247. Swin Transformer achieves 83.20% accuracy, while InceptionResNetV2 achieves 81.40% through two-stage transfer learning. GradCAM visualizations confirm clinically aligned lesion localization across all architectures. This work provides architectural insights for deploying robust DR screening systems in clinical environments.

Keywords

Diabetic Retinopathy, Vision Transformer, Swin Transformer, InceptionResNetV2, Transfer Learning, GradCAM, Fundus Image Classification, Deep Learning, Medical Image Analysis

How to cite this paper

Samir Mulla, Mahamadtohid Naikwadi, Prajwal Khandait, Aditya Sutar, Rajesh Kumar; Uma Gurav "Comparative Study of Deep Learning ArchitecturesforAutomated Diabetic Retinopathy Grading:Vision Transformer, Swin Transformer, and InceptionResNetV2" Iconic Research And Engineering Journals Volume 9 Issue 12 2026 Page 527-536 https://doi.org/10.64388/IREV9I12-1718689

Samir Mulla, Mahamadtohid Naikwadi, Prajwal Khandait, Aditya Sutar, Rajesh Kumar; Uma Gurav "Comparative Study of Deep Learning ArchitecturesforAutomated Diabetic Retinopathy Grading:Vision Transformer, Swin Transformer, and InceptionResNetV2" Iconic Research And Engineering Journals, vol. 9, no. 12, Jun. 2026, doi: https://doi.org/10.64388/IREV9I12-1718689

Samir Mulla, Mahamadtohid Naikwadi, Prajwal Khandait, Aditya Sutar, Rajesh Kumar; Uma Gurav (2026). Comparative Study of Deep Learning ArchitecturesforAutomated Diabetic Retinopathy Grading:Vision Transformer, Swin Transformer, and InceptionResNetV2. Iconic Research And Engineering Journals, 9(12). doi: https://doi.org/10.64388/IREV9I12-1718689

Samir Mulla, Mahamadtohid Naikwadi, Prajwal Khandait, Aditya Sutar, Rajesh Kumar; Uma Gurav "Comparative Study of Deep Learning ArchitecturesforAutomated Diabetic Retinopathy Grading:Vision Transformer, Swin Transformer, and InceptionResNetV2" Iconic Research And Engineering Journals, vol. 9, no. 12, Jun. 2026. Crossref, https://doi.org/10.64388/IREV9I12-1718689

@article{1718689,
      author = {Samir Mulla, Mahamadtohid Naikwadi, Prajwal Khandait, Aditya Sutar, Rajesh Kumar; Uma Gurav},
      title = {Comparative Study of Deep Learning ArchitecturesforAutomated Diabetic Retinopathy Grading:Vision Transformer, Swin Transformer, and InceptionResNetV2},
      journal = {Iconic Research And Engineering Journals},
      year = {2026},
      volume = {9},
      number = {12},
      pages = {527-536},
      issn = {2456-8880},
      url = {https://www.irejournals.com/formatedpaper/1718689.pdf},
      abstract = {Diabetic Retinopathy (DR) is a vision-threatening complication of diabetes mellitus that progresses silently through five clinically defined severity grades. Timely automated screen-ing is critical to prevent irreversible vision loss, particu-larly in resource-constrained healthcare settings. This paper presents a systematic comparative study of three state-of-the-art deep learning architectures-Vision Transformer (ViT-Base/16), Swin Transformer (swin base patch4 window7 224), and InceptionResNetV2-applied to five-class DR grading on the APTOS 2019 fundus image dataset (3,662 images). All models employ transfer learning from ImageNet-pretrained weights. We analyze each architecture from the perspectives of classification accuracy, per-class F1-score, macro-averaged AUC, GradCAM-based explainability, training dynamics, and parameter efficiency. Our ViT-Base/16 model, fine-tuned end-to-end with AdamW, cosine annealing, and label smoothing, achieves the highest validation accuracy of 85.40% with a macro-averaged F1-score of 0.7247. Swin Transformer achieves 83.20% accuracy, while InceptionResNetV2 achieves 81.40% through two-stage transfer learning. GradCAM visualizations confirm clinically aligned lesion localization across all architectures. This work provides architectural insights for deploying robust DR screening systems in clinical environments.},
      keywords = {Diabetic Retinopathy, Vision Transformer, Swin Transformer, InceptionResNetV2, Transfer Learning, GradCAM, Fundus Image Classification, Deep Learning, Medical Image Analysis},
      month = {June},
      doi = {https://doi.org/10.64388/IREV9I12-1718689}
  }