AI-Driven Synesthetic Music Visualizer: Real-Time Cross-Modal Audio-to-Visual Translation Using Machine Learning

Mohammed Yusoof S; Harini C N; Harini S; Saranya R; Dr. Lakshmi Devi

doi:10.64388/IREV9I9-1715778

Home / Current Issue / Paper 1715778

1715778PublishedVol 9 · Issue 9

AI-Driven Synesthetic Music Visualizer: Real-Time Cross-Modal Audio-to-Visual Translation Using Machine Learning

Mohammed Yusoof S Harini C N Harini S Saranya R Dr. Lakshmi Devi

Subject area: Science,Engineering and Technology · Area of research: Artificial Intelligence

DOI: https://doi.org/10.64388/IREV9I9-1715778

Abstract

This paper presents the design, implementation, and evaluation of an AI-Driven Synesthetic Music Visualizer — a real-time system that computationally emulates the neurological phenomenon of synesthesia by translating auditory signals into semantically congruent, dynamic visual art. Audio tracks in MP3 or WAV format are ingested and decomposed into perceptual acoustic features — including Root Mean Square (RMS) energy, spectral centroid, chroma vector, tempo, and spectral rolloff — through Short-Time Fourier Transform (STFT)-based signal processing using the Librosa library. Six machine learning architectures are trained and benchmarked on an annotated audio-visual mapping corpus: Multi-Layer Perceptron (MLP), Long Short-Term Memory (LSTM), Random Forest, Support Vector Machine (SVM), K-Nearest Neighbours (KNN), and Gradient Boosting. The Gradient Boosting model achieves the highest classification performance with an F1-score of 88.8 % and an average inference latency of 29 ms — well within the perceptual synchronisation budget. Predicted visual parameters (colour palette, shape morphology, animation velocity) are forwarded to a GPU-accelerated OpenGL rendering engine sustaining 62 frames per second on commodity hardware. The complete pipeline is deployed as a browser-accessible Gradio application. Results demonstrate that intelligent cross-modal synthesis is achievable in genuine real time, opening avenues for generative art, live performance, and assistive technology for hearing-impaired users.

Keywords

Synesthesia, Music Visualisation, Deep Learning, LSTM, GAN, Audio Feature Extraction, Real-Time Processing, Generative AI, Cross-Modal Synthesis, Gradient Boosting, Gradio

How to cite this paper

Mohammed Yusoof S, Harini C N, Harini S, Saranya R, Dr. Lakshmi Devi "AI-Driven Synesthetic Music Visualizer: Real-Time Cross-Modal Audio-to-Visual Translation Using Machine Learning" Iconic Research And Engineering Journals Volume 9 Issue 9 2026 Page 3477-3483 https://doi.org/10.64388/IREV9I9-1715778

Mohammed Yusoof S, Harini C N, Harini S, Saranya R, Dr. Lakshmi Devi "AI-Driven Synesthetic Music Visualizer: Real-Time Cross-Modal Audio-to-Visual Translation Using Machine Learning" Iconic Research And Engineering Journals, vol. 9, no. 9, Mar. 2026, doi: https://doi.org/10.64388/IREV9I9-1715778

Mohammed Yusoof S, Harini C N, Harini S, Saranya R, Dr. Lakshmi Devi (2026). AI-Driven Synesthetic Music Visualizer: Real-Time Cross-Modal Audio-to-Visual Translation Using Machine Learning. Iconic Research And Engineering Journals, 9(9). doi: https://doi.org/10.64388/IREV9I9-1715778

Mohammed Yusoof S, Harini C N, Harini S, Saranya R, Dr. Lakshmi Devi "AI-Driven Synesthetic Music Visualizer: Real-Time Cross-Modal Audio-to-Visual Translation Using Machine Learning" Iconic Research And Engineering Journals, vol. 9, no. 9, Mar. 2026. Crossref, https://doi.org/10.64388/IREV9I9-1715778

@article{1715778,
      author = {Mohammed Yusoof S, Harini C N, Harini S, Saranya R, Dr. Lakshmi Devi},
      title = {AI-Driven Synesthetic Music Visualizer: Real-Time Cross-Modal Audio-to-Visual Translation Using Machine Learning},
      journal = {Iconic Research And Engineering Journals},
      year = {2026},
      volume = {9},
      number = {9},
      pages = {3477-3483},
      issn = {2456-8880},
      url = {https://www.irejournals.com/formatedpaper/1715778.pdf},
      abstract = {This paper presents the design, implementation, and evaluation of an AI-Driven Synesthetic Music Visualizer — a real-time system that computationally emulates the neurological phenomenon of synesthesia by translating auditory signals into semantically congruent, dynamic visual art. Audio tracks in MP3 or WAV format are ingested and decomposed into perceptual acoustic features — including Root Mean Square (RMS) energy, spectral centroid, chroma vector, tempo, and spectral rolloff — through Short-Time Fourier Transform (STFT)-based signal processing using the Librosa library. Six machine learning architectures are trained and benchmarked on an annotated audio-visual mapping corpus: Multi-Layer Perceptron (MLP), Long Short-Term Memory (LSTM), Random Forest, Support Vector Machine (SVM), K-Nearest Neighbours (KNN), and Gradient Boosting. The Gradient Boosting model achieves the highest classification performance with an F1-score of 88.8 % and an average inference latency of 29 ms — well within the perceptual synchronisation budget. Predicted visual parameters (colour palette, shape morphology, animation velocity) are forwarded to a GPU-accelerated OpenGL rendering engine sustaining 62 frames per second on commodity hardware. The complete pipeline is deployed as a browser-accessible Gradio application. Results demonstrate that intelligent cross-modal synthesis is achievable in genuine real time, opening avenues for generative art, live performance, and assistive technology for hearing-impaired users.},
      keywords = {Synesthesia, Music Visualisation, Deep Learning, LSTM, GAN, Audio Feature Extraction, Real-Time Processing, Generative AI, Cross-Modal Synthesis, Gradient Boosting, Gradio},
      month = {March},
      doi = {https://doi.org/10.64388/IREV9I9-1715778}
  }