
“Bringing the voices of millions into the AI era”
For Dhyon Technology, Lotus Avio led a specialized regional audio data-collection drive across Bihar and Jharkhand, gathering high-quality, authentic speech in three underrepresented languages: Bhojpuri, Maithili and Magahi. The data was used to train and refine Automatic Speech Recognition (ASR) and voice-AI models. Mainstream models struggle with regional Indian languages for want of clean, diverse, localized data. We recruited native speakers across districts to capture genuine dialects, curated prompts spanning daily conversation, agriculture, local governance and folklore, and captured uncompressed studio-grade audio with strict noise control and precise transcription, delivering hundreds of hours of pristine, model-ready data.
Our campaign strategy
Native-speaker recruitment
Onboarded speakers across districts to capture genuine dialects and accents, untouched by urban normalization.
Contextual scripts
Curated prompts across daily conversation, agriculture, local governance and folklore so models learn real, contextual language.
Studio-grade capture
Uncompressed audio at 16 kHz or higher, strict noise-cancellation, and rejection of low-quality samples before delivery.
Three-language matrix
Balanced Bhojpuri, Maithili and Magahi coverage across age, gender and background, with precise transcription alignment.


