“Bringing the voices of millions into the AI era”

For Dhyon Technology, Lotus Avio led a specialized regional audio data-collection drive across Bihar and Jharkhand, gathering high-quality, authentic speech in three underrepresented languages: Bhojpuri, Maithili and Magahi. The data was used to train and refine Automatic Speech Recognition (ASR) and voice-AI models. Mainstream models struggle with regional Indian languages for want of clean, diverse, localized data. We recruited native speakers across districts to capture genuine dialects, curated prompts spanning daily conversation, agriculture, local governance and folklore, and captured uncompressed studio-grade audio with strict noise control and precise transcription, delivering hundreds of hours of pristine, model-ready data.

Our campaign strategy

01Sourcing

Native-speaker recruitment

Onboarded speakers across districts to capture genuine dialects and accents, untouched by urban normalization.

02Prompts

Contextual scripts

Curated prompts across daily conversation, agriculture, local governance and folklore so models learn real, contextual language.

03Quality

Studio-grade capture

Uncompressed audio at 16 kHz or higher, strict noise-cancellation, and rejection of low-quality samples before delivery.

04Coverage

Three-language matrix

Balanced Bhojpuri, Maithili and Magahi coverage across age, gender and background, with precise transcription alignment.

AI Speech Data for Dhyon Technology, image 2

AI Speech Data for Dhyon Technology, image 3

AI Speech Data for Dhyon Technology

Our campaign strategy

Native-speaker recruitment

Contextual scripts

Studio-grade capture

Three-language matrix

Have a project in mind?