The need to develop good Indic datasets in order to create effective Indic language models (LLMs) in India is a challenging endeavor due to the country’s linguistic diversity. While initiatives like Project Vaani, AI4Bharat, and Bhashini are making efforts to collect datasets for Indic languages, the volume of available data is still relatively small. Collecting data requires digitization of books, collaboration with linguists, content creation workshops, and partnerships with local institutions. Companies like Tech Mahindra and Swecha Telangana have sent teams to collect data from various regions and engaged communities in data collection efforts. However, building good datasets for all 22 official languages will take time and requires a unified and collaborative endeavor. Open-source approaches are being adopted by many initiatives to promote transparency and inclusivity in advancing linguistic technologies in India.
India’s Push for Indic Language Models Faces Data Challenge
Date:
Updated: [falahcoin_post_modified_date]