Mumbai Software Company Releases Kan-LLaMA, a 7B Llama-2 Model Fine-Tuned on Kannada

Date:

Updated: [falahcoin_post_modified_date]

Mumbai-based Software Company Development Tensoic Releases Kannada Llama Model

Mumbai-based software company Development Tensoic has recently launched Kan-LLaMA [ಕನ್-LLama], a cutting-edge language model designed specifically for Kannada, a Low Resource Indic language. Fine-tuned on 600 million Kannada tokens and integrated with state-of-the-art (SOTA) instruction datasets, Kan-LLaMA aims to enhance linguistic capabilities for Kannada text processing.

In a blog post, Development Tensoic shared insights into the painstaking process behind developing the model. We Continually Pre Train Llama-2 on ~600 Million Kannada Tokens from the popular CulturaX Dataset. The dataset consists of multiple de-duplicated Multilingual dumps from popular scrapes such as mC4 and OSCAR. We randomly select documents from the same, resulting in a text corpus of ~11GB for the pre-training step.

To expand the model’s linguistic capabilities, Development Tensoic developed a tokeniser, an integral tool for breaking down text into smaller units or tokens. This development involved increasing the vocabulary size of the existing Llama-2 to a total of 48K tokens, focusing on efficient processing of Kannada content.

The company trained a sentence piece tokeniser with a vocabulary size of 20K on a Kannada text corpus used for pre-training. Subsequently, this new tokeniser was merged with Llama-2’s existing tokeniser, resulting in improved processing capabilities for Kannada text.

For the fine-tuning phase, the model was further enhanced using chat-optimised and translated datasets to improve its conversational abilities. Development Tensoic has released these curated datasets under various licenses, namely cc-by-4.0 and Apache 2.0, encouraging contributions from the community.

The fine-tuning process utilized Axolotl, which provides a user-friendly environment through YAML configs to fine-tune Large Language Models (LLMs). The resulting Kannada Llama model demonstrates impressive generation capabilities through quantized versions.

With the release of Kan-LLaMA, Development Tensoic aims to provide a powerful tool for processing Kannada content, benefiting linguists, researchers, developers, and Kannada speakers worldwide. The company acknowledges the importance of linguistic diversity and is dedicated to building resources that support the preservation and advancement of regional languages.

As the language technology landscape evolves, the accessibility and sophistication of models like Kan-LLaMA offer promising potential for not only Kannada but also for other Low Resource Indic languages. As technology progresses, the development and fine-tuning of language models will foster inclusive and effective communication across diverse linguistic backgrounds.

Development Tensoic plans to make the models, code, datasets, and paper available under permissive licenses, ensuring that the benefits of Kannada Llama reach a wider audience and enable further advancements in natural language processing.

The arrival of Kan-LLaMA marks a significant milestone in the development of language models tailored for Kannada, promising to elevate the efficiency and accuracy of Kannada text analyses, conversational AI systems, and content generation. With its successful release, Kannada Llama paves the way for enhanced language technology solutions that support the growth and preservation of regional languages.

As Development Tensoic embraces the commitment to linguistic diversity, the future holds promise for even more advanced and comprehensive language models that cater to a multitude of languages and help bridge communication gaps across the globe.

[single_post_faqs]
Neha Sharma
Neha Sharma
Neha Sharma is a tech-savvy author at The Reportify who delves into the ever-evolving world of technology. With her expertise in the latest gadgets, innovations, and tech trends, Neha keeps you informed about all things tech in the Technology category. She can be reached at neha@thereportify.com for any inquiries or further information.

Share post:

Subscribe

Popular

More like this
Related

Revolutionary Small Business Exchange Network Connects Sellers and Buyers

Revolutionary SBEN connects small business sellers and buyers, transforming the way businesses are bought and sold in the U.S.

District 1 Commissioner Race Results Delayed by Recounts & Ballot Reviews, US

District 1 Commissioner Race in Orange County faces delays with recounts and ballot reviews. Find out who will come out on top in this close election.

Fed Minutes Hint at Potential Rate Cut in September amid Economic Uncertainty, US

Federal Reserve minutes suggest potential rate cut in September amid economic uncertainty. Find out more about the upcoming policy decisions.

Baltimore Orioles Host First-Ever ‘Faith Night’ with Players Sharing Testimonies, US

Experience the powerful testimonies of Baltimore Orioles players on their first-ever 'Faith Night.' Hear how their faith impacts their lives on and off the field.