Singapore Builds AI Model to ‘Represent’ Southeast Asians
SELANGOR: Southeast Asians have faced difficulties utilizing large language models (LLMs) such as Meta’s Llama 2 and Mistral AI in their native languages, resulting in nonsensical English translations. Tech experts caution that this places them at a disadvantage, as generative artificial intelligence (AI) continues to revolutionize education, work, and governance across the globe.
To address this imbalance, the Singapore government is spearheading an initiative to develop a Southeast Asian LLM called SEA-LION (Southeast Asian Languages in One Network). Trained on data in 11 Southeast Asian languages, including Thai, Vietnamese, and Bahasa Indonesia, this open-sourced model is intended to be a cost-effective and efficient option for businesses, governments, and academia in the region.
Leslie Teo, senior director for AI products at AI Singapore, emphasized the goal of accessibility for Southeast Asians. He questioned whether the region should conform to the machine or if technology should be made more accessible to its people, regardless of their proficiency in English. Teo clarified that SEA-LION is not meant to compete with existing LLMs, but rather to complement them and provide better representation for Southeast Asians.
Currently, most LLMs, such as Open AI’s GPT-4 and Meta’s Llama 2, have primarily been developed and trained in the English language, leaving many languages underrepresented. Consequently, governments and tech firms worldwide are endeavoring to bridge this gap. Countries like India are creating datasets in local languages, while the United Arab Emirates has developed an LLM powering generative AI tools in Arabic. Similarly, China, Japan, and Vietnam have introduced AI models using local languages.
Multilingual language models that incorporate multiple languages into training can help establish semantic and grammatical connections between high-resource languages and low-resource ones. These models have a wide range of applications, including translation services, customer-service chatbots, and content moderation on social media platforms struggling to identify hate speech in low-resource languages.
SEA-LION stands out with over 13% of its data derived from Southeast Asian languages, surpassing other major LLMs. Additionally, more than 9% of its data is sourced from Chinese text, and approximately 63% from English. AI Singapore places great importance on training SEA-LION with accurate and verified data, given the prevalent use of translated and unreliable information found on the internet.
While some worry that region-specific LLMs may perpetuate biased narratives, only reinforcing dominant views and potentially disregarding crucial socio-political issues, AI Singapore believes that relying solely on Western LLMs with inherent biases would be equally problematic. These biases may not align with the linguistic and cultural nuances of local language speakers, thus distorting their representation.
The development of SEA-LION aims to mitigate these concerns and strike a balance. By creating a model tailored to Southeast Asians, the hope is to provide a more accurate reflection of their linguistic and cultural nuances, while also preventing a revisionist view of history or the undermining of democratic values.
As more governments contribute data and businesses test SEA-LION, the model becomes more refined and applicable. Indonesian e-commerce giant Tokopedia, for instance, anticipates that a localized model will enhance their customer interactions in Bahasa Indonesia and improve overall user experiences.
Singapore’s ambitious pursuit of a Southeast Asian LLM underscores the importance of inclusive technology on a global scale. Rather than imposing a one-size-fits-all approach, this initiative strives to ensure that AI is accessible to diverse populations, empowering them to harness its benefits without language barriers.
In a rapidly evolving digital landscape, SEA-LION’s introduction heralds a transformative breakthrough, granting Southeast Asians the opportunity to participate equitably in the burgeoning AI-driven economy and assert their own technological self-reliance.