Bridging the AI Gap and Fostering Equity within ASEAN

On August 15, 2024, I attended the 3rd Language Summit on SEALD in Bangkok, Thailand, organized by AI Singapore, Google, and VISTEC. This continuing series of gatherings of ASEAN NLP researchers and institutions shows sustained progress towards addressing the challenges in AI adoption and promoting equity in the field.

The rapid development of AI technologies, particularly large language models (LLMs), has brought to light several critical issues. Access inequality remains a significant concern, as not all individuals, organizations, or countries have equal access to AI resources and technologies. The high costs associated with developing and using AI systems create barriers for many potential users, impacting affordability. Additionally, current AI systems often lack diverse representation, leading to biases in their outputs and applications. This calls for collaborative efforts to create more inclusive and accessible AI technologies.

AISG Partners

Project SEALD (Southeast Asian Languages in One Network Data) is a prime example of such collaborative efforts. It’s an initiative between Google and AI Singapore (AISG) that aims to enhance the capabilities of LLMs, making them more useful across Southeast Asia (SEA). The focus on SEA languages is crucial, given the region’s extraordinary range of languages and dialects. As a Filipino, I’m acutely aware of the linguistic diversity in our country alone - the Philippines has between 120 and 187 languages, depending on who you ask. And that level of diversity is precisely why it is crucial to incorporate local languages into LLMs—to better understand and respond to the unique cultural contexts of the region.

The Challenge

However, developing pre-trained language models is not straightforward, especially for developing nations like the Phlippines.

These models require vast amounts of high-quality, diverse data in the target languages, which is particularly challenging for languages with limited digital presence (think Cebuano, Hiligaynon, Ilocano, Kapampangan). Moreover, training LLMs demands substantial computational resources, with hardware costs often prohibitively expensive for many research institutions or countries like the Philippines. Can you imagine spending over $100M for large language model training? That’s the reported estimate for training GPT-4.

ASEAN’s United Front and the Philippines’ Role

In light of these challenges, the development of SEA-LION by AI Singapore, was a HUGE, HUGE welcome. Alongside the implementation of Project SEALD, it marks a significant milestone in the region’s AI tech advancement. By pooling leadership, resources, and expertise, these efforts pave the way for more inclusive and culturally attuned language models, which will play a crucial role in shaping the digital future of Southeast Asia.

With open-source pre-trained models that have already undergone extensive training on large and diverse datasets, researchers and developers in SEA can build upon these foundations rather than starting from scratch. This not only reduces the barrier to entry but also allows for localization and customization to fit the specific linguistic and cultural context.

SEA-LION v2 Pre-training Dataset Composition

It was my first time attending the Summit, and I was so inspired to see ASEAN-member countries unite to tackle shared challenges in natural language processing (NLP). I was personally inspired by the progress in foundation model building across ASEAN, ensuring AI technologies are both advanced and culturally relevant. While the Philippines still has to catch up, I’m optimistic about accelerated progress and contributions in the near future.

CAIR’s Commitment to Regional AI Advancement

As part of the Center for AI Research (CAIR), a program under the Philippine Department of Trade and Industry, I can share that we are strongly committed to Project SEALD and the SEA-LION LLM initiative spearheaded by AI Singapore. We have seen how LLMs, when deployed as applications, can significantly benefit enterprises. The prospect of an open-source model like SEA-LION is particularly exciting as it promises to make these powerful tools more accessible to our local businesses, especially MSMEs in the country.

SEA-LION v2 Pre-training Dataset Composition

CAIR is deeply invested in enhancing the competitiveness of Philippine businesses through AI technologies, including LLMs. By participating in these collaborative efforts, we aim to contribute Filipino language data and expertise, develop practical applications for local businesses, and facilitate knowledge transfer within the Philippine AI and business communities.

Indeed, our involvement in these projects reaffirms our commitment to fostering an AI-enabled, inclusive business environment in the Philippines, driving innovation and opening new opportunities in the global digital economy.

Related