Machine-Translated Web Content: Widespread Low Quality Raises Concerns

Date:

Updated: [falahcoin_post_modified_date]

Most of the content found on the web is actually machine-translated gibberish, according to researchers at Amazon Web Services (AWS). The team at AWS’s AI lab discovered that a vast amount of online content is generated by machines, resulting in low-quality translations across multiple languages. This finding emphasizes the importance of data quality and source consideration when training large language models (LLMs). The study also revealed that machine-generated content is particularly prevalent in translations for languages with limited resources, constituting a significant portion of all web content. The researchers note a selection bias in the type of content that is translated into multiple languages, potentially for the purpose of generating ad revenue.

The investigation was prompted by native speakers of low-resource languages who noticed that a considerable portion of web content in their native languages appeared to be machine-translated. To gain a better understanding of the issue’s scope, the team developed a vast resource called the Multi-Way ccMatrix (MWccMatrix). It comprises 6.4 billion unique sentences in 90 different languages, including translation tuples, which are sets of sentences translated into various languages.

The study, submitted to Cornell University’s pre-print server arXiv, demonstrates that a significant amount of web content is frequently translated into multiple languages, primarily through machine translation. This trend is not limited to languages with minimal resources but also encompasses a substantial portion of web content in those languages.

Additionally, the researchers noticed a bias in the selection of content chosen for multiple translations, indicating a potential motive for generating ad revenue. The paper concludes that while machine translation technology has made significant strides in the past decade, it still falls short of human quality. As a result, much of the machine-translated content available on the web today is of very low quality according to contemporary standards. These low-quality translations can yield less fluent LLM models that are prone to hallucinations. Moreover, the selection bias discovered in the study suggests that the data used for training these models may already be of lower quality, even before accounting for machine translation errors. The researchers emphasize the necessity of using high-quality data in training LLMs, such as books and Wikipedia articles, which are typically upsampled several times.

The findings from this study shed light on the prevalence of machine-translated content on the web and its subpar quality. As internet users increasingly rely on translated content, it becomes crucial to prioritize data quality and consider the source when training language models. The researchers’ work highlights the need for improvements in machine translation technology to ensure higher quality translations and more reliable content for users across the globe.

[single_post_faqs]
Neha Sharma
Neha Sharma
Neha Sharma is a tech-savvy author at The Reportify who delves into the ever-evolving world of technology. With her expertise in the latest gadgets, innovations, and tech trends, Neha keeps you informed about all things tech in the Technology category. She can be reached at neha@thereportify.com for any inquiries or further information.

Share post:

Subscribe

Popular

More like this
Related

Revolutionary Small Business Exchange Network Connects Sellers and Buyers

Revolutionary SBEN connects small business sellers and buyers, transforming the way businesses are bought and sold in the U.S.

District 1 Commissioner Race Results Delayed by Recounts & Ballot Reviews, US

District 1 Commissioner Race in Orange County faces delays with recounts and ballot reviews. Find out who will come out on top in this close election.

Fed Minutes Hint at Potential Rate Cut in September amid Economic Uncertainty, US

Federal Reserve minutes suggest potential rate cut in September amid economic uncertainty. Find out more about the upcoming policy decisions.

Baltimore Orioles Host First-Ever ‘Faith Night’ with Players Sharing Testimonies, US

Experience the powerful testimonies of Baltimore Orioles players on their first-ever 'Faith Night.' Hear how their faith impacts their lives on and off the field.