Metrics for Reliable AI-based Translation

Inspiration

Machine translation (MT) tools based on AI technology make it possible for virtually anyone to translate text into many of the world’s languages, and thus hold the promise of enabling seamless communication across language barriers. However, they still make many errors, and these errors are hard to catch by users who are not fluent in the languages involved. Lay users often use techniques like back translation (i.e., re-translate the machine generated output from the target language back to its source language) to judge the output quality when they do not know the target language. As a result MT is sometimes used inappropriately, even in high-stakes settings such as hospitals or court rooms where errors can have severe consequences. In this project, we investigate methods to help users assess the quality of MT outputs, so they can decide when it is appropriate to rely on MT or not. From this, we sought to explore the question, "Are there quantitative methods that can be used to automatically evaluate the acceptability of machine translations?"

What it does

Our analysis performs the following evaluations for French and Russian TedX (low-risk) and COVID-19 (high-risk) translations (from English):

Collects acceptability judgements and user confidences from bilingual and monolingual speakers for each language pair and each condition (high/low risk)
Evaluates translated output against the input segment using Comet-Src or length-based heuristics
Compiles visualizations to aid in the identification of accuracy thresholds for Russian and French
Delivers a final recommendation on whether or not the translation is accepted based on the individual metrics above, weighted by the accuracy thresholds

How we built it

We built the code for this project collaboratively using Google Colab after performing a literature review on academic articles that seek to answer adjacent/related questions.

Challenges we ran into

For some of the evaluations we generated based on the metrics mentioned above, there was not much differentiation between the scoring for the machine translation versus the human translation for those datasets. We decided to resolve this challenge by accepting that certain metrics appeared to be more useful for French and others were better to offer predictions on Russian.

Accomplishments that we're proud of

We were able to curate a small but high quality dataset for the task of Machine Translation Acceptability for two language pairs and two conditions (high/low). We are proud to have tested three separate methods (comparative token length, COMET-Src score, and WordAlignment score) in the exploration of acceptability metrics, which took a concerted effort from all of our team members! We were able to realize many components of a research project: literature review, data collection, data analysis and methods benchmarking in a really short period of time.

What we learned

Ultimately, the conclusion of this (very) short study is that the most useful metric for determining acceptability appears to depend on the target language, the domain of evaluation and the application scenario. We also found that backtranslation can be a really unreliable signal for monolingual speakers in high risk-scenarios as they fail to accept good translations that are otherwise acceptable to bilingual speakers. Furthermore, backtranslations can give a false-sense of confidence to monolingual speakers in making acceptability judgments which can be harmful in high risk scenarios.

What's next for Metrics for Reliable AI-based Translation

Why the best acceptability metric seems to vary by language, potential failure modes, the existence of edge cases, and better evaluation methods have yet to be explored, but there are also refinements to the monolingual and bilingual evaluation surveys that can be made to reduce the occurrence of disagreements.

Reference Links

The links attached to this devpost submission reference the project Google Drive folder (containing our code), and the project presentation that guided our thought process as we explored this research topic.

Built With

comet-src
numpy
pandas
python
sarcremoses
seaborn
simalign

Submitted to

Technica 2022
- Winner Research Track Projects

Created by

I worked on the length ratio scores for French and Russian translations, and constructed the Devpost submission.

Elliot "Li" Bearden
I worked on the length ratio scores for the French and Russian TED Talk translations.

Roshni Kainthan
Kate Olsen
arsema tegegne
Irene Han
sysemenova Semenova