Inspiration
Machine translation (MT) tools based on AI technology make it possible for virtually anyone to translate text into many of the world’s languages, and thus hold the promise of enabling seamless communication across language barriers. However, they still make many errors, and these errors are hard to catch by users who are not fluent in the languages involved. Lay users often use techniques like back translation (i.e., re-translate the machine generated output from the target language back to its source language) to judge the output quality when they do not know the target language. As a result MT is sometimes used inappropriately, even in high-stakes settings such as hospitals or court rooms where errors can have severe consequences. In this project, we investigate methods to help users assess the quality of MT outputs, so they can decide when it is appropriate to rely on MT or not. From this, we sought to explore the question, "Are there quantitative methods that can be used to automatically evaluate the acceptability of machine translations?"
What it does
Our analysis performs the following evaluations for French and Russian TedX (low-risk) and COVID-19 (high-risk) translations (from English):
- Collects acceptability judgements and user confidences from bilingual and monolingual speakers for each language pair and each condition (high/low risk)
- Evaluates translated output against the input segment using Comet-Src or length-based heuristics
- Compiles visualizations to aid in the identification of accuracy thresholds for Russian and French
- Delivers a final recommendation on whether or not the translation is accepted based on the individual metrics above, weighted by the accuracy thresholds
How we built it
We built the code for this project collaboratively using Google Colab after performing a literature review on academic articles that seek to answer adjacent/related questions.
Challenges we ran into
For some of the evaluations we generated based on the metrics mentioned above, there was not much differentiation between the scoring for the machine translation versus the human translation for those datasets. We decided to resolve this challenge by accepting that certain metrics appeared to be more useful for French and others were better to offer predictions on Russian.
Accomplishments that we're proud of
We were able to curate a small but high quality dataset for the task of Machine Translation Acceptability for two language pairs and two conditions (high/low). We are proud to have tested three separate methods (comparative token length, COMET-Src score, and WordAlignment score) in the exploration of acceptability metrics, which took a concerted effort from all of our team members! We were able to realize many components of a research project: literature review, data collection, data analysis and methods benchmarking in a really short period of time.
What we learned
Ultimately, the conclusion of this (very) short study is that the most useful metric for determining acceptability appears to depend on the target language, the domain of evaluation and the application scenario. We also found that backtranslation can be a really unreliable signal for monolingual speakers in high risk-scenarios as they fail to accept good translations that are otherwise acceptable to bilingual speakers. Furthermore, backtranslations can give a false-sense of confidence to monolingual speakers in making acceptability judgments which can be harmful in high risk scenarios.
What's next for Metrics for Reliable AI-based Translation
Why the best acceptability metric seems to vary by language, potential failure modes, the existence of edge cases, and better evaluation methods have yet to be explored, but there are also refinements to the monolingual and bilingual evaluation surveys that can be made to reduce the occurrence of disagreements.
Reference Links
The links attached to this devpost submission reference the project Google Drive folder (containing our code), and the project presentation that guided our thought process as we explored this research topic.
Log in or sign up for Devpost to join the conversation.