Inspiration
The Hausa language is spoken by millions of people in West Africa, particularly in Nigeria, Niger, and Ghana, including all the team ArewaDS-LMBF members. The translations from Hausa to English we see on social media are at times inaccurate or without context. We brainstormed on ways we can contribute to improving this situation, despite our limited coding knowledge. The most realistic contribution was to settle on a way of aligning Hausa-English translations with a view to improving parallel datasets for machine translation models, considering our skill-level and timeline. Luis’s session on word embeddings contributed to our settling on this project as 3 out of our 5 member-team attended the session and we all appreciated how word embeddings worked after the session.
Developing a Hausa-English sentence alignment tool can help in improving machine translation models by reducing the problem of dataset availability.
What it does
When supplied with texts in source and target languages in txt format, it returns the best possible pairwise translations from the two texts. Our main concern was the Hausa language, so we settled on a Hausa-English alignment. It can work on other language pairs, working excellently on small custom-datasets for French-English alignments.
How we built it
We provided a github repo that could be used to test the solution, it requires signing up with cohere and getting an API key. We provided scripts for working with CuPy and NumPy arrays; script containing functions for reading, writing, and manipulating word embeddings and finally a script that aligns the source and target sentences using the cohere multilingual embeddings API. The command line script for running the cohere alignment that takes into account the github repo file tree is as follows:
python3 scripts/cohere_align.py \
--cohere_api_key '<api_key>' \
-m 'embed-multilingual-v2.0' \
-s src.txt \
-t trg.txt \
-o cohere \
--retrieval 'nn' \
--dot \
--cuda
We did a comparison with LASER and the command-line script is as follows:
python3 scripts/laser_align.py \
-s src.txt \
-t trg.txt \
-o cohere \
--src_lang ha \
--trg_lang en \
--retrieval 'nn' \
--dot \
--cuda
where m is model name, s is source text path, t is target text path, o is output directory path, and provide the cuda option if you have GPU. More parameters are given in the alignment script.
In summary, the program takes as input two files containing source sentences and target sentences, respectively, and outputs a file with aligned sentences. We demonstrate our idea using a simple easily implementable solution that could be supercharged.
Challenges we ran into
Our team mates are all Machine Learning and AI enthusiasts and we have been learning so much about Data Science and Machine Learning for a few months now, but our knowledge of implementations, as we learned the hard way, was a bit lacking. We relied on the mentorship of our mentors here in the Arewa Data Science community for pointers on how to implement an easy solution that sells our idea and enthusiasm.
Accomplishments that we're proud of
- Our ability to actually showcase the potential of the solution and that it works
- The teamwork we exhibited and our never-giving-up attitude despite the hurdles we encountered. We are not experts in any way.
What we learned
- The importance of teamwork, how teamwork helps to stay motivated when your teammates are as invested as you are in a project
- The alignment worked excellently well on a small custom Hausa-English alignment and it had an F1 score of 0.9496 on the Flores dataset. By comparison, LASER somehow performed poorly on both. This shows that on the datasets we used, the cohere model performed better.
- The alignment has potential for use in many interesting scenarios
What's next for Cohere Parallel Language Sentence Alignment
The intent is to showcase the potential application and we have largely succeeded. The next thing is to look at the following, after we improve our machine learning knowledge:
- Improving the accuracy and efficiency of the alignment algorithms
- Evaluating the impact of sentence alignment on machine translation
- Exploring multilingual sentence alignment with more languages
- Finding and using more texts that have both Hausa and English versions
- Exploring using it for other potential use-cases such as:
- Machine translation services
- Language learning tools
- Cross-lingual search engines
- Multilingual customer service
- Translation software development
- Multilingual content creation, e.t.c.
Built With
- coheremultilingualembeddingsapi
- colab
- python
Log in or sign up for Devpost to join the conversation.