Cohere Parallel Language Sentence Alignment

Inspiration

The Hausa language is spoken by millions of people in West Africa, particularly in Nigeria, Niger, and Ghana, including all the team ArewaDS-LMBF members. The translations from Hausa to English we see on social media are at times inaccurate or without context. We brainstormed on ways we can contribute to improving this situation, despite our limited coding knowledge. The most realistic contribution was to settle on a way of aligning Hausa-English translations with a view to improving parallel datasets for machine translation models, considering our skill-level and timeline. Luis’s session on word embeddings contributed to our settling on this project as 3 out of our 5 member-team attended the session and we all appreciated how word embeddings worked after the session.

Developing a Hausa-English sentence alignment tool can help in improving machine translation models by reducing the problem of dataset availability.

What it does

When supplied with texts in source and target languages in txt format, it returns the best possible pairwise translations from the two texts. Our main concern was the Hausa language, so we settled on a Hausa-English alignment. It can work on other language pairs, working excellently on small custom-datasets for French-English alignments.

How we built it

We provided a github repo that could be used to test the solution, it requires signing up with cohere and getting an API key. We provided scripts for working with CuPy and NumPy arrays; script containing functions for reading, writing, and manipulating word embeddings and finally a script that aligns the source and target sentences using the cohere multilingual embeddings API. The command line script for running the cohere alignment that takes into account the github repo file tree is as follows:

python3 scripts/cohere_align.py \
   --cohere_api_key '<api_key>' \
   -m 'embed-multilingual-v2.0' \
   -s src.txt \
   -t trg.txt \
   -o cohere \
   --retrieval 'nn' \
   --dot \
   --cuda

We did a comparison with LASER and the command-line script is as follows:

python3 scripts/laser_align.py \
  -s src.txt \
  -t trg.txt \
  -o cohere \
  --src_lang ha \
  --trg_lang en \
  --retrieval 'nn' \
  --dot \
  --cuda

where m is model name, s is source text path, t is target text path, o is output directory path, and provide the cuda option if you have GPU. More parameters are given in the alignment script.

In summary, the program takes as input two files containing source sentences and target sentences, respectively, and outputs a file with aligned sentences. We demonstrate our idea using a simple easily implementable solution that could be supercharged.

Challenges we ran into

Our team mates are all Machine Learning and AI enthusiasts and we have been learning so much about Data Science and Machine Learning for a few months now, but our knowledge of implementations, as we learned the hard way, was a bit lacking. We relied on the mentorship of our mentors here in the Arewa Data Science community for pointers on how to implement an easy solution that sells our idea and enthusiasm.

Accomplishments that we're proud of

Our ability to actually showcase the potential of the solution and that it works
The teamwork we exhibited and our never-giving-up attitude despite the hurdles we encountered. We are not experts in any way.

What we learned

The importance of teamwork, how teamwork helps to stay motivated when your teammates are as invested as you are in a project
The alignment worked excellently well on a small custom Hausa-English alignment and it had an F1 score of 0.9496 on the Flores dataset. By comparison, LASER somehow performed poorly on both. This shows that on the datasets we used, the cohere model performed better.
The alignment has potential for use in many interesting scenarios

What's next for Cohere Parallel Language Sentence Alignment

The intent is to showcase the potential application and we have largely succeeded. The next thing is to look at the following, after we improve our machine learning knowledge:

Improving the accuracy and efficiency of the alignment algorithms
Evaluating the impact of sentence alignment on machine translation
Exploring multilingual sentence alignment with more languages
Finding and using more texts that have both Hausa and English versions
Exploring using it for other potential use-cases such as:

Machine translation services
Language learning tools
Cross-lingual search engines
Multilingual customer service
Translation software development
Multilingual content creation, e.t.c.

Built With

coheremultilingualembeddingsapi
colab
python

Submitted to

#CohereAIHack: Empower your Creativity with Cohere's Multilingual Embeddings
- Winner Second Place

Created by

I worked on a way to showcase our work using a colab notebook such that the person testing out our code can really appreciate the utility. I also helped out with fleshing out our project story.

Lukman Aliyu Jibril
Pharmacist interested in AI in Healthcare
I participated in the project development of 'Cohere Parallel Language Sentence Alignment', which aimed to create a parallel corpus of Hausa and English sentences for machine translation. My main role was to read and debug the code written by other team members, using various tools and techniques to identify and fix errors. I also helped to improve the code quality and efficiency by suggesting and implementing better algorithms and data structures. I contributed to the documentation and testing of the code, ensuring that it met the project specifications and standards.

Aminu Hamza Nababa
Techie | Data Science Student | Automotive Engineering Student.
I participated in implementing the solution, i was actively involved in developing the project itself and contributed to the actual code and also worked on writing the documentation and overall success.

Babangida Sani
Maths lover|code father| studying CS at BUK, kano
Faridah Yusuf
Nasiru Mohammed