We believe there are a lot of the Yoruba speaking community not fluent in English language and who also do not have time to read lengthy contents, and hence are deprived of news information. This has inspired us to build an app that can collect news articles, summarize them and translate them to Yoruba Languages.

To achieve this, we followed the steps below: Web Scraping and Cleaning: We first web-scraped news articles from https://punchng.com/. The data was then cleaned by removing tags and converting it to lower case. Text Summarization: The Cohere summarizer was then used to summarize the text. This resulted in a shorter, more concise version of the text that retained the most important information Multilingual Embedding: The dataset was then embedded with Cohere multilingual embedding. This allowed us to represent the text in a way that was understandable to both humans and machines. Tokenization: The dataset was then tokenized with BERT Tokenizer. This process broke the text down into individual words and phrases, which made it easier for the machine translation model to process Machine Translation Model Development: We then developed a machine translation model with the BERT model. This model was able to translate the text from English to Yoruba with a high degree of accuracy.

Some of our challenges were that most of Nigerian news sites were not crawlable and Cohere API we had limited API calls.

The next thing for us is to use Text-to-Speech API to convert the summarized Yoruba text into speech which can be read to the user via our application.

Built With

  • cohereapi
  • englishlanguage
  • mtranslate
  • punchonline
  • yorubalanguage
Share this project:

Updates