Inspiration
Each of our group members have struggled with a foreign language themselves especially understanding context and semantics. Speaking the language was our greatest struggle, natural native context is overlooked by many language curriculums or the resources are limited and insufficient. For example the Spanish curriculum at East Chapel Hill High School assumes after 2 years of Spanish education that students no longer need as much speech training or vocabulary expansion. Spanish III and Spanish IV are spent researching the history of foreign countries, doing art history projects about Spanish artists, and only 3-5 minutes per class making a half-hearted attempt to speak to each other about our weekend. The attempts teachers make towards cultural immersion are largely ineffective with research of latin American history a favorite among the faculty. As proven by East Chapel Hill teachers can provide a website in Spanish for students to study but, that is just one example attempting to represent an entire culture; something that is impossible to do with limited resources. We recognized the internet as a boundless resource of cultural immersion where native speakers of every language and culture interact. Seeing the opportunity and that no one was attempting to harness its power we decided to pioneer a new age of culturally relevant language learning and full immersion into other cultures.
What we learned
Webscraping, translation and many aspects of the project we undertook was new to us, and we learned a lot in practical skills. It was also the first time many of us coded collaboratively, and we learned to use git to facilitate collaboration.
How we built our project
We built the project by breaking the program down into three separate parts, with one person in charge of coding each portion: webscraping, machine learning algorithm and translation and sentence segmentation. After ensuring each individual section operates individually, we combined them into one program by importing each section as python modules.
Challenges we faced
The development of the webscraper was a big challenge. The selenium webscraper outputed sentences with large amounts of spaces and put words with hyperlinks on different lines. In order to ensure that all sentences are captured correctly and ignore one word sentences. We had to mess with regex and spent several hours on the formatting, obtaining help from some of the advisors.
To showcase how our app can bring culture and language together, we chose Japanese to make the cultural contrast more obvious-- and thus the value of the project. However, Japanese does not have spaces between words with made webscraping formatting, vocab separation and translation, as well as complexity detection exceedingly hard. We were eventually able to come up with a way to separate it into words by converting it into a word separated version of Japanese called romaji.
During the training of the machine learning model, we had to acquire training data. Without which the machine learning algorithm struggled to produce any results. After we completed our webscraping setup, we were able to scrape sentences from the internet, and assign difficulty levels to each of these sentences.
Next steps
We have completed a functioning prototype program containing the bulk of its functionality, so our next steps are to create a user interface, or obtain the help of more experienced people to do so, and polish our algorithms to make them more efficient and robust.
Built With
- ajexapi
- beautiful-soup
- csv
- matplotlib
- nltk
- numpy
- os
- pandas
- python
- regex
- requests
- scikit-learn
- selenium
Log in or sign up for Devpost to join the conversation.