Logo for the Kanji and Katakana Converter Project

Kanji and Katakana to Hiragana Converter

Inspiration

As a Japanese language learner myself, I was driven by the challenge of reading Japanese text in hiragana (not in romaji) to continue practicing my reading skills. Constantly resorting to dictionaries or ChatGPT for translations was becoming a hurdle in my learning journey. I yearned for a tool that could streamline my learning process, allowing me to focus on comprehension and fluency. This project was also an opportunity to delve into the world of bash scripts and understand the intricacies of file permissions on my computer, adding a technical edge to my linguistic pursuit.

What I Learned

The journey of building this project was as enlightening as it was challenging. I delved into the realm of bash scripts, unraveling the mysteries of their creation on Ubuntu. The concept of text tokenization unfolded before me, introducing me to the world of Natural Language Processing (NLP). I grappled with the sheer size and computational demands of language models; the dictionary alone occupied over 500 MB of space and had a noticeable download duration.

One striking realization was the computational convenience provided by languages that use spaces to distinguish words, a feature absent in Japanese and Chinese. This characteristic introduces a layer of complexity in automatic translation for these languages. Moreover, the process of crafting a logo for the project highlighted the significance of carefully crafted prompts in generative AI, where each word steers the creative process, and the creation unfolds with a blend of anticipation and patience.

How the Project Was Built

The journey commenced with a blend of technology and artistry. The logo was born out of a prompt "Japanese sakura, fantasy style, technological and multiculture, 'pastel' colors, flowers" crafted on krea.ai, encapsulating the essence of the project in a visual emblem. Communities like Reddit were instrumental in uncovering the quintessential libraries for text segmentation in Japanese, offering a wealth of insights and comparisons. Official documentation served as a compass, guiding the installation and utilization of these libraries while illuminating paths to troubleshoot common pitfalls.

ChatGPT stood as a steadfast ally, aiding in code refinement and concept clarification, ensuring that each line of code was not only functional but also meaningful.

Challenges Faced

The road was fraught with challenges, particularly when dealing with the MeCab and Fugashi libraries. Errors regarding non-existent files were frequent visitors, and while official documentation offered solutions, they often fell short in practice. The breakthrough came when I realized that the root of these issues lay in the incomplete download of the dictionary.

The transition from katakana to hiragana, an unconventional path contrary to native practices where furigana in children's books is presented in hiragana, demanded meticulous attention. Special characters like 'ー' needed discernment, as the transformation hinged on the numerical format of Unicode. Furthermore, my initial intention to orchestrate this operation with cron unveiled a lack of understanding of the tool, leading to the realization that cron was not suited for this task, hence the shift to a .sh script.

Conclusion

The "Kanji and Katakana to Hiragana Converter" is more than just a tool; it's a bridge connecting learners to the rich tapestry of the Japanese language. It stands as a testament to the fusion of technology and linguistics, simplifying the reading process while encouraging the practice of hiragana, ultimately fostering a deeper connection with the language.

Built With

ai
bash
chatgpt
fugashi
generative
github
krea.ai
mecab
natural-language-processing
python
reddit
taking-advantage-of-its-extensive-libraries-and-straightforward-syntax

Updates

Judith Urbina started this project — Jan 31, 2024 07:15 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.