Inspiration
We were inspired by how Pokemon composers attempt to emulate the sounds of the countries that each region is based on.
What it does
Our project calculates the similarity between different audio tracks. By using tracks from each region and representative music from each country, we can see through aggregation the relationship between each region's and country's music.
How we built it
We first downloaded all the music using tools like ytdl and the khinsider game OST as mp3s. Then we fed this into the audio spectogram transformer (AST). This model works by converting the mp3 into a spectogram which uses a vision transformer on the spectogram. The AST then outputs a matrix of size sequence length and hidden dimension. After this we did mean pooling to get a vector of size hidden dimension. These were our vector embeddings of each soundtrack which were then averaged by region and country. These representative vectors were then used to generate the visualizations.
Challenges we ran into
We had challenges with our embedding models as many models we found weren't maintained and had bugs. Eventually, after a lot of research we found the AST model. Another challenge was each example taking a lot of time to run through the model. We rectified this by using a gpu to run inference.
Accomplishments that we're proud of
We are very proud of how the website turned out, especially how it looks. We are also proud of how interesting results. For example, we can see which tracks are similar or dissimilar compared to other tracks.
What we learned
It is very important to pick the right models and ensure that they were trained in domain examples.
What's next for Embedding Pokémon OSTs into the Real World
We were thinking about using textual descriptions of each track to generate embeddings. These would likely be more semantically representative because text based llm embeddings are very well developed.
Log in or sign up for Devpost to join the conversation.