Inspiration

We were inspired by how Pokemon composers attempt to emulate the sounds of the countries that each region is based on.

What it does

Our project calculates the similarity between different audio tracks. By using tracks from each region and representative music from each country, we can see through aggregation the relationship between each region's and country's music.

How we built it

We first downloaded all the music using tools like ytdl and the khinsider game OST as mp3s. Then we fed this into the audio spectogram transformer (AST). This model works by converting the mp3 into a spectogram which uses a vision transformer on the spectogram. The AST then outputs a matrix of size sequence length and hidden dimension. After this we did mean pooling to get a vector of size hidden dimension. These were our vector embeddings of each soundtrack which were then averaged by region and country. These representative vectors were then used to generate the visualizations.

Challenges we ran into

We had challenges with our embedding models as many models we found weren't maintained and had bugs. Eventually, after a lot of research we found the AST model. Another challenge was each example taking a lot of time to run through the model. We rectified this by using a gpu to run inference.

Accomplishments that we're proud of

We are very proud of how the website turned out, especially how it looks. We are also proud of how interesting results. For example, we can see which tracks are similar or dissimilar compared to other tracks.

What we learned

It is very important to pick the right models and ensure that they were trained in domain examples.

What's next for Embedding Pokémon OSTs into the Real World

We were thinking about using textual descriptions of each track to generate embeddings. These would likely be more semantically representative because text based llm embeddings are very well developed.

Built With

Share this project:

Updates