AI-Powered Dynamic Game Music: A Hackathon Journey

Inspiration

Our inspiration for this project came from a shared love of both indie games and dynamic music systems. We've always been fascinated by how music can profoundly impact the player experience, creating moments of tension, joy, sadness, and wonder. However, we also recognize that creating truly dynamic soundtracks – music that adapts in real-time to the player's actions and the game's state – is a significant challenge, especially for independent developers with limited resources. Many indie games rely on static soundtracks or repetitive loops, which can diminish immersion. We wanted to explore how AI could bridge this gap, making adaptive music more accessible. Specifically, we were also inspired by the RIT MAGIC Center's work. We wanted to create a system where music generation would be easy for artists.

What We Learned

This hackathon project was a deep dive into several areas, and we learned a ton in a short amount of time:

  • Hybrid CNN-Transformer Architectures: We had prior experience with CNNs for image classification and Transformers for natural language processing, but this project forced us to combine them in a novel way. We learned about different strategies for connecting these architectures (feature extraction, positional embeddings, pooling methods) and how to implement them using PyTorch and the Hugging Face transformers library.
  • Real-time (Simulated) Processing: While we didn't achieve true real-time integration with a game engine, we learned about the challenges of designing a system for low-latency processing. This involved choosing efficient models, optimizing data loading, and carefully considering the frequency of analysis.
  • Music Theory and Emotion: We had to brush up on our music theory knowledge to create a reasonable mapping between visual moods and musical elements (instruments, tempo, key, etc.). This highlighted the interdisciplinary nature of the project.
  • Hugging Face Transformers Library: We gained practical experience using the transformers library for loading pre-trained models, modifying their configurations, and integrating them into a custom PyTorch model.
  • Data Augmentation Techniques: We learned about various image augmentation techniques (random cropping, flipping, color jittering, etc.) and how to implement them using torchvision.transforms. This was crucial for expanding our small dataset.
  • PyTorch/Tensorflow Ecosystem: We experimented and learned a lot more using PyTorch (and could have used Tensorflow too)

How We Built It

The project was built using Python and several key libraries:

  1. Data Collection and Labeling: We gathered screenshots from various indie games and manually labeled them with a set of 12 aggregate mood categories (e.g., "exploration-immersion", "challenge-flow", "dread-isolation"). This involved a lot of subjective judgment!

  2. Model Architecture (Hybrid CNN-Transformer):

    • CNN (MobileNetV3): A pre-trained MobileNetV3 was used as a feature extractor. We removed the final classification layer and used the output of an earlier layer as a sequence of "visual tokens."
    • Transformer Encoder (Small BERT): A small, pre-trained BERT model (from Hugging Face) was used to process the sequence of visual tokens, capturing relationships between different regions of the image.
    • Classification Head: A simple feed-forward network was added on top of the Transformer output to predict the mood category.
  3. Training:

    • We used the torchvision.datasets.ImageFolder to load the labeled images.
    • Data augmentation was applied to artificially increase the dataset size.
    • The model was trained using the AdamW optimizer and CrossEntropyLoss.
  4. Inference and Similarity Check:

    • A predict_mood function was created to process individual images and get the predicted mood.
    • A process_frame function was implemented to simulate real-time processing. It calculates the similarity between the feature vectors of consecutive frames and only updates the suggested music if the similarity is below a threshold.
  5. Mood to Instrument Mapping:

    • Created a dictionary to correlate between the moods that were predicted by the AI, and instruments to create it.
  6. Simulated Real-time Loop:

    • A list of image paths was used to simulate a stream of screenshots from a game.
    • The process_frame function was called for each image, simulating real-time analysis.

Challenges Faced

  • Limited Data: The biggest challenge was the small size of the labeled dataset. Training deep learning models typically requires thousands or even millions of examples. We had to rely heavily on data augmentation and pre-trained models to mitigate this. A larger, more diverse dataset would significantly improve the model's accuracy and robustness.
  • Real-time Performance: Achieving true real-time performance with a complex model like a hybrid CNN-Transformer is difficult. While we optimized the code as much as possible (using a lightweight CNN, smaller transformer and image downscaling), further optimization (e.g., using a dedicated inference engine like ONNX Runtime) would be needed for a production system.
  • Subjectivity of Mood: Defining and labeling "mood" is inherently subjective. Different people might interpret the same image differently. This makes it challenging to create a truly objective ground truth for training.
  • Game Integration: Integrating into the game to create a seamless loop of music was difficult.
  • Time Constraints: The hackathon timeframe forced us to make compromises and prioritize. We focused on building a working prototype that demonstrated the core concept, rather than a fully polished and optimized system.
  • Finding the right model: Initially, we wanted to use a vision transformer (VIT), however, we found that the CNN-Transformer model would suit our purposes.

Despite these challenges, we're proud of what we were able to accomplish in such a short time. The project demonstrates the potential of AI to enhance the emotional impact of games, and we're excited to continue exploring this area in the future.

Built With

Share this project:

Updates