Video Game Marketing Assistant

An example of the expected generated image.

Introduction

Our group's passion for video games and game development drives our project. We've noticed that game store pages often don't serve as the best advertisements for games—especially true for indie developers who lack the resources and support that larger studios possess. To assist these indie developers in their marketing efforts, we plan to develop a deep learning model that generates compelling game summaries and matching marketing visualizations. This model will be trained on datasets containing game descriptions, ensuring it captures the essence of what the game is about. Our goal is to enhance marketing materials and provide guidance to developers on focusing their resources effectively to highlight their games' key selling points.

This project primarily employs supervised learning techniques. The text generation component is a structured prediction task, where the model is trained to predict and generate the next token that seamlessly integrates into the ongoing sentence, based on the initial game description and preceding tokens. Subsequently, the model undertakes a classification task to determine the game genre based on the crafted description. Finally, the image generation phase also utilizes supervised learning; here, the model is tasked with creating a visualization that corresponds appropriately to the specified genre. This structured approach ensures that each stage of the workflow is informed by and builds upon the outputs of the preceding tasks, facilitating a cohesive and targeted content generation process.

Dataset

Our datasets, sourced from Kaggle, consist of data scraped from the online game store and repositories, such as steam.com (Steam Games Dataset and Steam Reviews). These datasets includes games titles, date, synopsis, and reviews, which will serve as the primary training material for our LSTM model. For the GAN model, screenshots of these games are required. However, these images need to be manually collected and converted. We will download the game videos from the respective game pages and utilize an online tool, specifically ezgif.com's video-to-jpg converter, to extract screenshots from these videos. These steps ensure that our models have access to accurate and relevant training data for generating game content.

After data cleaning, the dataset used for this project consists of both textual and graphical data sourced from online game store pages and repositories. It comprises three main columns: Synopsis, which provides the official description of the game; Genre, categorizing each game into genres such as Adventure, Casual, Strategy, Action, and Simulation; and Screenshot, which includes in-game screenshots of each title. The dataset encompasses a total of 5,000 data points, evenly distributed across the genres with each having 1,000 entries.

Dataset Description:

Text: more than 2.7M + 6M rows (each length 50-100 words). We will need to clean the text by lemmatization, and tokenization.
Image: at least 90K images and videos, we will take several (10 or more) images from the videos to acquire more image data.

Related Works

The blog by Marco Del Pra on Medium delves into Generative Adversarial Networks, a cornerstone technology for our project aimed at generating game synopses and screenshots. GANs, through their innovative architecture comprising a generator and discriminator, excel in creating data that mimics a specific distribution, showcasing their prowess in generating both textual and visual content. This aligns with our project's objective to automate the generation of detailed game narratives and corresponding imagery, which can serve as a creative catalyst for game developers. The article not only covers the foundational aspects and advancements within GAN technology, such as Deep Convolutional GANs (DC-GANs) for enhanced image quality, but also addresses critical challenges like mode collapse and training instability. These insights are invaluable to our project, offering a blueprint for overcoming technical hurdles and leveraging GANs’ generative capabilities to fulfill our goal of streamlining content creation in game development.
Attentional Generative Adversarial Networks: AttnGAN builds upon vanilla GANs by adding the attention mechanism to allow for text-to-image generations. Moreover, in the paper, the authors proposed a Deep Attentional Multimodal Similarity Model (DAMSM), which is able to compute the similarity between the generated image and the sentence using “both the global sentence level information and the fine-grained word level information.” The DAMSM thus also gives us a fine-grained image-text matching loss metric that helps to train the generator to generate images that best match the input text. Github Repo
Deep Fusion Generative Adversarial Networks: DF-GAN is a newer GAN model for text-to-image generation. However, it employs a more complex structure, we will also explore using it and check how it performs. Github Repo

Methodology

Our project aims to develop a model capable of generating marketing text and visualizations given the input text. The model integrates both Natural Language Processing (NLP) and Generative Adversarial Networks (GANs) to achieve this goal.

Phase 1: The LSTM model will be trained on both game descriptions and game reviews. We plan to experiment augment our data by concatenating parts of game descriptions with its reviews. At test time, given a short game description the model will “finish the paragraph” by predicting the next token. Then, we can feed in both the generated description and "this game is great because ..." and get the part that showcases "why this game is good."

Phase 2: The GAN model will be trained using game images (frames extracted from videos). uses our input as well as language model output to generate images that visually correspond to the text given.

Both models are trained jointly and use the same dataset to ensure consistency and relevance in the generated content. However, some weights or layers could be shared so that the text and visual outputs are coherent.

The design of our model, which integrates LSTM/Transformer and GANs, strategically aligns with the natural progression of game development, from conceptualizing a story to creating corresponding marketing text and visuals. This approach not only mimics real-life game design processes but also takes advantage of the fact that game reviews are also helpful marketing material. Thus generated marketing text might be better in appealing to potential buyers. Automating the marketing stage also enhances efficiency and scalability in game development, allowing for the developers to dedicate more of their time on developing games, instead of marketing.

It is possible that challenges arise, however, there are some alternatives that we can use as a backup. These include enhancing model training with richer datasets, and exploring hybrid approaches like conditional GANs that utilize additional visual inputs. Modifying the model to be more modular and integrating user feedback loops can also help refine and improve the outputs, ensuring the models adapt and evolve based on real-world application and user interaction.

Metrics

For our language model, we will use Categorical Cross-Entropy loss to control the model performance. In addition, human evaluation after the generation of paragraphs is also necessary in order to evaluate how appealing the generated text is. Moreover, for the classification task, we use accuracy which is the most suitable metric to measure the correctness of the classification.

In the context of GAN, the notion of “accuracy” does not really apply, instead, GAN loss might be more suitable to use which is specific to GANs, such as binary cross-entropy for both the generator and discriminator. These losses provide insights into how well the discriminator can distinguish between real and fake data, and how convincingly the generator can produce data that appears real.

For our specific task of text-to-image generation, we plan to implement the image-text matching loss metric used in AttnGAN, for the purpose of ensuring the generated image is relevant to the input text. Other specialized metrics like Inception Score (IS) or Fréchet Inception Distance (FID) might also be relevant, especially for assessing the visual quality and diversity of generated images. For example, higher IS implies that generated images are distinct to each other across different classes; lower FID implies that the generated images are visually similar to “real images” – in our case, real game images.

To evaluate the performance of our GAN model, we will monitor both the discriminator and generator loss functions. These include binary cross-entropy losses for real and generated data and gradient penalties to ensure stability in training. By examining these loss metrics, we can track whether the discriminator is improving in identifying real vs. fake data, and whether the generator is becoming more adept at fooling the discriminator.

Moreover, we will use image-text matching loss to evaluate the GAN model’s performance in generating images that are coherent to input text. Additionally, visual inspections of generated images and user feedback on the realism and coherency of these images will also be crucial for performance assessment.

Goals

Base Goals: Achieve a minimum viable product where the NLP model can generate text with key words related to the given prompt, and the GAN can generate non-noisy images.
Target Goals: The NLP model is able to generate coherent and creative sentences based on the prompted sentences, and the GAN can produce recognizable images that generally correspond to the descriptions.
Stretch Goals: Enhance the quality of the NLP outputs so it is capable of not only describing the game, but also showing “what is great” or a key-selling point of the game; and improve the GAN outputs to high-definition, aesthetically pleasing images that closely mirror the text inputs.

Ethics

Why is Deep Learning a good approach to this problem?

Deep learning is highly suitable for generating images from game introductions due to its proficiency in handling high-dimensional data, learning meaningful feature representations, and modeling relationships hidden within the game description and reviews. Its scalability and ability to improve with more data make it ideal for adapting to evolving gaming trends, without requiring human labor for new games added. Additionally, deep learning can manage the entire process from text processing to image generation in an end-to-end manner, reducing the need for manual intervention and ensuring cohesive outputs. As of today’s technology, it seems that only deep learning models are capable of generating text and image given a large range of input description text.

How are you planning to qualify or measure error or success? What implications does your quantification have?

To measure the success and error of our project, we will utilize a combination of quantitative and qualitative metrics. For the NLP model, metrics such as perplexity to assess language model fluency will be used. For the GAN or diffusion model, visual quality and image similarity will be evaluated using previously mentioned IS and FID to measure image realism and diversity. The image-text matching loss from AttnGAN can be used to quantify performance of generated image given input text. Additionally, user feedback through surveys and usability tests will provide qualitative insights into the relevance and appeal of the generated images. This dual approach allows us to refine technical accuracy while also ensuring the outputs meet user expectations and usability standards, thereby directly impacting the project’s practical effectiveness and user satisfaction.

CheckOff #3 - Reflection

Introduction

Our group's passion for video games and game development drives our project. While browsing game store pages, it came to our attention that game marketing material (synopsis and visualizations) are often not the best advertisements for the game – actual game reviews often do a better job of leading potential buyers to finally decide to buy and play the games.

Challenges

In developing the language model, we initially encountered significant challenges due to the computational demands. Our initial plan was to utilize a dataset of 2,000 game descriptions, with 80% allocated for training. However, processing 1,600 training instances proved to be time-prohibitive. To mitigate this issue, we are considering a reduction in the training dataset to 800 descriptions, which is expected to substantially decrease the training duration while maintaining a manageable computational load. We plan to first find the best performing model architecture with fewer epoch and reduced training set, then train the specific model with large epoch size and potentially full training set.

For the image generation part, one challenge was for the generated image to also be considered images suitable for marketing or advertising purposes, instead of mere images of possible game screenshots. Thus, we decided to add CycleGAN to our model for its ability to map screenshots to images that look like advertisement posters. Another challenge here is finding appropriate “marketing posters” dataset. Currently, we are using online posters (from social media). Another part of the challenge is that DF-GAN employs a rather complex architecture.

Insights

Specifically on preprocessing for the LSTM model, initially, the text is cleaned to exclude punctuation and transformed into a single string. Subsequently, a tokenizer generates numerical sequences from the text, creating subsequences from each sentence. After preprocessing, analysis of the input_sequences revealed a maximum length of 719. The data was structured into matrices X and y, with X having a shape of (142951, 718) and y being a one-dimensional array with 142951 entries. Post one-hot encoding on y, which converts categorical integer features into a binary matrix, the shape becomes (142951, 13377), indicating a significant expansion suitable for categorical processing in neural network models.

We have experimented with generating game descriptions using the language model. While the model successfully identifies and integrates keywords relevant to the input sentence, the outputs do not yet form coherent and complete sentences. This indicates a need for further refinement in the model's ability to construct grammatically and contextually sound sentences based on the provided inputs.

For image preprocessing, we currently selected 1 screenshot per game, and resized all screenshots to be (256,256) and scaled between 0 and 1. So the dataset we are working with are np arrays with shape (N,256,256,3) for N images with 3 channels (RGB). Currently, we have a working implementation of CycleGAN, however the generated images are not in the style as intended (suitable for marketing purposes).

For the language model, the computational intensity has limited the number of epochs, preventing us from achieving our anticipated performance at this stage. However, there is a noticeable improvement in accuracy as the number of epochs increases. Moving forward, we plan to allocate additional epochs to the training process, which is expected to enhance the model's performance and align it more closely with our objectives.

Currently, we are still fixing some bugs that occur during the training of our text-to-image generated GAN model (DF-GAN), but CycleGAN for style transfer is working. Moreover, we are also experiencing very long training time for CycleGAN. Thus, we might need to acquire a more appropriate dataset that specializes in ads / marketing posters for this part of the model to work properly. We might also consider taking a more feasible approach of using conditional GAN to generate images by genre if training time would not work.

Plan

Our project is following our plan and we just need additional time to continue training the model with more epochs in order to achieve the desired level of performance. This extended training period is essential for refining the model's accuracy and effectiveness.

For the language model, we might think of making our model simpler to try to minimize the running time. If simplifying the model could not achieve the goal to minimize run time, we may try to limit the genre of data to reduce vocabulary variety.

For the GAN model, we plan to fix bugs to get it fully working. If we cannot, we might switch to Conditional GAN instead, given the time and computational resources available.