CS2 Insight Hub

CS2 Insights hub

Inspiration

We developed CS2 Insight Hub to provide the Counter-Strike community with an easy way to explore CS2 data in depth. Our goal goes beyond just stats—we aim to incorporate insights from casters and analysts, blending real data with expert perspectives for a more comprehensive view of every match.

The problem we aim to tackle is that while HLTV.org holds a lot of information and statistics about pro CS2, it might be tedious to find answers from. A GenAI assistant, which you could just ask questions, would make finding fact-based analysis of pro CS much easier. Since HLTV does not expose its data to public, we had to get crafty in order to provide the AI the data to base answers on.

What it does

CS2 Insight Hub provides users with data-driven insights by gathering information from official CS2 matches via the grid.gg/ database open access, along with transcripts from highlights and podcasts.

How we built it

All the coding was done with PySpark Notebooks in Microsoft Fabric environment.

The project is divided into data ingestion pipelines, lakehouse storage and a question answering AI assistant. The two data pipelines ingest unstructured and structured data.

We also provide a sample dataset, in order to test the solution without having access to grid.gg API, where the structured data comes from. (https://grid.gg/)

Unstructured data comes in a form of Youtube transcripts from pro match highlights's and CS2 podcasts. Structured data is queried from grid.gg API through GraphQL, and then transformed according to the medallion architecture, using PySpark notebooks.

After the data is loaded into a lakehouse, unstructured data is vectorized into a Chroma vectorstore to act as a basis for a RAG assistant, and structured data is loaded as an SQLite database to act as a basis for SQL agent.

The AI assitant consists of three models: An SQL agent, a RAG assistant and a summarizing assistant. The SQL agent is a default langgraph react agent, the other two are basic langchain chains.

To use the assistant, a Gradio UI is launched in the notebook and also accessible through a public URL.

Challenges we ran into

One of the initial challenges we faced was the limited access to CS2 datasets. Luckily, we gained open access to Grid.gg, which offers a comprehensive esports dataset API that includes CS2 data.

We encountered rate-limiting issues with the Azure OpenAI service while embedding transcripts into Chroma. To resolve this, we processed the transcripts in smaller batches, adding a 60-second pause between each batch.

Accomplishments that we're proud of

We are particularly proud of our ability to seamlessly integrate two distinct types of data: structured data from CS2 match statistics stored in an SQL database and unstructured data from transcripts of highlights and podcasts.

What we learned

We learned to use Microsoft Fabric features, including notebooks in pipelines, Lakehouse for scalable data storage, environment configuration, dimensional semantic models, and the use of delta tables.

In addition to Microsoft Fabric, we also gained experience with Azure OpenAI and learned to make an SQL agent with langgraph.