Inspiration
We were learning about the transformer architecture and vector databases and wanted to put together an application that uses both of these tools. As we were going over the transformer architecture we were watching this youtuber known as 3blue1brown, who was showing visuals of attention mechanism. This motivated us to create a better way to visualize the attention mechanism as in the video he was using static data so we thought what happened if we did it live as the ai is inferencing we access its heads and return it to a frontend for visualization. The vector database was then added when we thought that maybe we could give the model context but instead of dumping a 500 page book onto it we can add it in a vector database. Which made it so that when model get queried we fetch the top ten most relevant chunks of data to the query and give it as context allowing for less tokens to be used. Along with allowing the user to visualize it in 3D space for a better understanding of how the vector relations work
What it does
The product turns a PDF into an interactive 3D map of ideas. It breaks the document into sentences, finds which ones are most similar, and connects them visually so users can explore how ideas relate across the text. It also includes search, AI question answering with selected context, and attention head visualizations to help users see how the model is interpreting the language. Overall, it helps people research documents faster and understand both the content and the AI’s reasoning more clearly.
How we built it
In the video we shows a simple overview of what the application does, now we are going to give you a basic overview of the how it does it.
Backend: Feel free to visit the backend repo for more info, we have commented a decent amount of the code. The overview:
- The frontend uploads a pdf and the backend accesses it through an api route (fastapi)
- The backend reads the pdf and splits into chunks based off of each sentence
- The chunks are then converted into embeddings stored in a vector database
- After this the part where I give a 3d vector back to the frontend which is then when it processes it and displays it
- The way it works is that the vectors are compared to one and another keep in mind currently it has 1536 dimensions, and based off of the comparison if it is above a said threshold then it will create an edge between them. Now after that the way we convert the 1536d into a 3d for the graph is Principal Component Analysis (PCA). Basically the way it works is that it goes to each 1536d vector and gets the variance and then the top 3 vairances are provided to the front end to graph
- After this the basic 3d graph is done, moving on to the attention heads
- Now the person queries the AI
- We are comparing the chunks that are stored in the vector db compared to the query to get the related context and then passing that to the llm
- The llm then starts generating the tokens and then we are extracting the attention weights after each forward pass
- then I am passing an attention_grid which is actually just (layers × heads × tokens) matrix
- Finally after it is done then the response is given back to the llm to reframe as when we are extracting the tokens for the live stream it ruins the formatting.
Once again this an basic overview if you are intrested in understanding the backend better refer to the github and read the comments we have left.
Frontend: On the frontend, we used react-three in order to generate 3D objects within our Next.js project properly, and we used d3 in order to calculate the positions of each node within our 3d space, which is done when we try to place objects inside of our sphere. The backend sends us data about each node and the edges that need to be rendered, and d3 can layout how that will render via the (x,y,z) coords given by the server through PCA to convert a high dimensional array into 3D, and then we use react-three in order to actually render it properly. For our basic UI frameworks, we used shadcn and aceternity in order to create unique UI elements like glow cards, radial hover cards, buttons, and overall general UI surrounding our scene. For our actual heatmap, we used WebGL alongside C++ code in order to compute both the vertex position (i.e. where to draw), and the individual pixel colors through precise float16 values. Pixel value intensities can also be changed depending on whether or not you want it to use the local maximum within the head itself (i.e. the highest token relation within that head), or the global maximum of 1.0 which would normalize the values as usual with pixel values between 0 and 255.
Challenges we ran into
The real question is what challenges did we not run into. This application is one of the most complicated projects we have done so we were constantly running into brick walls, problems and headaches. Examples of these problems are making the shapes of tensor that the frontend and backend need the same, the vector similarity, how much can the backend provide to the frontend without crashing the web browser, along with countless ui bugs such as nodes lighting up for no reasons, connection not being accurate, along with the nodes not being in the sphere defined even after normalization. And even after we found a way to contain it within the sphere, a lot of the mappings were still inaccurate, which could be seen by how a majority of our nodes were on the edge of the sphere. Additionally, when creating heatmaps we had to find a way where we could preserve as much color information as possible, as tensors returned by the server a lot of the time were small float16 values, so we had to switch to using WebGL in order to preserve these precise values and render them.
What we learned
We learned about vector databases, cosine similarity, 3D rendering with gpu in browser, and laying out nodes in a 3D space, and a lot more
What's next for Cognitive Cartographer
Now that we’ve created a good UI base for what visualizing the self attn of a transformer architecture would look like, we plan on incorporating the full feature set for the architecture in order to see everything that the model is thinking of behind the scenes.
Log in or sign up for Devpost to join the conversation.