Inspiration

Research shows that even misalignment in a narrow context can lead to broader harm (Turner et al, 2025). We want to investigate how LLM takes in conflicting contexts and see how its outputs behave with contradictions in different environments.

What it does

Source data from Wikipedia pages in different categories of content, then change key figures/facts in the chosen text that affect the central context/message. We upload modified information in a text file as context for Claude LLM, then prompt it with a query about information in the file. By evaluating the responses, we analyse how often the LLM hallucinates and in what area of knowledge it hallucinates the most.

How we built it

Project Structure: wiki-modifer -- sources wikipedia pages and returns modified text

Phase2-CLI_chat: interfaces with Claude LLM. Allows upload of context documents (i.e. modified wikipedia texts in this case) and allows user to interact and prompt it. It's responses and behaviours are then analysed.

embeddings_testing: finds token embeddings and computes cosine similarity between them.

What's next for RAGebait

This research gives us insight into how the inner thinking process of LLM relates to what data it is most vulnerable to misalignments. This can be used to assist research on what areas and environments are LLMs most reliable in, and hence allow us to make further realisations on the usage and limitations of large language models.

Built With

Share this project:

Updates