Evaluating LLM manipulation

Unethical advertising has a long history of producing social harm, and the emergence of embedded advertising within large language model (LLM) conversations introduces the potential for novel and underexplored risks. Unlike traditional social media platforms, where targeted advertising relies on explicit user profiles, LLMs can infer sensitive attributes through dialogue. Users also tend to disclose information to LLMs with greater trust and an expectation of privacy, increasing the potential for manipulation. As commercial pressures incentivize LLM providers to maximize advertising revenue, there is a risk that models may adopt persuasive behaviors that steer users toward sponsored products, regardless of user need or interest.

This project proposes the development of a benchmark to evaluate persuasive and manipulative behavior in LLMs. We will design an analytic tool that quantifies the degree to which an LLM exhibits undue persuasive influence. To construct this benchmark, we will create an adversarial LLM explicitly instructed to encourage unnecessary purchases and compare its behavior against a non-adversarial baseline. Simulated users representing diverse demographic groups will interact with both models. The resulting interaction data will be used to develop and validate our measurement framework, which will then be applied to assess persuasive tendencies in existing commercial LLMs.