Inspiration

With the ever-developing Large-Language Model and Chatbot, humans become increasingly dependent on these models to find new information. As a developer, producing a chatbot that allows accurate information is extremely important. From a business perspective, we want a chatbot that not only produces accurate result, but also responses that fit users' preferences. Now what features and characteristics of the responses that may give an edge in terms of users' preferences? Let's dive deep into it.

How we built it

We explored our datasets to gain insights about winning distribution of LLMs, how they are spread out and if some models are chosen more often that other models. In our datasets, we found 3 chatbot version to win more than 50% of the time: gpt-4-1106-preview, gpt-3.5-turbo-0314, gpt-4-0125-preview. Then we fit several features into our model to understand if certain features are preferred over the other. Some features we are interested in include:

cosine similarities between responses through SentenceTransformers
sentiment score for responses through vaderSentiment
number of sentences
word count
readability through textstat We utilized SHAP to understand how certain feature drives the probability of users leaning toward a certain response.

We used 2 different models to train our dataset: xG Boost and Logistic Regression. Both of our models did not get the best accuracy in terms of predicting whether users would prefer a certain response based on the chosen features. Both are around 45% accurate. However, for our question of interest, we focus more on determining feature importance. For comparison, our models perform better when we added more features and so we can derive some conclusions based on SHAP. We found the top features that drive the preferences to be cosine similarities between responses and word count.

What's next?

From EDA, some LLM models perform better or have higher winning percentage. Next interesting step to continue on this research is to engineer a new feature where we build a classifier to predict what LLM version a response is generated from.