Checkin 3 (May 1) Progress Report:
Introduction:
Broadly speaking, we are trying to map out the evolution of ideas as captured by the embeddings and prompt completions of word embedding and language models, respectively, trained on time-sliced data. We believe this is a problem of interdisciplinary relevance that utilizes fairly simple ideas in machine and deep learning, making it an interesting problem to explore for this project.
Challenges:
The hardest part of our project so far has been figuring out how best to compare the embeddings of headlines across time-slices of data. Since we train separate embeddings for each time-slice, we cannot directly compare the embeddings of headlines from different time-slices as this would not provide any meaningful distance/similarity insight (the corresponding embeddings likely do not represent the same latent semantics). Our solution to this is to use the orthogonal Procrustes method to find the optimal rotation of each embedding coordinate system that best preserves the cosine similarities between different iterations of the embedding models. However, as we discussed with our mentor TA, this still may not be completely reliable for comparing the treatment/significance of different words, so we have decided to compare the neighbors of any given word across the time-slices in order to analyze changes in the significance/connotations of words over time. We are still settling on a way to best perform this analysis but are planning on implementing a search to find a certain number of nearest neighbors to a word of interest (based on cosine similarity, as in the debiasing lab), or perhaps to gather all the neighboring words within a certain “radius” of a word of interest.
Insights:
The speed at which an embedding model trains seems to be related to the vocab size and embedding size used for the embedding model. This is understandable since increasing these quantities directly increases the expressivity of the model. We are now trying to plot the loss of the embedding model as a function of the embedding dimension on a log scale (embedding dim = 16, 32, 64, 128, 256) to see if there is a local minimum or convergence of the loss. This is being run on the entire dataset. We will then use this to pick the best hyperparameters to train the model on the time-sliced data.
Plan:
We are mostly on track with our project; the next steps are to finalize the neighbor search for our embedding model (and find a set of words that yield interesting/varied results) and to complete our language model in order to predict related words in headlines within a given time-slice. After discussing with our mentor TA, we have decided to definitely attempt to build/train our own language model (although our expectations for this are low given the relatively small size of our dataset for the purposes of training a language model); our reach goal is to also implement in-context learning by fine-tuning an existing large language model, but we are also uncertain whether this will yield meaningful results.
Log in or sign up for Devpost to join the conversation.