Final Submission:

Final writeup/reflection: https://docs.google.com/document/d/1eisvTNrBxbRDqVz6ZJ5wX8Hn95wjbbf1io1r-DunUxw/edit

Presentation slides: https://docs.google.com/presentation/d/1lqh4Amn8RFfPyb7lkSDrqgEf4oWOeBLzQE_e6LPQIqg/edit#slide=id.p

Github repository: https://github.com/zaultavangar/DL-final-project

Check-ins:

Check-in 2:

Title: Summarizes the main idea of your project.

Probing the dynamics of word embeddings and prompt completions using time-sliced training of language models

Who: Names and logins of all your group members.

Roman Hall (rhall10), Zaul Tavangar (ztavanga), Gopal Iyer (giyer4)

Introduction: What problem are you trying to solve and why?

Broadly speaking, we are trying to map out the evolution of ideas as captured by the embeddings and prompt completions of word embedding and language models trained on time-sliced data. We believe this is a problem of interdisciplinary relevance that utilizes fairly simple ideas in machine and deep learning, making it an interesting problem to explore for this project.

If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper.

We are simultaneously attempting to (a) apply existing ideas to new data, and (b) explore a novel (to our knowledge) idea on that data. The work that comes closest to (a) is Haoxiang Zhang’s master’s thesis (https://escholarship.org/uc/item/9tp9g31f). We will, however, be building and training our own embedding model, likely with a much smaller embedding space than word2vec, which is used in the thesis. The objective of this thesis is to study the time-evolution of word embeddings of particular words using embedding models that are trained on time-sliced text data. This helps identify interesting semantic correlations between related ideas and/or discover correlations between ostensibly unrelated ones. We chose this paper for inspiration due to its clarity of presentation and the simple yet powerful ideas it implements. Our complete dataset is one involving a million newspaper headlines collected over 19 years and can be found here: https://www.kaggle.com/datasets/therohk/million-headlines.

If you are doing something new, detail how you arrived at this topic and what motivated you.

Idea (b), to our limited knowledge, is new, however, given the pace of progress in the development and application of language models, has realistically been attempted before in some way, shape, or form. During the course of our project, we will do our best to identify works in deep learning and applied literature that match our idea in order to ensure fair attribution. The idea is as follows: For a fixed embedding scheme (possibly one trained on the full, “unsliced,” dataset on which part (a) is based), we will train a fairly simple language model (for next-word prediction) with predetermined hyperparameters on each slice of the dataset. We will then take a single fixed prompt, for example, “Current public sentiment is that war is __.” For each trained instance of the language model, we will get the embedding and the token itself that has the highest probability of being the completion of the prompt for that slice of the dataset. Given a single such prompt, this analysis would shed some light on the evolution of the journalistic context surrounding a particular idea. Given two or more prompts, it has the potential to identify correlations between pairs of complementary ideas, such as (1) “Current public sentiment is that war is _,” and (2) “Consumers feel that gas prices are ____.” While it might be too ambitious to expect conclusive findings given the fact that we plan to use architecturally fairly simple models, we hope that the method itself provides a systematic approach to tackle similar problems.

What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Etc.

The first part of our model is an unsupervised learning problem: we feed the model the dataset and train it to learn word embeddings that are unknown to us when we begin the task. Then, the second part of our model wherein we predict related words involves structured prediction: we oversee the training of the model by checking its accuracy against known/actual headlines from given time periods.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project?

Read and briefly summarize a paper/article/blog relevant to your topic

Please see our response in the Introduction section. A few additional details: It has been accurately noted (Zhang 2019, link above) that embeddings trained on different time slices may not necessarily be “aligned,” i.e. they might not represent the same latent semantics. In order to correct for this, we plan to apply the recommended orthogonal Procrustes method to find the optimal rotation of each embedding coordinate system that best preserves the cosine similarities between different iterations of the embedding models. The data from the high-dimensional embedding space is treated using the t-SNE nonlinear dimensionality reduction algorithm to enable visualization and ease of analysis.

URLs to any public implementations you find of the paper you’re trying to implement.

TensorFlow word2vec documentation: https://www.tensorflow.org/tutorials/text/word2vec

Data: What data are you using (if any)?

Say something about where your data comes from.

https://www.kaggle.com/datasets/therohk/million-headlines We are using the “A Million News Headlines” dataset prepared by Rohit Kulkarni (see above link). This contains news headlines published over a period of nineteen years, from 2003 through 2021, sourced from the Australian Broadcasting Corporation. The format is a CSV file containing publishing date and the text of the headline.

How big is it? Will you need to do significant preprocessing?

The dataset is contained in a single CSV file that is 22MB. In terms of entries, the corpus of data has a volume of around 200 headlines per day for the years that it covers (2003 - 2021), which is more than a million headlines. Since it’s already in CSV format, we will have very limited preprocessing to do (just reading/parsing the file, e.g. using np.loadtxt()).

Methodology: What is the architecture of your model?

Our model consists of two components: the embedding model and the language model. While we have yet to conduct systematic experiments on model architecture, we anticipate that our embedding model will be a massively pared-down version of word2vec. Similarly, our language model will likely be a very simple implementation of a transformer. Particularly, we would like to work with small embedding spaces of size ~20–30. This is because each (sub-)dataset on which our models are trained will be fairly small in size and a large embedding space will likely confuse the model into missing important correlations or identifying spurious ones. (This intuition may not be entirely valid and we are open to discussing it further.) Our goal is also to conduct thorough and systematic analyses of simple models instead of running ambitious and poorly planned experiments on larger ones.

How are you training the model?

One of us has access to Brown’s Oscar supercomputing cluster. This will give us access to a sufficiently large number of cores and a GPU, which should hopefully enable us to carry out high throughput testing.

If you are implementing an existing paper, what do you think will be the hardest part about implementation?

We are not exactly implementing an existing paper, but we anticipate that experimenting with different models to achieve high accuracy targets will likely be the most challenging, time-consuming part of the project.

If you are doing something new, justify your design.

It is fair to say that a simplified version of word2vec with an embedding size of roughly ~20 should be able to capture at least some meaningful correlations in the sub-datasets that we plan to use. For the language model, we hope that a simple transformer architecture will reach acceptably low perplexity scores. In case we don’t achieve this, another idea is to simply fine-tune existing (L)LMs, such as those available on huggingface.co, on the time-sliced data. One potential issue with this approach is that the relatively small sizes of our sub-datasets might cause little to no noticeable change in the predictive tendencies of the language models that we would be fine-tuning. This is an issue on which we might require some guidance.

Metrics: What constitutes “success?”

Success for this project would be to observe any meaningful shifts in the word embeddings from the headline data over time – shifts that we could make sense of or that might shed light on global trends, sentiment, and/or the evolution of ideas alongside current events.

What experiments do you plan to run?

We plan to choose specific words of interest (perhaps pairs of words that we would like to study the relationship between) and see how their distances in embedding (or a suitably dimensionally reduced) space change over time. This will be one way to evaluate/test our word embeddings model. More technically, we will employ a distance-based clustering metric for the unsupervised learning of the embeddings. Then, for the predictive language model, we will run experiments to predict headlines including input words of interest, which will hopefully reflect shifts in public sentiment/bias, and global changes in general.

Does the notion of model “accuracy” apply for your project, or is some other metric more appropriate?

Accuracy does apply for our project. For both the unsupervised and supervised components of the project, well-defined accuracy metrics exist. Specifically, for learning the time-varying embeddings, log-likelihood maximization is used as a training objective, while perplexity can be used for the next-word prediction model.

If implementing an existing project, what were the authors hoping to find and how did they quantify their results?

For the embedding model idea, we borrow from Zhang (2019), linked above. That thesis explores a number of simple but interesting methods to probe the evolving relationships between pairs of words. As an example, consider Figs. 4.7 and 4.8 in the main text. They consider three different word pairs across which one word, e.g. ‘unemployment’ is common. They then plot the cosine similarity between the other word in the pairs and ‘unemployment’ to chart the evolving semantic commonality between, say, ‘GDP,’ and ‘unemployment,’ which, to the extent that current language reflects reality, helps concretely quantify the relationship between these two economic concepts. They also tabulate word frequencies and map out their shifts in reduced-dimensional representations to better highlight the evolving semantic properties of their segmented data.

If you are doing something new, explain how you will assess your model’s performance.

We will assess our language model’s performance with perplexity, seeing how well it is able to predict accurate/realistic headlines given prompt words.

What are your base, target, and stretch goals?

Base goals: Have functional architectures for embedding and language models that run smoothly on the preprocessed data. Target goals: Come up with well-documented, quantitative analyses of the shifts in word embeddings/distances as well as prediction embeddings over time. Stretch goals: Apply some form of stochastic differential equation analysis to the embedding data. If the complete analysis is systematic and conclusive up to a minimum standard, put up our code on a public repository and upload our documented findings to the arXiv and/or consider submitting it to a (digital humanities?) journal.

Ethics: Choose 2 of the following bullet points to discuss

What broader societal issues are relevant to your chosen problem space?

News bias. The data comes from an Australian news source, ABC (Australian Broadcasting Corporation). There is always a concern that the headlines coming from this source may have cultural, national, or political bias?

Why is Deep Learning a good approach to this problem?

Deep learning is a good approach because it is a great way to discover relationships between certain words or set of words using embeddings. From an ethical standpoint, our project may even unintentionally come across and quantify some of the bias described in the previous question, which would prove to be another interesting result.

Division of labor: Briefly outline who will be responsible for which part(s) of the project.

Dataset preprocessing/manipulation: Zaul, Word embedding model: Gopal, Predictive language model: Roman

Checkin 3 Progress Report (May 1st):

Introduction:

Broadly speaking, we are trying to map out the evolution of ideas as captured by the embeddings and prompt completions of word embedding and language models, respectively, trained on time-sliced data. We believe this is a problem of interdisciplinary relevance that utilizes fairly simple ideas in machine and deep learning, making it an interesting problem to explore for this project.

Challenges:

The hardest part of our project so far has been figuring out how best to compare the embeddings of headlines across time-slices of data. Since we train separate embeddings for each time-slice, we cannot directly compare the embeddings of headlines from different time-slices as this would not provide any meaningful distance/similarity insight (the corresponding embeddings likely do not represent the same latent semantics). Our solution to this is to use the orthogonal Procrustes method to find the optimal rotation of each embedding coordinate system that best preserves the cosine similarities between different iterations of the embedding models. However, as we discussed with our mentor TA, this still may not be completely reliable for comparing the treatment/significance of different words, so we have decided to compare the neighbors of any given word across the time-slices in order to analyze changes in the significance/connotations of words over time. We are still settling on a way to best perform this analysis but are planning on implementing a search to find a certain number of nearest neighbors to a word of interest (based on cosine similarity, as in the debiasing lab), or perhaps to gather all the neighboring words within a certain “radius” of a word of interest.

Insights:

The speed at which an embedding model trains seems to be related to the vocab size and embedding size used for the embedding model. This is understandable since increasing these quantities directly increases the expressivity of the model. We are now trying to plot the loss of the embedding model as a function of the embedding dimension on a log scale (embedding dim = 16, 32, 64, 128, 256) to see if there is a local minimum or convergence of the loss. This is being run on the entire dataset. We will then use this to pick the best hyperparameters to train the model on the time-sliced data.

Plan:

We are mostly on track with our project; the next steps are to finalize the neighbor search for our embedding model (and find a set of words that yield interesting/varied results) and to complete our language model in order to predict related words in headlines within a given time-slice. After discussing with our mentor TA, we have decided to definitely attempt to build/train our own language model (although our expectations for this are low given the relatively small size of our dataset for the purposes of training a language model); our reach goal is to also implement in-context learning by fine-tuning an existing large language model, but we are also uncertain whether this will yield meaningful results.