Hangyu Du posted an update — Dec 08, 2021 02:12 AM EST

"DeepWalk" by ShallowRun

Who

Nan Gao : ngao, Yiwen Chen: ychen485, Hangyu Du: hdu17

Introduction

Our project aims to re-implement a SOTA latent space representation learning algorithm for graphs, DeepWalk. It learns the representations of a graph’s vertices by modeling a stream of random walks and capturing the neighborhood similarities. The output of the model is low-dimensional embeddings for each node. We want to apply DeepWalk to some real-world problems, such as account classification for Facebook pages, link prediction-based user/content recommendation, large-scale graph visualization, etc.

Methodology

DeepWalk employs two steps. The first step is to discover the neighborhoods for nodes in the graph. To do so, we generate multiple random walks starting from each node in the graph and store them as sequences of word tokens. The second step is to feed random walk results into a Word2Vec/SkipGram model. Embeddings for each word token are learned to maximize the similarity among words that appear in the same window because words that occur in the same window tend to have similar meanings (key assumption). DeepWalk outputs the low-dimensional embeddings for each node as the outcome, which can be used to perform multiple tasks such as node classification, link prediction, community detections, etc.

Preprocess

We parsed the data from edge lists and node lists of the original dataset into an undirected homogeneous graph where nodes represent Facebook pages, and edges represent mutual likes between nodes. The graph has four categories: politicians, governmental organizations, television shows, and companies. It contains 22,470 nodes and 171,002 edges.

Random walk

A single random walk process begins by choosing a node to start, and randomly selecting a node from its neighbors as the next node to traverse. The process stops when a fixed length of nodes has been visited from the start node. For each node in the graph, we repeat the single random walk process 10 times and store the results. In terms of hyperparameters, we generate 10 random walks for each node in the graph and set the walk length to 10.

Word2Vec

The random walks generated from the previous step are then fed to a Word2Vec (SkipGram) model, where walks are treated like sentences. Nodes are used as equivalents to words, which are fed into the model in a sliding window. The model tries to predict the next node in the walk, and in the process learns a reasonable embedding matrix that assigns closer embedding to similar nodes.

Results

Node classification & Visualization

After obtaining the embeddings for each node in the graph, we performed a multi-class node classification task to test the performance of our model. We trained an MLP classifier with a single hidden layer to classify embeddings into the predefined four categories. 80% of the data are used for training, and 20% of the data are reserved for the test set. We achieved a weighted average of 92%.

We also visualized the embeddings by using t-SNE to reduce the high-dimensional embeddings into two dimensions. Compared to the randomized embeddings initialized in our Word2Vec, we can clearly see that our final embeddings show four clusters of nodes, where each cluster represents a single category. This shows that DeepWalk is capable of encoding unstructured graphs into something useful and tangible for further graph analytics.

Challenges

It took us some time to find a near-optimal combination of hyperparameters and model setup that could demonstrate robust performance while not making training too slow. Specifically, we considered walk length, number of walks per node as hyperparameters for the random-walk part. For the Word2Vec part, we also have window_size, embedding_size to tune. For the classification part, we compared the performance of different classifiers from a group of non-linear classifiers including MLP, SVM, KNN, Decision Tree, Random Forest, etc.

Reflection

How do you feel your project ultimately turned out? How did you do relative to your base/target/stretch goals?

Our base goal is being able to implement random walk and generate word embeddings from SkipGram/Word2Vec. Our target goal is being able to use the word embeddings to do node classification on one dataset e.g. “Facebook Large Page-Page Network”. Our stretch goal is being able to visualize our word-vector results on a scatter plot and see words in the same category being grouped in the same cluster. Overall, we were able to not only achieve our base goal of generating the word embeddings on one dataset but also reach our target and stretch goal of applying the word embeddings on real-word problems like community detection and visualizing our results by showing pages of the same categories grouped together on the plot.

Did your model work out the way you expected it to?

Yes, it managed to extract meaningful information from the original unstructured graph, so that we can perform node classification and obtain satisfactory results.

How did your approach change over time? What kind of pivots did you make, if any? Would you have done differently if you could do your project over again?

Before the second checkoff with our TA, we used an external implementation of Word2Vec in our model. We later switched to using our own Word2Vec to improve the originality of our implementation., while achieving roughly the same performance as using the external implementation.

Since our dataset is relatively small, if we could do our project over again, we would further ensure every observation from the original dataset has the chance of appearing in training and testing set by using k-fold- cross-validation(spitting the data into k groups, training and testing our model k times) to combat overfitting.

What do you think you can further improve on if you had more time?

We noticed that generating random walks takes up a great portion of the processing time and this step is highly parallelizable. If we had more time, we will try to accelerate this process by making use of thread or GPU to make the algorithm more scalable.

We can also improve by further quantifying our results, comparing the macro-F1 and micro-F1 scores(F1 scores are a measure of a model’s accuracy in binary classifications) with those of other baseline methods like what the researchers did in the paper. The baseline methods that were in the papers were SpectralClustering, Modularity, EdgeCluster, wvRN, and Majority.

What are your biggest takeaways from this project/what did you learn?

We learned that every model has its limitations, and it is our job to find the best model for a certain application. For example, DeepWalk only encodes based on the graph structure, and it does not take node features into consideration. For a graph with rich information associated with nodes, DeepWalk cannot accommodate them into the final embeddings, causing a loss of information. Also, DeepWalk does not differentiate between edges, and it does not consider edge weight. Sometimes, it is likely that a certain edge is preferred over other edges, but DeepWalk cannot take that into its random walk procedure. Lastly, DeepWalk does not have the ability to balance exploration vs exploitation, which may make the algorithm more random and less robust.

Secondly, we learned that it is sometimes rare for a model to perform at the level we need for production just in the first instance, and we might need to tweak the hyperparameters multiple times to achieve optimal performance.

We also learned it is important to improve the modularity of the implementation when designing it, so we can work simultaneously later to improve the efficiency of our collaboration.

Log in or sign up for Devpost to join the conversation.