Title + Main Idea
Our project aims to re-implement a SOTA latent space representation learning algorithm for graphs, DeepWalk. It is the very first model that applies Deep Learning to network science, which successfully proves that DL is a promising direction in the field of network science. We want to dive deep into this pioneering model to see how it is designed, and apply it to real-world problems, such as account classification for Facebook pages, link prediction-based user/content recommendation, large-scale graph visualization, etc.
Who
Nan Gao (ngao), Yiwen Chen (ychen485), Hangyu Du (hdu17)
Introduction
We are implementing an existing paper called Deepwalk. DeepWalk is a two-stage algorithm that aims to learn social representations of a graph’s vertices by modeling a stream of random walks. Here, social representation refers to features of the vertices that capture the neighborhood similarities and communities. The resulting outcome of the model is low-dimensional embeddings for each node, which can be used to perform multiple tasks such as node classification, link prediction, community detections, etc. The problem is an unsupervised learning problem because we do not know the word embeddings or the latent representations of nodes beforehand.
Related Work
We are aware of another random walk-based graph learning model, Node2vec. It was introduced in 2016, about 2 years after DeepWalk. The main difference is that Node2vec takes edge weight into consideration when performing random walks. Also, it has two bias hyperparameters to help control how we want to explore the graph (exploitation vs. exploration). On the contrary, DeepWalk only works on unweighted graphs, but it still yields pretty good results in many cases. This motivates us to dive deeper into DeepWalk.
Implementations of DeepWalk: https://github.com/phanein/deepwalk/ https://dsgiitr.com/blogs/deepwalk/
Data
We are using datasets from this GitHub repository. Specifically, we will choose datasets that capture the social relationships or interest groups for the community detection problem. Moreover, we will choose datasets that are relatively small in size to reduce processing time.
E.g.
“Facebook Large Page-Page Network”
It contains data about Facebook page networks of four categories: politicians, governmental organizations, television shows, and companies. The number of nodes in this dataset is 22,470.
“Wikipedia Article Networks”
it contains data from the English Wikipedia, representing the page-page networks on 3 topics: chameleons, crocodiles, and squirrels. The number of nodes in this dataset is 19,109.
Methodology
DeepWalk employs two steps. The first step is to discover the neighborhoods for nodes in the graph. To do so, we generate multiple random walks starting from each node in the graph and store them as sequences of word tokens.
The second step is to feed random walk results into a Word2Vec (SkipGram) model. Embeddings for each word token are learned to maximize the similarity among words that appear in the same window because words that occur in the same window tend to have similar meanings (key assumption).
Lastly, DeepWalk outputs the low-dimensional embeddings for each node as the outcome, which can be used to perform multiple tasks such as node classification, link prediction, community detections, etc.
Implementation Details
Random walk
A single random walk process begins by choosing a node to start, and randomly selecting a node from its neighbors as the next node to traverse. The process stops when a fixed length of nodes has been visited from the start node. For each node in the graph, we repeat the single random walk process 10 times and store the results. In terms of hyperparameters, we generate 10 random walks for each node in the graph and set the walk length to 10.
Word2Vec
The random walks generated from the previous step are then fed to a Word2Vec (SkipGram) model, where walks are treated like sentences. Nodes are used as equivalents to words, which are fed into the model in a sliding window. The model tries to predict the next node in the walk, and in the process learns a reasonable embedding matrix that assigns closer embedding to similar nodes.
Metrics
We measure the performance of the model by doing binary/multi-class node classification tasks on our datasets and compare the result of DeepWalk with a few other graph representation learning algorithms as our baseline methods The authors of the paper quantified the results of their model by comparing its macro-F1 and micro-F1 scores with those of other baselines methods. F1 scores are a measure of a model’s accuracy in binary classifications. The baseline methods that the authors used are SpectralClustering, Modularity, EdgeCluster, wvRN, and Majority. Our base goal is to be able to implement random walk and the Word2Vec (SkipGram) model correctly and compare the F1 scores for one data set e.g. “Wikipedia Article Networks”. Our target goal for the model is to yield better results than our baseline methods. Our stretch goal is being able to apply the model onto a second data set e.g. “Facebook Large Page-Page Network” and also yield better results than our baseline methods.
Results
Node classification & Visualization
After obtaining the embeddings for each node in the graph, we performed a multi-class node classification task to test the performance of our model. We trained an MLP classifier with a single hidden layer to classify embeddings into the predefined four categories. 80% of the data are used for training, and 20% of the data are reserved for the test set. We achieved a weighted average of 92%.
We also visualized the embeddings by using t-SNE to reduce the high-dimensional embeddings into two dimensions. Compared to the randomized embeddings initialized in our Word2Vec, we can clearly see that our final embeddings show four clusters of nodes, where each cluster represents a single category. This shows that DeepWalk is capable of encoding unstructured graphs into something useful and tangible for further graph analytics.
Ethics
DeepWalk can be used to pinpoint manipulative groups inside a social network or a stock market. By understanding the relationships of individuals that focus on similar pages that promote racism, hate speech, and misinformation, we can investigate coordinated manipulation to prevent the spread of falsehood and hatred online.
The major stakeholders in this problem are social media companies that use DeepWalk to discover people with common interests and keep them tightly connected by recommending similar groups to join. The consequences of mistakes made by our algorithm can cause poor recommendations and incorrect community predictions.
Division of Labor
Pre-processs + Random Walk algorithm + visualization:
Hangyu Du
Word2Vec:
Yiwen Chen
Node Classification:
Hangyu Du, Yiwen Chen, Nan Gao
Video Presentation + Poster:
Nan Gao

Log in or sign up for Devpost to join the conversation.