Title + Main Idea

Our project aims to re-implement a SOTA latent space representation learning algorithm for graphs, DeepWalk. It is the very first model that applies Deep Learning to network science, which successfully proves that DL is a promising direction in the field of network science. We want to dive deep into this pioneering model to see how it is designed, and apply it to real-world problems, such as account classification for Facebook pages, link prediction-based user/content recommendation, large-scale graph visualization, etc.

Who

Nan Gao (ngao), Yiwen Chen (ychen485), Hangyu Du (hdu17)

Introduction

We are implementing an existing paper called Deepwalk. DeepWalk is a two-stage algorithm that aims to learn social representations of a graph’s vertices by modeling a stream of random walks. Here, social representation refers to features of the vertices that capture the neighborhood similarities and communities. The resulting outcome of the model is low-dimensional embeddings for each node, which can be used to perform multiple tasks such as node classification, link prediction, community detections, etc. The problem is an unsupervised learning problem because we do not know the word embeddings or the latent representations of nodes beforehand.

Related Work

We are aware of another random walk-based graph learning model, Node2vec. It was introduced in 2016, about 2 years after DeepWalk. The main difference is that Node2vec takes edge weight into consideration when performing random walks. Also, it has two bias hyperparameters to help control how we want to explore the graph (exploitation vs. exploration). On the contrary, DeepWalk only works on unweighted graphs, but it still yields pretty good results in many cases. This motivates us to dive deeper into DeepWalk.

Implementations of DeepWalk: https://github.com/phanein/deepwalk/ https://dsgiitr.com/blogs/deepwalk/

Data

We are using datasets from this GitHub repository. Specifically, we will choose datasets that capture the social relationships or interest groups for the community detection problem. Moreover, we will choose datasets that are relatively small in size to reduce processing time.

E.g.

“Facebook Large Page-Page Network”

It contains data about Facebook page networks of four categories: politicians, governmental organizations, television shows, and companies. The number of nodes in this dataset is 22,470.

“Wikipedia Article Networks”

it contains data from the English Wikipedia, representing the page-page networks on 3 topics: chameleons, crocodiles, and squirrels. The number of nodes in this dataset is 19,109.

Methodology

DeepWalk employs two steps. The first step is to discover the neighborhoods for nodes in the graph. To do so, we generate multiple random walks starting from each node in the graph and store them as sequences of word tokens.

The second step is to feed random walk results into a Word2Vec (SkipGram) model. Embeddings for each word token are learned to maximize the similarity among words that appear in the same window because words that occur in the same window tend to have similar meanings (key assumption).

Lastly, DeepWalk outputs the low-dimensional embeddings for each node as the outcome, which can be used to perform multiple tasks such as node classification, link prediction, community detections, etc.

Implementation Details

Random walk

A single random walk process begins by choosing a node to start, and randomly selecting a node from its neighbors as the next node to traverse. The process stops when a fixed length of nodes has been visited from the start node. For each node in the graph, we repeat the single random walk process 10 times and store the results. In terms of hyperparameters, we generate 10 random walks for each node in the graph and set the walk length to 10.

Word2Vec

The random walks generated from the previous step are then fed to a Word2Vec (SkipGram) model, where walks are treated like sentences. Nodes are used as equivalents to words, which are fed into the model in a sliding window. The model tries to predict the next node in the walk, and in the process learns a reasonable embedding matrix that assigns closer embedding to similar nodes.

Metrics

We measure the performance of the model by doing binary/multi-class node classification tasks on our datasets and compare the result of DeepWalk with a few other graph representation learning algorithms as our baseline methods The authors of the paper quantified the results of their model by comparing its macro-F1 and micro-F1 scores with those of other baselines methods. F1 scores are a measure of a model’s accuracy in binary classifications. The baseline methods that the authors used are SpectralClustering, Modularity, EdgeCluster, wvRN, and Majority. Our base goal is to be able to implement random walk and the Word2Vec (SkipGram) model correctly and compare the F1 scores for one data set e.g. “Wikipedia Article Networks”. Our target goal for the model is to yield better results than our baseline methods. Our stretch goal is being able to apply the model onto a second data set e.g. “Facebook Large Page-Page Network” and also yield better results than our baseline methods.

Results

Node classification & Visualization

After obtaining the embeddings for each node in the graph, we performed a multi-class node classification task to test the performance of our model. We trained an MLP classifier with a single hidden layer to classify embeddings into the predefined four categories. 80% of the data are used for training, and 20% of the data are reserved for the test set. We achieved a weighted average of 92%.

We also visualized the embeddings by using t-SNE to reduce the high-dimensional embeddings into two dimensions. Compared to the randomized embeddings initialized in our Word2Vec, we can clearly see that our final embeddings show four clusters of nodes, where each cluster represents a single category. This shows that DeepWalk is capable of encoding unstructured graphs into something useful and tangible for further graph analytics.

Ethics

DeepWalk can be used to pinpoint manipulative groups inside a social network or a stock market. By understanding the relationships of individuals that focus on similar pages that promote racism, hate speech, and misinformation, we can investigate coordinated manipulation to prevent the spread of falsehood and hatred online.

The major stakeholders in this problem are social media companies that use DeepWalk to discover people with common interests and keep them tightly connected by recommending similar groups to join. The consequences of mistakes made by our algorithm can cause poor recommendations and incorrect community predictions.

Division of Labor

Pre-processs + Random Walk algorithm + visualization:

Hangyu Du

Word2Vec:

Yiwen Chen

Node Classification:

Hangyu Du, Yiwen Chen, Nan Gao

Video Presentation + Poster:

Nan Gao

Links

Oral Presentation

Built With

Share this project:

Updates

posted an update

"DeepWalk" by ShallowRun

Who

Nan Gao : ngao, Yiwen Chen: ychen485, Hangyu Du: hdu17

Introduction

Our project aims to re-implement a SOTA latent space representation learning algorithm for graphs, DeepWalk. It learns the representations of a graph’s vertices by modeling a stream of random walks and capturing the neighborhood similarities. The output of the model is low-dimensional embeddings for each node. We want to apply DeepWalk to some real-world problems, such as account classification for Facebook pages, link prediction-based user/content recommendation, large-scale graph visualization, etc.

Methodology

DeepWalk employs two steps. The first step is to discover the neighborhoods for nodes in the graph. To do so, we generate multiple random walks starting from each node in the graph and store them as sequences of word tokens. The second step is to feed random walk results into a Word2Vec/SkipGram model. Embeddings for each word token are learned to maximize the similarity among words that appear in the same window because words that occur in the same window tend to have similar meanings (key assumption). DeepWalk outputs the low-dimensional embeddings for each node as the outcome, which can be used to perform multiple tasks such as node classification, link prediction, community detections, etc.

Preprocess

We parsed the data from edge lists and node lists of the original dataset into an undirected homogeneous graph where nodes represent Facebook pages, and edges represent mutual likes between nodes. The graph has four categories: politicians, governmental organizations, television shows, and companies. It contains 22,470 nodes and 171,002 edges.

Random walk

A single random walk process begins by choosing a node to start, and randomly selecting a node from its neighbors as the next node to traverse. The process stops when a fixed length of nodes has been visited from the start node. For each node in the graph, we repeat the single random walk process 10 times and store the results. In terms of hyperparameters, we generate 10 random walks for each node in the graph and set the walk length to 10.

Word2Vec

The random walks generated from the previous step are then fed to a Word2Vec (SkipGram) model, where walks are treated like sentences. Nodes are used as equivalents to words, which are fed into the model in a sliding window. The model tries to predict the next node in the walk, and in the process learns a reasonable embedding matrix that assigns closer embedding to similar nodes.

Results

Node classification & Visualization

After obtaining the embeddings for each node in the graph, we performed a multi-class node classification task to test the performance of our model. We trained an MLP classifier with a single hidden layer to classify embeddings into the predefined four categories. 80% of the data are used for training, and 20% of the data are reserved for the test set. We achieved a weighted average of 92%.

We also visualized the embeddings by using t-SNE to reduce the high-dimensional embeddings into two dimensions. Compared to the randomized embeddings initialized in our Word2Vec, we can clearly see that our final embeddings show four clusters of nodes, where each cluster represents a single category. This shows that DeepWalk is capable of encoding unstructured graphs into something useful and tangible for further graph analytics.

Challenges

It took us some time to find a near-optimal combination of hyperparameters and model setup that could demonstrate robust performance while not making training too slow. Specifically, we considered walk length, number of walks per node as hyperparameters for the random-walk part. For the Word2Vec part, we also have window_size, embedding_size to tune. For the classification part, we compared the performance of different classifiers from a group of non-linear classifiers including MLP, SVM, KNN, Decision Tree, Random Forest, etc.

Reflection

How do you feel your project ultimately turned out? How did you do relative to your base/target/stretch goals?

Our base goal is being able to implement random walk and generate word embeddings from SkipGram/Word2Vec. Our target goal is being able to use the word embeddings to do node classification on one dataset e.g. “Facebook Large Page-Page Network”. Our stretch goal is being able to visualize our word-vector results on a scatter plot and see words in the same category being grouped in the same cluster. Overall, we were able to not only achieve our base goal of generating the word embeddings on one dataset but also reach our target and stretch goal of applying the word embeddings on real-word problems like community detection and visualizing our results by showing pages of the same categories grouped together on the plot.

Did your model work out the way you expected it to?

Yes, it managed to extract meaningful information from the original unstructured graph, so that we can perform node classification and obtain satisfactory results.

How did your approach change over time? What kind of pivots did you make, if any? Would you have done differently if you could do your project over again?

Before the second checkoff with our TA, we used an external implementation of Word2Vec in our model. We later switched to using our own Word2Vec to improve the originality of our implementation., while achieving roughly the same performance as using the external implementation.

Since our dataset is relatively small, if we could do our project over again, we would further ensure every observation from the original dataset has the chance of appearing in training and testing set by using k-fold- cross-validation(spitting the data into k groups, training and testing our model k times) to combat overfitting.

What do you think you can further improve on if you had more time?

We noticed that generating random walks takes up a great portion of the processing time and this step is highly parallelizable. If we had more time, we will try to accelerate this process by making use of thread or GPU to make the algorithm more scalable.

We can also improve by further quantifying our results, comparing the macro-F1 and micro-F1 scores(F1 scores are a measure of a model’s accuracy in binary classifications) with those of other baseline methods like what the researchers did in the paper. The baseline methods that were in the papers were SpectralClustering, Modularity, EdgeCluster, wvRN, and Majority.

What are your biggest takeaways from this project/what did you learn?

We learned that every model has its limitations, and it is our job to find the best model for a certain application. For example, DeepWalk only encodes based on the graph structure, and it does not take node features into consideration. For a graph with rich information associated with nodes, DeepWalk cannot accommodate them into the final embeddings, causing a loss of information. Also, DeepWalk does not differentiate between edges, and it does not consider edge weight. Sometimes, it is likely that a certain edge is preferred over other edges, but DeepWalk cannot take that into its random walk procedure. Lastly, DeepWalk does not have the ability to balance exploration vs exploitation, which may make the algorithm more random and less robust.

Secondly, we learned that it is sometimes rare for a model to perform at the level we need for production just in the first instance, and we might need to tweak the hyperparameters multiple times to achieve optimal performance.

We also learned it is important to improve the modularity of the implementation when designing it, so we can work simultaneously later to improve the efficiency of our collaboration.

Log in or sign up for Devpost to join the conversation.

posted an update

Github Link: https://github.com/hangyu98/DeepWalk

Reflection:

Introduction:

We are implementing an existing paper called Deepwalk. DeepWalk aims to learn social representations of a graph’s vertices by modeling a stream of random walks. Here, social representation refers to features of the vertices that capture the neighborhood similarities and communities. The resulting outcome of the model is low-dimensional embeddings for each node, which can be used to perform multiple tasks such as node classification, link prediction, community detections, etc. The problem is an unsupervised learning problem because we do not know the word embeddings or the latent representations of nodes beforehand.

Challenges:

It took us a lot of effort to understand the structure of the graph dataset we are using. Currently, the step of generating random walks takes up the most time in our model, we are considering the possibility of using multithreading to accelerate it in the possible future. We need to consider how to split our data into a training dataset and a testing dataset in a reasonable way since the data is not evenly distributed among every category. This might cause our model to work really well classifying certain categories but not so well with others. We can also consider using another database that has a more even distribution.

Insights: Are there any concrete results you can show at this point?

We can print out the embeddings of every node that is stored in a dictionary. We can print out the classification report of each category

Plan: Are you on track with your project?

We need to figure out the best way to split the data in order for our model to work well classifying all categories. We can also consider plotting a visualization of the word embedding in a graph, showing the nodes in the same category in the same cluster. We are also considering testing this model on large-scale datasets. Currently, the classification performance seems ideal, but it may need further tuning if we move on to larger datasets. We are considering tuning word2Vec first and visualizing the result. By looking at how well the nodes are clustered, we will manually evaluate the word2Vec model’s performance. Then we will test and tune the classifier.

Log in or sign up for Devpost to join the conversation.