posted an update

Github Link: https://github.com/hangyu98/DeepWalk

Reflection:

Introduction:

We are implementing an existing paper called Deepwalk. DeepWalk aims to learn social representations of a graph’s vertices by modeling a stream of random walks. Here, social representation refers to features of the vertices that capture the neighborhood similarities and communities. The resulting outcome of the model is low-dimensional embeddings for each node, which can be used to perform multiple tasks such as node classification, link prediction, community detections, etc. The problem is an unsupervised learning problem because we do not know the word embeddings or the latent representations of nodes beforehand.

Challenges:

It took us a lot of effort to understand the structure of the graph dataset we are using. Currently, the step of generating random walks takes up the most time in our model, we are considering the possibility of using multithreading to accelerate it in the possible future. We need to consider how to split our data into a training dataset and a testing dataset in a reasonable way since the data is not evenly distributed among every category. This might cause our model to work really well classifying certain categories but not so well with others. We can also consider using another database that has a more even distribution.

Insights: Are there any concrete results you can show at this point?

We can print out the embeddings of every node that is stored in a dictionary. We can print out the classification report of each category

Plan: Are you on track with your project?

We need to figure out the best way to split the data in order for our model to work well classifying all categories. We can also consider plotting a visualization of the word embedding in a graph, showing the nodes in the same category in the same cluster. We are also considering testing this model on large-scale datasets. Currently, the classification performance seems ideal, but it may need further tuning if we move on to larger datasets. We are considering tuning word2Vec first and visualizing the result. By looking at how well the nodes are clustered, we will manually evaluate the word2Vec model’s performance. Then we will test and tune the classifier.

Log in or sign up for Devpost to join the conversation.