Connecting The Dots

Inspiration

Accompanying the rising popularity of social media platforms, the field of social network analysis has taken center-stage as a key-player for modern applications of graph theory and machine learning. From our prior exposure to graph theory and ML at Rice, our team developed an interest in the application of machine learning models on graph networks and decided to take on the Bill.com challenge to recommend businesses to one another.

What it does

Before explaining our model, let us briefly summarize the content of our dataset. The dataset represents a mathematical graph such that a node represents a social media webpage and an edge between two nodes indicate that two social media pages link to one another. Additional available features are the categories of every web page (broadly split into 4 categories) and the contents of the webpage represented in a one-hot encoding of the top 5000 common and meaningful words.

To put it simply, our model is a logistic regression model that aims to predict whether there exist an edge between two nodes based on the page types of the nodes and the Jaccard Index computed from their respective one-hot encoding of words.

How we built it

Prior to deciding on a model, our team conducted exploratory data analysis and made the following observations:

For nodes that share an edge between them, about 85% falls under the same page types.
For nodes that don't share an edge between them, about 70% have different page types.
The distribution of Jaccard Index for words across pairs of nodes with and without an edge between them are similar with a mean hovering around 0.16 and a standard deviation of 0.12.
When examining the size of the connected components within the graph, there is primarily one major connected component with the size of 20589 nodes with the rest being significantly smaller size (primarily in the single-digits).

With these observations in mind, we decided that while the graph structure (such as degree, common neighbors, etc.) is important, the specific page type of the node may present the most important information for predicting whether an edge exists between two nodes. At first, we trained a GNN with the hopes of incorporating all of the information we have at hand. However, the interpretability and prediction accuracy of the GNN was relatively low compared to the logistic regression and thus we decided to adapt that model for its significantly greater interpretability.

Challenges we ran into

The primary challenge of this project lies in the fact that none of our team members had experience with machine learning in the context of graphs. Thus, the representation of the available data into numerical metrics compatible with the machine learning models at our disposal took up significant amount of time. Besides feature extraction, our team was also interested in extracting features and generating an interpretable predictive model based on it. In order to do this, we experimented with the degree of nodes, jaccard index, distribution of words/degrees for various page types, and more metrics of determining set similarity. On the technical implementation side, none of us had extensive experience with PyTorch and GNNs. Despite that, we were able to implement a version of that model for performance comparison purposes.

Accomplishments that we're proud of

Despite our lack of prior experience with machine learning for graphs, we were able to combine our collective knowledge of graph theory, statistical metrics, and machine learning models to generate features that allowed the classic logistic regression model to achieve a relatively high predictive accuracy of 85.339% on the final test set. We were also able to create a GNN model with a new deep learning framework that we had never used before.

What we learned

Choose a model based on your data. A simple, interpretable machine learning model with carefully crafted features can sometimes triumph complex "blackbox" deep learning models in terms of performance and insight. On the technical side we also learned more about PyTorch and GNNs and the importance of applying knowledge from multiple fields (i.e. graph theory) as well as domain knowledge (an understanding of social media and websites) to solve data science problems.

What's next for Connecting The Dots

Due to the time constraints of the hackathon and our team's overall lack of experience in graph theory beyond the undergraduate discrete math and algorithms course at Rice, we were only able to compute features that capture some aspects of the graph provided. Given more time, we hope to extract further features for the model: Katz Index to capture the connection between nodes, perform an estimation of graph properties (degrees, common neighbors, etc.) for isolated nodes with a Bayesian approach, and leveraging model stacking to boost the performance of our predictions.