Cluster and Co-occurence Analysis of Star Wars Dataset

Get sorted into your Star Wars fan cluster now!

Inspiration

We wanted something more interesting than “yet another dashboard.” Since the dataset is synthetic and global clustering was weak (Hopkins ≈ 0.992), we reframed the task: show how the fandom connects. Instead of forcing blobs in 2-D, we map which things are loved together and let judges explore segments of fans.

What it does

An interactive co-occurrence network of Star Wars items (heroes, films, planets, ships, soundtracks).
Edges mean two items are liked together in the same review; thickness = Jaccard similarity.
A cluster selector filters the network to just one of our 9 K-Means segments, so you see each segment’s own “mini-galaxy.”
Live controls (min frequency / min co-mentions / min Jaccard / item types) + a search/neighbor explorer.
A small cohesion stat (avg Jaccard among a cluster’s top items) to justify the clustering.

How we built it

For two items A and B, Jaccard = #(reviews with A & B) / #(reviews with A or B) It asks: of all reviews that mention either, what fraction mention both? That normalizes popularity, so Luke doesn’t auto-pair with everything just because he’s common.

It asks: of all reviews that mention either, what fraction mention both? That normalizes popularity, so Luke doesn’t auto-pair with everything just because he’s common.

Challenges we ran into

Data prep (≈36k rows): parse all fav_* columns; normalize oddities (the dataset spells hero as heroe).
Counts: per-item frequency and item–item co-mentions per row.
Edge weights: compute Jaccard for each pair; threshold by support + Jaccard.
Graph: render with pyvis/vis.js (force-directed layout), color by type, size by log-frequency.
Clustering: One-Hot encode favorites → K-Means (k=9). Save kmeans_model.joblib, encoder.joblib, top_features.joblib.
Cluster-aware view: assign each row a cluster at runtime, then rebuild the co-occurrence network only from that subset.
Validation: show avg within-cluster Jaccard for the cluster’s top items vs global as a sanity check.

Accomplishments that we're proud of

A segment-aware network that’s fast and clean to use.
A tiny, honest metric (avg Jaccard) that backs up the qualitative picture.
A polished Streamlit UX that a judge can understand in seconds.

What we learned

When global structure is weak, pairwise structure (graphs) still sings.
“Explainability beats raw accuracy” in short demos—Jaccard + networks are judge-friendly.