Inspiration

We wanted something more interesting than “yet another dashboard.” Since the dataset is synthetic and global clustering was weak (Hopkins ≈ 0.992), we reframed the task: show how the fandom connects. Instead of forcing blobs in 2-D, we map which things are loved together and let judges explore segments of fans.

What it does

  • An interactive co-occurrence network of Star Wars items (heroes, films, planets, ships, soundtracks).
  • Edges mean two items are liked together in the same review; thickness = Jaccard similarity.
  • A cluster selector filters the network to just one of our 9 K-Means segments, so you see each segment’s own “mini-galaxy.”
  • Live controls (min frequency / min co-mentions / min Jaccard / item types) + a search/neighbor explorer.
  • A small cohesion stat (avg Jaccard among a cluster’s top items) to justify the clustering.

How we built it

For two items A and B, Jaccard = #(reviews with A & B) / #(reviews with A or B) It asks: of all reviews that mention either, what fraction mention both? That normalizes popularity, so Luke doesn’t auto-pair with everything just because he’s common.​

It asks: of all reviews that mention either, what fraction mention both? That normalizes popularity, so Luke doesn’t auto-pair with everything just because he’s common.

Challenges we ran into

  • Data prep (≈36k rows): parse all fav_* columns; normalize oddities (the dataset spells hero as heroe).
  • Counts: per-item frequency and item–item co-mentions per row.
  • Edge weights: compute Jaccard for each pair; threshold by support + Jaccard.
  • Graph: render with pyvis/vis.js (force-directed layout), color by type, size by log-frequency.
  • Clustering: One-Hot encode favorites → K-Means (k=9). Save kmeans_model.joblib, encoder.joblib, top_features.joblib.
  • Cluster-aware view: assign each row a cluster at runtime, then rebuild the co-occurrence network only from that subset.
  • Validation: show avg within-cluster Jaccard for the cluster’s top items vs global as a sanity check.

Accomplishments that we're proud of

  • A segment-aware network that’s fast and clean to use.
  • A tiny, honest metric (avg Jaccard) that backs up the qualitative picture.
  • A polished Streamlit UX that a judge can understand in seconds.

What we learned

  • When global structure is weak, pairwise structure (graphs) still sings.
  • “Explainability beats raw accuracy” in short demos—Jaccard + networks are judge-friendly.

What's next for Cluster and Co-occurence Analysis of Star Wars Dataset

  • Community detection (Louvain) to auto-label network themes.
  • A recommender: given a user’s likes, show the next high-Jaccard neighbors.
  • Session logging to see how interactions shift clusters over time.

Built With

Share this project:

Updates