Daki Life — A Living Knowledge Graph for Your Mind

Problem Statement

Journaling is one of the most evidence-backed practices for mental clarity and personal growth, yet most people abandon the habit because it's inconvenient to maintain and trace back. It requires carving out dedicated time, staring at a blank page, and knowing what to say. And even then, each entry exists in isolation from what came before, making it difficult to spot patterns, connect ideas, and see how much you’ve actually grown.

Daki-Life makes journaling as convenient as possible while connecting all of your past ideas together. Two main ideas: journaling after a focus session during breaktime and journaling independently, either from a prompt or free writing. Over time, your notes self-organize into clusters (categories) with subtopics nested inside. Your ideas don’t disappear. They find each other


What It Does

Daki Life is a focus journal that builds a living semantic knowledge graph from your reflection notes.

  1. Focus — Pomodoro-style timer with structured reflection prompts at the end of each work block.
  2. Graph — Semantic Knowledge Graph — An interactive graph that automatically maps your notes into clustered life themes like Health, Creativity, and Relationships.
  3. Home — At-a-glance stats: sessions logged, top clusters by volume, and time-tracked categories.

Every note you write updates the graph in under 3 seconds for ~250 notes.


The Algorithm Stack

ClusterNode Object

The core of the backend operating on the ClusterNode object for the semantic journal graph

Field Type Description
id str Unique UUID for the ClusterNode
parent_id Optional[str] ClusterNode parent (for traversing backwards)
depth int Depth level of the ClusterNode
note_ids list[str] All note IDs within the ClusterNode
children list[ClusterNode] Children ClusterNodes (for traversing forwards)
coordinates_2d dict[str, dict] 2D coordinates for each note within the ClusterNode
semantic_centroid Optional[np.ndarray] Each note's text content is given a 1536-dimensional vector; the semantic centroid is the average of all these vectors
label Optional[str] All note texts undergo TF-IDF keyword extraction, producing a raw keyword list, which is then passed to a GPT-4o-mini API call to generate a label

The ML pipeline is the product.

Embedding

Notes are embedded at write-time via text-embedding-3-small (1536 dimensions) and stored in Supabase + pgvector. No retrieval-time re-embedding — every note already lives in semantic space.

UMAP — Dimensionality Reduction

We run UMAP in two separate passes:

  • 1536D → 8D for density-aware clustering (preserving local neighbourhood structure)
  • 1536D → 2D for graph layout (min_dist=0.1), so semantically similar notes appear physically close

UMAP is preferred over t-SNE because it preserves global structure across the full embedding manifold — clusters that are related stay near each other in 2D, not just internally tight.

HDBSCAN — Recursive Hierarchical Clustering

HDBSCAN runs recursively on the 8D output to produce a multi-depth cluster tree:

Root  ("Life")
  ├── Health
  │     ├── Running
  │     └── Sleep
  ├── Creativity
  └── Relationships
        └── Deep Connection

min_cluster_size adapts by depth:

$$k_{\min} = \begin{cases} \max(10,\; \lfloor n/16 \rfloor) & \text{depth} = 0 \ 3 & \text{depth} > 0 \end{cases}$$

Outlier notes (HDBSCAN label -1) are never force-assigned to the nearest cluster — they surface as standalone nodes at the depth where they first became noise.

C-TF-IDF — Discriminative Label Extraction

Each cluster is treated as a single document. C-TF-IDF surfaces words that are specific to a cluster — high within it, rare across siblings:

$$score(term, cluster) = tf(term, cluster) \times idf(term, all\ clusters)$$

$$where\ tf = 1 + \ln(count)$$ $$idf = \ln\left(\frac{1 + m}{1 + df}\right) + 1$$

Top-10 n-grams per cluster are passed to GPT-4o-mini for final 1–3 word human-readable labels.

Identity Persistence — Jaccard Matching

On every rebuild, new clusters are matched to old DB records via Jaccard similarity on note membership:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}, \qquad \text{match if } J > 0.5$$

Labels are reused when a cluster's centroid hasn't drifted more than cosine distance 0.05 — keeping your graph visually stable as new notes arrive.

Cosine Similarity Edges

$$\text{sim}(a, b) = \frac{a \cdot b}{|a|\,|b|}$$

Computed efficiently via L2-normalized matrix multiplication:

normed = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
sim_matrix = normed @ normed.T

Each note connects to its top-3 nearest neighbours within its leaf cluster. Cluster centroids connect to their top-3 sibling clusters to form the graph's backbone edges.



Cluster Metrics

Four scores are computed per cluster and min-max normalized across the full tree so they're always comparable regardless of corpus size.

Taxonomic Complexity — how much a cluster has structurally fractured into specific sub-thoughts. Instead of counting raw depth, it measures how deeply nested and populated the subtree is:

$$\text{TC} = \sum_{i \,\in\, \text{sub-clusters}} \bigl(\text{depth}_i \times \log(\text{note_count}_i)\bigr)$$

Information Density — vocabulary richness weighted by term specificity:

$$ID = \frac{U}{\sqrt{T}} \times \frac{1}{K} \sum_{k=1}^{K} w_k$$

where U = unique token types with non-zero TF-IDF weight, T = total raw word count, and w_1 through w_K are the top K TF-IDF scores (K = 8).

Semantic Cohesion — the mean cosine similarity of all note embeddings to the cluster centroid. High scores mean notes are tightly focused around a single idea:

$$Cohesion = \frac{1}{N} \sum_{i=1}^{N} \frac{v_i \cdot c}{|v_i| |c|}$$

Semantic Divergence — how distinct a cluster's thinking is from your overall baseline. Cosine distance between the cluster centroid c and the global centroid c_global (the average of every note in your database):

$$Divergence = 1 - \frac{c \cdot c_{global}}{|c| |c_{global}|}$$

A high divergence score signals niche interests or exploratory ideas far from your average thought.

All four scores are normalized to [0, 1]:

$$x_{norm} = \frac{x - min}{max - min}$$

Design Decisions

Problem Algorithmic Choice Why
Flat embeddings → spatial layout UMAP (two-pass) Faster than t-SNE; preserves global manifold structure
Finding themes without labels HDBSCAN Noise-robust; no need to pre-specify $k$
Cluster labels that generalize C-TF-IDF + GPT-4o-mini Discriminative, not just frequent
Stable UX across rebuilds Jaccard matching + centroid drift guard Prevents graph thrashing on incremental updates
Graph layout performance D3 force sim + UMAP 2D coords Physics sim only for collision; semantic positions pre-computed

The full rebuild runs: on cosine similarity, bounded by reducing the dimensions from 1536 to 8 before the heavy clustering passes. $$O(N^2 \cdot D)$$


Tech Stack

  • Mobile — React Native (Expo), TypeScript, Expo Router
  • Graph renderingreact-native-svg, D3 force simulation
  • API — Node.js + Express
  • ML sidecar — Python, FastAPI, umap-learn, hdbscan, scikit-learn
  • LLM — OpenAI gpt-4o-mini (labels), text-embedding-3-small (embeddings)
  • Database — Supabase (Postgres + pgvector + Auth + Realtime)

Roadmap

  • Cross-user graph diffing — surface notes that are semantically close to another user's clusters (opt-in)
  • Temporal drift tracking — visualize how your semantic clusters shift week over week
  • On-device UMAP — eliminate the sidecar round-trip for small corpora using ONNX-exported models

Every note you write makes the graph smarter.

Built With

Share this project:

Updates