Flâneur

Above is a map of the links people save on Curius, laid out semantically: nearby dots are pages that mean similar things. You can pan around it, click any dot to get its neighbors, and search it by meaning.

Curius is social bookmarking: you save links, follow people, see what they saved. Your feed only goes as far as the people you follow, and the interesting saves are usually a hop or two past that. So I crawled the public graph to see all of it.

The crawl

Curius profiles are backed by a JSON API. One GET returns everything a person has ever saved:

GET curius.app/api/users/2138/links

{"userSaved": [{
  "id": 224407,
  "link": "https://avc.com/2009/10/business-model-jujutsu",
  "title": "Business Model Jujutsu - AVC",
  "createdBy": 2138,
  "createdDate": "2026-07-01T13:12:39.344Z",
  ...
}]}

Each entry is one fact: person 2138 saved that URL. One GET gets you one person, and their follow-list points at more people, so the crawler just kept going — 6,259 profiles, every URL normalized and deduped into one table. Once it's all in one place there are two questions you can ask it. Who saves the same things? Which pages mean similar things?

The save graph

The people index is a self-join over saves:

SELECT a.person_id, b.person_id, COUNT(*) AS shared
FROM saves a JOIN saves b
  ON a.link_id = b.link_id AND a.person_id < b.person_id
GROUP BY 1, 2 ORDER BY shared DESC;

-- 1419 × 2138 -> 170 shared links
-- 1528 × 2096 -> 161
-- 1296 × 2138 -> 159

That join produces about 3.1 million overlap edges, and the most-connected person shares saves with 3,311 others. Followers are who you chose; this is whose libraries actually overlap yours, whether you know them or not.

I like this layer because it works before any ML. A link matters because different people saved it. A person is central because their saves overlap everyone's. A recommendation can start here instead of from popularity: walk to the people near you, take what they saved that you haven't seen.

The fetch funnel

The second question — which pages mean similar things — needs the pages themselves, so every link gets fetched and stripped to readable text. Most of the web doesn't cooperate: of 41,446 links, 5,830 sit behind bot walls or paywalls, 1,998 are dead, and 273 timed out. After extraction, 22,358 pages have usable text and 20,494 end up embedded. The rest just don't get a dot.

Chunking

Embedding those 20,494 pages surfaced the first real bug. My first pass ran each page through a small Qwen3 embedding model on Workers AI, feeding it whatever fit in the model's window. For a 900-word blog post that's the whole page. For the things people actually save — 12,000-word essays, papers, whole personal sites — it's the intro, so long pages got placed by how they open instead of what they're about.

The fix, rolling through the corpus now as a second embedding lane: split long pages into passages, embed each passage, and pool the vectors back into one per link, weighted by length. One dot per link, based on the whole page — the map cuts over once the lane finishes and beats the old vectors on a neighbor-quality check, not before.

The map

With one vector per link, the map is the easy part: UMAP flattens the vectors to 2D and deck.gl draws them. The clusters that fall out are the ones you'd hope for: Transformer Models, Interface Design, Techno-Optimism, Literature. The names are generated, not curated — KMeans cuts the space at a few zoom levels and a model names each cluster from its most-saved titles.

Search and neighbors

The whole interactive path is one data structure: the 20,494 link vectors packed into an 84 MB float32 matrix, loaded once, scanned in memory. Clicking a dot scans that dot's vector against every other link — all of them, no shortlist. An earlier version compared against the 800 most-saved links only, which quietly meant a niche page's true neighbors were never even candidates; brute force over everything is 2 ms of math, so there was nothing to save.

The math was never the cost. Rebuilding the inputs was: vectors used to live in SQLite as JSON text, so a cold start parsed ~400 MB of strings to reconstruct 84 MB of floats — six to eight seconds of deserialization for two milliseconds of arithmetic. The fix was to stop storing numbers as text. Each vector is normalized once at write time and stored as raw little-endian float bytes; building the matrix is now a memcpy.

The same write packs each vector's sign bits — 128 bytes per link, 2.6 MB for the whole corpus — and cold starts scan that binary plane first. Hamming distance over all 20k bit signatures picks 1,024 candidates, and only those get rescored in float, read straight from their BLOBs. Measured against the exact scan over 200 seed links, the bits pass keeps 99.95% of the true top-10. Cold dropped from ~4.5 s to ~3.6 s; the warm path keeps the exact full scan. Precompute on write, never on read; at twenty thousand vectors you can afford to precompute everything.

Neighbor order is mostly geometry with a social nudge: 70% the vector, 20% "the same people saved both," 10% how many people saved it at all. Meaning ranks; your people break ties.

While you type, a lexical index answers in about a millisecond; when the query settles the semantic pass lands in roughly 300 ms — nearly all of it the embedding call, none of it the scan. Query vectors are cached, so repeating a search skips the model entirely.

search "digital gardens" ->

{"count": 20, "items": [{
  "title": "Of Digital Streams, Campfires and Gardens",
  "url": "https://tomcritchlow.com/2018/10/10/of-gardens-and-wikis",
  "score": 0.791,
  "x": 0.154, "y": 0.112
}, ...]}

Every result comes back with its map coordinates, so lighting up the matches is just recoloring those dots — the map and the search share one coordinate space.

What's next

Two things, one per index.

The friends-of-friends feed. Everything it needs already exists: take my nearest fifty people from the save graph, collect what they saved weighted by overlap, drop what I've already seen, rank the rest. Right now Flâneur is the map you walk by hand — click a dot, get its neighbors, search by meaning, see who saved what. The feed is the same two indices pointed at one person.

And the passage-level index. The chunk lane is embedding every passage of every long page right now, and the bits-then-rescore scan is what makes indexing them all affordable — search gets to read every chunk of every essay instead of one pooled vector per link. The cold path has known fat left too: about 1.5 s of isolate boot and schema setup, and a bit plane that's assembled from 20k SQLite rows when it could be one 2.6 MB read.

profiles -> saves/follows
saves -> co-save graph
links -> text -> chunks -> pooled vectors
vectors -> UMAP map
vectors -> f32 bytes + sign bits -> bits scan + float rescore -> search + neighbors