Backscroll

Backscroll reads my Twitter timeline and compiles a feed of what I actually want to read. It scrapes about 2,600 tweets a day from my home timeline plus my likes and bookmarks, packs the lot into a ~250k-token prompt, and a model ranks what's worth surfacing. Instead of opening X and bouncing for an hour, I read a ranked digest in ten or fifteen minutes.

Why I made it

The X algorithm is optimizing for X, not for me. Most days it shows me the same three accounts in different outfits and ten ragebait threads I didn't ask for. My likes and bookmarks already encode a real ranking function I trust — it just isn't the one driving the feed I see. Backscroll is the smallest thing that puts my own ranking function back in charge: read my timeline directly, use my likes as a taste signal, compile a feed I would have built by hand if I had the time.

How long it took

About seven weeks. First commit on the scraper was 2026-04-04. The first usable ranker came online in mid-April. The four frontends, the Code Mode harness, the rubric editor, the chat tab, the media mirror, the search corpus, the authors table, the public read path, and the Telegram delivery bot landed in successive arcs through mid-May. The system is three Cloudflare Workers in one repo: one private worker owns the scraper and the compile pipeline, two small public workers serve the read traffic, and all three share a single Durable Object via script_name.

What it actually saves

Hard to put a clean number on it, but the shape of my reading time changed. Before, I would open X, scroll until I felt bad, close it, repeat a few times a day, and still miss the things I actually wanted to see. After, I open Backscroll once or twice a day, read 30–80 ranked items, and I'm done. The slow compile happens in the background on a cron, so the feed feels instant when I open it even when the model behind it is taking a minute to think. The phone version is a Telegram bot for when I'm not at a computer — same ranked items, push delivery, inline thumbs to steer the rubric.

How it works

A single Durable Object owns the scraper, the compiler, and the serving layer, so every part of the system reads from one consistent piece of state. The compile pipeline has five steps: scrape, rank, evaluate, publish, iterate.

The scrape step is a headless browser that signs into my account, opens the home timeline, and scrolls until either the target tweet count is hit or the run times out. The Settings panel logs each run with its three or four phase timestamps (browser launched → surface loaded → Tweets visible → Scrolling done) so when a scrape goes wrong I can tell whether it broke at login, at DOM mount, or partway through the scroll. A typical run lands 70–80 new tweets in about 70 seconds and writes them straight into the DO's SQLite, where the ranker picks them up on the next compile.

Step two is where the interesting choices live. Instead of having the model chain a dozen single tool calls, the model writes one bounded JavaScript program that runs inside an isolated Cloudflare Dynamic Worker with the network blocked, and that program calls typed feed.* tools as if they were a local SDK. Intermediate state lives in JS variables rather than getting smeared across the model's context window. The worker boundary is the only place where errors and untrusted code matter. The entire ranking experiment happens in one model generation rather than a dozen round-trips.

01Prompt

Make me a feed about RL environments. Prefer primary sources, skip repeat threads, and publish only if diagnostics pass.

02Code Mode REPLdynamic worker

const pool = await feed.candidates({  mode: "for_you", limit: 600, unseenOnly: true});const ids = rank(pool.candidates).slice(0, 80);await feed.evaluate({ ids });return feed.publish({ ids });

03Feed session3 of 80

Andrej Karpathy@karpathy· 2h
RL environments are the new datasets. The bottleneck shifted from tokens to tasks with dense, well-defined reward.
Nat Friedman@natfriedman· 5h
Procgen still the cleanest test of generalization. Fixed levels just measure memorization at this point.
Rob Knight@ada_rob· 9h
Minimal RL gym wrapper — 200 LOC, no Mujoco. Plugs into a Worker for headless rollouts.

Two feed modes share this pipeline. For You is the deterministic, hot-path one — no LLM at request time, ranking weights live in a strategy doc, candidate pool is six SQL queries with fresh / backlog / resurface / liked_author / explore / bookmark lanes, diversity penalty is token-overlap MMR. Curated is the slow, prompt-driven one — you can say "make me a feed about RL environments" and the program inspects content directly. If a published feed is wrong, the iterate step forks a child session that excludes everything I already saw and re-ranks against the same rubric, so feedback feeds back instead of getting lost.

The two files that hold this together are feed-compiler.ts (the session lifecycle: feedStats, feedStartRun, feedCandidates, feedEval, feedPublish) and feed-ranker.ts (the deterministic scoring half: buildCandidatePool, compileForYouFeed, evaluateFeed, diagnoseBadFeed). Rubric and strategy live in separate markdown files. Rubric is what to value; strategy is how to rank. Editing one without the other is the most common way to get a worse feed.

The Settings panel is the operational view — one place to see the job ledger, the rolling corpus window, and whether anything is currently broken. Each job (scrape, following, discover, bookmark, likes) reports its last success timestamp and can be re-run manually. The corpus tree is the materialized 12-hour window the ranker actually reads from: 2,929 tweets right now under a 10,000-row cap, totaling about 274k tokens — which is what the "~250k-token prompt" earlier means in practice. The Rebuild button forces a fresh window from the scraped pool; Reload picks the latest snapshot without re-windowing.

Frontends

Four surfaces, all reading from the same underlying feed sessions. The compiler doesn't know about any of them.

The native Backscroll UI is the primary one. The header is backscroll · 116,145 — the second number is today's token budget for the compile run, which is the only honest "size" metric for a feed assembled by an LLM. The tab bar runs For you / All / Bookmarks / Likes / Archive / Chat. Each ranked card has a "why" button that opens the score breakdown. The composer at the bottom isn't for tweeting; it asks "What feed do you want?" and feeds the input straight into Code Mode as a directive.

The Twitter-web frontend renders the same sessions in an X-shaped layout for when I want the familiar muscle memory. The composer bar at the top doubles as a feed-directive input — "more from @kohjingyu", "less politics" — so I can steer the feed mid-session without leaving the surface.

The Hacker News-style frontend (backlist.sdan.io) renders the same sessions as a ranked link aggregator, which I find better for dense reading sessions. It has its own SEO surface — per-day archive pages at /d/{day} and per-author pages at /by/{handle} are server-rendered and indexable; per-tweet pages are deliberately not built so we don't compete with x.com/i/status/{id} on raw tweet text.

The Telegram bot is push delivery. Curated thread summaries arrive in a private channel with "More like this" and "Less of this" buttons inline, and that feedback flows into the rubric the same way clicks on the web version do.

What it's not

It's not a recommender. There's only one user; collaborative filtering is wrong-shaped. It's not a search engine — when I want to query, the Chat tab uses the model with REASONING tools (searchTweets, searchTweetCorpus, queryDB) that return structured payloads, and a small allowlist of PRESENTATION tools (showTweets, compileFeed) that render cards inline. The agent never has to "summarize the cards it just showed."

The hosted demo at backscroll.sdan.io serves the latest cached session, and the public workers handle UA-split SSR so crawlers see real HTML while humans get the SPA. They sit behind a private worker that holds the heavy bindings — browser automation, object storage, search — none of which belong on a Googlebot-facing path. The demo is a read-only window into one person's compiled feed: you can see the architecture and the ranking, but the rubric and the likes signal are mine.

What I don't have right yet

The ranker is weight-tuned, not learned. I rebalanced qualityWeight against authorWeight from 3:1 to closer to 1:1 and widened the bookmark lane share from 8% to 18%, and that fixed the "more of what you already see" failure mode. The next step is online weight learning against dwell + bookmark labels — the harness for it lives in feed-ranker.ts already, but the eval setup isn't there. Once the eval is honest, the rest (DSPy on the rubric prompt, a small learned model behind the weights, or behavior cloning on my scroll traces) becomes a question of which knob moves the metric, not vibes.