News Tracker — Plan
A continuously-updating, LLM-processed news dashboard for every company we're tracking. Two views: recent (timeline) and by company. Dedup at the cluster level, summarize per cluster, score importance so Ravi reads 10 items a day instead of 1,000.
Drafted 2026-05-19. Lives at /research/investments/news-tracker/. This is a plan, not the app yet.
1. Goal & Non-Goals
Goal: For every company on Ravi's interest list, surface everything material that's been said about it — by the company, by regulators, by reporters, by other investors — within 24 hours, deduped, summarized, and ranked. Ravi should be able to spend ≤10 minutes/day and know what changed.
- Yes: SEC filings, press releases, earnings transcripts, mainstream financial press, analyst notes (where free), 13F changes, Form 4 insider activity, key Twitter/blog signals.
- Yes: AI-written cluster summary + "what changed" + "why it matters for our thesis."
- No (initially): Price-only alerts (Stock Watch handles that). Social-noise scraping (StockTwits / Reddit firehose). Twitter/X firehose (paid).
- No: Trading signals. We're tracking information, not auto-acting on it.
2. Sources — Where News Comes From
Pillar 1 (authoritative, free) is the backbone. Pillar 2 (press, free RSS) is the bulk. Pillar 3 (paid APIs) is the polish. Pillar 4 (LLM-only) is for things only Claude can do.
Pillar 1 — Authoritative filings & company-direct
| Source | Access | Why it matters |
| SEC EDGAR (8-K, 10-Q, 10-K, S-1, 13D/G, 13F, Form 4) | FREE JSON + RSS. We already use it in awb/. | Truth source. 8-K = "something happened in last 4 days." Form 4 = insider buys/sells. 13D = activist stakes. |
| Company IR / press release pages | FREE Custom scraper per company OR PR Newswire/BusinessWire/GlobeNewswire RSS by ticker. | Earliest signal — companies post here before press picks it up. |
| Earnings transcripts (Seeking Alpha, Motley Fool, company webcasts) | FREE partial; PAID for full real-time. | Management's own words. We already manually grab these — automate. |
| FERC / state PSC filings, EIA-411/860 | FREE | Critical for power/data-center names (BE, CEG, VST, AES) and BTC→AI hosting names where grid connection is the binding constraint. |
| Patent filings (USPTO PAIR) | FREE | Leading indicator for semis (NVDA, MU) and biotech. Low-priority v1. |
Pillar 2 — Press & aggregators (RSS-first)
| Source | Access | Notes |
Google News RSS per ticker (news.google.com/rss/search?q=$TICKER) | FREE | Best blanket coverage. Already aggregates Reuters/Bloomberg/CNBC/WSJ headlines. We follow link to fetch full text where allowed. |
| Yahoo Finance news (per-ticker) | FREE via yfinance.Ticker(t).news or scrape | Curated; lower volume than Google News. |
| PR Newswire / BusinessWire / GlobeNewswire / AccessWire | FREE RSS by ticker | Raw company-issued releases. Same content as IR page, easier to subscribe to. |
| Reuters / CNBC / FT / Bloomberg topic RSS | FREE headlines; paywall for full text | Headlines often enough. For paywalled deep stories, log title + outlet, mark "READ ON SOURCE." |
| Seeking Alpha | FREE news; PAID premium analysis | News tab is fine without subscription via RSS. |
| Substack / blog feeds (Doomberg, Stratechery, Diffusion, Semianalysis, Asianometry, etc.) | FREE RSS | Mention-based: if author mentions a tracked ticker, surface it. |
Pillar 3 — Paid or rate-limited APIs (selective)
| Source | Cost | Decision |
| Finnhub news API (already have key) | HAVE KEY | USE. Per-ticker news endpoint, cheap, fast. Good baseline. |
| Polygon.io news | Starter $29/mo | Skip v1. Revisit if Finnhub coverage is thin. |
| Alpha Vantage NEWS_SENTIMENT | Free tier, low limits | Skip — we do our own sentiment via Claude. |
| Benzinga / NewsAPI / NewsCatcher | $30–500/mo | Skip v1. |
| Twitter/X API | $200/mo basic | Skip v1. Curate ~30 must-follow accounts via Nitter RSS instead. |
| Whale Wisdom / 13F-HR alerts | $0–50/mo | Skip — we parse 13Fs directly from EDGAR. |
Pillar 4 — LLM-native sources (no traditional feed)
| Source | Mechanism |
| Claude WebSearch daily sweep per high-priority ticker | "What material news about $TICKER in last 24h that wasn't in earlier feeds?" — catches paywalled stories, podcasts, foreign press. |
| Podcast transcripts (Invest Like the Best, Acquired, BG2, All-In) | Whisper-transcribe via a weekly worker; extract ticker mentions. |
| YouTube CC from a watchlist of analyst channels | Same idea — caption fetch + LLM scan for tracked-ticker mentions. |
3. Pipeline (end-to-end)
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ FETCH │───▶│ NORMALIZE │───▶│ DEDUP & │───▶│ ENRICH │───▶│ RANK & │
│ (per-source │ │ to canonical │ │ CLUSTER │ │ (LLM │ │ STORE │
│ workers) │ │ item schema │ │ (embedding + │ │ summary, │ │ in D1 │
│ │ │ │ │ URL hash) │ │ impact, │ │ │
│ │ │ │ │ │ │ novelty) │ │ │
└─────────────┘ └──────────────┘ └──────────────┘ └─────────────┘ └──────────────┘
│ │
└──── Cron every 15min for Pillar 1+2, every 4h for Pillar 4 ────────────────────┘
3.1 Canonical item schema
{
"id": "sha256(url|title|published_at)",
"source": "edgar | google-news | finnhub | pr-newswire | substack | websearch | ...",
"source_type": "filing | press | news | analysis | social | transcript",
"tickers": ["CORZ", "CRWV"], // many-to-many; LLM may add tickers the source didn't tag
"title": "...",
"url": "...",
"published_at": "ISO8601",
"fetched_at": "ISO8601",
"raw_text": "...", // full body when fetchable
"raw_html_hash": "sha256",
"embedding": [float; 1024], // for clustering
"cluster_id": "uuid", // populated by dedup step
"is_primary_in_cluster": bool, // the "best" copy of a story
"llm": {
"summary_120w": "...",
"what_changed": "...", // vs prior knowledge of this company
"impact": "high | med | low",
"thesis_relevance": "...", // free-text: how it touches our investment thesis
"novelty": 0.0–1.0, // 1.0 = brand new; 0.0 = restating something we have
"tags": ["earnings","capex","customer-win","regulatory","management-change","..."]
}
}
3.2 Dedup & clustering
- L1 — URL canonicalization. Strip tracking params, follow redirects, lowercase host. Identical URL = same item.
- L2 — Content hash. SHA256 of normalized title + first 500 chars of body. Catches syndicated copies.
- L3 — Embedding clusters. Voyage-3 or text-embedding-3-small. Cosine ≥0.86 with another item from same ticker in last 72h ⇒ same cluster. The most authoritative source wins as primary (EDGAR > company PR > Reuters > aggregator).
- L4 — LLM tie-breaker. For borderline clusters (cosine 0.78–0.86), Claude Haiku decides "same story or different angle?"
Why this matters. CORZ-CoreWeave merger termination generated 40+ stories in one day. Cluster = 1 entry with "primary = company 8-K, 39 secondary mentions, 4 distinct analyst takes." Ravi reads 1 thing, not 40.
3.3 LLM enrichment (per cluster, not per item)
- Run once per cluster after dedup. Use the primary item's full text + titles of secondaries.
- Pull last 5 prior cluster summaries for this ticker as context → enables "what changed" and proper novelty scoring.
- Pull the ticker's analysis HTML from
reports/research/stocks/$T/ if it exists → enables thesis-relevance commentary.
- Model: Sonnet 4.6 for routine clusters; Opus 4.7 for clusters tagged high-impact or for "interested" companies.
- Prompt-cache the analysis HTML + last-5 summaries so we only pay full input price on the first cluster per ticker per day.
3.4 Importance score
Final ranking blends:
- LLM-assigned impact (high/med/low → 3/2/1)
- Source authority (EDGAR 8-K material event = 3, press = 1)
- Novelty (1.0 = new, 0.0 = rehash)
- Interest multiplier (×3 if "interested", ×1.5 if "tracked", ×1 if "universe-only")
- Recency decay (half-life 72h for routine, 7d for filings)
4. Storage (Cloudflare D1)
Schema (click to expand)
CREATE TABLE news_items (
id TEXT PRIMARY KEY,
source TEXT, source_type TEXT,
url TEXT, title TEXT,
published_at TEXT, fetched_at TEXT,
raw_text TEXT, raw_html_hash TEXT,
cluster_id TEXT, is_primary INTEGER,
embedding BLOB, -- 4KB per row
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_news_published ON news_items(published_at DESC);
CREATE INDEX idx_news_cluster ON news_items(cluster_id);
CREATE TABLE news_item_tickers (
item_id TEXT, ticker TEXT,
PRIMARY KEY (item_id, ticker)
);
CREATE INDEX idx_nit_ticker ON news_item_tickers(ticker);
CREATE TABLE news_clusters (
id TEXT PRIMARY KEY,
primary_ticker TEXT,
first_seen TEXT, last_seen TEXT,
story_kind TEXT, -- earnings, m&a, regulatory, mgmt, ...
summary_120w TEXT,
what_changed TEXT,
thesis_relevance TEXT,
impact TEXT, -- high/med/low
novelty REAL,
importance_score REAL, -- materialized for sorting
tags_json TEXT,
llm_model TEXT, llm_cost_usd REAL,
updated_at TEXT
);
CREATE INDEX idx_cluster_importance ON news_clusters(importance_score DESC, last_seen DESC);
CREATE TABLE company_interest (
ticker TEXT PRIMARY KEY,
level TEXT, -- 'interested' | 'tracked' | 'universe'
note TEXT,
updated_at TEXT
);
CREATE TABLE source_feeds (
id INTEGER PRIMARY KEY,
kind TEXT, url TEXT, ticker TEXT NULL,
last_polled TEXT, last_status TEXT,
poll_interval_sec INTEGER
);
D1 row count estimate: 200 tracked tickers × ~30 items/day × 90 days retention = ~540K rows. Comfortably in D1's free tier. Older items roll off to R2 (cold storage) but keep cluster summaries forever.
5. Frontend (the dashboard)
View A — Recent (timeline)
- Default sort: importance_score (desc), then recency.
- Filters: time window (24h/7d/30d), interest level (Interested / Tracked / All), impact, story kind.
- Each row: ticker chip, headline, 2-sentence summary, "what changed", source pills (EDGAR / Reuters / +12 others), expand to read primary text.
- "Mark as read" + per-row "👁 follow this story" to get updates if the cluster grows.
View B — By Company
- Sidebar: list of all tracked tickers, sorted by interest level then unread-cluster count.
- Main: timeline for the selected ticker. Group by week. Show count of clusters & total items.
- Top of page: pinned "company snapshot" (price, 1Y move, link to our existing analysis, link to Stock Watch).
- Compare mode: pick 2-3 tickers, see their news side-by-side (useful for sector comparison).
Plus a tiny Daily Digest page: top 10 clusters from the last 24h across Interested companies, in one printable view. The thing Ravi actually opens with his morning coffee.
6. Marking "Interested" companies
- 3 levels: Interested (actively investigating, full Opus treatment, daily digest, ×3 score) · Tracked (in our universe, normal processing, ×1.5) · Universe (passive, headline-only, ×1).
- Single button on each company row. Same source-of-truth (
company_interest table) as Stock Watch / AI Stock Universe so flags are consistent across the site.
- Seed the initial Interested list from the new Stock Watch priority subgroups: CRWV, NBIS, CORZ, IREN, BTBT, GLXY, APLD, WULF, CIFR, HUT, plus META.
7. Infrastructure
| Layer | Choice | Why |
| Hosting | Cloudflare Pages + Worker | Same as rest of work repo. Zero new infra. |
| Cron | Cloudflare Cron Triggers | Fetch workers run on schedule. 15min for Pillar 1+2, 4h for Pillar 4. |
| DB | Cloudflare D1 (work database, new tables) | Already wired. |
| Cold storage | Cloudflare R2 for raw HTML/PDF originals | Cheap, append-only. |
| Queues | Cloudflare Queues | Fetch → enrich is decoupled. Survives spikes (earnings days). |
| Embeddings | Voyage-3 (preferred) or OpenAI text-embedding-3-small | Cheap. ~$0.10 per 1M tokens. ~$0.30/mo at our volume. |
| LLM | Sonnet 4.6 default, Opus 4.7 for Interested + high-impact | Per investments/CLAUDE.md "always Opus for analysis" rule, scoped to high-priority clusters. |
| Auth | Single-user, password / token gate on write endpoints (mark-interested, mark-read) | Read-only public is fine. |
8. Cost model (rough)
- Embeddings: ~6K items/day × 1KB avg → 6M tokens/day → ~$0.60/day = $18/mo
- LLM enrichment (Sonnet): ~200 unique clusters/day × ~3K input + 400 output (cached at 0.1×) → ~$1.50/day = $45/mo
- LLM enrichment (Opus for Interested + high-impact, ~20/day): ~$1/day = $30/mo
- Daily WebSearch sweep (Pillar 4) on ~10 Interested tickers: ~$0.40/day = $12/mo
- Cloudflare D1 + R2 + Workers: ~$5/mo
- Total: ~$110/mo at steady state. Acceptable. If we're wrong by 3x, still under $350/mo.
9. Build phases
| Phase | Scope | Estimate |
| P0 — MVP read-only | EDGAR + Finnhub + Google News for ~15 Interested tickers. Hash-only dedup. Sonnet summaries. Single "Recent" view. No clustering yet. | 1 weekend |
| P1 — Cluster & rank | Add embeddings + cluster + importance score. Add "By Company" view. Add Interested/Tracked/Universe levels. | 1 weekend |
| P2 — Source breadth | Add PR Newswire/BusinessWire RSS, Seeking Alpha, Substack feeds, Yahoo news. Add Daily Digest page. | 1 weekend |
| P3 — LLM-native sources | Claude WebSearch sweep. Podcast/YouTube transcript scanning for tracked tickers. | 1 weekend |
| P4 — Polish | Mark-as-read, follow-cluster, email digest, compare mode, mobile UX pass. | 1 weekend |
10. Open questions for Ravi
- How many Interested companies do you want at any one time? (Affects Opus cost.) Suggest 10-15.
- Daily digest delivery: email (via Resend), Slack DM to yourself, or just a page you check?
- Retention: do you want all historical items searchable forever, or roll off raw items after 90d and keep only cluster summaries?
- Foreign-language coverage (Nikkei for Japan trading houses, Chinese press for BABA/JD/Tencent)? Adds complexity; worth it?
- Do you want auto-extraction of new tickers mentioned in articles about tracked companies (e.g., "BE signed a deal with new customer X") so we can add X to Universe automatically?
11. Risks & mitigations
- LLM hallucination on summaries → always show source URL prominently, never paraphrase numbers without quoting.
- Source blocks scrapers → fall back to title-only with "READ ON SOURCE" link. Don't try to defeat paywalls.
- Aggregator double-counting → strict dedup pipeline (4 layers above).
- Drift in interest list → quarterly review prompt: "These 5 Interested companies haven't generated material news in 60 days; demote?"
- Cron job failure silently → health endpoint, daily heartbeat to Drive, alert if >6h gap.
What this is NOT. Not a trading platform. Not real-time (15-min lag is fine). Not a substitute for reading 10-Ks. It's a "did anything important happen?" filter so the deep reading time is spent on the right things.
Next step after Ravi's review: build P0. Folder: /Users/ravf/projects/work/research/investments/news-tracker/. Worker code will live under /_worker.js with new /api/news/* routes.