News Tracker — Plan

A continuously-updating, LLM-processed news dashboard for every company we're tracking. Two views: recent (timeline) and by company. Dedup at the cluster level, summarize per cluster, score importance so Ravi reads 10 items a day instead of 1,000.

Drafted 2026-05-19. Lives at /research/investments/news-tracker/. This is a plan, not the app yet.

1. Goal & Non-Goals

Goal: For every company on Ravi's interest list, surface everything material that's been said about it — by the company, by regulators, by reporters, by other investors — within 24 hours, deduped, summarized, and ranked. Ravi should be able to spend ≤10 minutes/day and know what changed.

Yes: SEC filings, press releases, earnings transcripts, mainstream financial press, analyst notes (where free), 13F changes, Form 4 insider activity, key Twitter/blog signals.
Yes: AI-written cluster summary + "what changed" + "why it matters for our thesis."
No (initially): Price-only alerts (Stock Watch handles that). Social-noise scraping (StockTwits / Reddit firehose). Twitter/X firehose (paid).
No: Trading signals. We're tracking information, not auto-acting on it.

2. Sources — Where News Comes From

Pillar 1 (authoritative, free) is the backbone. Pillar 2 (press, free RSS) is the bulk. Pillar 3 (paid APIs) is the polish. Pillar 4 (LLM-only) is for things only Claude can do.

Pillar 1 — Authoritative filings & company-direct

Source	Access	Why it matters
SEC EDGAR (8-K, 10-Q, 10-K, S-1, 13D/G, 13F, Form 4)	FREE JSON + RSS. We already use it in `awb/`.	Truth source. 8-K = "something happened in last 4 days." Form 4 = insider buys/sells. 13D = activist stakes.
Company IR / press release pages	FREE Custom scraper per company OR PR Newswire/BusinessWire/GlobeNewswire RSS by ticker.	Earliest signal — companies post here before press picks it up.
Earnings transcripts (Seeking Alpha, Motley Fool, company webcasts)	FREE partial; PAID for full real-time.	Management's own words. We already manually grab these — automate.
FERC / state PSC filings, EIA-411/860	FREE	Critical for power/data-center names (BE, CEG, VST, AES) and BTC→AI hosting names where grid connection is the binding constraint.
Patent filings (USPTO PAIR)	FREE	Leading indicator for semis (NVDA, MU) and biotech. Low-priority v1.

Pillar 2 — Press & aggregators (RSS-first)

Source	Access	Notes
Google News RSS per ticker (`news.google.com/rss/search?q=$TICKER`)	FREE	Best blanket coverage. Already aggregates Reuters/Bloomberg/CNBC/WSJ headlines. We follow link to fetch full text where allowed.
Yahoo Finance news (per-ticker)	FREE via `yfinance.Ticker(t).news` or scrape	Curated; lower volume than Google News.
PR Newswire / BusinessWire / GlobeNewswire / AccessWire	FREE RSS by ticker	Raw company-issued releases. Same content as IR page, easier to subscribe to.
Reuters / CNBC / FT / Bloomberg topic RSS	FREE headlines; paywall for full text	Headlines often enough. For paywalled deep stories, log title + outlet, mark "READ ON SOURCE."
Seeking Alpha	FREE news; PAID premium analysis	News tab is fine without subscription via RSS.
Substack / blog feeds (Doomberg, Stratechery, Diffusion, Semianalysis, Asianometry, etc.)	FREE RSS	Mention-based: if author mentions a tracked ticker, surface it.

Pillar 3 — Paid or rate-limited APIs (selective)

Source	Cost	Decision
Finnhub news API (already have key)	HAVE KEY	USE. Per-ticker news endpoint, cheap, fast. Good baseline.
Polygon.io news	Starter $29/mo	Skip v1. Revisit if Finnhub coverage is thin.
Alpha Vantage NEWS_SENTIMENT	Free tier, low limits	Skip — we do our own sentiment via Claude.
Benzinga / NewsAPI / NewsCatcher	$30–500/mo	Skip v1.
Twitter/X API	$200/mo basic	Skip v1. Curate ~30 must-follow accounts via Nitter RSS instead.
Whale Wisdom / 13F-HR alerts	$0–50/mo	Skip — we parse 13Fs directly from EDGAR.

Pillar 4 — LLM-native sources (no traditional feed)

Source	Mechanism
Claude WebSearch daily sweep per high-priority ticker	"What material news about $TICKER in last 24h that wasn't in earlier feeds?" — catches paywalled stories, podcasts, foreign press.
Podcast transcripts (Invest Like the Best, Acquired, BG2, All-In)	Whisper-transcribe via a weekly worker; extract ticker mentions.
YouTube CC from a watchlist of analyst channels	Same idea — caption fetch + LLM scan for tracked-ticker mentions.

3. Pipeline (end-to-end)

┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────┐
│  FETCH      │───▶│  NORMALIZE   │───▶│  DEDUP &     │───▶│  ENRICH     │───▶│  RANK &      │
│ (per-source │    │  to canonical │    │  CLUSTER     │    │ (LLM        │    │  STORE       │
│  workers)   │    │   item schema │    │ (embedding +  │    │  summary,   │    │  in D1       │
│             │    │               │    │  URL hash)    │    │  impact,    │    │              │
│             │    │               │    │               │    │  novelty)   │    │              │
└─────────────┘    └──────────────┘    └──────────────┘    └─────────────┘    └──────────────┘
       │                                                                                │
       └──── Cron every 15min for Pillar 1+2, every 4h for Pillar 4 ────────────────────┘

3.1 Canonical item schema

{
  "id": "sha256(url|title|published_at)",
  "source": "edgar | google-news | finnhub | pr-newswire | substack | websearch | ...",
  "source_type": "filing | press | news | analysis | social | transcript",
  "tickers": ["CORZ", "CRWV"],         // many-to-many; LLM may add tickers the source didn't tag
  "title": "...",
  "url": "...",
  "published_at": "ISO8601",
  "fetched_at": "ISO8601",
  "raw_text": "...",                    // full body when fetchable
  "raw_html_hash": "sha256",
  "embedding": [float; 1024],           // for clustering
  "cluster_id": "uuid",                 // populated by dedup step
  "is_primary_in_cluster": bool,        // the "best" copy of a story
  "llm": {
    "summary_120w": "...",
    "what_changed": "...",              // vs prior knowledge of this company
    "impact": "high | med | low",
    "thesis_relevance": "...",          // free-text: how it touches our investment thesis
    "novelty": 0.0–1.0,                 // 1.0 = brand new; 0.0 = restating something we have
    "tags": ["earnings","capex","customer-win","regulatory","management-change","..."]
  }
}

3.2 Dedup & clustering

L1 — URL canonicalization. Strip tracking params, follow redirects, lowercase host. Identical URL = same item.
L2 — Content hash. SHA256 of normalized title + first 500 chars of body. Catches syndicated copies.
L3 — Embedding clusters. Voyage-3 or text-embedding-3-small. Cosine ≥0.86 with another item from same ticker in last 72h ⇒ same cluster. The most authoritative source wins as primary (EDGAR > company PR > Reuters > aggregator).
L4 — LLM tie-breaker. For borderline clusters (cosine 0.78–0.86), Claude Haiku decides "same story or different angle?"

Why this matters. CORZ-CoreWeave merger termination generated 40+ stories in one day. Cluster = 1 entry with "primary = company 8-K, 39 secondary mentions, 4 distinct analyst takes." Ravi reads 1 thing, not 40.

3.3 LLM enrichment (per cluster, not per item)

Run once per cluster after dedup. Use the primary item's full text + titles of secondaries.
Pull last 5 prior cluster summaries for this ticker as context → enables "what changed" and proper novelty scoring.
Pull the ticker's analysis HTML from reports/research/stocks/$T/ if it exists → enables thesis-relevance commentary.
Model: Sonnet 4.6 for routine clusters; Opus 4.7 for clusters tagged high-impact or for "interested" companies.
Prompt-cache the analysis HTML + last-5 summaries so we only pay full input price on the first cluster per ticker per day.

3.4 Importance score

Final ranking blends:

LLM-assigned impact (high/med/low → 3/2/1)
Source authority (EDGAR 8-K material event = 3, press = 1)
Novelty (1.0 = new, 0.0 = rehash)
Interest multiplier (×3 if "interested", ×1.5 if "tracked", ×1 if "universe-only")
Recency decay (half-life 72h for routine, 7d for filings)

4. Storage (Cloudflare D1)

Schema (click to expand)

CREATE TABLE news_items (
  id TEXT PRIMARY KEY,
  source TEXT, source_type TEXT,
  url TEXT, title TEXT,
  published_at TEXT, fetched_at TEXT,
  raw_text TEXT, raw_html_hash TEXT,
  cluster_id TEXT, is_primary INTEGER,
  embedding BLOB,                       -- 4KB per row
  created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_news_published ON news_items(published_at DESC);
CREATE INDEX idx_news_cluster ON news_items(cluster_id);

CREATE TABLE news_item_tickers (
  item_id TEXT, ticker TEXT,
  PRIMARY KEY (item_id, ticker)
);
CREATE INDEX idx_nit_ticker ON news_item_tickers(ticker);

CREATE TABLE news_clusters (
  id TEXT PRIMARY KEY,
  primary_ticker TEXT,
  first_seen TEXT, last_seen TEXT,
  story_kind TEXT,                      -- earnings, m&a, regulatory, mgmt, ...
  summary_120w TEXT,
  what_changed TEXT,
  thesis_relevance TEXT,
  impact TEXT,                          -- high/med/low
  novelty REAL,
  importance_score REAL,                -- materialized for sorting
  tags_json TEXT,
  llm_model TEXT, llm_cost_usd REAL,
  updated_at TEXT
);
CREATE INDEX idx_cluster_importance ON news_clusters(importance_score DESC, last_seen DESC);

CREATE TABLE company_interest (
  ticker TEXT PRIMARY KEY,
  level TEXT,                           -- 'interested' | 'tracked' | 'universe'
  note TEXT,
  updated_at TEXT
);

CREATE TABLE source_feeds (
  id INTEGER PRIMARY KEY,
  kind TEXT, url TEXT, ticker TEXT NULL,
  last_polled TEXT, last_status TEXT,
  poll_interval_sec INTEGER
);

D1 row count estimate: 200 tracked tickers × ~30 items/day × 90 days retention = ~540K rows. Comfortably in D1's free tier. Older items roll off to R2 (cold storage) but keep cluster summaries forever.

5. Frontend (the dashboard)

View A — Recent (timeline)

Default sort: importance_score (desc), then recency.
Filters: time window (24h/7d/30d), interest level (Interested / Tracked / All), impact, story kind.
Each row: ticker chip, headline, 2-sentence summary, "what changed", source pills (EDGAR / Reuters / +12 others), expand to read primary text.
"Mark as read" + per-row "👁 follow this story" to get updates if the cluster grows.

View B — By Company

Sidebar: list of all tracked tickers, sorted by interest level then unread-cluster count.
Main: timeline for the selected ticker. Group by week. Show count of clusters & total items.
Top of page: pinned "company snapshot" (price, 1Y move, link to our existing analysis, link to Stock Watch).
Compare mode: pick 2-3 tickers, see their news side-by-side (useful for sector comparison).

Plus a tiny Daily Digest page: top 10 clusters from the last 24h across Interested companies, in one printable view. The thing Ravi actually opens with his morning coffee.

6. Marking "Interested" companies

3 levels: Interested (actively investigating, full Opus treatment, daily digest, ×3 score) · Tracked (in our universe, normal processing, ×1.5) · Universe (passive, headline-only, ×1).
Single button on each company row. Same source-of-truth (company_interest table) as Stock Watch / AI Stock Universe so flags are consistent across the site.
Seed the initial Interested list from the new Stock Watch priority subgroups: CRWV, NBIS, CORZ, IREN, BTBT, GLXY, APLD, WULF, CIFR, HUT, plus META.

7. Infrastructure

Layer	Choice	Why
Hosting	Cloudflare Pages + Worker	Same as rest of `work` repo. Zero new infra.
Cron	Cloudflare Cron Triggers	Fetch workers run on schedule. 15min for Pillar 1+2, 4h for Pillar 4.
DB	Cloudflare D1 (`work` database, new tables)	Already wired.
Cold storage	Cloudflare R2 for raw HTML/PDF originals	Cheap, append-only.
Queues	Cloudflare Queues	Fetch → enrich is decoupled. Survives spikes (earnings days).
Embeddings	Voyage-3 (preferred) or OpenAI `text-embedding-3-small`	Cheap. ~$0.10 per 1M tokens. ~$0.30/mo at our volume.
LLM	Sonnet 4.6 default, Opus 4.7 for Interested + high-impact	Per `investments/CLAUDE.md` "always Opus for analysis" rule, scoped to high-priority clusters.
Auth	Single-user, password / token gate on write endpoints (mark-interested, mark-read)	Read-only public is fine.

8. Cost model (rough)

Embeddings: ~6K items/day × 1KB avg → 6M tokens/day → ~$0.60/day = $18/mo
LLM enrichment (Sonnet): ~200 unique clusters/day × ~3K input + 400 output (cached at 0.1×) → ~$1.50/day = $45/mo
LLM enrichment (Opus for Interested + high-impact, ~20/day): ~$1/day = $30/mo
Daily WebSearch sweep (Pillar 4) on ~10 Interested tickers: ~$0.40/day = $12/mo
Cloudflare D1 + R2 + Workers: ~$5/mo
Total: ~$110/mo at steady state. Acceptable. If we're wrong by 3x, still under $350/mo.

9. Build phases

Phase	Scope	Estimate
P0 — MVP read-only	EDGAR + Finnhub + Google News for ~15 Interested tickers. Hash-only dedup. Sonnet summaries. Single "Recent" view. No clustering yet.	1 weekend
P1 — Cluster & rank	Add embeddings + cluster + importance score. Add "By Company" view. Add Interested/Tracked/Universe levels.	1 weekend
P2 — Source breadth	Add PR Newswire/BusinessWire RSS, Seeking Alpha, Substack feeds, Yahoo news. Add Daily Digest page.	1 weekend
P3 — LLM-native sources	Claude WebSearch sweep. Podcast/YouTube transcript scanning for tracked tickers.	1 weekend
P4 — Polish	Mark-as-read, follow-cluster, email digest, compare mode, mobile UX pass.	1 weekend

10. Open questions for Ravi

How many Interested companies do you want at any one time? (Affects Opus cost.) Suggest 10-15.
Daily digest delivery: email (via Resend), Slack DM to yourself, or just a page you check?
Retention: do you want all historical items searchable forever, or roll off raw items after 90d and keep only cluster summaries?
Foreign-language coverage (Nikkei for Japan trading houses, Chinese press for BABA/JD/Tencent)? Adds complexity; worth it?
Do you want auto-extraction of new tickers mentioned in articles about tracked companies (e.g., "BE signed a deal with new customer X") so we can add X to Universe automatically?

11. Risks & mitigations

LLM hallucination on summaries → always show source URL prominently, never paraphrase numbers without quoting.
Source blocks scrapers → fall back to title-only with "READ ON SOURCE" link. Don't try to defeat paywalls.
Aggregator double-counting → strict dedup pipeline (4 layers above).
Drift in interest list → quarterly review prompt: "These 5 Interested companies haven't generated material news in 60 days; demote?"
Cron job failure silently → health endpoint, daily heartbeat to Drive, alert if >6h gap.

What this is NOT. Not a trading platform. Not real-time (15-min lag is fine). Not a substitute for reading 10-Ks. It's a "did anything important happen?" filter so the deep reading time is spent on the right things.

Next step after Ravi's review: build P0. Folder: /Users/ravf/projects/work/research/investments/news-tracker/. Worker code will live under /_worker.js with new /api/news/* routes.