News Tracker — Plan

A continuously-updating, LLM-processed news dashboard for every company we're tracking. Two views: recent (timeline) and by company. Dedup at the cluster level, summarize per cluster, score importance so Ravi reads 10 items a day instead of 1,000.

Drafted 2026-05-19. Lives at /research/investments/news-tracker/. This is a plan, not the app yet.

1. Goal & Non-Goals

Goal: For every company on Ravi's interest list, surface everything material that's been said about it — by the company, by regulators, by reporters, by other investors — within 24 hours, deduped, summarized, and ranked. Ravi should be able to spend ≤10 minutes/day and know what changed.

2. Sources — Where News Comes From

Pillar 1 (authoritative, free) is the backbone. Pillar 2 (press, free RSS) is the bulk. Pillar 3 (paid APIs) is the polish. Pillar 4 (LLM-only) is for things only Claude can do.

Pillar 1 — Authoritative filings & company-direct

SourceAccessWhy it matters
SEC EDGAR (8-K, 10-Q, 10-K, S-1, 13D/G, 13F, Form 4)FREE JSON + RSS. We already use it in awb/.Truth source. 8-K = "something happened in last 4 days." Form 4 = insider buys/sells. 13D = activist stakes.
Company IR / press release pagesFREE Custom scraper per company OR PR Newswire/BusinessWire/GlobeNewswire RSS by ticker.Earliest signal — companies post here before press picks it up.
Earnings transcripts (Seeking Alpha, Motley Fool, company webcasts)FREE partial; PAID for full real-time.Management's own words. We already manually grab these — automate.
FERC / state PSC filings, EIA-411/860FREECritical for power/data-center names (BE, CEG, VST, AES) and BTC→AI hosting names where grid connection is the binding constraint.
Patent filings (USPTO PAIR)FREELeading indicator for semis (NVDA, MU) and biotech. Low-priority v1.

Pillar 2 — Press & aggregators (RSS-first)

SourceAccessNotes
Google News RSS per ticker (news.google.com/rss/search?q=$TICKER)FREEBest blanket coverage. Already aggregates Reuters/Bloomberg/CNBC/WSJ headlines. We follow link to fetch full text where allowed.
Yahoo Finance news (per-ticker)FREE via yfinance.Ticker(t).news or scrapeCurated; lower volume than Google News.
PR Newswire / BusinessWire / GlobeNewswire / AccessWireFREE RSS by tickerRaw company-issued releases. Same content as IR page, easier to subscribe to.
Reuters / CNBC / FT / Bloomberg topic RSSFREE headlines; paywall for full textHeadlines often enough. For paywalled deep stories, log title + outlet, mark "READ ON SOURCE."
Seeking AlphaFREE news; PAID premium analysisNews tab is fine without subscription via RSS.
Substack / blog feeds (Doomberg, Stratechery, Diffusion, Semianalysis, Asianometry, etc.)FREE RSSMention-based: if author mentions a tracked ticker, surface it.

Pillar 3 — Paid or rate-limited APIs (selective)

SourceCostDecision
Finnhub news API (already have key)HAVE KEYUSE. Per-ticker news endpoint, cheap, fast. Good baseline.
Polygon.io newsStarter $29/moSkip v1. Revisit if Finnhub coverage is thin.
Alpha Vantage NEWS_SENTIMENTFree tier, low limitsSkip — we do our own sentiment via Claude.
Benzinga / NewsAPI / NewsCatcher$30–500/moSkip v1.
Twitter/X API$200/mo basicSkip v1. Curate ~30 must-follow accounts via Nitter RSS instead.
Whale Wisdom / 13F-HR alerts$0–50/moSkip — we parse 13Fs directly from EDGAR.

Pillar 4 — LLM-native sources (no traditional feed)

SourceMechanism
Claude WebSearch daily sweep per high-priority ticker"What material news about $TICKER in last 24h that wasn't in earlier feeds?" — catches paywalled stories, podcasts, foreign press.
Podcast transcripts (Invest Like the Best, Acquired, BG2, All-In)Whisper-transcribe via a weekly worker; extract ticker mentions.
YouTube CC from a watchlist of analyst channelsSame idea — caption fetch + LLM scan for tracked-ticker mentions.

3. Pipeline (end-to-end)

┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────┐
│  FETCH      │───▶│  NORMALIZE   │───▶│  DEDUP &     │───▶│  ENRICH     │───▶│  RANK &      │
│ (per-source │    │  to canonical │    │  CLUSTER     │    │ (LLM        │    │  STORE       │
│  workers)   │    │   item schema │    │ (embedding +  │    │  summary,   │    │  in D1       │
│             │    │               │    │  URL hash)    │    │  impact,    │    │              │
│             │    │               │    │               │    │  novelty)   │    │              │
└─────────────┘    └──────────────┘    └──────────────┘    └─────────────┘    └──────────────┘
       │                                                                                │
       └──── Cron every 15min for Pillar 1+2, every 4h for Pillar 4 ────────────────────┘

3.1 Canonical item schema

{
  "id": "sha256(url|title|published_at)",
  "source": "edgar | google-news | finnhub | pr-newswire | substack | websearch | ...",
  "source_type": "filing | press | news | analysis | social | transcript",
  "tickers": ["CORZ", "CRWV"],         // many-to-many; LLM may add tickers the source didn't tag
  "title": "...",
  "url": "...",
  "published_at": "ISO8601",
  "fetched_at": "ISO8601",
  "raw_text": "...",                    // full body when fetchable
  "raw_html_hash": "sha256",
  "embedding": [float; 1024],           // for clustering
  "cluster_id": "uuid",                 // populated by dedup step
  "is_primary_in_cluster": bool,        // the "best" copy of a story
  "llm": {
    "summary_120w": "...",
    "what_changed": "...",              // vs prior knowledge of this company
    "impact": "high | med | low",
    "thesis_relevance": "...",          // free-text: how it touches our investment thesis
    "novelty": 0.0–1.0,                 // 1.0 = brand new; 0.0 = restating something we have
    "tags": ["earnings","capex","customer-win","regulatory","management-change","..."]
  }
}

3.2 Dedup & clustering

Why this matters. CORZ-CoreWeave merger termination generated 40+ stories in one day. Cluster = 1 entry with "primary = company 8-K, 39 secondary mentions, 4 distinct analyst takes." Ravi reads 1 thing, not 40.

3.3 LLM enrichment (per cluster, not per item)

3.4 Importance score

Final ranking blends:

4. Storage (Cloudflare D1)

Schema (click to expand)
CREATE TABLE news_items (
  id TEXT PRIMARY KEY,
  source TEXT, source_type TEXT,
  url TEXT, title TEXT,
  published_at TEXT, fetched_at TEXT,
  raw_text TEXT, raw_html_hash TEXT,
  cluster_id TEXT, is_primary INTEGER,
  embedding BLOB,                       -- 4KB per row
  created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_news_published ON news_items(published_at DESC);
CREATE INDEX idx_news_cluster ON news_items(cluster_id);

CREATE TABLE news_item_tickers (
  item_id TEXT, ticker TEXT,
  PRIMARY KEY (item_id, ticker)
);
CREATE INDEX idx_nit_ticker ON news_item_tickers(ticker);

CREATE TABLE news_clusters (
  id TEXT PRIMARY KEY,
  primary_ticker TEXT,
  first_seen TEXT, last_seen TEXT,
  story_kind TEXT,                      -- earnings, m&a, regulatory, mgmt, ...
  summary_120w TEXT,
  what_changed TEXT,
  thesis_relevance TEXT,
  impact TEXT,                          -- high/med/low
  novelty REAL,
  importance_score REAL,                -- materialized for sorting
  tags_json TEXT,
  llm_model TEXT, llm_cost_usd REAL,
  updated_at TEXT
);
CREATE INDEX idx_cluster_importance ON news_clusters(importance_score DESC, last_seen DESC);

CREATE TABLE company_interest (
  ticker TEXT PRIMARY KEY,
  level TEXT,                           -- 'interested' | 'tracked' | 'universe'
  note TEXT,
  updated_at TEXT
);

CREATE TABLE source_feeds (
  id INTEGER PRIMARY KEY,
  kind TEXT, url TEXT, ticker TEXT NULL,
  last_polled TEXT, last_status TEXT,
  poll_interval_sec INTEGER
);

D1 row count estimate: 200 tracked tickers × ~30 items/day × 90 days retention = ~540K rows. Comfortably in D1's free tier. Older items roll off to R2 (cold storage) but keep cluster summaries forever.

5. Frontend (the dashboard)

View A — Recent (timeline)

  • Default sort: importance_score (desc), then recency.
  • Filters: time window (24h/7d/30d), interest level (Interested / Tracked / All), impact, story kind.
  • Each row: ticker chip, headline, 2-sentence summary, "what changed", source pills (EDGAR / Reuters / +12 others), expand to read primary text.
  • "Mark as read" + per-row "👁 follow this story" to get updates if the cluster grows.

View B — By Company

  • Sidebar: list of all tracked tickers, sorted by interest level then unread-cluster count.
  • Main: timeline for the selected ticker. Group by week. Show count of clusters & total items.
  • Top of page: pinned "company snapshot" (price, 1Y move, link to our existing analysis, link to Stock Watch).
  • Compare mode: pick 2-3 tickers, see their news side-by-side (useful for sector comparison).

Plus a tiny Daily Digest page: top 10 clusters from the last 24h across Interested companies, in one printable view. The thing Ravi actually opens with his morning coffee.

6. Marking "Interested" companies

7. Infrastructure

LayerChoiceWhy
HostingCloudflare Pages + WorkerSame as rest of work repo. Zero new infra.
CronCloudflare Cron TriggersFetch workers run on schedule. 15min for Pillar 1+2, 4h for Pillar 4.
DBCloudflare D1 (work database, new tables)Already wired.
Cold storageCloudflare R2 for raw HTML/PDF originalsCheap, append-only.
QueuesCloudflare QueuesFetch → enrich is decoupled. Survives spikes (earnings days).
EmbeddingsVoyage-3 (preferred) or OpenAI text-embedding-3-smallCheap. ~$0.10 per 1M tokens. ~$0.30/mo at our volume.
LLMSonnet 4.6 default, Opus 4.7 for Interested + high-impactPer investments/CLAUDE.md "always Opus for analysis" rule, scoped to high-priority clusters.
AuthSingle-user, password / token gate on write endpoints (mark-interested, mark-read)Read-only public is fine.

8. Cost model (rough)

9. Build phases

PhaseScopeEstimate
P0 — MVP read-onlyEDGAR + Finnhub + Google News for ~15 Interested tickers. Hash-only dedup. Sonnet summaries. Single "Recent" view. No clustering yet.1 weekend
P1 — Cluster & rankAdd embeddings + cluster + importance score. Add "By Company" view. Add Interested/Tracked/Universe levels.1 weekend
P2 — Source breadthAdd PR Newswire/BusinessWire RSS, Seeking Alpha, Substack feeds, Yahoo news. Add Daily Digest page.1 weekend
P3 — LLM-native sourcesClaude WebSearch sweep. Podcast/YouTube transcript scanning for tracked tickers.1 weekend
P4 — PolishMark-as-read, follow-cluster, email digest, compare mode, mobile UX pass.1 weekend

10. Open questions for Ravi

  1. How many Interested companies do you want at any one time? (Affects Opus cost.) Suggest 10-15.
  2. Daily digest delivery: email (via Resend), Slack DM to yourself, or just a page you check?
  3. Retention: do you want all historical items searchable forever, or roll off raw items after 90d and keep only cluster summaries?
  4. Foreign-language coverage (Nikkei for Japan trading houses, Chinese press for BABA/JD/Tencent)? Adds complexity; worth it?
  5. Do you want auto-extraction of new tickers mentioned in articles about tracked companies (e.g., "BE signed a deal with new customer X") so we can add X to Universe automatically?

11. Risks & mitigations

What this is NOT. Not a trading platform. Not real-time (15-min lag is fine). Not a substitute for reading 10-Ks. It's a "did anything important happen?" filter so the deep reading time is spent on the right things.

Next step after Ravi's review: build P0. Folder: /Users/ravf/projects/work/research/investments/news-tracker/. Worker code will live under /_worker.js with new /api/news/* routes.