Show HN: Epstein's emails reconstructed in a message-style UI (OCR and LLMs)
Posted by toon-noot 2 days ago
This project reconstructs the Epstein email records from the recent U.S. House Oversight Committee releases using only public-domain documents (23,124 image files + 2,800 OCR text files).
Most email pages contain only one real message, buried under layers of repeated headers/footers. I wanted to rebuild the conversations without all the surrounding noise.
I used an OCR + vision-LLM pipeline to extract individual messages from the email screenshots, normalize senders/recipients, rebuild timestamps, detect duplicates, and map threads. The output is a structured SQLite database that runs client-side via SQL.js (WebAssembly).
The repository includes the full extraction pipeline, data cleaning scripts, schema, limitations, and implementation notes. The interface is a lightweight PWA that displays the reconstructed messages in a phone-style UI, with links back to every original source image for verification.
Live demo: https://epsteinsphone.org
All source data is from the official public releases; no leaks or private material.
Happy to answer questions about the pipeline, LLM extraction, threading logic, or the PWA implementation.
Comments
Comment by pfd1986 2 days ago
Neat data visualization solution!
Comment by toon-noot 2 days ago
Comment by marstall 2 days ago
Comment by dizhn 2 days ago
Comment by toon-noot 2 days ago
Comment by pea 2 days ago
Comment by palmotea 2 days ago
Comment by lights0123 2 days ago