Show HN: Kage – Shadow any website to a single binary for offline viewing
Posted by tamnd 2 days ago
Comments
Comment by simonw 2 days ago
Turns out it's using another project by the same author: https://github.com/tamnd/ascii-gif
The script used for the demo is at https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c63... and has a comment showing how to run it:
ascii-gif render docs/demo/kage.tape -o docs/static/demo.gif
Looks like it's an opinionated wrapper around https://github.com/charmbracelet/vhsComment by vqtska 2 days ago
Comment by embedding-shape 2 days ago
Comment by Noumenon72 2 days ago
Comment by vqtska 2 days ago
Comment by LocoPadre 2 days ago
Comment by jubilanti 2 days ago
Comment by embedding-shape 2 days ago
Comment by alterom 2 days ago
Comment by tamnd 2 days ago
Comment by stavros 2 days ago
Comment by stellamariesays 1 day ago
Comment by wolttam 2 days ago
Cool!
It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.
Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?
Comment by tamnd 2 days ago
Also, in my mind, I already have a script/program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.
Comment by d3Xt3r 2 days ago
Basically I'm looking for something like the old-school .chm files on Windows, where you could pack a bunch of HTML documents into a single archive and open it without needing to embed a full browser engine.
This would have the advantage of keeping the file sizes really small. And you don't have to worry about the browser engine become outdated and potentially becoming an attack vector.
Comment by Bad_CRC 1 day ago
Comment by samat 1 day ago
For the younger generation https://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help
Comment by d3Xt3r 1 day ago
Comment by mgiampapa 2 days ago
Comment by tamnd 2 days ago
Comment by mcdonje 2 days ago
Epub would also be a great target.
Comment by smeej 2 days ago
Comment by gwern 2 days ago
So something like SingleFileZ https://github.com/gildas-lormeau/SingleFileZ or Gwtar https://gwern.net/gwtar ?
Comment by everforward 1 day ago
In a green field world, I have a personal requirement that technical documentation systems are capable of bulk exporting to a human-readable format on disk. I’m pretty flexible on what that is, though. Markdown is preferred, but I’m also fine with static, dependency-free HTML and I could accept PDFs if the rest of it is super nice.
It’s an integral part of DR, and most places want their docs on-premise, so DR effectively requires offline documentation. Everywhere I’ve worked either a) writes documentation in something that works offline (eg git repo with tarballs somewhere), or b) has invested a bunch of time in trying to scrape their own wiki into something legible during DR.
I guess it’s a long-winded way of saying “that’s using a tool to fix a self-inflicted problem that shouldn’t exist”.
Comment by ninalanyon 2 days ago
If the result is static why does it need a server? Isn't it possible to make it so that it can simply be opened by the browser? Like:
$ firefox $HOME/data/kage/paulgraham.com
Then the result would be useable on machines without kage nstalled.
Comment by tamnd 2 days ago
Actually, Kage has two parts: a crawler that crawls pages and converts them to clean HTML by capturing the DOM after rendering in Chrome/Chromium, and a pack/serve component that packages the result as either a ZIM file for Kiwix or an executable file.
Comment by doctoboggan 2 days ago
Comment by dmazzoni 2 days ago
Comment by pixelatedindex 2 days ago
Comment by embedding-shape 2 days ago
Comment by rzzzt 2 days ago
Related WHATWG discussion: https://github.com/whatwg/html/issues/3099
Comment by embedding-shape 2 days ago
Comment by rzzzt 1 day ago
Comment by embedding-shape 1 day ago
I was thinking "of course it works, how else would people get started creating websites otherwise?" then I remember what's the most common approaches in the frontend ecosystem nowadays.
Back in the days of yore, every tutorial/book started with "First we create a index.html file which you open in your browser ...", even a JavaScript resource would start with this of course :)
Comment by rzzzt 15 hours ago
The protection mechanism was introduced so that malicious saved pages can't just grab things from your Downloads folder and send stuff it to an attacker's server. But the method turned out to be a bit more refined than I have imagined: you can display an image but can't grab the pixels, run a script but not inspect its source code, fetch() will be unavailable, etc.
Comment by dncornholio 2 days ago
Comment by embedding-shape 2 days ago
Comment by recursive 2 days ago
Comment by danielheath 2 days ago
Comment by recursive 2 days ago
Comment by danielheath 1 day ago
Comment by recursive 1 day ago
To see it work, click "Download self contained .html" from the menu.
Here's the source file that handles this part: https://github.com/tomtheisen/mutraction/blob/master/mutract...
The idea is to use <script type="inline-module" name="foo">...</script> to define modules. That's something I just made up. For each such script, provision a blob URL. The main blocker is usually the same origin policy. Crucially, these blob URLs count as the same origin. So then you need to rewrite the imports from the named modules to the blob URLs. I used some regex rather than a proper parser, but it was more than good enough for me.
It seems quite doable to make some proper bundling tools around this concept.
Comment by afavour 2 days ago
Comment by embedding-shape 2 days ago
Comment by xlii 2 days ago
Won't comment on a project (though idea seems interesting) but this in README is a tell for me ;)
Comment by maxloh 2 days ago
It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.
They also offer a CLI powered by Puppeteer. [1]
Comment by tamnd 2 days ago
What I'm implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.
Comment by maxloh 2 days ago
I think the misunderstanding stems from the browser's "Save As" reference in the description. It is misleading. You use "Save As" to save a single page, not an entire website.
Also, the description lacks a clear explanation of the project's purpose. It would be helpful to include a sentence explaining that the program downloads an entire website, not just a single page.
Comment by nikisweeting 2 days ago
I highly recommend reading the singlefile source or https://archiveweb.page/ to see how they handle closed shadow DOMs, cross-origin iframes, websockets, media urls, deduping large assets, etc.
Comment by sillysaurusx 2 days ago
Not the same thing, but I made a clone of pg’s website which can be used for exactly that: https://github.com/shawwn/pg
If you want to read all essays, just clone the repo and open any of the .html files. Or any of the .page files which generated them.
Comment by sdevonoes 2 days ago
Comment by sermah 2 days ago
Comment by ivangelion 2 days ago
Comment by wamatt 2 days ago
That said, Kage looks promising if OP can combine SingleFile reproduction quality with the HTTPTrack spidering approach. SPA's are kinda tricky with archiving and do wonder how well Kage would handle that
Comment by initramfs 2 days ago
For some reason it displays in IE better but I don't recall seeing this option in chrome of Firefox recently..
Comment by tamnd 2 days ago
Comment by maxloh 2 days ago
That way, the page is self-contained as it is, but requires no bundled binary code to serve the site. It is actually safer security-wise.
The vendored script can be as simple as this:
const site = {
"path-1": "<!DOCTYPE html><html> ... </html>",
"path-2": "<!DOCTYPE html><html> ... </html>",
// More paths
}
function attachListeners() {
for (const [path, html] of Object.entries(site)) {
document.querySelector(`a[href=${path}]`).onclick = () => {
document.documentElement.outerHTML = html
attachListeners()
}
}
}
document.addEventListeners("DOMContentLoaded", attachListeners)Comment by HelloUsername 2 days ago
Comment by nmstoker 2 days ago
Comment by dmazzoni 2 days ago
Let's say you have a site that fetches content from a database. If you Save As, then at best you'll get a local copy of an HTML page with JS that loads the content from the same remote database. It might not work (since the local copy has a different origin), or if it does, it requires you to be online, which defeats half of the purpose.
What this project, and SingleFile, both do is save a snapshot of what the rendered page actually looks like at that moment in time. The scripts are stripped out so it runs locally and has no external dependencies.
Comment by arikrahman 2 days ago
Comment by telesilla 2 days ago
Comment by throwaway219450 2 days ago
https://wiki.openzim.org/wiki/Build_your_ZIM_file
EDIT: https://get.kiwix.org/en/solutions/applications/kiwix-reader...
Comment by tamnd 2 days ago
The executable file is mostly for people who don't have Kiwix installed yet, or just want to run the archive directly.
Comment by telesilla 2 days ago
Comment by tamnd 2 days ago
Comment by nikisweeting 2 days ago
Comment by latexr 2 days ago
pandoc --from html --to epub --output /PATH/TO/FILE.epub https://example.comComment by arikrahman 2 days ago
Comment by aynite 2 hours ago
Comment by gregwebs 2 days ago
Comment by dimiprasakis 2 days ago
In any case, cool stuff :)
Comment by nikisweeting 2 days ago
Comment by tamnd 2 days ago
https://github.com/tamnd/kage/blob/main/Dockerfile
Btw, let me think the way to only enable this when running inside Docker.
Comment by nikisweeting 2 days ago
Comment by tamnd 2 days ago
Thanks for nice trick.
Comment by dimiprasakis 1 day ago
But, a compromise still lands on host's kernel, Docker doesn't provide kernel isolation (well it does on a macOS because it runs in Docker machine but thats a side effect).
I wonder if a better solution would be to play with seccomp or Linux capabilities so that Chrome is sandboxed even in Docker. Not sure how this would work tbh.
Answering here to get ideas, I saw your fix on Git and request for feedback (will try to review and give it some thought once I find some time)
Comment by nikisweeting 1 day ago
Comment by rahimnathwani 2 days ago
Comment by tamnd 2 days ago
Comment by ralferoo 1 day ago
Comment by coffeecoders 2 days ago
It's one of the reasons I've become a bigger fan of RSS over time. A feed from 10-ish years ago is often more usable today than a carefully preserved (application) website.
Comment by tamnd 2 days ago
Comment by couscouspie 1 day ago
Comment by b0dhimind 13 hours ago
Comment by shinryuu 2 days ago
Compared to that is there anything kage does better?
Comment by soulofmischief 2 days ago
https://github.com/jart/cosmopolitan
https://justine.lol/cosmopolitan/index.html
(Certificates just expired for justine's website, just ignore the warning.)
Comment by tamnd 2 days ago
I did something like that a very long time ago (Of course, I have forgotten)
Comment by jokethrowaway 2 days ago
I'd rather have platform specific minimal binaries than a single binary with hacks.
Installing packages is a solved problem
Comment by soulofmischief 2 days ago
It's fine if you don't personally find it useful for your workflow, but I think it's mad cool, especially since you can zip together multiple binaries into one, along with data.
Comment by jokethrowaway 2 days ago
I would recommend an add-on or new feature to detect and remove cookie banners / annoying popups that open on load (eg. sign up to my mailing list).
listing a few examples form fastText could help you.
You might also have the opposite problem though: some websites have content in the base html (so it's searchable by Google and they get views) and remove it on load (so you have to pay).
Capturing the initial html and comparing it to the final version could give you some hints and allow you to repair the removed content.
Best of luck with the project!
Comment by kadhirvelm 2 days ago
But will look into this now, see if we can swap some stuff out. We’ve really liked the idea of an offline mirror, makes a lot of collaboration use cases simpler
Comment by lolpython 2 days ago
Comment by sanqui 2 days ago
Comment by tamnd 2 days ago
By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli
Comment by sanqui 2 days ago
Comment by tamnd 2 days ago
For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!
Comment by sanqui 2 days ago
Sounds awesome. There is a lot of untapped potential with respect to efficiently archiving and indexing websites. I saw the impressive things Marginalia Search is doing in this area (the blog is great when it gets technical). There is also a lot of very complete archives of websites out there which are not being indexed at all, and I would love to make them available for researchers. In any case, I'm interested in your project!
Comment by Prime_Axiom 2 days ago
Comment by threecheese 2 days ago
Comment by tamnd 2 days ago
Comment by Dhavidh 2 days ago
Comment by sails 2 days ago
Comment by Igor_Wiwi 2 days ago
Comment by amatecha 2 days ago
Comment by cynicalsecurity 2 days ago
Comment by carsonye 2 days ago
Comment by tamnd 1 day ago
Comment by tamnd 1 day ago
Comment by jyscao 2 days ago
Comment by chinnyys 2 days ago
Is the code also AI slop?
Comment by Sathwickp 1 day ago
Comment by godot 2 days ago
for an entire website though of many pages I can see this can be useful.
Comment by c7b 2 days ago
Comment by tamnd 1 day ago
For video downloading, I suggest wrapping around yt-dlp. It's an awesome tool.
Comment by calrizien 2 days ago
Comment by tamnd 2 days ago
I previously downloaded the Snowflake docs, and it was something like tens or even hundreds of thousands of pages, I do not remember exactly. The output ended up being very large.
By the way, I forgot to add zstd compression support to my ZIM reader/writer. I will implement that in the next version.
Comment by tamnd 1 day ago
```bash bin/kage clone https://developer.apple.com/documentation/ \ --scope-prefix /documentation/ \ --out /Users/apple/data/apple-docs \ --chrome "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \ --max-pages 0 --max-depth 0 \ --workers 3 --browser-pages 3 --asset-workers 6 \ --render-timeout 60s --settle 2s --timeout 30s \ 2>&1 | tee -a /Users/apple/apple-docs.log ```
Adjust it to your needs :)
I smoke-tested it, and all the content and CSS work, but I stripped all the JS, so the sidebar won't work.
If you run into any problems, feel free to create new issues in the repo. It helps me prioritize and know what should be fixed.
Comment by Departed7405 1 day ago
Comment by nitotm 2 days ago
Comment by smusamashah 2 days ago
Comment by daviding 2 days ago
Comment by rickylin 2 days ago
Comment by italiancheese 2 days ago
Have you even read the first line of the readme of the project you're commenting on?
Comment by KellyCriterion 2 days ago
Comment by endorphine 2 days ago
Comment by G_o_D 2 days ago
Comment by chfritz 2 days ago
Comment by sneak 2 days ago
Comment by snowflaxxx 1 day ago
Comment by kjmh 1 day ago
Comment by ekianjo 1 day ago
Comment by Onavo 2 days ago
Comment by delduca 2 days ago
Comment by grahamstanes17 2 days ago
Comment by aa-jv 1 day ago
So I don't quite get whats the point of kage? What does it do that print-to-PDF won't already do? The resulting .pdf's contain all the content, and also include the original URL and creation date, etc. How is kage an improvement?
Comment by netdevphoenix 1 day ago
Comment by k4rnaj1k 2 days ago
Comment by eventinbox 2 days ago