Internet Archive's Storage
Posted by zdw 4 days ago
Comments
Comment by dr_dshiv 15 hours ago
That’s impressive. Wikipedia spends $185m per year and the Seattle public library spends $102m. Maybe not comparable exactly, but $30m per year seems inexpensive for the memory of the world…
Comment by AdamN 14 hours ago
Comment by sshine 13 hours ago
The combined value of The Internet Archive -- whether we think just the infrastructure, just the value of the data, or the actual utility value to mankind -- vastly outperforms an individual contributor's at almost every well-paying internet startup. At the simple cost of not getting to pocket that value.
I wish I believed in something this much.
Comment by fragmede 7 hours ago
Comment by sshine 2 hours ago
Comment by toomuchtodo 3 hours ago
Over 1,000 Arizona teachers resigning plays a part in shortage - https://news.ycombinator.com/item?id=46728151 - January 2026
Comment by miki123211 10 hours ago
AWS is priced as if your alternative was doing everything in house, with Silicon Valley salaries. If your goal isn't "go to market quickly and make sure our idea works, no matter the cost", it may not be the right fit for you. If you're a solo developer, non-profit, or another organization with excess volunteer time and little money, you can very often do what AWS does for a fraction of the cost.
Comment by storystarling 8 hours ago
Comment by exe34 13 hours ago
Comment by votepaunchy 12 hours ago
Comment by exe34 9 hours ago
we were told the profit motive and competition would make them efficient.
Comment by komali2 8 hours ago
They believe their own propaganda unfortunately.
Comment by exe34 6 hours ago
Comment by abanana 4 hours ago
A separate issue worth mentioning is that the water companies (as opposed to trains, gas, electricity, Royal Mail, etc) don't fall under this because they were privatised as regional monopolies. The government didn't even (pretend to) attempt to create competition.
Comment by exe34 1 hour ago
Comment by fragmede 7 hours ago
Comment by delusional 11 hours ago
Are you of the impression that the problems African nations are facing is that they're holding hands and singing too much? Are the Africans just lazy?
Comment by zozbot234 12 hours ago
Comment by swores 11 hours ago
To me it seems a perfectly natural effect of nearly everyone using it as a website which holds lots of information, and very few people comparatively have any experience with the community side, so people assume that what they see is what Wikipedia is.
Not many people are spending time reading reports on organisation costs breakdowns for Wikipedia, so the only way they'd know is if someone like you actively tells them. I personally also assumed server costs were the vast majority, with legal costs a probable distant second - but your comment has inspired me to actually go and look for a breakdown of their spending, so thanks.
Edit: FY24-25, "infrastructure" was just 49.2% of their budget - from https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_...
Comment by miki123211 10 hours ago
I suspect that 95+% of visits to Wikipedia don't actually require them to run any PHP code, but are instead just served from some cache, as each Wikipedia user viewing a given article (if they're not logged in) sees basically the same thing.
This is in contrast to E.G. a social network, which needs to calculate timelines per user. Even if there's no machine learning and your algorithm is "most recent posts first", there's still plenty of computation involved. Mastodon is a good example here.
Comment by fragmede 7 hours ago
Comment by wpietri 9 hours ago
Comment by zozbot234 9 hours ago
But they want that information to be at least kept up to date and hopefully to improve over time, right? That's what the community is for. It's not a free lunch.
Comment by swores 1 hour ago
Edit: I wasn't going to say anything, but then noticed you're the same person I was replying to before, so I will since it's more than once - in both your comments you seem to feel that you need to defend Wikipedia but in both cases there was nobody attacking them :)
I appreciate that internet comments can often contain lots of hostility, but I encourage you to remember that it's not a default state, and that often comments are just good faith opinions without an angry subtext. In both cases you could have just written as if adding some interesting information, rather than as if you're countering an anti-Wikipedia campaign. (And I'm not trying to attack or criticise now either, sorry if it comes off that way - just constructive feedback!)
Comment by B1FIDO 7 hours ago
So Wikipedia is not merely a "cloud app with cloud storage" but it is a first-class cloud-based platform: the English project is merely the largest and best-known, but there are hundreds, hundreds of other projects hosted on WMF's cloud services. And the developers and the bot operators who run in the backend are hardly detectable by the end-users or even the everyday editors, but they are also the backbone of WMF services, and they are supported by WMF admins and developers, to run their applications that support editors and wiki admins in their duties.
Comment by vern001 13 hours ago
If I didn’t have a job or responsibilities and was told that I was allowed to just be curious and have fun, I would spend a tremendous amount of time just reading, listening, watching, playing, etc. on IA.
Visiting IA is the closest feeling I can get to visiting the library when I was young. The library used to be the only place where you could just read swaths of magazines, newspapers, and books, and also check out music- for free.
Also, I love random stuff. IA has digitized tape recordings that used to play in K-Mart. While Wikipedia spends time culling history that people have submitted, IA keeps it. They understand the duty they have when you donate part of human history to them, instead of some person that didn’t care about some part of history just deleting it.
IA is not just its storage and the Wayback machine, even though those things are incredible and a massive part of its value to humanity. It’s someone that just cares.
At the end of the day, big companies just need to make profit. Do big companies care about your digitized 8-track collection you have in cloud storage? One day maybe they will take it away from you to avoid a lawsuit or to get you to rent music from them.
And your local NAS and backups? Do you think your niche archive will survive a space heater safety mechanism failure, a pipe bursting, when your house is collateral damage in a war, or your accidental death? I understand wanting to keep your own copies of things just-in-case, but if you want those things to survive, why not also host them at IA if others generally would find joy or knowledge from them?
Comment by fragmede 7 hours ago
Comment by dpedu 10 hours ago
https://web.archive.org/web/20090219172931/https://blogs.msd...
Comment by delusional 11 hours ago
It's not fair to compare an institution with a website.
Comment by entangledqubit 4 hours ago
Physical libraries also tend to be the defacto life help desk for a lot of people out there.
Comment by mc32 11 hours ago
Comment by buildbot 8 hours ago
Comment by komali2 8 hours ago
Comment by bakugo 11 hours ago
Only a small fraction of that is spent on actually hosting the website. The rest goes into the pockets of the owners and their friends.
You can do a lot with very little if your primary goal isn't to enrich yourself.
Comment by Atreiden 10 hours ago
Being a 503c, they're required to disclose their expenditures, among other things. CN gives them a perfect score, and the expense ratio section puts their program spend at 77.4% of the budget https://www.charitynavigator.org/ein/200049703#overall-ratin...
Worth mentioning that Wikipedia gets an order of magnitude more traffic than the Internet archive.
Comment by dpedu 10 hours ago
Scroll down to the "Statement of activities (audited)" section:
https://wikimediafoundation.org/annualreports/2023-2024-annu...
Comment by rrr_oh_man 9 hours ago
…across 650 employees, which is $166K on average.
Comment by kingstnap 4 hours ago
https://wikimediafoundation.org/who-we-are/financial-reports...
If you look at the audited financial report of last year.
$3,474,785 was spent on hosting. Which makes sense its basically a static site.
This is out of expenses of $190,938,007
Thats about 1.8%. This is not new. Its been the case for years. Wikipedia has never had very high hosting costs. Its always been going into their grants or whatever else.
Despite the nonsense about AI overloading their servers even if it doubled the load it would barely affect the budget.
Comment by bakugo 9 hours ago
With an order of magnitude less data to host, though. The entirety of Wikipedia is less than 1PB [1], while the entirety of IA is 175+ PB [2].
Traffic is relatively cheap, especially for a very cache-friendly website like Wikipedia.
[1] https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
Comment by arjie 17 hours ago
* power budget dominates everything: I have access to a lot of rack hardware from old connections, but I don't want to put the army of old stuff in my cabinet because it will blow my power budget for not that much performance in comparison to my 9755. What disks does the IA use? Any specific variety or like Backblaze a large variety?
* magnetic is bloody slow: I'm not the Internet Archive so I'm just going to have a couple of machines with a few hundred TiB. I'm planning on making them all a big zfs so I can deduplicate but it seems like if I get a single disk failure I'm doomed to a massive rebuild
I'm sure I can work it out with a modern LLM, but maybe someone here has experience with actually running massive storage and the use-case where tomorrow's data is almost the same as today's - as is the case with the Internet Archive where tomorrow's copy of wiki.roshangeorge.dev will look, even at the block level, like yesterday's copy.
The last time I built with multi-petabyte datasets we were still using Hadoop on HDFS, haha!
Comment by adrian_b 12 hours ago
This is especially true when you take into account that regardless whether you use HDDs or tapes, you should better duplicate them and preferably not keep the copies in the same place.
The difference in cost between tapes and HDDs becomes significantly greater when you take into account that data stored on HDDs must be copied on new HDDs after a few years, due to the short lifetime of HDDs. The time after which you may need to move data on new tapes is not determined by the lifetime of tapes (guaranteed to be at least 30 years) but by the obsolescence of the tape drives for a given standard, and it should be after at least 10 to 15 years.
If you keep on a SSD/HDD a database of the content of the tapes, containing the metadata of the stored files and their location on tapes, the access time to archived data is composed of whatever time you need for taking the tape from a cabinet and inserting it into the drive, plus a seeking time of around 1 minute, on average.
Once the archived data is reached, the sequential transfer speed of tapes is greater than that of HDDs.
LTO-9 cartridges have a significantly lower volume and weight than 24-TB HDDs (for storing the same amount of data), which simplifies storage and transport.
Comment by Datagenerator 15 hours ago
Comment by xyzzy123 15 hours ago
Yeah, resilvers will take 24h if your pool is getting full but with RAIDZ2 it's not that scary.
I'm running TrueNAS scale. I used to just use Ubuntu (more flexible!) but over many years I had a some bad upgrades where kernel & zfs stopped being friends. My rack is pretty nearby so for me, a big 4U case with 120mm front fans was high priority, it has a good noise profile if you replace with Noctuas, you get a constant "whoosh" rather than a whine etc.
Running 8+2 with 24tb drives. I used to run with 20 slots full of old ex-cloud SAS drives but it's more heat / noise / power intensive. Also, you lose flexibility if you don't have free slots. So eventually ponied up for 24tb disks. It hurt my wallet but greatly reduced noise and power.
Case: RM43-320-RS 4U
CPU: Intel Xeon E3-1231 v3 @ 3.40GHz (4C/8T, 22nm, 80W TDP)
RAM: 32GB DDR3 ECC
Motherboard: Supermicro X10SL7-F (microATX, LGA1150 socket)
- Onboard: Dual Intel I210 1GbE (unused)
- Onboard: LSI SAS2308 8-port SAS2 controller (6Gbps, IT mode)
- Onboard: Intel C220 chipset 6-port SATA controller
Storage Controllers:
- LSI SAS2308 (onboard) → Intel RES2SV240 backplane (SFF-8087 cables)
- Intel C220 SATA (onboard) → boot SSD
Backplane:
- Intel RES2SV240 24-bay 2U/3U SAS2 Expander
- 20× 3.5" hot-swap bays (10 populated, 10 empty)
- Connects via Mini SAS HD SFF-8643 to Mini SAS SFF-8087 Cable, 0.8M x 5
Boot/Cache:
- Intel 120GB SSD SSDSC2CW120A3 (boot drive, SATA)
- Intel Optane 280GB SSDPED1D280GA (ZFS SLOG device, NVMe)
Network:
- Intel 82599ES dual-port 10GbE SFP+ NIC (PCIe x8 add-in card)
It's a super old box but it does fine and will max 10Gbe for sequential and do 10k write iops / 1k random read iops without problems. Not great, not terrible. You don't really need the SLOG unless you plan to run VMs or databases off it.I personally try to run with no more than 10 slots out of 20 used. This gives a bit of flexibility for expanding, auxiliary pools, etc etc. Often you find you need twice as much storage as you're planning on directly using. For upgrades, snapshots, transfers, ad-hoc stuff etc.
Re: dedup, I would personally look to dedup at the application layer rather than in the filesystem if I possibly could? If you are running custom archiving software then it's something you'd want to handle in the scope of that. Depends on the data obviously, but it's going to be more predictable, and you understand your data the best. I don't have zfs de-dup turned on but for a 200TiB pool with 128k blocks, the zfs DDT will want like 500GiB ram. Which is NOT cheap in 2026.
I also run a 7-node ceph cluster "for funsies". I love the flexibility of it... but I don't think ceph truly makes sense until you have multiple racks or you have hard 24/7 requirements.
Comment by genewitch 15 hours ago
for the first two, depending on throughput desired, you can do with spinning rust. you pick your exposure, single platter or not, speed or not, and interface. And no fancy raid hardware needed.
I've had decent luck with 3+1 warm and 4+1 archival. if you don't need quick seeks but want streaming data to be nice, make sure your largest file fits on a single drive, and do two parity disks for archive, a single for warm. md + lvm; ext4 fs, too. my very biased opinion based on tried everything and am out of ideas, and i am tired, and that stuff just works. I am not quick to the point but you need to split your storage up. use 18+ SMR disks, shingled magnetic recording hard drives, for larger stuff that you don't need to transfer very fast. 4k video for consumption on a 4k televsion fits here. Use faster, more reliable disks for data used a lot, &c
Hot or fast seeks & transfers is different, but i didn't get the idea that's what you were after. Hadoop ought be used for hot data, imo. People may argue that zfs of xfs or jfs or ffs is better than ext4, but are they gunna jump in and fix it for free when something goes wrong for whatever reason?
sorry, this is confusing. Unsure how to fix that. i have files on this style system that have been in continuous readable condition since the mid 1990s. There's been some bumps as i tried every [sic] other system and method.
TL;dr to scale my 1/10th size up, i personally would just get a bigger box to put the disks in, and add an additional /volumeN/ mountpoint for each additional array i added. it goes without saying that under that directory i would CIFS/NFS share subdirectories that fit that array's specifications. again, i am just tired of all of this, i'm also all socialed out so, apologies.
Comment by mrexroad 15 hours ago
Are there any other data centers harvesting waste heat for benefit?
Comment by cloud-oak 14 hours ago
https://www.twobirds.com/en/insights/2024/germany/rechenzent...
Comment by londons_explore 3 hours ago
Comment by Dr4kn 3 hours ago
If you can get paid on your waste heat why wouldn't you like that?
Comment by fragmede 3 hours ago
Comment by miduil 15 hours ago
Also data centers need physical space, and often - you need heating where there is not a lot of space (cities), and for "district heating" you need higher temperatures usually.
Comment by stanac 15 hours ago
https://www.euroheat.org/dhc/knowledge-hub/datacentre-suppli...
Comment by bilegeek 15 hours ago
Comment by stingraycharles 13 hours ago
I do vaguely remember that the economics of it all were not great, but it’s definitely a thing for quite a while already.
Comment by arcade79 13 hours ago
Has any of the big ones released articles on their storage systems in the last 5-10 years?
Comment by smueller1234 8 hours ago
https://cloud.google.com/blog/products/storage-data-transfer...
https://cloud.google.com/blog/products/storage-data-transfer...
Facebook's published content on Tectonic is quite good and I think it's well more recent than 2010-14.
(Current Google employee, just pointing to public content, hope that's helpful.)
Comment by arcade79 7 hours ago
Comment by theMMaI 10 hours ago
Comment by 1vuio0pswjnm7 2 hours ago
https://en.wikipedia.org/wiki/Wayback_Machine
https://blog.archive.org/2025/09/02/looking-back-on-preservi...
https://archive.org/web/petabox.php
https://en.wikipedia.org/wiki/PetaBox
https://github.com/internetarchive/dweb-archive
https://en.wikipedia.org/wiki/Internet_Archive
https://www.eweek.com/storage/making-web-memories-with-the-p...
https://internetarchive.archiveteam.org/index.php/PetaBox
https://blog.archive.org/2010/07/27/the-fourth-generation-pe...
https://hackaday.com/2025/11/18/internet-archive-hits-one-tr...
https://www.computerworld.com/article/1562759/the-internet-a...
https://www.datacenterknowledge.com/business/internet-archiv...
https://www.rootsimple.com/2023/08/inside-the-internet-archi...
https://richmondsunsetnews.com/2017/03/11/internet-archive-p...
https://en.wikipedia.org/wiki/Heritrix
https://support.archive-it.org/hc/en-us/articles/11500108118...
https://digitalcommons.odu.edu/cgi/viewcontent.cgi?article=1...
https://iipc.github.io/warc-specifications/specifications/wa...
https://usehall.com/agents/heritrix-bot
https://library.imaging.org/admin/apis/public/api/ist/websit...
https://blog.archive.org/2025/03/
https://archive.org/details/alexacrawls
https://en.wikipedia.org/wiki/Alexa_Internet
https://projects.propublica.org/nonprofits/organizations/943...
https://werd.io/update-on-the-20242025-end-of-term-web-archi...
https://www.historyascode.com/tools-data/archive-it/
https://digitization.archive.org/pricing/
https://www.sfgate.com/tech/article/bay-area-warehouse-inter...
https://vault-webservices.zendesk.com/hc/en-us/articles/2289...
https://en.wikipedia.org/wiki/Hachette_v._Internet_Archive
https://copyrightalliance.org/copyright-cases/hachette-book-...
https://law.justia.com/cases/federal/appellate-courts/ca2/23...
https://www.library.upenn.edu/news/hachette-v-internet-archi...
https://www.lutzker.com/ip_bit_pieces/internet-archives-open...
https://blog.archive.org/2023/08/17/what-the-hachette-v-inte...
https://www.musicbusinessworldwide.com/labels-settle-copyrig...
https://consequence.net/2025/09/internet-archive-labels-sett...
https://blog.archive.org/2025/09/15/an-update-on-the-great-7...
https://giga.law/daily-news/2025/9/15/music-publishers-inter...
https://www.webpronews.com/internet-archive-settles-copyrigh...
https://blog.archive.org/2025/07/
https://blog.archive.org/2018/07/21/decentralized-web-faq/
https://blog.archive.org/2016/06/23/decentalized-web-server-...
https://blog.archive.org/2025/02/06/update-on-the-2024-2025-...
https://www.reddit.com/r/DataHoarder/comments/1ijkdjl/progre...
Comment by tylerchilds 19 hours ago
Comment by tylerchilds 19 hours ago
Comment by ranger_danger 19 hours ago
Comment by metadat 17 hours ago
Comment by jonas21 17 hours ago
https://hackernoon.com/the-long-now-of-the-web-inside-the-in...
Comment by reaperducer 17 hours ago
I wouldn't be surprised if it's AI.
It's time to come up with a term for blog posts that are just AI-augmented re-hashes of other people's writing.
Maybe blogslop.
Comment by tolerance 16 hours ago
I’m under the impression that this style of writing is what people wish they got when they asked AI to summarize a lengthy web page. It’s criticism and commentary. I can’t see how you missed out on the passages that add to and even correct or argue against statements made in the Hackernoon article.
In a way I can’t tell how one can believe that “re-hashing [an article], interspersed with [the blogger’s] own comments” isn’t a common blogging practice. If not then the internet made a mistake by allowing the likes of John Gruber to earn a living this way.
And trust that I enjoy a good knee-jerk “slop” charge myself. To me this doesn’t qualify a bit.
Comment by dexdal 17 hours ago
Comment by schainks 17 hours ago
Comment by badlibrarian 17 hours ago
Comment by dang 16 hours ago
"Don't be snarky."
Comment by badlibrarian 15 hours ago
Comment by dang 5 hours ago
Comment by krackers 16 hours ago
Comment by chimeracoder 17 hours ago
So it sounds like they have data in other locations as well, hopefully.
Comment by electroly 16 hours ago
[1] https://en.wikipedia.org/wiki/Internet_Archive#Operations
Comment by badlibrarian 15 hours ago
Comment by tolerance 15 hours ago
Comment by badlibrarian 15 hours ago
Comment by JavohirXR 12 hours ago