ISO PDF spec is getting Brotli – ~20 % smaller documents with no quality loss

Posted by whizzx 2 days ago

Comments

Comment by ericpauley 2 days ago

Some real cognitive dissonance in this article…

“The PDF Association operates under a strict principle—any new feature must work seamlessly with existing readers” followed by introducing compression as a breaking change in the same paragraph.

All this for brotli… on a read-many format like pdf zstd’s decompression speed is a much better fit.

Comment by xxs 2 days ago

yup, zstd is better. Overall use zstd for pretty much anything that can benefit from a general purpose compression. It's a beyond excellent library, tool, and an algorithm (set of).

Brotli w/o a custom dictionary is a weird choice to begin with.

Comment by adzm 2 days ago

Brotli makes a bit of sense considering this is a static asset; it compresses somewhat more than zstd. This is why brotli is pretty ubiquitous for precompressed static assets on the Web.

That said, I personally prefer zstd as well, it's been a great general use lib.

Comment by dist-epoch 2 days ago

You need to crank up zstd compression level.

zstd is Pareto better than brotli - compresses better and faster

Comment by atiedebee 2 days ago

I thought the same, so I ran brotli and zstd on some PDFs I had laying around.

  brotli 1.0.7 args: -q 11 -w 24
  zstd v1.5.0  args: --ultra -22 --long=31 
                 | Original | zstd    | brotli
  RandomBook.pdf | 15M      | 4.6M    | 4.5M
  Invoice.pdf    | 19.3K    | 16.3K   | 16.1K

I made a table because I wanted to test more files, but almost all PDFs I downloaded/had stored locally were already compressed and I couldn't quickly find a way to decompress them.

Brotli seemed to have a very slight edge over zstd, even on the larger pdf, which I did not expect.

Comment by mort96 2 days ago

EDIT: Something weird is going on here. When compressing zstd in parallel it produces the garbage results seen here, but when compressing on a single core, it produces result competitive with Brotli (37M). See: https://news.ycombinator.com/item?id=46723158

I did my own testing where Brotli also ended up better than ZSTD: https://news.ycombinator.com/item?id=46722044

Results by compression type across 55 PDFs:

    +------+------+-----+------+--------+
    | none | zstd | xz  | gzip | brotli |
    +------|------|-----|------|--------|
    | 47M  | 45M  | 39M | 38M  | 37M    |
    +------+------+-----+------+--------+

Comment by mort96 2 days ago

Turns out that these numbers are caused by APFS weirdness. I used 'du' to get them which reports the size on disk, which is weirdly bloated for some reason when compressing in parallel. I should've used 'du -A', which reports the apparent size.

Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):

    +---------+---------+--------+--------+--------+
    |  none   |  zstd   |   xz   |  gzip  | brotli |
    +---------|---------|--------|--------|--------|
    | 47.81M  | 37.92M  | 37.96M | 38.80M | 37.06M |
    +---------+---------+--------+--------+--------+

These numbers are much more impressive. Still, Brotli has a slight edge.

Comment by tracker1 1 day ago

Worth considering the compress/decompress overhead, which is also lower in brotli than zstd from my understanding.

Also, worth testing zopfli since it's decompression is gzip compatible.

Comment by atiedebee 1 day ago

Ran the tests again with some more files, this time decompressing the pdf in advance. I picked some widely available PDFs to make the experiment reproducable.

  file            | raw         | zstd       (%)      | brotli     (%)     |
  gawk.pdf        | 8.068.092   | 1.437.529  (17.8%)  | 1.376.106  (17.1%) |
  shannon.pdf     | 335.009     | 68.739     (20.5%)  | 65.978     (19.6%) |
  attention.pdf   | 24.742.418  | 367.367    (1.4%)   | 362.578    (1.4%)  |
  learnopengl.pdf | 253.041.425 | 37.756.229 (14.9%)  | 35.223.532 (13.9%) |

For learnopengl.pdf I also tested the decompression performance, since it is such a large file, and got the following (less surprising) results using 'perf stat -r 5':

  zstd:   0.4532 +- 0.0216 seconds time elapsed  ( +-  4.77% )
  brotli: 0.7641 +- 0.0242 seconds time elapsed  ( +-  3.17% )

The conclusion seems to be consistent with what brotli's authors have said: brotli achieves slightly better compression, at the cost of a little over half the decompression speed.

Comment by mrspuratic 2 days ago

> I couldn't quickly find a way to decompress them

    pdftk in.pdf output out.pdf decompress

Comment by Thoreandan 2 days ago

Does your source .pdf material have FlateDecode'd chunks or did you fully uncompress it?

Comment by atiedebee 1 day ago

I wasn't sure. I just went in with the (probably faulty) assumption that if it compresses to less than 90% of the original size that it had enough "non-randomness" to compare compression performance.

Comment by order-matters 2 days ago

Whats the assumption we can potentially target as reason for the counter-intuitive result?

that data in pdf files are noisy and zstd should perform better on noisy files?

Comment by jeffbee 2 days ago

What's counter-intuitive about this outcome?

Comment by order-matters 2 days ago

maybe that was too strongly worded but there was an expectation for zstd to outperform. So the fact it didnt means the result was unexpected. i generally find it helpful to understand why something performs better than expected.

Comment by mort96 2 days ago

Isn't zstd primarily designed to provide decent compression ratios at amazing speeds? The reason it's exciting is mainly that you can add compression to places where it didn't necessarily make sense before because it's almost free in terms of CPU and memory consumption. I don't think it has ever had a stated goal of beating compression ratio focused algorithms like brotli on compression ratio.

Comment by sgerenser 2 days ago

I actually thought zstd was supposed to be better than Brotli in most cases, but a bit of searching reveals you're right... Brotli, especially at the highest compression levels (10/11), often exceeds zstd at the highest compression levels (20-22). Both are very slow at those levels, although perfectly suitable for "compress once, decompress many" applications which the PDF spec is obviously one of them.

Comment by jeffbee 2 days ago

Are you sure? Admittedly I only have 1 PDF in my homedir, but no combination of flags to zstd gets it to match the size of brotli's output on that particular file. Even zstd --long --ultra -22.

Comment by xxs 1 day ago

on max compression (11 vs zstd's 22) of text brotli will be around 3-4% denser... and a lot slower. Decompression wise zstd is over 2x faster.

The pdfs you have are already compressed with deflate (zip).

Comment by DetroitThrow 2 days ago

I love zstd but this isn't necessarily true.

Comment by dchest 2 days ago

Not with small files.

Comment by Dylan16807 2 days ago

If that's about using predefined dictionaries, zstd can use them too.

If brotli has a different advantage on small source files, you have my curiosity.

If you're talking about max compression, zstd likely loses out there, the answer seems to vary based on the tests I look at, but it seems to be better across a very wide range.

Comment by dchest 6 hours ago

No, it's literally just compressing small files without training zstd dict or plugging external dictionaries (not counting the built-in one that brotli has). Especially for English text, brotli at the same speed as zstd gives better results for small data (in kilobyte to a few of megabyte range).

Comment by itsdesmond 2 days ago

> Pareto

I don’t think you’re using that correctly.

Comment by wizzwizz4 2 days ago

It's correct use of Pareto, short for Pareto frontier, if the claim being made is "for every needed compression ratio, zstd is faster; and for every needed time budget, zstd is faster". (Whether this claim is true is another matter.)

Comment by stonogo 2 days ago

brotli is ubiquitous because Google recommends it. While Deflate definitely sucks and is old, Google ships brotli in Chrome, and since Chrome is the de facto default platform nowadays, I'd imagine it was chosen because it was the lowest-effort lift.

Nevertheless, I expect this to be JBIG2 all over again: almost nobody will use this because we've got decades of devices and software in the wild that can't, and 20% filesize savings is pointless if your destination can't read the damn thing.

Comment by deepsun 2 days ago

Brotli compresses my files way better, but it's doing it way slower. Anyway, universal statement "zstd is better" is not valid.

Comment by xxs 1 day ago

On max compression "--ultra -22", zstd is likely to be 2-4% less dense (larger) on text alike input. While taking over 2x times times to compress. Decompression is also much faster, usually over 2x.

I have not tried using a dictionary for zstd.

Comment by greenavocado 2 days ago

This bizzare move has all the hallmarks of embrace-extend-extinguish rather than technical excellence

Comment by mmooss 2 days ago

Note the language: "You're not creating broken files—you're creating files that are ahead of their time."

Imagine a sales meeting where someone pitched that to you. They have to be joking, right?

I have no objection to adding Brotli, but I hope they take the compatability more seriously. You may need readers to deploy it for a long time - ten years? - before you deploy it in PDF creation tools.

Comment by nxobject 2 days ago

(sarcasm warning...)

You're absolutely right! It's not just an inaccurate slogan—it's a patronizing use of artificial intelligence. What you're describing is not just true, it's precise.

Comment by mmooss 1 day ago

I don't understand your point ...

Comment by eventualcomp 1 day ago

The commenter is making a joke about the style of delivery of the sentence you quoted, because the style is [1]characteristic of AI generated writing.

[1]https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

Comment by spider-mario 1 day ago

> on a read-many format like pdf zstd’s decompression speed is a much better fit.

brotli decompression is already plenty fast. For PDFs, zstd’s advantage in decompression speed is academic.

Comment by deepsun 2 days ago

Well, except for speed, compression algorithms need to be compared in terms of compression, you know.

Here's discussion by brotli's and zstd's staff:

https://news.ycombinator.com/item?id=19678985

Comment by bhouston 2 days ago

Are they using a custom dictionary with Brotli designed for PDFs? I am not sure if it would help or not, but it seems like one of those cases it may help?

Something like this:

https://developer.chrome.com/blog/shared-dictionary-compress...

In my applications, in the area of 3D, I've been moving away from Brotli because it is just so slow for large files. I prefer zstd, because it is like 10x faster for both compression and decompression.

Comment by whizzx 2 days ago

The pdf association is still running experiments on whether or not to support custom dictionaries based on real life workloads gains.

So it might land in the spec once it has proven if offers enough value

Comment by Proclus 2 days ago

It seems they're using the standard dictionary, which is utterly bizzare.

The standard Brotli dictionary bakes in a ton of assumptions about what the Web looked like in 2015, including not just which HTML tags were particularly common but also such things as which swear words were trendy.

It doesn't seem reasonable to think that PDFs have symbol probabilities remotely similar to the web corpus Google used to come up with that dictionary.

On top of that, it seems utterly daft to be baking that into a format which is expected to fit archival use cases and thus impose that 2015 dictionary on PDF readers for a century to come.

I too would strongly prefer that they use zstd.

Comment by bhouston 2 days ago

BTW I've looked into custom dictionaries before for similar use cases and I suspect it would only offer like a 1% improvement or so for PDFs -- still good, but not a massive difference maker. The issue is that PDFs, like web pages, are incredibly repetitive in terms of their tags/structure. As such the custom dictionary only helps if the doc is really small, otherwise because of the repetitive nature, the self-inferred dictionary will resemble the custom dictionary after just a few blocks of PDF content.

The sole exception is if they are restarting the brotli stream for each page, and they are not sharing a dictionary, custom or inferred across the whole doc. Then the dictionary will have to be re-inferred on each page, and then a shared custom dictionary would make more sense.

Comment by bobpaw 2 days ago

How can iText claim that adding Brotli is not a backward incompatible change (in the "Why keep encoding separate" table)? In the first section the author states that any new feature must work seamlessly with existing readers. New documents created that include this compression would be unintelligible to any reader that only supports Deflate.

Am I missing something? Adoption will take a long time if you can't be confident the receiver of a document or viewers of a publication will be able to open the file.

Comment by whizzx 2 days ago

It's prototypish work to support it before it land's in the official specification. But it will indeed take some adoption time.

Because I'm doing the work to patch in support across different viewers to help adoption grow. And once the big opensource ones ship it pdfjs, poppler, pdfium, adoption can quickly rise.

Comment by croes 2 days ago

There are old devices where the viewer can’t be patched. That’s killing one of the main features of PDF

Comment by nialse 2 days ago

Who is responsible for the terrible decision? In the pro vs con analysis, saving 20% size occasionally vs updating ALL pdf libraries/apps/viewers ever built SHOULD be a no-brainer.

Comment by superkuh 2 days ago

This is nice, but PDF jumped the shark already. It's no longer a document format that always looks the same everywhere. The inclusion of "Dynamic XFA (XML Form Architecture) PDF" in the spec made it so PDF is an unreliable format. The aformentioned is a PDF without content that pulls down all it's content from the web. It even still, ostensibly, supports Flash (swf) animations. In practice these "PDF"s are just empty white pages with an error message like,

>"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download. For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries."

Comment by kayodelycaon 2 days ago

Fortunately, XFA is deprecated. I haven’t seen one of those for a very long time.

Comment by superkuh 2 days ago

Maybe in spec, but the damage is done and persists.

The (USA) Wisconsin Dept. of Natural Resources has nearly all their regulation PDFs as these XFA non-pdfs that I cannot read. So I cannot know the regulations. My emails about this topic (to multiple addresses over many years a dozen times) have gone unanswered.

If Acrobat supports it it doesn't matter what the spec says. Until Adobe drops XFA from Acrobat and forces these extremely silly people to stop, PDF is no longer PDF.

Comment by ndriscoll 2 days ago

What is the point of using a generic compression algorithm in a file format? Does this actually get you much over turning on filesystem and transport compression, which can transparently swap the generic algorithm (e.g. my files are already all zstd compressed. HTTP can already negotiate brotli or zstd)? If it's not tuned to the application, it seems like it's better to leave it uncompressed and let the user decide what they want (e.g. people noting tradeoffs with bro vs zstd; let the person who has to live with the tradeoff decide it, not the original file author).

Comment by wongarsu 2 days ago

Few people enable file system compression, and even if they do it's usually with fast algorithms like lz4 or zstd -1. When authoring a document you have very different tradeoffs and can afford the cost of high compression levels of zstd or brotli.

Comment by Someone 2 days ago

- inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text

- when jumping from page to page, you won’t have to decompress the entire file

Comment by wizzwizz4 2 days ago

> inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text

Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.

> when jumping from page to page, you won’t have to decompress the entire file

This is already a thing with any compression format that supports quasi-random access, which is most of them. The answers to https://stackoverflow.com/q/429987/5223757 discuss a wide variety of tools for producing (and seeking into) such files, which can be read normally by tools not familiar with the conventions in use.

Comment by Someone 1 day ago

> Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.

Far from the same amount:

- existing tools that split PDFs into pages will remain working

- if defensively programmed, existing PDF readers will be able to render PDFs containing JPEG XL images, except for the images themselves.

Comment by eru 2 days ago

Well, if sanity had prevailed, we would have likely stuck to .ps.gz (or you favourite compression format), instead of ending up with PDF.

Though we might still want to restrict the subset of PostScript that we allow. The full language might be a bit too general to take from untrusted third parties.

Comment by dunham 2 days ago

Don't you end up with PDF if you start with PS and restrict it to a subset? And maybe normalize the structure of the file a little. The structure is nice when you want to take the content and draw a bit more on the page. Or when subsetting/combining files.

I suspect PDF was fairly sane in the initial incarnation, and it's the extra garbage that they've added since then that is a source of pain.

I'm not a big fan of this additional change (nor any of the javascript/etc), but I would be fine with people leaving content streams uncompressed and running the whole file through brotli or something.

Comment by eru 1 day ago

> Don't you end up with PDF if you start with PS and restrict it to a subset?

PDF is also a binary format.

Comment by mikkupikku 2 days ago

I thought PDFs can contain arbitrary PS.

Comment by lmz 2 days ago

Compression filters are in PostScript.

Comment by ksec 2 days ago

Why not zstd?

Comment by PunchyHamster 2 days ago

incompetence

Comment by whizzx 2 days ago

You can read about it here https://pdfa.org/brotli-compression-coming-to-pdf/

Comment by jeffbee 2 days ago

That mentions zstd in a weird incomplete sentence, but never compares it.

Comment by F3nd0 2 days ago

They don’t seem to provide a detailed comparison showing how each compression scheme fared at every task, but they do list (some of) their criteria and say they found Brotli the best of the bunch. I can’t tell if that’s a sensible conclusion or not, though. Maybe Brotli did better on code size or memory use?

Comment by eviks 2 days ago

Hey, they did all the work and more, trust them!!!

> Experts in the PDF Association’s PDF TWG undertook theoretical and experimental analysis of these schemes, reviewing decompression speed, compression speed, compression ratio achieved, memory usage, code size, standardisation, IP, interoperability, prototyping, sample file creation, and other due diligence tasks.

Comment by LoganDark 2 days ago

I love when I perform all the due diligence tasks. You just can't counter that. Yes but, they did all the due diligence tasks. They considered all the factors. Every one. Think you have one they didn't consider? Nope.

Comment by jsnell 2 days ago

But they didn't write "all". They wrote "other", which absolutely does not imply full coverage.

Maybe read things a bit more carefully before going all out on the snide comments?

Comment by wizzwizz4 2 days ago

In fact, they wrote "reviewing […] other due diligence tasks", which doesn't imply any coverage! This close, literal reading is an appropriate – nay, the only appropriate – way to draw conclusions about the degree of responsibility exhibited by the custodians of a living standard. By corollary, any criticism of this form could be rebuffed by appeal to a sufficiently-carefully-written press release.

Comment by LoganDark 1 day ago

It implies potential coverage of anything one could bring up. It creates a similar impression in my mind, because it becomes easy to claim you already considered something.

Comment by 2 days ago

Comment by HackerThemAll 2 days ago

I think this was the main reason (from the linked article) LOL:

"Brotli is a compression algorithm developed by Google."

They have no idea about Zstandard nor ANS/FSE comparing it with LZ77.

Sheer incompetence.

Comment by cortesoft 2 days ago

I can’t imagine the people actually doing the technical work don’t know about Zstandard.

Comment by mort96 2 days ago

I just took all PDFs I had in my downloads folder (55, totaling 47M). These are invoices, data sheets, employment contracts, schematics, research reports, a bunch of random stuff really.

I compressed them all with 'zstd --ultra -22', 'brotli -9', 'xz -9' and 'gzip -9'. Here are the results:

    +------+------+-----+------+--------+
    | none | zstd | xz  | gzip | brotli |
    +------|------|-----|------|--------|
    | 47M  | 45M  | 39M | 38M  | 37M    |
    +------+------+-----+------+--------+

Here's a table with all the files:

    +------+------+------+------+--------+
    | raw  | zstd | xz   | gzip | brotli |
    +------+------+------+------+--------+
    | 12K  | 12K  | 12K  | 12K  | 12K    |
    | 20K  | 20K  | 20K  | 20K  | 20K    | x5
    | 24K  | 20K  | 20K  | 20K  | 20K    | x5
    | 28K  | 24K  | 24K  | 24K  | 24K    |
    | 28K  | 24K  | 24K  | 24K  | 24K    |
    | 32K  | 20K  | 20K  | 20K  | 20K    | x3
    | 32K  | 24K  | 24K  | 24K  | 24K    |
    | 40K  | 32K  | 32K  | 32K  | 32K    |
    | 44K  | 40K  | 40K  | 40K  | 40K    |
    | 44K  | 40K  | 40K  | 40K  | 40K    |
    | 48K  | 36K  | 36K  | 36K  | 36K    |
    | 48K  | 48K  | 48K  | 48K  | 48K    |
    | 76K  | 128K | 72K  | 72K  | 72K    |
    | 84K  | 140K | 84K  | 80K  | 80K    | x7
    | 88K  | 136K | 76K  | 76K  | 76K    |
    | 124K | 152K | 88K  | 92K  | 92K    |
    | 124K | 152K | 92K  | 96K  | 92K    |
    | 140K | 160K | 100K | 100K | 100K   |
    | 152K | 188K | 128K | 128K | 132K   |
    | 188K | 192K | 184K | 184K | 184K   |
    | 264K | 256K | 240K | 244K | 240K   |
    | 320K | 256K | 228K | 232K | 228K   |
    | 440K | 448K | 408K | 408K | 408K   |
    | 448K | 448K | 432K | 432K | 432K   |
    | 516K | 384K | 376K | 384K | 376K   |
    | 992K | 320K | 260K | 296K | 280K   |
    | 1.0M | 2.0M | 1.0M | 1.0M | 1.0M   |
    | 1.1M | 192K | 192K | 228K | 200K   |
    | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M   |
    | 1.2M | 1.1M | 1.0M | 1.0M | 1.0M   |
    | 1.3M | 2.0M | 1.1M | 1.1M | 1.1M   |
    | 1.7M | 2.0M | 1.7M | 1.7M | 1.7M   |
    | 1.9M | 960K | 896K | 952K | 916K   |
    | 2.9M | 2.0M | 1.3M | 1.4M | 1.4M   |
    | 3.2M | 4.0M | 3.1M | 3.1M | 3.0M   |
    | 3.7M | 4.0M | 3.5M | 3.5M | 3.5M   |
    | 6.4M | 4.0M | 4.1M | 3.7M | 3.5M   |
    | 6.4M | 6.0M | 6.1M | 5.8M | 5.7M   |
    | 9.7M | 10M  | 10M  | 9.5M | 9.4M   |
    +------+------+------+------+--------+

Zstd is surprisingly bad on this data set. I'm guessing it struggles with the already-compressed image data in some of these PDFs.

Going by only compression ratio, brotli is clearly better than the rest here and zstd is the worst. You'd have to find some other reason (maybe decompression speed, maybe spec complexity, or maybe you just trust Facebook more than Google) to choose zstd over brotli, going by my results.

I wish I could share the data set for reproducibility, but I obviously can't just share every PDF I happened to have laying around in my downloads folder :p

Comment by mort96 2 days ago

Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):

    +---------+---------+--------+--------+--------+
    |  none   |  zstd   |   xz   |  gzip  | brotli |
    +---------|---------|--------|--------|--------|
    | 47.81M  | 37.92M  | 37.96M | 38.80M | 37.06M |
    +---------+---------+--------+--------+--------+

These numbers are much more impressive. Still, Brotli has a slight edge.

Comment by terrelln 2 days ago

> | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M |

Something is going terribly wrong with `zstd` here, where it is reported to compress a file of 1.1MB to 2MB. Zstd should never grow the file size by more than a very small percent, like any compressor. Am I interpreting it correctly that you're doing something like `zstd -22 --ultra $FILE && wc -c $FILE.zst`?

If you can reproduce this behavior, can you please file an issue with the zstd version you are using, the commands used, and if possible the file producing this result.

Comment by mort96 2 days ago

Okay now this is weird.

I can reproduce it just fine ... but only when compressing all PDFs simultaneously.

To utilize all cores, I ran:

    $ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22 & done; wait

(and similar for the other formats).

I ran this again and it produced the same 2M file from the source 1.1M file. However when I run without paralellization:

    $ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22; done

That one file becomes 1.1M, and the total size of *.zst is 37M (competitive with Brotli, which is impressive given how much faster it is to decompress).

What's going on here? Surely '-22' disables any adaptive compression stuff based on system resource availability and just uses compression level 22?

Comment by terrelln 2 days ago

Yeah, `--adaptive` will enable adaptive compression, but it isn't enabled by default, so shouldn't apply here. But even with `--adaptive`, after compressing each block of 128KB of data, zstd checks that the output size is < 128KB. If it isn't, it emits an uncompressed block that is 128KB + 3B.

So it is very central to zstd that it will never emit a block that is larger than 128KB+3B.

I will try to reproduce, but I suspect that there is something unrelated to zstd going on.

What version of zstd are you using?

Comment by mort96 2 days ago

'zstd --version' reports: "** Zstandard CLI (64-bit) v1.5.7, by Yann Collet **". This is zstd installed through Homebrew on macOS 26 on an M1 Pro laptop. Also of interest, I was able to reproduce this with a random binary I had in /bin: https://floss.social/@mort/115940378643840495

I was completely unable to reproduce it on my Linux desktop though: https://floss.social/@mort/115940627269799738

Comment by terrelln 2 days ago

I've figured out the issue. Use `wc -c` instead of `du`.

I can repro on my Mac with these steps with either `zstd` or `gzip`:

    $ rm -f ksh.zst
    $ zstd < /bin/ksh > ksh.zst
    $ du -h ksh.zst
    1.2M ksh.zst
    $ wc -c ksh.zst
     1240701 ksh.zst
    $ zstd < /bin/ksh > ksh.zst
    $ du -h ksh.zst
    2.0M ksh.zst
    $ wc -c ksh.zst
     1240701 ksh.zst
    
    $ rm -f ksh.gz
    $ gzip < /bin/ksh > ksh.gz
    $ du -h ksh.gz
    1.2M ksh.gz
    $ wc -c ksh.gz
     1246815 ksh.gz
    $ gzip < /bin/ksh > ksh.gz
    $ du -h ksh.gz
    2.1M ksh.gz
    $ wc -c ksh.gz
     1246815 ksh.gz

When a file is overwritten, the on-disk size is bigger. I don't know why. But you must have ran zstd's benchmark twice, and every other compressor's benchmark once.

I'm a zstd developer, so I have a vested interest in accurate benchmarks, and finding & fixing issues :)

Comment by mort96 2 days ago

Interesting!

It doesn't seem to be only about overwriting, I can be in a directory without any .zst files and run the command to compress 55 files in parallel and it's still 45M according to 'du -h'. But you're right, 'wc -c' shows 38809999 bytes regardless of whether 'du -h' shows 45M after a parallel compression or 38M after a sequential compression.

My mental model of 'du' was basically that it gives a size accurate to the nearest 4k block, which is usually accurate enough. Seems I have to reconsider. Too bad there's no standard alternative which has the interface of 'du' but with byte-accurate file sizes...

Comment by terrelln 2 days ago

Yeah, it isn't quite that simple. E.g. `/bin/ksh` reports 1.4MB, but it is actually 2.4MB. Initially, I thought it was because the file was sparse, but there are only 493KB of zeros. So something else is going on. Perhaps some filesystem-level blocks are deduped from other files? Or APFS has transparent compression? I'm not sure.

It does still seem odd that APFS is reporting a significantly larger disk-size for these files. I'm not sure why that would ever be the case, unless there is something like deferred cleanup work.

Comment by mort96 2 days ago

Ross Burton on Mastodon suggests that it might be deduplication; when writing sequentially, later files can re-use blocks from earlier files, while that isn't the case as much when writing sequentially. That seems plausible enough to me.

Comment by mort96 2 days ago

I've concluded that this can't be the reason. It'd only result in an error where the size reported by 'du' is smaller than the apparent size (aka number of bytes reported by 'wc -c') of the file. What we see here is that the size reported by 'du' is almost twice as large as the number of bytes. That can't be the result of dedpulication.

I'll chalk it up to "some APFS weirdness".

Comment by Zekio 2 days ago

doesn't zstd cap out at compression level 19?

Comment by mort96 2 days ago

From the man page:

    --ultra: unlocks high compression levels 20+ (maximum 22), using a lot more memory.

Regardless, this reproduces with random other files and with '-9' as the compression level. I made a mastodon post about it here: https://floss.social/@mort/115940378643840495

Comment by gcr 2 days ago

If you're worried about double-compression of image data, you can uncompress all images by using qpdf:

    qpdf --stream-data=uncompress in.pdf out.pdf

The resulting file should compress better with zstd.

Comment by noname120 2 days ago

Why not use a more widespread compression algorithm (e.g. gzip) considering that Brotli barely performs better at all? Sounds like a pain for portability

Comment by mort96 2 days ago

I'm not sold on the idea of adding compression to PDF at all, I'm not convinced that the space savings are worth breaking compatibility with older readers. Especially when you consider that you can just compress it in transit with e.g HTTP's 'Content-Encoding' without any special PDF reader support. (You can even use 'Content-Encoding: br' for brotli!)

If you do wanna change PDF backwards-incompatibly, I don't think there's a significant advantage to choosing gzip to be honest, both brotli and zstd are pretty widely available these days and should be fairly easy to vendor. But yeah, it's a slight advantage I guess. Though I would expect that there are other PDF data sets where brotli has a larger advantage compared to gzip.

But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)

Comment by ksec 1 day ago

>But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)

I may dislike Google. But my support of JPEG XL and Zstd has nothing to do with competition tech being Google at all. I simply think JPEG XL and Zstd are better technology.

Comment by noname120 2 days ago

Could you add compression and decompression speeds to your table?

Comment by mort96 2 days ago

I just did some interactive shell loops and globs to compress everything and output CSV which I processed into an ASCII table, so I don't exactly have a pipeline I can modify and re-run the tests with compression speeds added ... but I can run some more interactive shell-glob-and-loop-based analysis to give you decompression speeds:

    ~/tmp/pdfbench $ hyperfine --warmup 2 \
    'for x in zst/*; do zstd -d >/dev/null <"$x"; done' \
    'for x in gz/*; do gzip -d >/dev/null <"$x"; done' \
    'for x in xz/*; do xz -d >/dev/null <"$x"; done' \
    'for x in br/*; do brotli -d >/dev/null <"$x"; done'
    Benchmark 1: for x in zst/*; do zstd -d >/dev/null <"$x"; done
      Time (mean ± σ):     164.6 ms ±   1.3 ms    [User: 83.6 ms, System: 72.4 ms]
      Range (min … max):   162.0 ms … 166.9 ms    17 runs
    
    Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
      Time (mean ± σ):     143.0 ms ±   1.0 ms    [User: 87.6 ms, System: 43.6 ms]
      Range (min … max):   141.4 ms … 145.6 ms    20 runs
    
    Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
      Time (mean ± σ):     981.7 ms ±   1.6 ms    [User: 891.5 ms, System: 93.0 ms]
      Range (min … max):   978.7 ms … 984.3 ms    10 runs
    
    Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
      Time (mean ± σ):     254.5 ms ±   2.5 ms    [User: 172.9 ms, System: 67.4 ms]
      Range (min … max):   252.3 ms … 260.5 ms    11 runs
    
    Summary
      for x in gz/*; do gzip -d >/dev/null <"$x"; done ran
        1.15 ± 0.01 times faster than for x in zst/*; do zstd -d >/dev/null <"$x"; done
        1.78 ± 0.02 times faster than for x in br/*; do brotli -d >/dev/null <"$x"; done
        6.87 ± 0.05 times faster than for x in xz/*; do xz -d >/dev/null <"$x"; done

As expected, xz is super slow. Gzip is fastest, zstd being somewhat slower, brotli slower again but still much faster than xz.

    +-------+-------+--------+-------+
    | gzip  | zstd  | brotli | xz    |
    +-------+-------+--------+-------+
    | 143ms | 165ms | 255ms  | 982ms |
    +-------+-------+--------+-------+

I honestly expected zstd to win here.

Comment by noname120 2 days ago

Thanks a lot. Interestingly Brotli’s author mentioned here that zstd is 2× faster at decompressing, which roughly matches your numbers:

https://news.ycombinator.com/item?id=46035817

I’m also really surprised that gzip performs better here. Is there some kind of hardware acceleration or the like?

Comment by terrelln 2 days ago

Zstd should not be slower than gzip to decompress here. Given that it has inflated the files to be bigger than the uncompressed data, it has to do more work to decompress. This seems like a bug, or somehow measuring the wrong thing, and not the expected behavior.

Comment by mort96 2 days ago

It seems like zstd is somehow compressing really badly when many zstd processes are run in parallel, but works as expected when run sequentially: https://news.ycombinator.com/item?id=46723158

Regardless, this does not make a significant difference. I ran hyperfine again against a 37M folder of .pdf.zst files, and the results are virtually identical for zstd and gzip:

    +-------+-------+--------+-------+
    | gzip  | zstd  | brotli | xz    |
    +-------+-------+--------+-------+
    | 142ms | 165ms | 269ms  | 994ms |
    +-------+-------+--------+-------+

Raw hyperfine output:

    ~/tmp/pdfbench $ du -h zst2 gz xz br
     37M    zst2
     38M    gz
     38M    xz
     37M    br
    
    ~/tmp/pdfbench $ hyperfine ...
    Benchmark 1: for x in zst2/*; do zstd -d >/dev/null <"$x"; done
      Time (mean ± σ):     164.5 ms ±   2.3 ms    [User: 83.5 ms, System: 72.3 ms]
      Range (min … max):   162.3 ms … 172.3 ms    17 runs
    
    Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
      Time (mean ± σ):     142.2 ms ±   0.9 ms    [User: 87.4 ms, System: 43.1 ms]
      Range (min … max):   140.8 ms … 143.9 ms    20 runs
    
    Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
      Time (mean ± σ):     993.9 ms ±   9.2 ms    [User: 896.7 ms, System: 99.1 ms]
      Range (min … max):   981.4 ms … 1007.2 ms    10 runs
    
    Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
      Time (mean ± σ):     269.1 ms ±   8.8 ms    [User: 176.6 ms, System: 75.8 ms]
      Range (min … max):   261.8 ms … 287.6 ms    10 runs

Comment by terrelln 2 days ago

Ah I understand. In this benchmark, Zstd's decompression time is 284 MB/s, and Gzip's is 330 MB/s. This benchmark is likely dominated by file IO for the faster decompressors.

On the incompressible files, I'd expect decompression of any algorithm to approach the speed of `memcpy()`. And would generally expect zstd's decompression speed to be faster. For example, on a x86 core running at 2GHz, Zstd is decompressing a file at 660 MB/s, and on my M1 at 1276 MB/s.

You could measure locally either using a specialized tool like lzbench [0], or for zstd by just running `zstd -b22 --ultra /path/to/file`, which will print the compression ratio, compression speed, and decompression speed.

[0] https://github.com/inikep/lzbench

Comment by 2 days ago

Comment by gcr 2 days ago

If we're making breaking changes to PDFs, I'd love if the committee added a modern image format like JPEG-XL. In my experience, most disk usage of PDFs comes from images, not streams.

I keep a bunch of comics in PDF but JPEG-XL is by far the best way to enjoy them in terms of disk space.

Comment by Bolwin 2 days ago

Odd you should say that, as that's exactly what they've been discussing

Comment by gcr 2 days ago

No it's not. This article is about proposing Brotli as another possible '/Filter' for stream objects, like content streams (page drawing commands). Images are streams too, but unless you mean compressing raw pixel bytes in Brotli, there's no mention of a JPEG-XL or WEBP filter.

Comment by NoahZuniga 2 days ago

well, not mentioned in this specific article. But JPEG-XL support is something they're working on [1].

[1]: https://pdfa.org/wp-content/uploads/2025/10/PDFDays2025-Brea...

Comment by gcr 1 day ago

Oh cool!! TIL

Comment by 2 days ago

Comment by nbevans 2 days ago

This is a really really bad idea. Don't break backwards compat. for 20% of gains. Internet connection speeds and storage capacities only go up. In a few years time, 20% of gains will seem crazy to have broken back-compat for.

Comment by whinvik 2 days ago

I am often frustrated by PDF issues such as how complicated it is to create one.

But reading the article I realized PDFs have become ubiquitous because of its insistence on backwards compatibility. Maybe for some things it's good to move this slow.

Comment by jhealy 2 days ago

The article is wrong, the PDF spec has introduced breaking changes plenty of times. It’s done slowly and conservatively though, particularly now that the format is an ISO spec.

The PDF format is versioned, and in the past new versions have introduced things like new types of encryption. It’s quite probable that a v1.7 compliant PDF won’t open on a reader app written when v1.3 was the latest standard.

Comment by cess11 2 days ago

'Your PDF:s will open slower because we decided that the CDN providers are more important than you'.

If size was important to users then it wouldn't be so common that systems providers crap out huge PDF files consisting mainly of layout junk 'sophistication' with rounded borders and whatnot.

The PDF/A stuff I've built stays under 1 MB for hundreds of pages of information, because it's text placed in a typographically sensible manner.

Comment by noname120 2 days ago

Ridiculous statement. CDN providers can already use filesystem compression and standard HTTP Accept-Encoding compression for transfers (which includes brotli by the way). This ISO provides virtually no benefit to them

Comment by cess11 2 days ago

This reasoning comes from TFA.

Comment by h4x0rr 2 days ago

Wouldn't lzma2 be better here since a pdf is more read heavy?

Comment by F3nd0 2 days ago

Going by one of Brotli’s authors’ comment [1] on another post, it probably wouldn’t.

[1] https://news.ycombinator.com/item?id=46035817

Comment by avalys 2 days ago

This article is AI slop.

Comment by jeffbee 2 days ago

Yep.

Comment by delfinom 2 days ago

tl;dr Commerical entity is paying to have the ISO altered to "legalize" their SDK they are pushing which is incompatible with standard PDF readers.

ISO is pay to play so :shrug:

Comment by whizzx 2 days ago

No this feature is coming straight from the PDF association itself and we just added experimental support before it's officially in the spec to help testing between different sdk processors.

So your comment is a falsehood

Comment by lmz 2 days ago

It's not even clear that they were the ones suggesting inclusion. They're just saying their library now supports the new thing.

https://pdfa.org/brotli-compression-coming-to-pdf/

> As of March 2025, the current development version of MuPDF now supports reading PDF files with Brotli compression. The source is available from github.com/ArtifexSoftware/mupdf, and will be included as an experimental feature in the upcoming 1.26.0 release.

> Similarly, the latest development version of Ghostscript can now read PDF files with Brotli compression. File creation functionality is underway. The next official Ghostscript release is scheduled for August this year, but the source is available now from github.com/ArtifexSoftware/Ghostpdl.

Comment by adrian_b 2 days ago

Yes, I do not see any source of financial gain that could motivate them for this, because both MuPDF and Ghostscript are free.

MuPDF is an excellent PDF reader, the fastest that I have ever tested. There are plenty of big PDF files where most other readers are annoyingly slow.

It is my default PDF and EPUB reader, except that in very rare cases I encounter PDF files which MuPDF cannot understand, when I use other PDF readers (e.g. Okular).

Comment by bhouston 2 days ago

I'm no fan of Adobe, but it is not that hard to add brotli support given that it is open. Probably can be added by AI without much difficulty - it is a simple feature. I think compared to the ton of other complex features PDF has, this is an easy one.

Comment by vgtftf 2 days ago

[flagged]