Floating-Point Printing and Parsing Can Be Simple and Fast

Posted by chmaynard 5 days ago

Comments

Comment by vitaut 4 days ago

The shortest double-to-string algorithm is basically Schubfach or, rather, it's variation Tejú Jaguá with digit output from Dragonbox. Schubfach is a beautiful algorithm: I implemented and wrote about it in https://vitaut.net/posts/2025/smallest-dtoa/. However, in terms of performance you can do much better nowadays. For example, https://github.com/vitaut/zmij does 1 instead of 2-3 costly 128x64-bit multiplications in the common case and has much more efficient digit output.

Comment by jhallenworld 1 day ago

I have been using Walter Bright's libc code from Zortech-C for microcontrollers, where I care about code size more than anything else:

https://github.com/nklabs/libnklabs/blob/main/src/nkprintf_f... https://github.com/nklabs/libnklabs/blob/main/src/nkstrtod.c https://github.com/nklabs/libnklabs/blob/main/src/nkdectab.c

nkprintf_fp.c+nkdectab.c: 2494 bytes

schubfach.cc: 10K bytes.. the code is small, but there is a giant table of numbers. Also this is just dtoa, not a full printf formatter.

OTOH, the old code is not round-trip accurate.

Russ Cox should make a C version of his code..

Comment by nigeltao 1 day ago

> Russ Cox should make a C version of his code.

https://github.com/rsc/fpfmt/blob/main/bench/uscalec/ftoa.c

Comment by vitaut 23 hours ago

Note that it has the same table of powers of 10: https://github.com/rsc/fpfmt/blob/main/bench/uscalec/pow10.h

Comment by vitaut 23 hours ago

It is possible to compress the table using the technique from Dragonbox (https://github.com/fmtlib/fmt/blob/8b8fccdad40decf68687ec038...) at the cost of some perf. It's on my TODO list for zmij.

Comment by magicalhippo 1 day ago

What about reasonably fast but smallest code, for running on a microcontroller? Anything signifactly better in terms of compiled size (including lookups)?

Comment by vitaut 23 hours ago

If you compress the table (see my earlier comment) and use plain Schubfach then you can get really small binary size and decent perf. IIRC Dragonbox with the compressed table was ~30% slower which is a reasonable price to pay and still faster than most algorithms including Ryu.

Comment by adgjlsfhk1 23 hours ago

When 30% is only ~3-6 ns it definitely seems worthwhile.

Comment by vitaut 10 hours ago

Note that ~3-6ns is on modern desktop CPUs where extra few kB matter less. On microcontrollers it will be larger in absolute terms but I would expect the relative difference to also be moderate.

Comment by andrepd 1 day ago

I implemented Teju Jaguá in Rust, based of the original C impl https://github.com/andrepd/teju-jagua-rs. Comparing to Zmij, I do wonder how much speedup is there on the core part of the algorithm (f2^e -> f10^e) vs on the printing part of the problem (f*10^e -> decimal string)! Benchmarks on my crate show a comparable amount of time spent on each of those parts.

Comment by vitaut 23 hours ago

I don't have exact numbers but from measuring perf changes per commit it seemed that most improvements came from "printing" (e.g. switching to BCD and SIMD, branchless exponent output) and microoptimizations rather than algorithmic improvements.

Comment by WiSaGaN 22 hours ago

Rust's `serde_json` recently switched to use a new library for floating string conversion: https://github.com/dtolnay/zmij.

Comment by vitaut 10 hours ago

I was impressed how fast the Rust folks adopted this! Kudos to David Tolnay and others.