Losing 1½ Million Lines of Go

Posted by moks 5 days ago

Comments

Comment by mroche 15 hours ago

> Unfortunately, Go’s library doesn’t get updated every time Unicode does. As of now, January 2026, it’s still stuck at Unicode 15.0.0, which dates to September 2023; the latest version is 17.0.0, last September. Which means there are plenty of Unicode characters Go doesn’t know about, and I didn’t want Quamina to settle for that.

I have to say I am surprised about that. Does anyone have any context or guesses as to why this is the case?

EDIT: Go's unicode was actually updated to v17 yesterday:

https://github.com/golang/go/commit/dd39dfb534d2badf1bb2d72d...

Comment by fsmv 14 hours ago

There was a short thread about this on mastodon involving Rob Pike the other day https://hachyderm.io/@robpike/115896334649905170

Comment by matt3210 15 hours ago

Based on the commit message and using "CL" which is google lingo for Change List on their internal system, I bet this was already available on the internal version and just ported to github version after someone pointed it out.

Comment by neild 15 hours ago

Much more prosaic (if slightly embarrassing), I'm afraid: The update was non-trivial (this CL is simple, but there are some accompanying ones in x/text which are not) and it didn't hit the top of the priority list for anyone who understands x/text.

Go is pretty much entirely developed in public; there are some Google-internal customizations but none of them are particularly exciting and almost all changes start in the open source repo and are imported from there.

Comment by LukeShu 14 hours ago

"CL"/"Change List" is the lingo for the Gerrit code review tool, which is how all contributions to Go happen. Creating a GitHub PR simply triggers a bot to create a Gerrit CL, which is where all discussion about the "PR" happens and where the "accept" button gets clicked.

Comment by 8n4vidtmkvmk 11 hours ago

Is Gerrit the same as Critique?

Comment by tonfa 10 hours ago

It's a descendant of critique's predecessor (Mondrian)

https://www.gerritcodereview.com/about.html

Comment by 15 hours ago

Comment by watchful_moose 15 hours ago

Hard to get promoted at Google doing that

Comment by cap11235 13 hours ago

[flagged]

Comment by Someone 12 hours ago

> Sure, these automata are “wide”, with lots of branches, but they’re also shallow, since they run on UTF-8 encoded characters whose maximum length is four and average length is much less

I would consider splitting this task into two:

- extracting the next Unicode code unit

- determining whether it’s in the code class

For the second, instead of using an automaton, one could use a perfect hash (https://en.wikipedia.org/wiki/Perfect_hash_function). That could make that part branch-free.

Is that a good idea?

Comment by norir 8 hours ago

A precomputed lookup table would be about 1MB covering all of then code points. The lookup code would first compute the code point (and also could do validation) and directly look up the class in the table. The lookup table would not need to be directly embedded in go code and could just be stored in a binary file. But I'd imagine it also could be put in an array literal in its own file that would never be opened by an ide if the program needs to be distributed as a single binary.