ULID: Universally Unique Lexicographically Sortable Identifier

Posted by der_gopher 8 days ago

Comments

Comment by rdtsc 12 hours ago

> It is worth noting that the newest proposed standard for unique identifiers, UUID v7, aims to address the sortability and database performance issues of older UUID versions by adopting a similar time-ordered structure to ULID.

Yeah, I would go with UUID v7 at this point given that it's part of the UUID RFC https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-vers...

Comment by codys 8 hours ago

Ya, UUID v7 has been standard for a few years now. Perhaps the author is not familiar with the terminology of RFCs and so is misinterpreting the terminology used by IETF: "Request for Comments" and "Proposed Standard" can sound like they're not complete to folks not familiar with the IETF's process. Even then though, I would think they'd notice all the software that has UUID v7 support.

Comment by sedatk 12 hours ago

Whenever ULID comes up, I need to remind that it has a sequential ID generation mode in its spec which is prone to conflicts on multi-threads, processes or hosts which kills the purpose of a "universal" identifier. If you need a sequential ID, just use an integer, preferably one that's autoincremented by the database.

It's best to stick to UUIDv7 because of such quirks of ULID.

Comment by cpburns2009 11 hours ago

> I need to remind that it has a sequential ID generation mode in its spec which is prone to conflicts on multi-threads, processes or hosts which kills the purpose of a "universal" identifier.

Can you expand on how this can actually cause a problem? My understanding is different processes and hosts should never conflict because of the 80 bits of random data. The only way I can conceive of a conflict is multiple threads using the same non-thread-safe generator during the same millisecond.

Comment by sedatk 11 hours ago

You're right, not hosts or processes in that case. I forgot about random part as it's been a while since I looked at it. However, a single instance of a ULID generator must support this mode, which means that on multi-threaded architectures, it must lock the sequence as it still uses a single random value. That again, kills the purpose of a client-side, lock-free generation of universal identifiers as you said.

Comment by 0x457 10 hours ago

You only need to lock sequence if you care about IDs being ordered within a millisecond. That generally only matters when you create a batch of IDs at once, in that case you don't need to lock anything: generate ULID, keep incrementing sequence in that batch either by doing on the same thread, or by moving it from thread to thread. Kinda like creating an iterator and zip'ing it with iterator of thing you need IDs for.

I've switched to using UUIDv7 tho. It made sense to use ULID before v7, but now ULID only has one thing going on - smaller string representation. That doesn't matter if your storage can store UUIDs natively (i.e. as 128 bit integer)

If your goal is to have global order intact, then neither ULID nor UUIDv7 is going to work for you.

Comment by sedatk 9 hours ago

> You only need to lock sequence if you care about IDs being ordered within a millisecond

Yes, and that's when sequences are only used. I guess that's to avoid hogging the CPU or emptying the OS entropy pool during high loads.

However, that "optimization" is a failure mode if you're not aware how ULID internals work. It's easy to shoot yourself in the foot by blindly trusting ULID will always generate a unique ID across threads without blocking your thread. That's a sneaky footgun.

> That generally only matters when you create a batch of IDs at once

No, any web service instance can receive requests at arbitrary times, and sometimes in the same millisecond zone. The probability is proportional to the number of concurrent users and requests.

> If your goal is to have global order intact, then neither ULID nor UUIDv7 is going to work for you.

Agreed.

Comment by jasonwatkinspdx 9 hours ago

> or emptying the OS entropy pool during high loads.

Just a heads up that's not really a thing. If the CSPRNG is initialized correctly you're done. There's nothing being depleted. I know for ages the linux docs said different, they were just wrong and a maintainer was keeping a weird little fiefdom over it.

Comment by sedatk 8 hours ago

Thanks for the heads up, then it’s one less reason for ULID to adopt this weird behavior.

Comment by vbezhenar 10 hours ago

I hope that's not literally incrementing a sequence. Because it would lead to trivial neighbor ID guessing attacks.

I've implemented this thing, though not called it ULID. I've dedicated some bits for timestamp, some bits for counter within millisecond and rest for randomness. So they always ordered and always unpredictable.

Another approach is to keep latest generated UUID and if new UUID requested within the same timestamp - generate random part until it's greater than previous one. I think that's pretty good approach as well.

Comment by sedatk 9 hours ago

> I hope that's not literally incrementing a sequence

It's literally incrementing it by one:

https://github.com/ulid/javascript/blob/11c2067821ee19e4dc78...

Comment by vbezhenar 4 hours ago

Well, that makes little sense for me, you can just use numeric identifier instead. Bulk inserts which generate identifiers in bulk are commonly used.

But that's easy to fix, so just implementation quirk for this particular library, the idea is sound.

Comment by sedatk 2 hours ago

> But that's easy to fix, so just implementation quirk for this particular library, the idea is sound.

It's in ULID spec.

Comment by jasonwatkinspdx 9 hours ago

> I hope that's not literally incrementing a sequence. Because it would lead to trivial neighbor ID guessing attacks.

It is and it does.

Also the ULID spec suggests you use a CSPRNG, but doesn't mandate that or provide specific advice on appropriate algorithms. So in practice people may reach for whatever hash function is convenient in their project, which may just be FNV or similar with considerably weaker randomness too.

Comment by cpburns2009 10 hours ago

If you really need lock-free generation, you can use an alternate generator that uses new random bits for every submillisecond id. That's what the `ulid-py` library for Python does by default instead of incrementing the random bits.

Comment by sedatk 9 hours ago

Yes, the problem is that this mode is supported and required per the spec. So, a developer must know the pros/cons of this mode. It requires them to correctly assess the consequences. It's quite easy to shoot themsleves in the foot especially when a solid alternative like UUIDv7 exists.

Comment by unscaled 8 hours ago

The monotonic behavior is not the default, but I would also be happier if it was removed from the spec or at least marked with all the appropriate warning signs on all the libraries implementing it.

But I don't think UUIDv7 solves the issue by "having less quirks". Just like you'd have to be careful to use the non-monotonic version of ULID, you'd have to be careful to use the right version of UUID. You also have to hope that all of your UUID consumers (which would almost invariably try to parse or validate the UUID, even if they do nothing with it) support UUIDv7 or don't throw on an unknown version.

Comment by sedatk 7 hours ago

UUIDv7 is the closest to ULID as both are timestamp based, and UUIDv7 has fewer quirks than ULID, no question about it.

I agree that picking UUID variant requires caution, but when someone has already picked ULID, UUIDv7 is easily a superior alternative.

Comment by skeledrew 7 hours ago

Actually dived into this a bit just a couple days ago. It's very near impossibly for there to be a conflict since the timestamp resolves at the microsecond level, and if it's among threads, then there's a global state that, if somehow it should be hit 2+ times in the same microsecond, ensures detection and the random portion is incremented.

Comment by marifjeren 10 hours ago

> If you need a sequential ID, just use an integer

Are monotonic/sequential ULIDs as easily enumerated as integers? It's the ease of enumerability that keeps a lot of folks away from using sequential integers as IDs

Comment by sedatk 9 hours ago

You mean someone who wants to attack your system might be discouraged by Base32 encoding?

Comment by marifjeren 9 hours ago

Sorry, I'm not familiar with the ULID spec. You seem to be, hence my asking. Are you saying monotonic/sequential ULIDs are just (or just as easily enumerated as) Base32-encoded integers?

Oh and yeah, I guess I do think lots of script / AI kiddies would be discouraged by, or fail to see an opportunity when presented with, something that does not look like the numbers they saw in school.

Comment by N_Lens 9 hours ago

ULID's initial segment is timestamp generated, with a random suffix at the end. This kind of collision you're concerned about is not an issue at all, across multi-threads, processes or hosts.

Comment by sedatk 9 hours ago

Not if the same ULID generator instance is used across threads.

Comment by N_Lens 9 hours ago

That depends on if the specific implementation of the generator instance is thread safe. Highly implausible to use the same generator instance between different threads/processes/hosts because there's no benefit at all and only additional downsides.

Comment by sedatk 8 hours ago

To implement a thread-safe sequential increment, you need locking. When you use locking, then it becomes a “non-universal” ID generator with arbitrary performance impact.

Either it’s collision-prone or locking. Both are problematic in their own way.

It’s footguns all over while UUIDv7 simply exists.

Comment by N_Lens 2 hours ago

There is practically no need to have a thread-safe ULID generator that would be shared across threads/processes/hosts - a non-scenario that I cannot envision occurring in practice.

Comment by listenallyall 11 hours ago

Under what circumstances is it prone to conflicts? On separate threads/hosts/processes, id's created within the same millisecond would be differentiated by the 80 bits of randomness (more than UUID v7).

Comment by jasonwatkinspdx 10 hours ago

No, ULID has a "monotonic" feature, where if it detects the same millisecond timestamp in back to back calls, it just increments the 80 bit "random" portion. This means it has convoying behavior. If two machines are generating ids independently and happen to choose initial random positions near each other, the probability of collision is much higher than the basic birthday bound.

I think this "sort of monotonic but not really" is the worst of both to be honest. It tempts you to treat it like an invariant when it isn't.

If you want monotonicity with independent generation, just use a composite key that's a lamport clock and a random nonce. Or if you want to be even more snazzy use Hybrid Logical Clocks or similar.

Comment by Dylan16807 9 hours ago

> If two machines are generating ids independently and happen to choose initial random positions near each other, the probability of collision is much higher than the basic birthday bound.

But the chance of the initial random positions being near each other is very very low.

If you pick a billion random numbers in an 80 bit space, the chance you have a collision is one in a million. (2^80 / (2^30)^2)

If you pick a thousand random starting points and generate a million sequential numbers each, the chance your starting points are sufficiently close to each other to cause an overlap is one in a trillion. ((2^80 / 2^20) / (2^10)^2)

In that one in a trillion case, you'll likely end up with half a million collisions, which might matter to you. But if you care about 0 collisions versus 1+ collisions, pick the monotonic version.

Comment by jasonwatkinspdx 9 hours ago

Right, but the point is there's no reason to accept this limitation. Likewise why hardcode millisecond scale timestamps in a world where billions of inserts per second are practical on a single server?

Or if what you want is monotonic distributed timestamps, again, HLC is how you do that properly.

So you're embracing this weird limitations for no real benefit.

And as you can see in the rest of this comment thread, a lot of people simply do not even know this behavior and are assuming the 80 bit portion is always random. Which is my whole point about having a not really an invariant invariant just being a bad way to go fundamentally.

Edit: just to reply to the below since I can't do so directly, I understand the arithmetic here. What I'm saying is there's zero reason to choose this weird scheme vs something that's just the full birthday bound and you never think about it again.

As another comment points out: just consider neighbor guessing attacks. This 80 bit random but not random field is a foot gun to anyone that just assumes they can treat it as truly random.

Comment by Dylan16807 8 hours ago

> why hardcode millisecond scale timestamps in a world where billions of inserts per second are practical on a single server?

> Or if what you want is monotonic distributed timestamps, again, HLC is how you do that properly.

Why not just 64 bit timestamps stapled to a random number? You can be collision proof and monotonic without doing anything fancy.

> this weird scheme vs something that's just the full birthday bound and you never think about it again

But the weird scheme gives you better odds than the birthday bound.

Comment by listenallyall 9 hours ago

It's not a "limitation", he's saying there is much, much, much less chance of having any collisions with ULIDs - one in a million vs one in a trillion

Comment by listenallyall 9 hours ago

I dont think this holds up. Define "near each other" in an 80-bit random space. Further, the likelihood of a potential conflict is offset by the fact that you have far fewer "initial random positions" instead of every single element defining its own random position. And the extra random bits (over UUID7) reduce conflict possibilities by orders of magnitude.

I concede I'm no mathematician and I could be wrong here, but your analysis feels similar to assuming 10-11-12-13-14-15 is less likely to be a winning lottery ticket because the odds against consecutive numbers are so massive.

Comment by jasonwatkinspdx 9 hours ago

No, the calculation is straightforward and I'm not making the fallacious assumption you say there at the end about a magical lottery ticket number.

My basic point is the probability of collision is lower than the birthday bound, there's no need for this, and as comments in this thread make clear people are not understanding this limitation even exists with the specification.

Comment by listenallyall 8 hours ago

> the calculation is straightforward

Ok then, make it easy - your requirement is to independently pick 4 numbers from the range 0 to 9, without resulting in any duplicates. Which is more likely to be successful:

- pick 4 random digits independently

- pick a random digit, which will be appended by the next digit as pick #2 (i.e. if you pick 5, then 6 will automatically be your second digit, if you pick 9, 0 will be your second digit). Then pick once more on the same terms.

The math here is easy: scenario 1 you have 0.9 x 0.8 x 0.7 = 0.504 likelihood of success. Scenario 2 it's simply 0.7.

Comment by sedatk 11 hours ago

See my sibling comment.

Comment by nighthawk454 13 hours ago

Mentioned in the article's comments:

> Why not use UUID7?

> "ULID is much older than UUID v7 though and looks nicer"

For those unfamiliar, UUIDv7 has pretty much the same properties – sortable, has timestamp, etc.

ULID: 01ARZ3NDEKTSV4RRFFQ69G5FAV

UUIDv7: 019b04ff-09e3-7abe-907f-d67ef9384f4f

Comment by wood_spirit 12 hours ago

UUID 7 is so much easier than the ULID in the article manipulate. Pretty much every language and database has the string manipulation and from_hex functions to extract the timestamps without any special support function. Whereas a format that is too clever is way more complicated to work with.

Comment by nvader 12 hours ago

UUIDv7 looks better in the eye of this beholder.

Comment by ChymeraXYZ 12 hours ago

I know it may sound stupid but in my latest project I chose ULIDs because I can easily select them as one word, instead of various implementations of browsers, terminals, DB guis, etc each have their own opinion how to select and copy the whole UUID. So from that point of view ULIDs "look" better for me as they are more ergonomic when I actually have to deal with them manually.

Comment by unscaled 8 hours ago

I don't think it's stupid and this is one of the reason I prefer ULIDs or something like it. These IDs are very important for diagnostics, and making them easily selectable is a good goal in my book.

Comment by andy_ppp 12 hours ago

It’s also quite common to base62 the UUID value so in this case “31prI2bsccbXJB7cvbtV9”

Comment by elias1233 10 hours ago

I have always been a bit hesitant to use UUIDs with timestamps as it can be a security issue if the IDs are public. For example getting the age of a user account just from the id. I will say, however, that I have not heard of any major incidents stemming from this.

Comment by ivan_gammel 11 minutes ago

I don’t think it’s generally a good idea to store important domain information in synthetic keys. UUIDs should always be treated as opaque keys, but they may have structure that will help them fulfilling their primary function. Timestamp in UUID v7 may be close to moment of record creation, but it shouldn’t be the contract in your system that it is the creation timestamp.

Comment by verandaguy 7 hours ago

The classic solution to this is to have an internal ID (UUIDv7 if you want to use UUID, nice for indexing in newer databases) and an external ID (UUIDv4 or similar) which doesn't leak information to the outside world (but which otherwise doesn't offer any benefits at the storage level).

Comment by wafflestomp 5 hours ago

BTW, this doesn’t work well in S3 due to the timestamp being to the left of the randomness.

Comment by sblom 13 hours ago

I love the aesthetics. The cryptographic strength tradeoffs (against UUIDv7) seem rough for a lot of applications, though.

Comment by jalk 12 hours ago

Not sure what you mean by cryptographic strength - they are both Unique ID generators, not meant for anything related to cryptography.

UUIDv7 has 62 bits of random data, ULID uses 80 bits, so if anything ULID is "stronger" (meaning less chances of generating the same id within the same millisecond)

Comment by 0x457 10 hours ago

UUIDv7 has 74 bits of randomness, not 62, you forgot rand_a portion, so the difference is just 6 bits and only matters within the same millisecond.

Comment by 11 hours ago

Comment by swyx 7 hours ago

i keep a list of UUID info: https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...

for those also learning

Comment by catlover76 8 days ago

[dead]

Comment by pklausler 11 hours ago

Universally distinct would seem more correct than unique, at least mathematically speaking, unless there's only ever going to be just one of them.

Comment by N_Lens 9 hours ago

Interesting article noting how ULIDs solve database index fragmentation caused when using UUIDv4. However, for extremely high-volume writes, ULIDs create "hot spots" at the current timestamp index location, potentially causing contention. The article notes that UUID v7 (newly standardized) adopts the same time-ordered approach, validating ULID's design.