The experience of rendering Arabic typography and its technical debt
Posted by bookofjoe 4 days ago
Comments
Comment by samat 3 days ago
> I have watched senior engineers, fluent in both Arabic and English, give up on writing a long email in Outlook on a Wednesday afternoon because the cursor would not behave, and switch to Arabic-only or English-only because the cognitive cost of fighting the editor exceeded the cost of monolingual phrasing. Actually I remember very well suffering this while using Facebook for the first time in my life, and I could not register; I was very slow typer that when I reached the moment the cursor does this weird thing, I would just stare at it and never progress.
> This is the ordinary experience of writing mixed Arabic-English text in 2026, in every major editor, email client, and chat application I know of. The pettier cousins are everywhere too, and I collect them: a range like 10–20 silently reading as twenty-to-ten, because digits are weak and the dash is neutral; a trailing exclamation mark teleporting to the far end of the line; a password, toggled visible, displaying in an order that does not match what was typed. None of these are anyone's bug, exactly.
My own Cyrillic struggles are nothing in comparison.
Comment by dhosek 3 days ago
In college [2], when I wanted to quote some texts from Exodus in Hebrew in a paper that I wrote, I ended up avoiding the issue by hand-reversing the letter order and manually breaking lines. 8 bits is insufficient to cover all the possible combinations of letters and vowel markings so the font didn’t include any vowel markings and only did dageshim for בּ and פּ if I recall correctly.
⸻
1. As an aside, it would have been really nice if Unicode provided a R-L mirrored Latin alphabet to make it easier for monolingual developers to grasp the complexities surrounding mixed directional typesetting. I suppose it could still be added, although Unicode tends towards conservatism on adding additional characters.
2. This was 1990, well before Unicode in the era of a hundred or so 8-bit character encodings, most of which were not implemented widely. I also had to type the text using the arbitrary ASCII-Hebrew mapping of the font I was using which, among other things, led me to discover that letter frequency in Hebrew is much more uniform than it is in English.
Comment by teddyh 3 days ago
Comment by kstrauser 3 days ago
Comment by gus_massa 3 days ago
https://tug.org/TUGboat/tb08-1/tb17knutmix.pdf ¹
There is no space between pdf and ¹, so the HN server assumes incorrectly that the ¹ is part of the link.
Comment by mschuster91 3 days ago
Comment by qingcharles 3 days ago
I already had a good understand of CJK scripts, and you'll come across RtL there, with things like tategaki which is both vertical and RtL at the same time (and can include quotes in other languages such as English and Arabic). Here's some lyrics I made in that format for reference:
https://codepen.io/kingcharlesone/pen/GgRXLoM
What peculiarities does Cyrillic text have? I've never learned to convert Cyrillic to Latin.
Comment by raphlinus 3 days ago
Cyrillic for Russian is reasonably straightforward, but it's also used for many other languages. The variation in style is particularly notable for Bulgarian[1]. A sophisticated font might have a "loca" table with locale-specific adjustments, but this is not universal yet, for example the issue to add it to Open Sans is still open[2]. To see the differences, try [3] and use the Language dropdown to select Bulgarian.
[1]: https://en.wikipedia.org/wiki/Bulgarian_alphabet
[2]: https://github.com/googlefonts/opensans/issues/114
[3]: https://localfonts.eu/freefonts/traditional-cyrillic-free-fo...
Comment by qingcharles 2 days ago
I'm building out a multi-lingual wiki and just about to start adding Cyrillic support, and this really helps me understand that there is more research I need to do to support other Cyrillic languages.
Comment by raphlinus 2 days ago
Best of luck and feel free to reach out if you have more specific questions.
Comment by qingcharles 2 days ago
https://github.com/googlefonts/literata
I'm already strict with marking the html pages with the correct lang and culture tags etc, but I'll be double-checking everything now. Thank you again for educating me :)
Comment by cyberrock 3 days ago
https://example.com/[Arabic or Hebrew start of sentence]
wrapped_url_part
It seems like there are probably some phishing attacks based on this.Comment by khoirul 3 days ago
Comment by evmar 4 days ago
Think of variable width characters and kerning and ligatures and hyphenation and justification. Imagine computers had been won by a CJK language, which have none of these problems. You could imagine a similar article about how exotic and difficult English layout is.
Comment by retrac 3 days ago
When carved in stone the lines are much straighter. When written with brush or pen they became semi-cursive. When printing was introduced, they became grid-like and regular.
What westerners who are passingly familiar would think of as the standard Chinese typeface - the strict square grid with straight-line characters - arises in part from printing technology. Easy to carve that into wood blocks, and easy to line up the slots into a grid.
Latin was similarly morphed to fit into the realities of printing in the 1500s. And is still being morphed. Notice how numbers 123... are in-line and at the same height as the letters. That's a very modern convention, typewriter and computer influence on our orthography. Traditionally digits were more likely to appear as subscript, off-centre.
Comment by nagaiaida 3 days ago
(aha i have found the answer to my own question: miniaturization for fractions in phototypesetting)
Comment by dhosek 3 days ago
Comment by adgjlsfhk1 3 days ago
Comment by dhosek 3 days ago
Comment by somat 3 days ago
I am not familiar with the history of Arabic typography, but I sort of assume there was an archaic block form and their current joined form is the result of many centuries of encoding hand writing practice. advanced enough that falling back to a block form is impossible with the side effect of making simple mechanical text formatting also impossible.
As for Chinese derived characters. we currently are able to jam them awkwardly into our alphabet optimized structures(one code per character) but I wonder if a Chinese native encoding would look different. Would it make sense to try and represent the sub-characters present in each Chinese character in the encoding? I suspect not, Chinese works, but it also does not appear amiable to simple mechanical assistance.
Comment by mook 3 days ago
As a reference, I don't believe any of the pre-Unicode CJK&c encodings attempted that.
Comment by slibhb 3 days ago
Hebrew is a closely related semitic language that simply adopted a block and cursive form. It has also been greatly simplified and friendlier towards loanwords, which has made it far easier to learn.
Comment by hackpelican 3 days ago
Weird to say Arabic hasn’t innovated or evolved considering the wild variety of dialects spoken in the modern world.
Conflating the language with the script is also bizarre. In terms of adapting Arabic to technology, look into romanized Arabic which was used before Unicode was common.
Comment by slibhb 3 days ago
> Weird to say Arabic hasn’t innovated or evolved considering the wild variety of dialects spoken in the modern world.
I didn't say Arabic has not innovated or evolved; only that it "has lagged behind other languages in terms of innovation". My belief is that that is due to linguistic conservatism, and linked to Islamism (or, at minimum, the centrality of Islam in Arab culture). Also related to this is the existence of Fusha, its place in Arab culture, and its branding as "modern standard Arabic".
I didn't conflate anything. While a script and a language are not the same, it's not a coincidence that Arabic is often written today in a script that is very close to Quranic script. And -- to really kick the hornet's nest -- it's also not a coincidence that there have been so few outstanding Arab writers (in Arabic) in the past 100 years. One novelist and a couple poets.
Comment by mschuster91 3 days ago
Now, reading that point one might ask the question if writing has been properly funded, or if the priority of cultural funding in the Arab world has been lower than, say, the funding of architecture and other forms of art. And on top of that, I'd also have a serious look at the market size, especially when compared with English-language writing.
Comment by ibn-ashraf 3 days ago
Firstly, the Qur'an wasn't written by the Prophet, he would dictate it and it would be written by his scribes.
Secondly, it's hard to argue that Islam has had a negative effect on Arabic or caused it to lag behind. In fact, it's easy to argue for the opposite. It's a historical fact that the Arabic language developed and proliferated rapidly due to the rise and spread of Islam. This is when its script and grammar were standardized, and when more and more works started being composed. And shortly thereafter the Islamic Golden Age began.
I don't have any issue with Hebrew, and maybe it is easier to learn. But this is because it was a dead language which was revived, resulting in a simplified language. Almost every other major language on Earth will have the same amount of "innovation" as Arabic. In fact, Arabic has many colloquial dialects which are used in day to day conversations, and these do consist of a simplified version with many loanwords. So I really don't know what you mean by a lack of innovation.
Comment by simonask 3 days ago
But if you compare it with basically any other major language, it’s clearly much, much more conservative. If you are a native English speaker, understanding English from 1,000 years ago is like learning a completely different language. If you are a native speaker of Italian, you cannot understand a text in Latin without significant training. This is true for all European languages other than Icelandic.
Chinese is pretty similar, even though the written language is slightly more stable.
So in comparison, Arabic is incredibly conservative.
Comment by decimalenough 3 days ago
https://en.wikipedia.org/wiki/Varieties_of_Arabic
A rough equivalent in both time and space is how the Vatican continues to use Latin, but the rest of the Roman Empire has splintered into Italian, French, Spanish, Romanian, etc.
Comment by slibhb 3 days ago
They speak it on tv and it's written in newpapers. They learn it in schools. Educated Arabs code switch into Fusha all the time. Islamist leaders (e.g. Nasrallah) speak Fusha in their broadcast speeches.
It's also pretty hard for foreigners to learn an ammiyya (outside of immersion). "Studying Arabic" almost always means Fusha.
I agree with you that "the actual Arabics are the 20-odd spoken languages". In a healhier culture, Fusha wouldn't exist or would have the same cultural place as Latin in the Western world.
Comment by dhosek 3 days ago
Comment by nwhnwh 3 days ago
Comment by aaa_aaa 3 days ago
Comment by khaled 3 days ago
Comment by wodenokoto 3 days ago
Comment by qingcharles 3 days ago
https://codepen.io/kingcharlesone/pen/GgRXLoM
Japanese magazines usually mix three different script types on a majority of the pages like this:
(In another quirk some Japanese mags open right-bound, others open left-bound)
Comment by jrdres 2 days ago
I also read that a few Chinese texts only make sense in vertical order: one had a pun where the characters read one way as separated characters, but as stacked was also a single character pun for something like a "crumbly cookie".
Comment by yorwba 4 days ago
Comment by mackeye 4 days ago
Comment by jansan 4 days ago
If you want a solution for this it has to happen in the rendering step, not the shaping (which is HarfBuzz's main task). The shaper has no information about the available space, but when rendering you could stretch individual glyphs to the desired width, similar to adjusting the width of whitespace in Latin, but more complex, because you actually have to modify the glyphs with a scale transform. I am not an expert on Arabic script by any means, but this should be possible IMO. It would at least be an interesting experiment. Of course the JSTF table would be the right way to do it, but there seems to be a lot of confusion around it. Maybe in the age of LLMs we can give it another shot.
Comment by amluto 4 days ago
As a practical matter, there’s an input length n and there is some upper bound B on a credible line length as measured in code points, so there are only at most n*B credible proposed lines to evaluate, which also limits the useful look back on the table to B positions, so I think the time complexity could be reduced to O(n*B^2) without making the results worse on reasonable inputs, and this is probably quite tolerable.
[0] Straightforward once you’ve implemented the whole Arabic rendering stack, anyway. I am certainly not qualified to calculate this function :)
Comment by alfiedotwtf 3 days ago
“Individual glyphs” :)
It’s Arabic, so you wouldn’t stretch a single glyph, id would have to e done after shaping so you can work out the next run (either a single Aleph or the joined characters) in order to know what is stretchable (then throw it to your layout step)
Comment by slim 3 days ago
Internet Explorer 5.5 implements text-justify: kashida. For one brief, weird browser-quarter Microsoft is the only software vendor on earth that can justify Arabic correctly on a screen.Comment by kqr 3 days ago
Comment by jazzyb 3 days ago
Comment by Obscurity4340 2 days ago
Comment by qingcharles 3 days ago
Comment by mohamedkoubaa 4 days ago
Comment by amdivia 3 days ago
Unfortunately it died
Comment by mohamedkoubaa 3 days ago
Comment by throw-the-towel 4 days ago
This part nearly had me chuckle audibly:
He says yes. The result is "Simplified Arabic": initial fused into medial, final into isolated, ligatures dropped. It conquers the Arab newsroom in a generation. Mrowa is assassinated at his desk eight years later, by an unrelated faction, in an unrelated dispute.
Also, it's depressing how hundreds of millions of people couldn't even get their language typeset on a computer, and our industry meanwhile was busy building AI-native AI for your groceries (have we mentioned it has AI btw?) and similar performative bullshit.
Comment by slim 3 days ago
Comment by kg 3 days ago
Comment by gwern 3 days ago
AI also brought you this "wonderful" article, I would note.
Comment by dboon 2 days ago
Comment by gwern 1 day ago
> The same six months I had closed three other tickets against the same product, each of which had presented to its filer as the only bug. A customer's name had appeared with its letters unjoined on a printed agreement, the way a sign-painter would have laid them out in 1962, because the PDF library on the receipt server pre-dated the existence of a shaping engine in its language runtime. A search index had been returning empty for accounts the customer service team could see in the database because a 2017 import had encoded twelve thousand names using fossil Unicode codepoints from 1991 instead of regular ones from 1995, and the index, very reasonably, treated the two encodings as different strings, So, that ragged-left ticket was the smallest of the four, HOWEVER, it sat on top of the same iceberg and pointed at the same thing.
Blatantly Claude (which he recently started using, judging by https://lr0.org/diary/2026-02-26/ and https://lr0.org/diary/#08062026 - note how LLM-written the second one sounds, a little ironically).
And you can punch the essay into Pangram if you have any doubt (omit Arabic text and formatting if you do this, focus on just plain English to be safe). For example, go to the end* and try "Everything in this story that actually works was paid for by almost nobody...Somebody will close it, probably unpaid, possibly reading this (or writing it? who knows)." '100% AI.'
Or just compare it to his older writings. Does this 2023 piece https://lr0.org/blog/p/d/ or this 2024 piece https://lr0.org/blog/p/democracy/ sound like OP?
* I always check sections towards the end instead of the beginning, because a lot of authors will write the introduction by hand and then give up and let the AI write the rest; and also more advanced sloppers will fiddle with the opening until it beats Pangram, which is not hard since Pangram heavily favors false negatives on AI contribution, and skip the rest because they assume readers will be too lazy to check beyond that.
Comment by amluto 4 days ago
> The relevant rule, W2 of UAX #9, reclassifies a digit as an ARABIC NUMBER if any of the previous strong characters in the paragraph were Arabic letters, and as a EUROPEAN NUMBER otherwise. Both render their internal digits left-to-right, which is correct: numbers everywhere on Earth are read most-significant-first.
Does the author mean most-significant-on-the-left? The statement as written is a statement about the order in which one reads or perhaps thinks the number, whereas I think the author is discussing how numbers, including collections of numbers delimited by hyphens and such, should be laid out on the page.
Comment by somat 3 days ago
Comment by kqr 3 days ago
I had always assumed that was what was intended with Arabic numbers, only silly Europeans made a mistake when they borrowed the positional system and forgot Arabic is written the other way. (Or perhaps intentionally avoided mirroring the digits for ease of communication?)
But the author of this article makes it sound like even in Arabic, numbers are read out loud most significant first.
Comment by Georgelemental 3 days ago
Comment by petesergeant 3 days ago
Comment by qingcharles 3 days ago
Comment by slim 4 days ago
on the other hand, in formal arabic, it's not unusual that numers are read in clusters from least significant to most significant (right to left). 1984 would be read : eighty four and nine hundred and a thousand. not sure if the author is aware of this
Comment by amluto 3 days ago
What does that even mean in this context? In a strictly LTR language, sure, you read left-to-right and the glyphs are rendered left-to-right. But the whole discussion is about bidirectional text, where the text is rendered by a complex algorithm. What is the “rendering direction”?
I know just enough about some RTL languages to know that one can absolutely intersperse RTL text with, say, and English phrase, and you still read the first (leftmost in the group) English sound first and so on :)
Comment by slim 3 days ago
Comment by jrdres 3 days ago
I'm curious how it handled mixed entry.
https://forum.vcfed.org/index.php?threads/bought-a-al-alamia...
Comment by evilturnip 3 days ago
I went down this rabbit-hole awhile back and it made me really appreciate the complexity of the script.
Comment by NooneAtAll3 3 days ago
what's missing here is "use shift + arrow keys for selection"
Firefox manages movement just fine. Selection tho? oh boi
Comment by anal_reactor 3 days ago
Comment by amake 3 days ago
Comment by anal_reactor 3 days ago
Comment by amake 2 days ago
Comment by anal_reactor 2 days ago
Comment by qingcharles 3 days ago
One thing that amuses me is that people share these "safe zone" templates for short form video to make sure your content isn't hidden behind the buttons:
But look over the shoulder of someone using the Arabic version of TikTok and you'll realize how flawed that is:
Comment by Georgelemental 3 days ago
Comment by adam_rida 4 days ago
The hard part is that typography, shaping, bidi behavior, font fallback, search, and the editor model all leak into each other.
You cannot fix one layer cleanly when the assumptions are wrong in all of them.
Comment by tensegrist 4 days ago
Comment by creesch 3 days ago
> the Kashida section was contributed to this post from a talk in Arabic of Nawal Hadeed, which she translated and added to the post herself. Although I'm unsure of LLM usage in the translation process, looking at the original Arabic I felt some change in tone while editing the post. I could have either declined the translation and never have this documented, procrastinate in translating it myself (which has been ongoing for a while), or publish as it is. I found the last least damaging.
Comment by masfuerte 4 days ago
Comment by VeninVidiaVicii 3 days ago
Comment by ramblurr 3 days ago