The state of modern AI text to speech systems for screen reader users
Posted by tuukkao 1 day ago
Comments
Comment by cachius 1 day ago
So what's the way forward for blind screen reader users? Sadly, I don't know.
Modern text to speech research has little overlap with our requirements. Using Eloquence [32-bit voice last compiled in 2003], the system that many blind people find best, is becoming increasingly untenable. ESpeak uses an odd architecture originally designed for computers in 1995, and has few maintainers. Blastbay Studios [...] is a closed-source product with a single maintainer, that also suffers from a lack of pronunciation accuracy.
In an ideal world, someone would re-implement Eloquence as a set of open source libraries. However, doing so would require expertise in linguistics, digital signal processing, and audiology, as well as excellent programming abilities. My suspicion is that modernizing the text to speech stack that is preferred by blind power-users is an effort that would require several million dollars of funding at minimum.
Instead, we'll probably wind up having to settle for text to speech voices that are "good enough", while being nowhere near as fast and efficient [800 to 900 words per minute] as what we have currently.
Comment by SequoiaHope 1 day ago
I found some sample audio from Eloquence. I like this type of voice!
Comment by Jeff_Brown 1 day ago
Comment by nuc1e0n 1 day ago
Comment by miki123211 1 day ago
From what we know, Eloquence was compiled in two stages, stage1 compiled a proprietary language called Delta (for text-to-phoneme rules) to C++, which was then compiled to machine code. A lot of the existing code is likely autogenerated from a much more compact representation, probably via finite state transducers or some such.
Comment by TheAceOfHearts 1 day ago
Comment by nuc1e0n 1 day ago
Comment by nowittyusername 1 day ago
Comment by swores 7 hours ago
Is supertonic the best sounding model, or is there a different one you'd recommend that doesn't perform as well but sounds even better?
Comment by jdp23 1 day ago
Comment by cachius 1 day ago
Comment by nowittyusername 1 day ago
Comment by pixl97 1 day ago
This is something I've noticed around a lot of AI related stuff. You really can't take any one article on it as definitive. This, and anything that doesn't publish how they fully implemented it is suspect. That's both for the affirmative and negative findings.
It reminds me a bit of the earlier days of the internet were there was a lot of exploration of ideas occurring, but quite often the implementation and testing of those ideas left much to be desired.
Comment by 8bitsrule 18 hours ago
Comment by noosphr 1 day ago
Comment by nowittyusername 1 day ago
Comment by gia_ferrari 1 day ago
Comment by nowittyusername 22 hours ago
Comment by noosphr 22 hours ago
Comment by dqv 1 day ago
My takeaway from the article is that accuracy of pronunciation, tweakability, and "time to first utterance" are what matter most.
Comment by ClawsOnPaws 23 hours ago
Some of this is surely ssubjective, but I'm pretty sure I'm not the only screen reader user with these opinions.
Comment by rhdunn 1 day ago
I don't know if this is a data/transcription issue, an issue with noisy audio, or what.
Comment by ctoth 1 day ago
Comment by WarmWash 1 day ago
"AI is going to make screen readers amazing!"
No, that is not what AI is going to do. That is the exact kind of missing the forest for the trees that comes with new tech.
AI will be used to act as a sighted person sitting next to the blind person, who the blind person is conversing with (at whatever speed they wish) to interpret and do stuff on the screen. It's a total misapplication of AI to think the goal is to leverage it to make screen readers better.
They can have sighted servant who is gleefully collaborating with them to use their computer. You don't need 900 words per minute read to you so you can build a full mental model of every webpage. You can just say "Lets go on amazon and look for paper towels", "Lets check the top stories on HN"
Comment by tuukkao 1 day ago
Comment by ClawsOnPaws 23 hours ago
Comment by WarmWash 1 day ago
No one will force a blind person to use a computer that converses in natural english. But even sighted people are likely to move away from dense visually heavy UIs towards natural conversational interface with digital systems. I suspect that given that comes to fruition (unlike us nerds, regular folks hate visual info dense clutter), young blind people won't even perceive much impediment in that area of life.
This isn't far off from CLI vs GUI debate, where CLIs are way faster and more efficient, but regular people overwhelmingly despise them and use GUIs. Ease over efficiency is the goal for them.
Comment by ALittleLight 1 day ago
However, not all blind people are good with screen readers. For them, an AI assistant would be useful. Even for good screen reader users an AI could be useful.
An example: Yesterday, I needed to buy new valve caps for my car's tires. The screen reader path would be something like walmart -> jump to search field, type "valve cap car tire" and submit -> jump to results section -> iterate through a few results to make sure I'm getting the right thing at a good price -> go to the result I want -> checkout flow. Alternatively, the AI flow would be telling my AI assistant that I need new car tire valve caps. The assistant could then simultaneously search many provider options, select one based on criteria it inferred, and order it by itself.
The AI path, in other words, gets a better result (looking through more providers means it's likelier to find a better path, faster delivery, whatever) and also, much easier and faster. Of course, not only for screen reader users, but also just everyone.
Comment by vunderba 1 day ago
Comment by rhdunn 9 hours ago
There are effectively two approaches to voice synthesis: time-domain and pitch-domain.
In time-domain synthesis you care concatenating short waveforms together. These are variations of Overlap and Add: OLA [1], PSOLA [2], MBROLA [3], etc.
In pitch-domain synthesis, the analysis and synthesis happens in the pitch domain through the Fast Fourier Transform (visualized as a spectrogram [4]), often adjusted to the Mel scale [5] to better highlight the pitches and overtones. The TTS synthesizer is then generating these pitches and converting them back to the time domain.
The basic idea is to extract the formants (pitch bands for the fundamental frequency and overtones) and have models for these. Some techniques include:
1. Klatt formant synthesis [6]
2. Linear Predictive Coding (LPC) [7]
3. Hidden Markov Model (HMM) [8]
4. WaveGrad NN/ML [9]
[1] https://en.wikipedia.org/wiki/Overlap%E2%80%93add_method
[2] https://en.wikipedia.org/wiki/PSOLA -- Pitch-synchronous Overlap and Add
[3] https://en.wikipedia.org/wiki/MBROLA -- Multi-Band Resynthesis Overlap and Add
[4] https://en.wikipedia.org/wiki/Spectrogram
[5] https://en.wikipedia.org/wiki/Mel_scale
[6] https://en.wikipedia.org/wiki/Dennis_H._Klatt
[7] https://en.wikipedia.org/wiki/Linear_predictive_coding
[8] https://www.cs.cmu.edu/~awb/papers/ssw6/ssw6_294.pdf
[9] https://arxiv.org/abs/2009.00713 -- WaveGrad: Estimating Gradients for Waveform Generation
Comment by aaronbrethorst 1 day ago
I feel like there’s a lot of backstory I’m missing.
Comment by 46493168 1 day ago
The original Eloquence TTS was developed as ETI-Eloquence. ScanSoft acquired speech recognition company SpeechWorks in 2003, and in October 2005, ScanSoft merged with Nuance Communications, with the combined company adopting the Nuance name. Currently, Code Factory distributes ETI Eloquence for Windows as a SAPI 5 TTS synthesizer, though I can’t figure out exact licensing relationship between Code Factory and Nuance, which was acquired by Microsoft in like 2022
Comment by miki123211 1 day ago
Microsoft only bought the speech recognition / med tech parts of nuance, everything else, notably the Vocalizer speech stack (and likely also Eloquence) was spun off as Cerence. We know that somebody still has source code for Eloquence somewhere, as Apple licenses it and compiles it natively for aarch64 (yes I've looked at those dylibs, no there's no emulation). Not sure why nobody is recompiling the Windows versions, either there's just no need to do so, or some Windows specific part of the code was lost in all the mergers and would need to be rewritten.
A lot of Eloquence IP was also licensed by IBM, and the text-to-phoneme processing stuff is still in use for IBM Watson to some extend (it's vulnerable to the same crash strings and has similar pronunciation quirks).
With that said, I'm not sure if Eloquence system integrators are getting the Delta code and the tools to compile it to C++, or just the pre-generated cpp. Either would be consistent with the fact that Apple compiles it for their own platforms but doesn't introduce any changes to the pronunciation rules. It is entirely within the realms of possibility that this part of the stack has been lost, at least to Cerrence, though there's nothing that specifically indicates that such is the case.
Comment by layer8 1 day ago
It’s not impossible that Apple might have transpiled the x86 machine code.
Comment by 46493168 1 day ago
[0]https://openletter.earth/to-cerence-inc-hims-inc-hims-intern...
Comment by superkuh 1 day ago
As someone with progressive retinal tearing who's used the linux desktop for 20 years I'm terrified. The forcing of the various incompatible waylands by the big linux corps has meant the end of support for screen readers. The only wayland compositor that supports screen readers in linux is GNOME's mutter and they literally only added that support last year (after 15 years of waylands) and instead of supporting standard at-spi and existing protocols that Orca and the like use GNOME decided to come up with two new in-house GNOME proprietary protocols (which themselves don't send the full window tree or anything on request but instead push only info about single windows, etc, etc) for doing it. No other wayland compositor supports screen readers. And without any standardization no developers will ever support screenreaders on waylands. Basically only GNOME's userspace will sort of support it. There's no hope for non-X11 based screen readers and all the megacorps are say they're dropping X11 support.
The only options I have are to use and maintain old X11 linux distros myself. But eventually things like CA TLS and browsers just won't be feasible for me to backport and compile myself. Eventually I'm going to have to switch to using Windows. It's a sad, sad state of things.
And regarding AI based text to speech: almost all of it kind of sucks for screen readers. Particularly the random garbled ai-noises that happen between and at the end of utterances, inaccurate readings, etc in many models. Not to mention requiring the use of a GPU and lots of system resources. The old Festival 1.96 Nitech HTS voices on (core2duo) CPU from the early 2000 are incomparibly faster, more accurate, and sound decent enough to understand.
Comment by noosphr 1 day ago
Gentoo, duvian and all the bsds will keep x11 around until the heat death of the universe. Anyone who doesn't force systemd on their users also doesn't force Wayland. You have plenty of options before windows.
Comment by lukastyrychtr 1 day ago
Comment by dfajgljsldkjag 1 day ago
Comment by visarga 1 day ago
Comment by noosphr 1 day ago
There doesn't need to be a way forward when the software 'just works' on every platform, I'm happily using it from my phone now.
Comment by NedF 8 hours ago
Comment by blabla_bla 1 day ago