Qwen3-TTS family is now open sourced: Voice design, clone, and generation
Posted by Palmik 2 days ago
Comments
Comment by simonw 2 days ago
I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/
Comment by javier123454321 2 days ago
Comment by rdtsc 2 days ago
Comment by rpdillon 2 days ago
Comment by neevans 2 days ago
Comment by freedomben 2 days ago
Comment by DANmode 2 days ago
Comment by plagiarist 1 day ago
Comment by muggermuch 1 day ago
Comment by aprilthird2021 1 day ago
This won't change anything about Western style courts which have always required an unbroken chain of custody of evidence for evidence to be admissable in court
Comment by cwillu 1 day ago
Comment by aprilthird2021 1 day ago
Comment by u8080 2 days ago
Comment by harshreality 2 days ago
Comment by javier123454321 2 days ago
Comment by arcanemachiner 2 days ago
Comment by oceanplexian 2 days ago
Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments). In fact it's a miracle and a bit ironic that the Chinese would be the ones to release a plethora of capable open source models, instead of the scraps like we've seen from Google, Meta, OpenAI, etc.
Comment by mrandish 1 day ago
Agreed. The only thing worse than everyone having access to this tech is only governments, mega corps and highly-motivated bad actors having access. They've had it a while and there's no putting the genii back in the bottle. The best thing the rest of us can do is use it widely so everyone can adapt to this being the new normal.
Comment by apitman 1 day ago
Comment by javier123454321 2 days ago
Comment by refulgentis 1 day ago
Socratic version: how can the Chinese companies afford to make them and give them out for free? Cui bono?
n.b. it's not because they're making money on the API, ex. open openrouter and see how Moonshot or DeepSeek's 1st party inference speed compares to literally any other provider. Note also that this disadvantage can't just be limited to LLMs, due to GPU export rules.
Comment by vonneumannstan 1 day ago
Lol what exactly do you think Zuck would do with your voice, drain your bank account??
Comment by liamN 8 hours ago
Comment by razster 1 day ago
Comment by fridder 2 days ago
Comment by grumbel 2 days ago
Comment by simonw 2 days ago
Comment by _kb 1 day ago
Comment by disillusioned 1 day ago
Comment by echelon 2 days ago
There are far more good and interesting use cases for this technology. Games will let users clone their voices and create virtual avatars and heroes. People will have access to creative tools that let them make movies and shows with their likeness. People that couldn't sing will make music.
Nothing was more scary than the invention of the nuclear weapon. And we're all still here.
Life will go on. And there will be incredible benefits that come out of this.
Comment by javier123454321 2 days ago
I simply think people don't really know that the new world requires a new set of rules of engagement for anything that exists behind a screen (for now).
Comment by supern0va 2 days ago
That said, I am likewise looking forward to the cool things to come out of this.
Comment by cookiengineer 1 day ago
> And there will be incredible benefits that come out of this.
Your username is echelon.
I just wanted to point that out.
Comment by michelb 1 day ago
Comment by doug713705 2 days ago
Except that building a nuclear weapon was not available to everyone, certainly not to dumb people whose brain have been feeded with social media content.
Comment by lynx97 1 day ago
Comment by DANmode 2 days ago
I was with you, until
But, yeah. Life will go on.
Comment by echelon 2 days ago
I'm a filmmaker. I've done it photons-on-glass production for fifteen years. Meisner trained, have performed every role from cast to crew. I'm elated that these tools are going to enable me to do more with a smaller budget. To have more autonomy and creative control.
Comment by redwall_hp 1 day ago
Hatsune Miku (Fujita Saki) is arguably the most prolific singer in the world, if you consider every Vocaloid user and the millions of songs that have come out of it.
So I don't think there's any uncharted territory...we still have singers, and sampled VST instruments didn't stop instrumentalists from existing; if anything, most of these newcomer generative AI tools are far less flexible or creatively useful than the vast array of synthesis tools musicians already use.
Comment by fc417fc802 1 day ago
No one was going to replace voice actors for TV and movie dubs with Miku whereas the cutting edge TTS tools seem to be nearing that point. Presumably human vocal performances will follow that in short order.
Comment by DANmode 2 days ago
Oh no.
Maybe we did frig this up.
Comment by fc417fc802 1 day ago
Comment by DANmode 1 day ago
and there are even a couple SaaS options for it now.
Comment by javier123454321 2 days ago
It's not so much of an issue with art for art's sake aided by AI. It's an issue with artistic work becoming unviable work.
Comment by volkercraig 2 days ago
Comment by fc417fc802 1 day ago
And this itself is another tired trope. Just because you can pattern match and observe that things repeatedly went a certain way in the past, doesn't mean that all future applications of said pattern will play out the same way. On occasion entire industries have been obliterated without a trace by technological advancement.
We can also see that there must be some upper ceiling on what humans in general are capable of - hit that and no new jobs will be created because humans simply won't be capable of the new tasks. (Unless we fuse with the machines or genetically engineer our brains or etc but I'm choosing to treat those eventualities as out of scope.)
Comment by Urahandystar 1 day ago
Comment by fc417fc802 1 day ago
It's a bit tricky to come up with concrete examples on the spot, in particular because drawing a line around a given industry or type of work is largely subjective. I could point to blacksmithing and someone could object that we still have metalworkers. But we don't have individual craftsmen hammering out pieces anymore. Someone might still object that an individual babysitting a CNC machine is analogous but somehow it feels materially different to me.
Leather workers are another likely example. To my mind that's materially different from a seamstress, a job that itself has had large parts of the tasks automated.
Horses might be a good example. Buggies and carriages replaced by the engine. Most of the transportation counterparts still exist but I don't think mechanics are really a valid counterpart to horse tenders and all the (historic) economic activity associated with that. Sure a few rich people keep race horses but that's the sort of luxury I was referring to above. The number of related job positions is a tiny fraction of what it was historically and exists almost solely for the purpose of entertaining rich people.
Historically the skill floor only crept up at a fairly slow rate so the vast majority of those displaced found new sectors to work in. But the rate of increase appears to have picked up to an almost unbelievable clip (we're literally in the midst of redefining the roles of software developers of all things, one of the highest skilled "bulk" jobs out there). It should be obvious that if things keep up the way they've been going then we're going to hit a ceiling for humans as a species not so long from now.
Comment by redwall_hp 1 day ago
Recorded music and radio obviously reduced the demand for performers, which reduced demand for sheets.
Comment by javier123454321 2 days ago
All that is before the fact that streaming services are stuffing playlists with AI generated music to further reduce the payouts to artists.
> Yet here we are, hundreds of years later, live music is still desirable, plays still happen, and faceless voices are still around...
Yes all those things still happen, but it's increasingly untenable to make a living through it.
Comment by cthalupa 1 day ago
I listen pretty exclusively to metal, and a huge chunk of that is bands that are very small. I go to shows where they headliners stick around at the bar and chat with people. Not saying this to be a hipster - I listen to plenty of "mainstream" stuff too - but to show that it's hard to get smaller than this when it comes to people wanting to make a living making music.
None of them made any money off of Spotify or whatever before AI. They probably don't notice a difference, because they never paid attention to the "revenue" there either.
But they do pay attention to Bandcamp. Because Bandcamp has given them more ability to make money off the actual sale of music than they've had in their history - they don't need to rely on a record deal with a big label. They don't need to hope that the small label can somehow get their name out there.
For some genres, some bands, it's more viable than ever before to make a living. For others, yeah, it's getting harder and harder.
Comment by volkercraig 1 day ago
Comment by patrickdavey 2 days ago
Comment by lynx97 1 day ago
Comment by TacticalCoder 1 day ago
And most of all: they're both local models. The cat is out of the box and it's never going back in. There's no censoring of this. No company that can pull the plug. Anyone with a semi-modern GPU can use these models.
Comment by magicalhippo 2 days ago
I presume this is due to using the base model, and not the one tuned for more expressiveness.
edit: Or more likely, the demo not exposing the expressiveness controls.
The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.
Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.
Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.
Comment by thedangler 2 days ago
Comment by magicalhippo 2 days ago
The HF demo is very similar to the GitHub demo, so easy to try out.
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install qwen3-tts
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --no-flash-attn --ip 127.0.0.1 --port 8000
That's for CUDA 12.8, change PyTorch install accordingly.Skipped FlashAttention since I'm on Windows and I haven't gotten FlashAttention 2 to work there yet (I found some precompiled FA3 files[3] but Qwen3-TTS isn't FA3 compatible yet).
[1]: https://github.com/QwenLM/Qwen3-TTS?tab=readme-ov-file#quick...
Comment by dur-randir 1 day ago
Comment by regularfry 1 day ago
Comment by magicalhippo 1 day ago
Try using mps I guess, I saw multiple references to code checking if device is not mps, so seems like it should be supported. If not, CPU.
Comment by dsrtslnd23 2 days ago
Comment by magicalhippo 2 days ago
Haven't looked into the demo to see if it could be optimized by moving certain bits to CPU for example.
Comment by pseudosavant 2 days ago
Comment by _kb 1 day ago
Comment by parentheses 1 day ago
``` Loaded speech tokenizer from ~/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426 e0d11f67716c1211e/speech_tokenizer Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s]Fetching 11 files: 100%|| 11/11 [00:00<00:00, 125033.45it/s] The tokenizer you are loading from '!/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426e0d11f67716c1211e' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instr.... This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. ```
Comment by cristoperb 1 day ago
Comment by viraptor 1 day ago
Comment by bsenftner 1 day ago
Comment by mohsen1 2 days ago
What am I doing wrong?
Comment by gregsadetsky 2 days ago
Comment by KolmogorovComp 2 days ago
Comment by simonw 2 days ago
That's not really rational considering the internet is full of examples of my voice that anyone could use though. Here's a recent podcast clip: https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3006s
Comment by KolmogorovComp 2 days ago
Comment by genewitch 1 day ago
i have several other examples from before my repeater ID voice clone. Newer voice models will have to wait till i recover my NAS tomorrow!
this is the newest one i have access to: Dick Powell voice clone off his Richard Diamond Persona: https://soundcloud.com/djoutcold/dick-powell-voice-clone-tes...
i was one-shotting voices years ago that were timbre/tonally identical to the reference voice; however the issue i had was inflection and subtlety. I find that female voices are much easier to clone, or at least it fools my brain into thinking so.
this model, if the results weren't too cherry picked, will be huge improvement!
Comment by kingstnap 2 days ago
Comment by itsTyrion 1 day ago
Comment by simonw 2 days ago
Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.py
You can try it with uv (downloads a 4.5GB model on first run) like this:
uv run https://tools.simonwillison.net/python/q3_tts.py \
'I am a pirate, give me your gold!' \
-i 'gruff voice' -o pirate.wavComment by genewitch 1 day ago
hopefully i can make this work on windows (or linux, i guess).
thanks so much.
Comment by rahimnathwani 1 day ago
uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --playComment by indigodaddy 2 days ago
Comment by simonw 1 day ago
You'd need to use a different build of the model though, I don't think MLX has a CPU implementation.
Comment by genewitch 1 day ago
anyhow, with faster CPUs and optimizations, you won't be waiting too long. Also 20GB is overkill for an audio model. Only text - LLM - are huge and take infinite memory. SD/FLUX models are under 16GB of ram usage (uh, mine are, at least!), for instance.
Comment by gcr 2 days ago
Comment by TheAceOfHearts 2 days ago
Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.
If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.
Comment by KaoruAoiShiho 2 days ago
Comment by TheAceOfHearts 2 days ago
> Read this in a calm, clear, and wise audiobook tone.
> Do not rush. Allow the meaning to sink in.
But maybe I should experiment with something more detailed. Do you have any suggestions?
Comment by KaoruAoiShiho 1 day ago
Character Name: Marcus Cole Voice Profile: A bright, agile male voice with a natural upward lift, delivering lines at a brisk, energetic pace. Pitch leans high with spark, volume projects clearly—near-shouting at peaks—to convey urgency and excitement. Speech flows seamlessly, fluently, each word sharply defined, riding a current of dynamic rhythm. Background: Longtime broadcast booth announcer for national television, specializing in live interstitials and public engagement spots. His voice bridges segments, rallies action, and keeps momentum alive—from voter drives to entertainment news. Presence: Late 50s, neatly groomed, dressed in a crisp shirt under studio lights. Moves with practiced ease, eyes locked on the script, energy coiled and ready. Personality: Energetic, precise, inherently engaging. He doesn’t just read—he propels. Behind the speed is intent: to inform fast, to move people to act. Whether it’s “text VOTE to 5703” or a star-studded tease, he makes it feel immediate, vital.
Comment by dsrtslnd23 2 days ago
Comment by TheAceOfHearts 2 days ago
The Tao Te Ching audiobook came in at 62 mins in length and it ran for 102 mins, which gives an RTF of 1.645.
I do get a warning about flash-attn not being installed, which says that it'll slow down inference. I'm not sure if that feature can be supported on the 1080 and I wasn't up for tinkering to try.
Comment by storystarling 1 day ago
Comment by genewitch 1 day ago
Comment by genewitch 2 days ago
Now, maybe the results were cherrypicked. i know everyone else who has released one of these cherrypicks which to publish. However, this is the first time i've considered it plausible to use AI TTS to remaster old radioplays and the like, where a section of audio is unintelligible but can be deduced from context, like a tape glitch where someone says "HEY [...]LAR!" and it's an episode of Yours Truly, Johnny Dollar...
I have dozens of hours of audio of like Bob Bailey and people of that era.
Comment by kamranjon 2 days ago
Comment by genewitch 2 days ago
besides, they know what side their bread is buttered on. I feel like this is almost not the real announcement; or, the engineers that wrote this up and did the demos just ran it that way. The normal speech voices are fine (lower than the anime ones on the page.) i agree that the first few are very infantile. I'll change that word if i can think of a better one.
Comment by freedomben 2 days ago
Comment by genewitch 2 days ago
Observe, original: https://www.youtube.com/watch?v=YiRcOVDAryM
my edit (took about an hour, if memory serves, to set up. forgot render time...): https://www.youtube.com/watch?v=xazubVJ0jz4
i say "was [...] software" because the last 2 times i've tried to use it, it did imperceptible cleanup, making it worthless. Anyhow, all my radio plays are from OTRR, i think.Audio.Restoration.DeNoise.DeNoiseLF.2.8.3_WiN.OSX is a more recent version i think
p.s. are you a "dude named Ben"?
Comment by freedomben 1 day ago
Yeah all my radio plays are from OTRR now. I bought a number of different "collections" from different sources but none of them come even close to the quality and care that the OTRR people have.
Also, always a pleasure to meet someone else who loves old-time radio :-D
What are some of your favorites? Probably my favorite is Abbott & Costello, followed by Have Gun - Will Travel and Gunsmoke. I like the Lone Ranger too but am only a few hours into it so far.
p.s. I am indeed a dude named Ben!
Comment by chriswep 1 day ago
Comment by throwaw12 2 days ago
Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.
Comment by mortsnort 2 days ago
Comment by zeppelin101 2 days ago
Comment by mhuffman 2 days ago
Comment by stuckkeys 2 days ago
Comment by pseudony 2 days ago
Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.
Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.
Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.
Beyond that, Glm4.7 should also be great.
See https://dev.to/kilocode/open-weight-models-are-getting-serio...
It is a recent case story of vibing a smaller tool with kilo code, comparing output from minimax m2.1 and Glm4.7
Honestly, just give it a whirl - no need to send money to companies/nations your disagree with with.
Comment by nunodonato 2 days ago
Comment by bigyabai 2 days ago
$20/month is a bit of an insane ask when the most valuable thing Anthropic makes is the free Claude Code CLI.
Comment by mikenew 2 days ago
Comment by stavros 2 days ago
Comment by Mashimo 1 day ago
Do you even need an subscription to any service for that? Is a free tier not enough?
Comment by dsrtslnd23 2 days ago
Comment by sumedh 1 day ago
alias "claude-zai"="ANTHROPIC_BASE_URL=$ZAI_ANTHROPIC_BASE_URL ANTHROPIC_AUTH_TOKEN=$ZAI_ANTHROPIC_AUTH_TOKEN claude"
Then you can run `claude`, hit your limit, exit the session and `claude-zai -c` to continue (with context reset, of course).Someone gave me that command a while back.
Comment by nunodonato 14 hours ago
Comment by Mashimo 1 day ago
Comment by stavros 2 days ago
Comment by nunodonato 2 days ago
Comment by TylerLives 2 days ago
What do you mean by this?
Comment by throwaw12 2 days ago
https://www.bloomberg.com/news/articles/2026-01-20/anthropic...
Comment by vlovich123 2 days ago
Comment by cmrdporcupine 2 days ago
And that's the rub.
Many of us are not.
Comment by subscribed 2 days ago
Comment by Levitz 2 days ago
Being critical of favorable actions towards a rival country shouldn't be divisive, and if it is, well, I don't think the problem is in the criticism.
Also the link doesn't mention open source? From a google search, he doesn't seem to care much for it.
Comment by giancarlostoro 2 days ago
I prefer to have more open models. On the other hand China closes up their open models once they start to show a competitive edge.
Comment by Balinares 2 days ago
Comment by mohsen1 2 days ago
I still have a small Claude account to do some code reviews. Opus 4.5 does good reviews but at this point GLM 4.7 usually can do the same code reviews.
If cost is an issue (for me it is, I pay out of pocket) go with GLM 4.7
Comment by imiric 2 days ago
Regardless of how productive those numbers may seem, that amount of code being published so quickly is concerning, to say the least. It couldn't have possibly been reviewed by a human or properly tested.
If this is the future of software development, society is cooked.
Comment by mohsen1 1 day ago
Stuff like this: https://github.com/mohsen1/claude-code-orchestrator-e2e-test...
Yes, the idea is to really, fully automate software engineering. I don't know if I am going to be successful but I'm on vacation and having fun!
if Opus 4.5/GLM 4.7 can do so much already, I can only imagine what can be done in two years. Might as well adopt to this reality and learn how leverage this advancement
Comment by azuanrb 1 day ago
Comment by mohsen1 1 day ago
I think using GitHub with issues,PRs and specially leveraging AI code reviewers like Greptile is the way to go Actually. I did an attempt here https://github.com/mohsen1/claude-orchestrator-action but I think it needs a lot more attention to get it right. Ideas in Gas Town are great and I might steal some of those. Running Claude Code in GitHub Action works with GLM 4.7 great.
Microsoft's new Agent SDK is also interesting. Unlocks multi-provider workflows so user can burn out all of their subscriptions or quickly switch providers
Also super interested in collaborating with someone to build something together if you are interested!
Comment by amrrs 2 days ago
Comment by davely 2 days ago
I spent 20 minutes yesterday trying to get GLM 4.7 to understand that a simple modal on a web page (vanilla JS and HTML!) wasn't displaying when a certain button was clicked. I hooked it up to Chrome MCP in Open Code as well.
It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"
It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).
Comment by bityard 2 days ago
My experience is that all of the models seem to do a decent job of writing a whole application from scratch, up to a certain point of complexity. But as soon as you ask them for non-trivial modifications and bugfixes, they _usually_ go deep into rationalized rabbit holes into nowhere.
I burned through a lot of credits to try them all and Gemini tended to work the best for the things I was doing. But as always, YMMV.
Comment by KolmogorovComp 2 days ago
Comment by Balinares 2 days ago
That evening, for kicks, I brought the problem to GLM 4.7 Flash (Flash!) and it one-shot the right solution.
It's not apples to apples, because when it comes down to it LLMs are statistical token extruders, and it's a lot easier to extrude the likely tokens from an isolated query than from a whole workspace that's already been messed up somewhat by said LLM. That, and data is not the plural of anecdote. But still, I'm easily amused, and this amused me. (I haven't otherwise pushed GLM 4.7 much and I don't have a strong opinion about about it.)
But seriously, given the consistent pattern of knitting ever larger carpets to sweep errors under that Claude seems to exhibit over and over instead of identifying and addressing root causes, I'm curious what the codebases of people who use it a lot look like.
Comment by girvo 2 days ago
This has been my consistent experience with every model prior to Opus 4.5, and every single open model I've given a go.
Hopefully we will get there in another 6 months when Opus is distilled into new open models, but I've always been shocked at some of the claims around open models, when I've been entirely unable to replicate them.
Hell, even Opus 4.5 shits the bed with semi-regularity on anything that's not completely greenfield for my usage, once I'm giving it tasks beyond some unseen complexity boundary.
Comment by throwaw12 2 days ago
I use Opus 4.5 for planning, when I reach my usage limits fallback to GLM 4.7 only for implementing the plan, it still struggles, even though I configure GLM 4.7 as both smaller model and heavier model in claude code
Comment by WarmWash 2 days ago
China would need an architectural breakthrough to leap American labs given the huge compute disparity.
Comment by miklosz 2 days ago
Comment by digdugdirk 2 days ago
A financial jackknifing of the AI industry seems to be one very plausible outcome as these promises/expectations of the AI companies starts meeting reality.
Comment by overfeed 2 days ago
1. Chinese researcher in China, to be more specific.
Comment by bfeynman 2 days ago
Comment by WarmWash 2 days ago
They need a training-multiplier breakthrough that would allow them to train SOTA models on on a fraction of the compute that the US does. And this would also have to be kept a secret and be well hidden (often multiple researchers from around the world put the pieces together on a problem at around the same time, so the breakthrough would have to be something pretty difficult to discover for the greatest minds in the field) to prevent the US from using it to multiply their model strength with their greater compute.
Comment by jacquesm 2 days ago
Comment by numpad0 2 days ago
1: https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_populationComment by overfeed 2 days ago
1. e.g. select any DeepSeek release, and read the accompanying paper
Comment by jacquesm 2 days ago
Your 'cope' accusation has no place here, I have no dog in the race and do not need to cope with anything.
Comment by overfeed 2 days ago
I will rephrase my statement and continue to stand by it: "Denying the volume of original AI research being done by China - a falsifiable metric - betrays some level of cope."
You seem to agree on the fact that China has surpassed the US. As for quality, I'll say expertise is a result of execution. At some point in time during off-shoring, the US had qualitatively better machinists that China, despite manufacturing volumes. That is no longer the case today - as they say, cream floats to the top, and that holds true for a pot or an industrial-sized vat.
Comment by popalchemist 1 day ago
Comment by sieabahlpark 2 days ago
Comment by aaa_aaa 2 days ago
Comment by mhuffman 2 days ago
Comment by cmrdporcupine 2 days ago
Comment by genewitch 1 day ago
because i've been on youtube and insta, and believe me, no one else even compares, yet.
Comment by Onavo 2 days ago
Comment by aussieguy1234 2 days ago
Comment by sampton 2 days ago
Comment by girvo 2 days ago
Comment by viraptor 1 day ago
Comment by akadeb 1 day ago
Comment by viraptor 1 day ago
Comment by stuckkeys 1 day ago
Works surprisingly good with a 4090. I will also try it on 5090. This is the best one I have seen so far. NGL. 11Labs is cooked lol.
Comment by rahimnathwani 2 days ago
Comment by magicalhippo 2 days ago
Comment by turnsout 2 days ago
Comment by Lichtso 1 day ago
Comment by rahimnathwani 1 day ago
https://huggingface.co/mlx-community/Qwen3-TTS-12Hz-0.6B-Bas...
uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --playComment by javier123454321 2 days ago
Comment by PunchyHamster 2 days ago
Comment by satvikpendem 2 days ago
Comment by d4rkp4ttern 1 day ago
[1] https://github.com/kyutai-labs/pocket-tts
[2] https://github.com/pchalasani/claude-code-tools?tab=readme-o...
Comment by anotherevan 1 day ago
I have a friend with a paralysed larynx who is often using his phone or a small laptop to type in order to communicate. I know he would love it if it was possible to take old recordings of him speaking and use that to give him back "his" voice, at least in some small measure.
Comment by 7777777phil 1 day ago
Comment by khimaros 1 day ago
Comment by gunalx 2 days ago
Comment by thedangler 2 days ago
Comment by dust42 2 days ago
Comment by daliusd 2 days ago
There are some samples. If you have GPU you might want to fork and improve this, but otherwise slow, but usable on CPU as well.
Comment by indigodaddy 2 days ago
Comment by andhuman 2 days ago
Comment by indigodaddy 1 day ago
Comment by quinncom 2 days ago
Comment by indigodaddy 2 days ago
Comment by magicalhippo 1 day ago
As a result it's dog slow on CPU only, like 3-4 minutes to produce a 3 second clip, and still significantly less than real-time on my 5090 using only 30% of the GPU.
Comment by whinvik 2 days ago
Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.
Comment by Footprint0521 2 days ago
Comment by whinvik 2 days ago
Comment by woodson 2 days ago
Comment by naveen-zerocool 1 day ago
Comment by lostmsu 2 days ago
Comment by JonChesterfield 2 days ago
Comment by sinnickal 1 day ago
Comment by albertwang 2 days ago
Comment by numpad0 1 day ago
1: https://old.reddit.com/r/ZenlessZoneZero/comments/1gqmtl1/th...
Comment by rapind 2 days ago
100% I was thinking the same thing.
Comment by bityard 2 days ago
And if you ask me, I think these models were trained on tween fiction podcasts. (My kids listen to a lot of these and dramatic over-acting seems to be the industry standard.)
Also, their middle-aged adult with an "American English" accent sounds like any American I've ever met. More like a bad Sean Connery impersonator.
Comment by reactordev 2 days ago
Comment by devttyeu 2 days ago
Comment by pixl97 2 days ago
Comment by thehamkercat 2 days ago
Comment by htrp 2 days ago
Comment by sails 2 days ago
Comment by bigyabai 2 days ago
Comment by swaraj 2 days ago
Comment by jakobdabo 2 days ago
This is needed for processing an indie game's voice recordings, where the voice actors weren't native speakers and had some accent.
Comment by jonkoops 1 day ago
Comment by dangoodmanUT 2 days ago
Comment by ideashower 2 days ago
Comment by subscribed 2 days ago
Comment by illwrks 2 days ago
Comment by wahnfrieden 2 days ago
Comment by numpad0 2 days ago
Comment by wahnfrieden 1 day ago
Comment by numpad0 1 day ago
Comment by wahnfrieden 1 day ago
Comment by salzig 2 days ago
Edit: "Cross-lingual Voice Clone" https://qwen.ai/blog?id=qwen3tts-0115#voice-clone
Comment by salzig 2 days ago