Qwen3-Omni-Flash-2025-12-01:a next-generation native multimodal large model
Posted by pretext 1 day ago
Comments
Comment by gardnr 1 day ago
You can expect this model to have similar performance to the non-omni version. [2]
There aren't many open-weights omni models so I consider this a big deal. I would use this model to replace the keyboard and monitor in an application while doing the heavy lifting with other tech behind the scenes. There is also a reasoning version, which might be a bit amusing in an interactive voice chat if it pronounces the thinking tokens while working through to a final answer.
1. https://huggingface.co/Qwen/Qwen2.5-Omni-7B
2. https://artificialanalysis.ai/models/qwen3-30b-a3b-instruct
Comment by red2awn 1 day ago
- 650M Audio Encoder
- 540M Vision Encoder
- 30B-A3B LLM
- 3B-A0.3B Audio LLM
- 80M Transformer/200M ConvNet audio token to waveform
This is a closed source weight update to their Qwen3-Omni model. They had a previous open weight release Qwen/Qwen3-Omni-30B-A3B-Instruct and a closed version Qwen3-Omni-Flash.
You basically can't use this model right now since none of the open source inference framework have the model fully implemented. It works on transformers but it's extremely slow.
Comment by olafura 1 day ago
Comment by coder543 1 day ago
Comment by red2awn 1 day ago
Comment by coder543 1 day ago
Comment by red2awn 1 day ago
I've seen it in their online materials too but can't seem to find it now.
Comment by gardnr 1 day ago
Comment by pythux 1 day ago
Comment by tensegrist 1 day ago
last i checked (months ago) claude used to do this
Comment by andy_ppp 1 day ago
Comment by plipt 1 day ago
Their benchmark table shows it beating Qwen3-235B-A22B
Does "Flash" in the name of a Qwen model indicate a model-as-a-service and not open weights?
Comment by red2awn 1 day ago
Comment by plipt 1 day ago
Was it being closed weight obvious to you from the article? Trying to understand why I was confused. Had not seen the "Flash" designation before
Also 30B models can beat a semi-recent 235B with just some additional training?
Comment by red2awn 1 day ago
For the evals it's probably just trained on a lot of the benchmark adjacent datasets compared to the 235B model. Similar thing happened on other model today: https://x.com/NousResearch/status/1998536543565127968 (a 30B model trained specifically to do well in maths get near SOTA scores)
Comment by andy_xor_andrew 1 day ago
Where are you finding that info? Not saying you're wrong; just saying that I didn't see that specified anywhere in the linked page, or on their HF.
Comment by plipt 1 day ago
The benchmark table shows this Flash model beating their Qwen3-235B-A22B. I dont see how that is possible if it is a 30B-A3B model.
I don't see a mention of a parameter count anywhere in the article. Do you? This may not be an open weights model.
This article feels a bit deceptive
Comment by sosodev 1 day ago
Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.
edit:
It does support real-time conversation! Has anybody here gotten that to work on local hardware? I'm particularly curious if anybody has run it with a non-nvidia setup.
Comment by potatoman22 1 day ago
Comment by dragonwriter 1 day ago
Comment by bakeman 1 day ago
“He’s on record saying he broke the record for spinning a record.”
Comment by dragonwriter 1 day ago
OTOH my point that the thing being suggested to be tested is not testable by seeing whether or not the system is capable of distinguishing homophones, but might be by seeing whether or not it distingishes heteronyms still stands. (The speculation that the record/record distinction intended was one that is actually a pair of heteronyms and that the error was merely the use of the word “homophone" in place of “heteronym”, rather than the basic logic of the comment is somewhat tangential to the main point.)
Comment by potatoman22 14 hours ago
Comment by sosodev 1 day ago
Comment by sosodev 1 day ago
However, It still doesn't seem capable of producing any of the sounds, like laughter, that I would expect from a native voice model.
Comment by potatoman22 14 hours ago
Comment by djtango 1 day ago
Comment by potatoman22 14 hours ago
Comment by red2awn 1 day ago
Comment by AndreSlavescu 1 day ago
Check it out here: https://models.hathora.dev/model/qwen3-omni
Comment by sosodev 1 day ago
Comment by red2awn 1 day ago
Comment by AndreSlavescu 1 day ago
Comment by valleyer 1 day ago
Comment by sosodev 1 day ago
Comment by whimsicalism 1 day ago
Comment by red2awn 1 day ago
Comment by whimsicalism 19 hours ago
Comment by dsrtslnd23 1 day ago
Comment by sosodev 1 day ago
Comment by ivape 1 day ago
This is part of programming that I think is the new field. There will be tons of work for those that can build the new workflows which will need to be primarily natural language driven.
Comment by sosodev 1 day ago
The creator posted a little demo of it working with Qwen3 Omni that is quite impressive: https://www.youtube.com/watch?v=5DBFVe3cLto
He didn't include any details regarding how the model was running though
Comment by terhechte 1 day ago
Qwen usually provides example code in Python that requires Cuda and a non-quantized model. I wonder if there is by now a good open source project to support this use case?
Comment by tgtweak 1 day ago
https://github.com/QwenLM/Qwen3-Omni#vllm-usage
https://github.com/QwenLM/Qwen3-Omni?tab=readme-ov-file#laun...
Comment by mobilio 1 day ago
Comment by novaray 1 day ago
Comment by sim04ful 1 day ago
I'm curious how anyone has solved this
Comment by artur44 1 day ago
Comment by pugio 1 day ago
Comment by artur44 1 day ago
Comment by regularfry 21 hours ago
Comment by mohsen1 1 day ago
Comment by devinprater 1 day ago
Comment by plipt 1 day ago
The benchmark table in their article shows Qwen3-Omni-Flash-2025-12-01 (and the previous Flash) as beating Qwen3-235B-A22B. How is that possible if this is only a 30B-A3B model? Also confusing how that comparison column starts out with one model but changes them as you descend down the table.
I don't see any FLASH variant listed on their Hugginface. Am i just missing it or are these specifying a model only used for their API service and there are no open weights to download?
Comment by apexalpha 1 day ago
Comment by aschobel 1 day ago
Comment by readyplayeremma 1 day ago
edit: Nevermind, in spite of them linking it at the top, they are the old models. Also, the HF demo is calling their API and not using HF for compute.
Comment by aschobel 1 day ago
Comment by banjoe 1 day ago
Comment by embedding-shape 1 day ago
Comment by MaxikCZ 1 day ago
Comment by red2awn 1 day ago
Comment by skrunch 1 day ago
Comment by binsquare 1 day ago
Especially in the fruit pricing portion of the video for this model. Sounds completely normal but I can immediately tell it is ai. Maybe it's intonation or the overly stable rate of speech?
Comment by Lapel2742 1 day ago
On the video itself: Interesting, but "ideal" was pronounced wrong in German. For a promotional video, they should have checked that with native speakers. On the other hand its at least honest.
Comment by nunodonato 1 day ago
Comment by wkat4242 1 day ago
Comment by sosodev 1 day ago
I think ChatGPT has the most lifelike speech with their voice models. They seem to have invested heavily in that area while other labs focused elsewhere.
Comment by vessenes 1 day ago
Comment by esafak 1 day ago
Maybe that's a good thing?
Comment by colechristensen 1 day ago
Comment by dvh 1 day ago
Comment by iFire 1 day ago
Weird, as someone not having a database of the web, I wouldn't be able to calculate either result.
Comment by dvh 1 day ago
Comment by MaxikCZ 1 day ago
Comment by plufz 1 day ago
Comment by littlestymaar 1 day ago
It would be better for most API usage though, as for business doing just a fraction of the job with 100% accuracy is often much preferable than claiming to do 100% but 20% is garbage.
Comment by kaoD 1 day ago
And that's how I know you're not an LLM!
Comment by iFire 1 day ago
Comment by esafak 1 day ago
Comment by littlestymaar 1 day ago
I don't think a model should know the answer, but it must be able to know that it doesn't know if you want to use it reliably.
Comment by esafak 1 day ago
Comment by parineum 1 day ago
OP provided a we link with the answer, aren't these models supposed to be trained on all of that data?
Comment by esafak 1 day ago
The model has a certain capacity -- quite limited in this case -- so there is an opportunity cost in learning one thing over another. That's why it is important to train on quality data; things you can build on top of.
Comment by parineum 1 day ago
Comment by esafak 1 day ago
Comment by DennisP 1 day ago
Comment by brookst 1 day ago
Comment by strangattractor 1 day ago
Comment by bongodongobob 1 day ago
Comment by cindyllm 1 day ago
Comment by mettamage 1 day ago
Comment by andy_ppp 22 hours ago
Comment by rarisma 1 day ago
Comment by BoorishBears 1 day ago
Not their fault frontier labs are letting their speech to speech offerings languish.
Comment by stevenhuang 1 day ago
Comment by forgingahead 1 day ago
Comment by vessenes 1 day ago
No idea how to check if this is actually deployed on qwen.com right now.
Comment by zamadatix 1 day ago
Assuming you mean qwen.ai, when you run a query it should take you to chat.qwen.ai with the list of models in the top left. None of the options appear to be the -Omni variant (at least when anonymously accessing it).
Comment by vessenes 1 day ago
Comment by mh- 1 day ago
Comment by vessenes 19 hours ago
It would be convincing if it said “I’m qwen-2025-12-whatever”. I agree it’s not dispositive if it refuses or claims to be llama 3 say. Generally most models I talk to do not hallucinate future versions of themselves, in fact it can be quite difficult to get them to use recent model designations; they will often autocorrect to older models silently.