Cohere's First Model for Developers

Posted by hmokiguess 5 days ago

Comments

Comment by amunozo 1 day ago

Are these models trained from scratch or do they necessarily need distillation from bigger models to be competitive? It's usually the case that they're a small model for a family with a bigger model. In the first case, does anybody know what's the economy of training this 30B-A3B model vs. training a DeepSeek V4 Pro or Flash size of models (1.6T, 200 something B, less activated)?

Comment by namr2000 16 hours ago

You don't have to train from scratch but you can. Distillation ends up being somewhere in the ballpark of 1000x faster to train [1]. It also comes with the huge advantage of not needing to create RLHF datasets, since you can just copy the behavior of the teacher model. This saves an enormous amount of labeling money at the cost of making the model behave similarly to the teacher. If you are training from scratch, you can look at LLM scaling laws to figure out roughly the compute budget you need to optimally train a model [2].

Based on [2] a 30B model needs something like 2e+23 FLOPS to train from scratch whereas a 1.6T model needs something like 1e+27 FLOPs to train. So DeepSeek v4 Pro was roughly 5000x more expensive to train than this model. I'm not totally sure how MOE affects scaling laws, so these numbers might be different in reality, but it gives you a good ballpark estimate of the difference in training scale.

[1] https://arxiv.org/abs/2505.12781 [2] https://arxiv.org/abs/2203.15556

Comment by amunozo 5 hours ago

Thank you for taking the time, this is a very useful and complete answer.

Comment by matt_daemon 1 day ago

> Hardware (minimum): 1× H100 @ FP8

Cool to see this but seems like it would be pretty expensive to run

Comment by anon373839 1 day ago

This is a 30B parameter model with 3B active. It should run performantly on a Mac with > 48GB RAM at 8bit precision.

Comment by ltononro 1 day ago

Well that is like 3 USD/hour if you run it on a rented gpu

Comment by yencabulator 17 hours ago

4-bit quantized 30B-A3B MoE models can run at something like 21 tokens/sec on a several year old AMD CPU.

Comment by montroser 20 hours ago

Well, this is certainly not benchmaxxed, I'll give it that. And props for being honest about how far behind Qwen 3.6 MoE is this model.

But yeah, it's not the best look to have to stretch and say it's "competitive" with other models in it's weight class, when it offers not much else that's useful or novel.

Comment by moojacob 1 day ago

I was a fan of coheres general purpose LLM. Command A I think? Before they came out with their reasoning model.

More competition is better.

Comment by SubiculumCode 1 day ago

I always forget the VRAM requirements on these MOE things

Comment by sipjca 1 day ago

fwiw because of the relatively few activated params offloading to system RAM is quite feasible, you can see the endless amount of people doing this on r/localllama with qwen3.6 35a3b

Comment by bitwize 1 day ago

I ran Gemma4 26B A4B on an 8yo PC with a fucking GTX and it did rather well.

Comment by doodlesdev 1 day ago

Well, that's pretty impressive. Care to share your setup to do that? How much DDR3/DDR4 do you have, too?

Comment by bitwize 14 hours ago

I... downloaded a 4-bit quantized GGUF of the model, used llama.cpp to run it, and pointed OpenCode at that. My machine is an 8-core Gen1 Ryzen 7, 32 GiB of DDR4, (I think) 4 GiB of VRAM on the graphics.

Comment by tonyrice 1 day ago

I'm excited to see more OSS models

Comment by 1 day ago

Comment by AbuAssar 22 hours ago

strange, I already submitted the same url 6 days ago:

https://news.ycombinator.com/item?id=48475095

Comment by mkl 22 hours ago

What's strange? Yours got no comments, so another attempt seems okay. It's pretty random what gets to the front page when.

Comment by zuzululu 1 day ago

Wasn't aware that Cohere was still around but this release doesn't exactly instill confidence.

Comment by greyb 1 day ago

>Wasn't aware that Cohere was still around but this release doesn't exactly instill confidence.

It's being kept alive because the Canadian government is desperate to have a local frontier lab and is willing to inject funding and force its adoption in government services, but leadership at Cohere is known to be weak in Canadian tech circles, and they pivoting to an enterprise-first market around production RAG rather than anything close to frontier work.

I'm glad they're doing open weight releases but they're not viable in the long-run. It is embarrassing sharing similar spaces with them, but I'll try this release out in OpenCode and re-think afterwards.

Comment by daijj 18 hours ago

Mulling over applying there to work. Hearing a bunch of mixed reviews where some people also complain about leadership but the day to day seems to be quite good. Any reason big US investors haven't put any money into it? (besides the fact that it's Canadian?)

Comment by zuzululu 16 hours ago

why would anybody put money in Cohere when they can do it in an American AI company with larger pay off?

Comment by suddenlybananas 1 day ago

It's embarassing? Awfully harsh!

Comment by moralestapia 23 hours ago

It really is. I’m very familiar with that as well.

It’s truly embarrassing how much hand-holding those guys have received from angels, investors, the government, etc. To the point where the same investors they’re going to pitch to are preparing their slides, telling them what to say during the presentation, and then approving them for even more funding afterward, lol.

That government part is corruption and illegal, by the way.

Actual usage on many of their APIs/models is painfully low, like in ... hundreds of DAUs. I don't blame them for this, but this is a "company" that should have died 2 years ago.

Comment by chartpath 21 hours ago

Sure, and OpenAI, Anthropic, Grokkk are totally self-made and profitable.

Comment by osti 20 hours ago

Well, the level of tech is at least on a whole different level at those companies than whatever cohere is doing.

Comment by zuzululu 18 hours ago

this is just one of the many weird stuff I hear out of Canada that always surprises me, I've compiled a list:

- the wife of a professor I knew in canada apparently makes 400k/year for some Aboriginal art gallery that gets like two visits a year. They kicked out small businesses in that building so they could have a 6000 sq ft for an art gallery that sits empty with the weirdest "art" that nobody has heard of.

- canadian coworker said around 2020 there was like 3 developers that charged the Canadian government $70 million for some flutter app that had ONE screen to check in and out of places due to quarantine and it didn't even work.

- ten million here and there to raise diversity and LGBTQ in African countries that don't even have running water or electricity and other brow raising spend of tax money

- a founder raised money for a SaaS but was shut down by the Canadian government after not being issued license. Same exact SaaS was funded and the person running it had political connections.

A country with 8 times smaller population and 15 times smaller economy than USA somehow has 7 times more tax employees. It's unclear what the end game is for Canada.

Comment by moralestapia 9 hours ago

>[...] charged the Canadian government $70 million for some flutter app [...]

I met one of these guys. Some parts of Canada are massively corrupt. If you're in the inner circle, the amount of things you get for free is unimaginable. If you're not, then you get the privilege of a 50%+ tax bill.

Comment by zuzululu 6 hours ago

also I find the salary is atrociously low in Canada for the exact same role

Comment by dismalaf 17 hours ago

> It's unclear what the end game is for Canada

The Liberals can never lose an election because the amount of people who rely on them for handouts outnumber the people who work in private enterprises and don't get handouts.

It's the Argentinian Peronist strategy...

Comment by N_Lens 1 day ago

It's easy to be critical.

Comment by 1 day ago

Comment by kadoban 1 day ago

Really? Why not. From the benchmarks at least it's a pretty decent small model.

Comment by redwood 1 day ago

Aren't they focused on embeddings and strong there?

Comment by chattermate 21 hours ago

[flagged]

Comment by moralestapia 1 day ago

[flagged]

Comment by rdevilla 1 day ago

[dead]

Comment by cyanydeez 3 days ago

looks like it's just qwen 3.6 coder.

Comment by lumost 1 day ago

its worse at code compared to qwen 3.6 coder.

Comment by stymaar 1 day ago

How can it be worse than something that doesn't exist?

Comment by amunozo 1 day ago

Sometimes non-existing is better than existing for unnecessary or harmful things. I know that is not what you mean but I just found it relevant in the age in which making new stuff is so fast and easy due to LLMs. Main enshitification would come, imo, not from bad things but for unnecessary things that nobody asked for.

Comment by SubiculumCode 1 day ago

Do you mean it's based on qwen 3.6 coder?

Comment by daemonologist 1 day ago

There is no "coder" version of Qwen 3.6; I think they just mean it's a coding-focused model of similar size and performance (to Qwen 3.6 35B-A3B).

Regular Qwen 3.6 benchmarks slightly better and has much wider software support though, so this is probably of interest only to organizations which disallow models trained in China.

Comment by kadoban 1 day ago

I mean, Qwen 3.6 kicks ass. I don't know who these people are, but if their first outing is "not quite as good as Qwen 3.6", that's not a bad start by any means.

30B vs 35B isn't nothing either.

If it ends up just being some tweaks to someone else's weights, then meh.

Comment by mtone 1 day ago

It was trained from scratch by Cohere. They're the only Canadian AI lab - I'm glad they're releasing open weights and I wish them luck catching up!