SubQ 1.1 Small

Posted by EDM115 18 hours ago

Counter120Comment50OpenOriginal

Comments

Comment by cmogni1 18 hours ago

I don’t understand why this lab is allergic to providing details on what they actually made, especially when Chinese labs are more than willing to share architectural specs/code/kernels (eg NSA/FSA, RAMBa, HISA, DSA LightningIndexer, etc). I don’t doubt that they’ve done something here, but the lack of details makes me default not trust this, particularly when this is the second time that they’ve released a “technical report” that just waxes poetic about the concept.

Comment by yorwba 15 hours ago

They don't need to provide any details at all. They just need to give people access to their model and charge them for it. That they don't do that and instead pay for external evaluations indicates that they believe people would be unimpressed if they could access the model directly. The only purpose of this press release seems to be making investors give them more money.

Comment by giancarlostoro 13 hours ago

Could also be they don't have enough funding to sustain that many users, or even the infrastructure lined up.

Comment by yorwba 12 hours ago

If their model really is so much more efficient, they should be able to run it cheaply even on rented infrastructure. If somehow they got so many users that they can't even rent the infrastructure quickly enough and they have to make people wait in a queue, that would be even better advertising.

Comment by famouswaffles 17 hours ago

Business wise, it would make sense to hold off on details till they're at least ready to serve. Look at what happened with Open AI and reasoning models. Everyone struggled with getting RL to work with LLMs for a good while. Open AI figured it out, and a few months later everyone had their prototypes out in short order. Don't forget who these labs employ. They're some of the brightest people around. Sub-q aren't really in a position for that lol. If they'd shared details at the first announcement for instance, the big labs might have had something out by now while they're still pulling resources to scale and then what ?

Comment by cmogni1 15 hours ago

I don't think it makes sense from a business perspective to hold off on details as a new lab. OpenAI will not implement new architectural changes unless they've tested the changes themselves internally. Even if someone claims some great innovation, they'd need to do scaling experiments to somewhere between the size of GPT-4 to GPT-5 before they'd decide it is worth it to implement themselves. Plenty of mechanisms that seem to work at one scale do not translate to the next.

Because the cost to OpenAI to make an architectural shift is far greater than the cost to a new lab to try something different, providing details is usually a net benefit for recruiting, building trust, getting acquired, etc. The lack of details is a poor business decision because it makes them seem untrustworthy.

I'm not advocating that they should open source their model, but there is already so much noise in the space and many bad papers that being cagey is a poor strategy for winning over talent, developers, etc.

Comment by famouswaffles 14 hours ago

>OpenAI will not implement new architectural changes unless they've tested the changes themselves internally.

OpenAI validating it can still happen faster than they can get the compute to serve the models themselves[1]. It doesn't make a lot of sense to give out details if they want to be a serious contender or even as some have said, be acquired.

Yeah there's noise but if they have the real deal then it doesn't matter. They only thing they need to do is let people pay to use the models.

[1] I'm assuming this is the primary cause of the delay. That may not be the case of course.

Comment by kristjansson 12 hours ago

>A full breakdown of the mechanism and how it compares to FlashAttention, DeepSeek sparse attention, and recurrent architectures is in the Technical Report.

Oh they did publish details lets read the technical report!

> The mechanism by which SSA meets these requirements is outside the scope of this report

TFGs...

Comment by jmward01 16 hours ago

Well, I know this is possible because I have built things that work just like it is promising to do. The two key technologies needed are:

- guided window attn. Predict where to attend to but in a fixed window. If you do this to just the token/vocab you can keep effectively unlimited context and perfect recall. (yes, I can do that. There is a trick to teaching it how to predict position. This also immediately opens other crazy things like NN memory)

-efficient fixed state size models. So not a recurrent mechanism because that breaks training, parallelizable like transformers, but fixed sized state instead of unbounded attn. Pick a reasonable amount of state and it is amazingly good since it doesn't need to keep separating wheat fro chaff in context (yes, it is possible to build this, I have. It works. This also opens up real streamed models. I have a true infinite context streamed model I toy with locally that I am getting to be audio/text in and audio/text out in real time.)

Put those together and you have O(1) token gen, infinite context and perfect recall. It is a whole new world of models. You can interact with a model until you have it at the state you want and then save its state and use that as if it were your system prompt. Batches pack perfectly so inference is massively more efficient. Training is massively more efficient. Transformer and unlimited attn models are a dead end. But how do you make money on this as an independent researcher? If I release the Two Weird Tricks this is all based on I get zip and the big players get even more tech for free. If I keep it all secret I get Zip and eventually the tricks will be figured out. (Yes a little frustration here) If anyone wants the model architecture of the future make me an offer :)

Comment by regularfry 16 hours ago

It's not quite true to say that if you release it you get nothing. If it's worthwhile and picked up by the open-weights labs, you get much bigger and better models implementing it than you would have had access to or been able to train otherwise, quicker than if they had to figure it out de novo.

Comment by jmward01 16 hours ago

Yeah. I am about to the point of just releasing it all. I love the tech. It does amazing things. But I want to move to the next big things I can see doing with it and building the custom ops to get it to work efficiently is a pain. I am positive others would run with it and make it all way better which would free me up to do more.

Comment by EDM115 2 hours ago

well if you ever release it, make sure to make a post so we can check it out !

Comment by in-silico 12 hours ago

Neither of these strike me as particularly groundbreaking.

The first idea (as I understand it as retrieving token ids rather than hidden states) is going to really struggle to do useful compositional reasoning and contextual recall.

The second idea has been been done a million times, with Linear Attention being maybe the first modern example. Hyena, state-space models, DeltaNet, and LaCT also lie in different regions of the performance-parallelizability spectrum of fixed-size models.

Comment by jmward01 14 hours ago

As a follow-up, I can see there is not a lot of belief which is why it is also hard to find a company to partner with on this. So, how -do- you make money on something like this as an independent researcher. Maybe I release trick one, show how guided window attn (and nn memory and probably a lot of robotics) can be trained? Thoughts? I can do that pretty quickly. By itself that is a pretty great tech (combined with fixed windows of full attn it is pretty amazing). The second trick, I think, is a bit more powerful although both are general purpose. If I do this, think people will believe trick two (and all the real time multi-modal streaming stuff)?

Comment by yorwba 12 hours ago

Demonstrate results. If you can produce results that are somehow better than what already exists, it doesn't matter much what the actual trick is. If the way your results are better is difficult to explain without significant technical background knowledge, you might be limited to only a small pool of angel investors at first, but you only need to convince one to get funding for a better demo and intros to VCs with deeper pockets.

Comment by jmward01 5 hours ago

Yeah. That is the plan I think I have settled on. I'll release something interesting here shortly but the full architecture, including all the multimodal input/output streaming is something I am considering my options on. I may even try to get to the 1-2b moderately well trained model stage and host it to show how transformative cached states are compared to cache tokens.

Comment by bratao 16 hours ago

I´m super curious about those "Two Weird Tricks". I would like that you would release more. It remember me the MiniMax Sparse Attention https://arxiv.org/html/2606.13392v1

Comment by jmward01 14 hours ago

Yeah, looks like fun stuff. You still need to preserve the entire kv cache though right? So even if compute is drastically less, memory keeps growing. The system I described keeps memory constant (well, if you keep the entire token history you technically are gaining one long of data per token generated but I think we can agree that is negligible and could be capped at something high like 1B or so with no meaningful impact). I think I will probably release trick one and see if people then believe trick two even without seeing it.

Comment by eikenberry 15 hours ago

Isn't the classic way of making money off an invention is to patent it... so why not patent those "Two Weird Tricks"?

Comment by giancarlostoro 13 hours ago

Expensive and if someone figures out a slight different way to do it you arent really “covered” its not a unique umbrella plus you would sort of give away the secrets.

Comment by supern0va 17 hours ago

You don't understand why the thing their entire company is valued upon is...not being given away freely? They literally are taking an open source model and then adapting it with this technique. If they disclose it, the frontier labs will immediately copy it and outperform them.

My guess is that they're angling for an acquisition.

Comment by GenerWork 16 hours ago

>My guess is that they're angling for an acquisition.

This is what I've thought was going to happen ever since they publicized their efforts. They probably don't have the money to train large models themselves, might as well get a nice chunk of change by being acquired by someone who already has said large models running.

Comment by giancarlostoro 16 hours ago

They probably don't have the money to run the model at reasonable scale.

Comment by cmogni1 15 hours ago

Ahh cf my comment above. The cost of failure at scale is too high for a major to just take a new architecture/mechanism and implement it, especially because a) most claims papers make aren't rigorously tested and b) plenty of things that work at one scale do not work at the scale on which the labs operate. If they want to get acquired, then they should show that they know what they're doing. Otherwise, it looks sketchy.

Comment by supern0va 15 hours ago

>The cost of failure at scale is too high for a major to just take a new architecture/mechanism and implement it,

Is it, though? This scrappy startup was able to take a large(-ish) open weights model and adapt it. Why can't the frontier labs do the same cost effectively?

>If they want to get acquired, then they should show that they know what they're doing.

I'm sure they would do so under an appropriate NDA as part of negotiations. I'm not sure why you think a full public disclosure is necessary.

Comment by cmogni1 13 hours ago

I don't mean to be shady, but there are plenty of details that they did release that show that they don't know what they're doing.

They make comparisons to FlashAttention-2 when FlashAttention-4 has been out (even if they wanted to stick to Hopper class GPUs for whatever reason there's still FlashAttention-3). The two orders of magnitude claim look like they're for prefill not next-token decoding, which is a bit duplicitous. Long context extrapolation experiments typically go well beyond 2x context length. Etc etc etc.

I never said they should have a full public disclosure, but I do think sharing something of substance helps build trust and also get people excited.

Lastly, frontier labs have other incentives than to eek out every dollar and cent. Having the most capable models, not the most cost effective, is of significantly higher priority as OpenAI and Anthropic march towards IPOs. The same is not necessarily true for Google/DeepMind, and one can see from their public releases alone for some of their open weight models that this may be more of a priority for them today.

Comment by giancarlostoro 18 hours ago

This one's interesting, and I think the next frontier for LLMs should really just be, how can we get something like Opus 4.6 to cost drastically less, for the same output? I say 4.6 because from 4.6 onwards it's been pretty darn good, at least for me, always feels like every model upgrade someone hates it, heck even 4.5 was fine.

Comment by robmccoll 18 hours ago

Yes - I want that and dramatically faster. Newer models don't seem to need any more or less guidance and iteration, so let's make the time-to-wrong-answer as short as possible.

Comment by giancarlostoro 17 hours ago

I'm not as crazy about speed as long as it's reasonably as "quick" as Opus. Which is faster than most developers can spit out code. I do get annoyed with Claude Code because it looks like it chooses to be as slow as possible, but maybe that's by design so its not pounding their backend every milisecond? Would probably be bad.

Local inference is insanely fast on my M4 Pro MBP though, so I can understand where you're coming from, but I don't need it too much faster. I still need time to review, test, review and provide feedback to the model. Fast is okay I guess for true vibe coding.

Comment by robmccoll 17 hours ago

I just don't want to have to have a pipeline going in order to fully occupy my time. I don't want to wait on the model to review the prompt, read the parts of the codebase indicated, do its own research in the codebase and documentation, plan, run agents ... actually write the code and NOW I can start reading it and reviewing it. That means I either need to run a lot of operations in parallel so that I always have something to do and the agent(s) are highly utilized or I'm writing something on my own that I keep getting that keeps getting interrupted. It's the constant context switching that kills me. I want to work on one problem at a time and really focus on it - even if I'm not writing every line myself.

Comment by mritchie712 15 hours ago

I agree on opus 4.5-4.8, but Fable 5 was a noticeable upgrade.

Comment by mstkllah 15 hours ago

Did not feel as an upgrade to me at all, felt way slower at the same quality level as 4.8 to me.

Comment by NetOpWibby 3 hours ago

Man I miss it

Comment by wxw 17 hours ago

> SSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length.

> At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2.

Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO.

Comment by bthornbury 12 hours ago

we need some better standard long-context benchmarks.

needle in a haystack is not good for this, yes it proves the model can attend to its context, but in its usual form, somewhat trivializes the query-key relationship.

something like long-form Q&A would be more ideal. Like reading a book and answering questions that require synthesizing information derived from either the whole thing or disparate portions of it. Like describing an entire character arc in a 1000 page novel with examples and evidential moments.

Comment by mark_l_watson 6 hours ago

Interesting idea but until I get my grubby little fingers in it, to try it - difficult to have an opinion.

I am hopefully expectant that we will see all sorts of optimizations in the next few years that will enable even more local model use and slash commercial API costs. I get excited by the results when I enjoy one or two short coding sessions a week with Claude Opus but it is even more exciting to get a major task done and see that I only used $0.05 for DeepSeek v4 Flash or perhaps $0.15 for DeepSeek v4 Pro. It was exciting in even a different way when I two shotted a complete TypeScript/Tauri app using gemma-12b-qat with little-coder on a cheap laptop a few days ago.

Comment by samber 17 hours ago

According to Subquadratic, Needle in a Haystack is strong up to 12m tokens, but RULER has not been tested above 128k tokens ??

Comment by satyarohith 17 hours ago

It's been all talk and no action ever since their first announcement.

Comment by kristjansson 12 hours ago

It's easy[1] to promise, it's hard to deliver. I hope the best for them.

[1]: https://magic.dev/blog/ltm-1 (note the date)

Comment by embedding-shape 17 hours ago

> SubQ 1.1 Small scores near-perfect at 1M, 2M, 6M, and 12M tokens. The model was trained predominantly at 1M tokens yet the retrieval held near perfectly at 12x that length, despite compressing attention to just 0.13% of relationships. This generalization is a direct consequence of SSA routing attention based on content relevance rather than fixed positional patterns.

If the results persists from 1M to 12M, why not 24M or 48M? Sounds almost too good to be true.

With back of the napkin math from inside my head, that'd be like 0.5/1 million LOC, depending on language/code density, could just fold the entire codebase into one prompt if it's a small one, that'd be neat :)

Comment by monster_truck 17 hours ago

It likely falls off very steeply after that. 8 to 1 (which I am assuming based on the 0.13% figure) is a pretty common ratio for sparse matrix stuff.

Comment by EDM115 18 hours ago

Comment by chrsw 17 hours ago

There was, let's say, significant skepticism the last time they announced something. What's changed?

Comment by supern0va 17 hours ago

I have no idea if the evaluator themselves is trustworthy, but it was supposedly independently evaluated by Appen: https://www.appen.com/whitepapers/benchmarking-subquadratics...

Comment by samber 16 hours ago

Comparing compute cost versus FlashAttention-2 is not very honest to me.

FlashAttention-2 is not used anymore for at least 2y.

This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO.

Comment by Depurator 17 hours ago

What kind of hardware would be needed to serve an instance with the full 12m context? And what kind of speeds can one expwct at those extremes at 10m+?

Comment by dundunUp 5 hours ago

What is this?

Comment by aesthesia 18 hours ago

Disappointing they don't actually say how their sparse attention mechanism works.

Comment by maz1b 17 hours ago

They've done multiple "evaluations" by third parties, but still, it seems that they aren't being fully transparent. I think the approach is quite interesting and novel, but this feels like deja vu.

I get why they aren't disclosing all the details, but it seems more hype-train-esque to me for this moment. I don't disagree that this could be big.

Comment by ballon_monkey 13 hours ago

Its funny that some people on HN think this whole thing is legit. The company is started by a bunch of no-bodies with 0 experience in AI in general let alone ML/Data.

Edit: Typical HN "I can downvote but I cannot dispute facts"

Comment by phantasmat 13 hours ago

[flagged]