SubQ 1.1 Small
Posted by EDM115 18 hours ago
Comments
Comment by cmogni1 18 hours ago
Comment by yorwba 15 hours ago
Comment by giancarlostoro 13 hours ago
Comment by yorwba 12 hours ago
Comment by famouswaffles 17 hours ago
Comment by cmogni1 15 hours ago
Because the cost to OpenAI to make an architectural shift is far greater than the cost to a new lab to try something different, providing details is usually a net benefit for recruiting, building trust, getting acquired, etc. The lack of details is a poor business decision because it makes them seem untrustworthy.
I'm not advocating that they should open source their model, but there is already so much noise in the space and many bad papers that being cagey is a poor strategy for winning over talent, developers, etc.
Comment by famouswaffles 14 hours ago
OpenAI validating it can still happen faster than they can get the compute to serve the models themselves[1]. It doesn't make a lot of sense to give out details if they want to be a serious contender or even as some have said, be acquired.
Yeah there's noise but if they have the real deal then it doesn't matter. They only thing they need to do is let people pay to use the models.
[1] I'm assuming this is the primary cause of the delay. That may not be the case of course.
Comment by kristjansson 12 hours ago
Oh they did publish details lets read the technical report!
> The mechanism by which SSA meets these requirements is outside the scope of this report
TFGs...
Comment by jmward01 16 hours ago
- guided window attn. Predict where to attend to but in a fixed window. If you do this to just the token/vocab you can keep effectively unlimited context and perfect recall. (yes, I can do that. There is a trick to teaching it how to predict position. This also immediately opens other crazy things like NN memory)
-efficient fixed state size models. So not a recurrent mechanism because that breaks training, parallelizable like transformers, but fixed sized state instead of unbounded attn. Pick a reasonable amount of state and it is amazingly good since it doesn't need to keep separating wheat fro chaff in context (yes, it is possible to build this, I have. It works. This also opens up real streamed models. I have a true infinite context streamed model I toy with locally that I am getting to be audio/text in and audio/text out in real time.)
Put those together and you have O(1) token gen, infinite context and perfect recall. It is a whole new world of models. You can interact with a model until you have it at the state you want and then save its state and use that as if it were your system prompt. Batches pack perfectly so inference is massively more efficient. Training is massively more efficient. Transformer and unlimited attn models are a dead end. But how do you make money on this as an independent researcher? If I release the Two Weird Tricks this is all based on I get zip and the big players get even more tech for free. If I keep it all secret I get Zip and eventually the tricks will be figured out. (Yes a little frustration here) If anyone wants the model architecture of the future make me an offer :)
Comment by regularfry 16 hours ago
Comment by jmward01 16 hours ago
Comment by EDM115 2 hours ago
Comment by in-silico 12 hours ago
The first idea (as I understand it as retrieving token ids rather than hidden states) is going to really struggle to do useful compositional reasoning and contextual recall.
The second idea has been been done a million times, with Linear Attention being maybe the first modern example. Hyena, state-space models, DeltaNet, and LaCT also lie in different regions of the performance-parallelizability spectrum of fixed-size models.
Comment by jmward01 14 hours ago
Comment by yorwba 12 hours ago
Comment by jmward01 5 hours ago
Comment by bratao 16 hours ago
Comment by jmward01 14 hours ago
Comment by eikenberry 15 hours ago
Comment by giancarlostoro 13 hours ago
Comment by supern0va 17 hours ago
My guess is that they're angling for an acquisition.
Comment by GenerWork 16 hours ago
This is what I've thought was going to happen ever since they publicized their efforts. They probably don't have the money to train large models themselves, might as well get a nice chunk of change by being acquired by someone who already has said large models running.
Comment by giancarlostoro 16 hours ago
Comment by cmogni1 15 hours ago
Comment by supern0va 15 hours ago
Is it, though? This scrappy startup was able to take a large(-ish) open weights model and adapt it. Why can't the frontier labs do the same cost effectively?
>If they want to get acquired, then they should show that they know what they're doing.
I'm sure they would do so under an appropriate NDA as part of negotiations. I'm not sure why you think a full public disclosure is necessary.
Comment by cmogni1 13 hours ago
They make comparisons to FlashAttention-2 when FlashAttention-4 has been out (even if they wanted to stick to Hopper class GPUs for whatever reason there's still FlashAttention-3). The two orders of magnitude claim look like they're for prefill not next-token decoding, which is a bit duplicitous. Long context extrapolation experiments typically go well beyond 2x context length. Etc etc etc.
I never said they should have a full public disclosure, but I do think sharing something of substance helps build trust and also get people excited.
Lastly, frontier labs have other incentives than to eek out every dollar and cent. Having the most capable models, not the most cost effective, is of significantly higher priority as OpenAI and Anthropic march towards IPOs. The same is not necessarily true for Google/DeepMind, and one can see from their public releases alone for some of their open weight models that this may be more of a priority for them today.
Comment by giancarlostoro 18 hours ago
Comment by robmccoll 18 hours ago
Comment by giancarlostoro 17 hours ago
Local inference is insanely fast on my M4 Pro MBP though, so I can understand where you're coming from, but I don't need it too much faster. I still need time to review, test, review and provide feedback to the model. Fast is okay I guess for true vibe coding.
Comment by robmccoll 17 hours ago
Comment by mritchie712 15 hours ago
Comment by mstkllah 15 hours ago
Comment by NetOpWibby 3 hours ago
Comment by wxw 17 hours ago
> At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2.
Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO.
Comment by bthornbury 12 hours ago
needle in a haystack is not good for this, yes it proves the model can attend to its context, but in its usual form, somewhat trivializes the query-key relationship.
something like long-form Q&A would be more ideal. Like reading a book and answering questions that require synthesizing information derived from either the whole thing or disparate portions of it. Like describing an entire character arc in a 1000 page novel with examples and evidential moments.
Comment by mark_l_watson 6 hours ago
I am hopefully expectant that we will see all sorts of optimizations in the next few years that will enable even more local model use and slash commercial API costs. I get excited by the results when I enjoy one or two short coding sessions a week with Claude Opus but it is even more exciting to get a major task done and see that I only used $0.05 for DeepSeek v4 Flash or perhaps $0.15 for DeepSeek v4 Pro. It was exciting in even a different way when I two shotted a complete TypeScript/Tauri app using gemma-12b-qat with little-coder on a cheap laptop a few days ago.
Comment by samber 17 hours ago
Comment by satyarohith 17 hours ago
Comment by kristjansson 12 hours ago
[1]: https://magic.dev/blog/ltm-1 (note the date)
Comment by embedding-shape 17 hours ago
If the results persists from 1M to 12M, why not 24M or 48M? Sounds almost too good to be true.
With back of the napkin math from inside my head, that'd be like 0.5/1 million LOC, depending on language/code density, could just fold the entire codebase into one prompt if it's a small one, that'd be neat :)
Comment by monster_truck 17 hours ago
Comment by EDM115 18 hours ago
Comment by chrsw 17 hours ago
Comment by supern0va 17 hours ago
Comment by samber 16 hours ago
FlashAttention-2 is not used anymore for at least 2y.
This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO.
Comment by Depurator 17 hours ago
Comment by dundunUp 5 hours ago
Comment by aesthesia 18 hours ago
Comment by maz1b 17 hours ago
I get why they aren't disclosing all the details, but it seems more hype-train-esque to me for this moment. I don't disagree that this could be big.
Comment by ballon_monkey 13 hours ago
Edit: Typical HN "I can downvote but I cannot dispute facts"
Comment by phantasmat 13 hours ago