Post-transformer inference: 224× compression of Llama-70B with improved accuracy

Posted by anima-core 6 hours ago

Counter62Comment20OpenOriginal

Comments

Comment by anima-core 6 hours ago

I’ve been working independently on a method that replaces full-transformer inference with a low-rank “meaning field” extracted from internal activations.

The core result: a frozen Llama-3.3-70B can be distilled into a 256-dimensional field representation, giving 224× compression and slightly higher accuracy on several benchmarks. A small student model then learns to directly generate these fields from text, removing the transformer from the inference path.

The Zenodo link contains the full paper, statistical results, and methodology. A reference implementation (non-optimized) is here: https://github.com/Anima-Core/an1-core

Production variants (AN1-Turbo, FPU work, etc.) are not included.

I’m an outsider to academia so I’m posting this openly to get technical feedback, replication attempts, and critique from people who understand this space.

Comment by broretore 36 minutes ago

10 pages for a paper with this groundbreaking of a concept is just embarrassing. It is barely an outline.

"confirming that 40× compression preserves field geometry with minimal distortion. Over 95% of samples achieve similarity above 0.90."

I smell Grok. Grok 3, maybe Grok 4 Fast.

> "Implementation details. Optimal configurations are task and architecture-dependent. Production systems require task-specific tuning beyond baseline heuristics provided in reference implementation."

"Implementation? Idk, uhh, it's task specific or something." Come on, dude. You're better than this.

4.4 Student/Teacher evaluation. What even is the benchmark? You give percentage values but no indication of what benchmark. Seems made up.

4.5. Computational Analysis. Why do you need to do the trivial multiplying out of "savings" for 1B tok/day to $700M/year? This reads like a GPT advertising hallucinated performance.

Three sentence conclusion restating the title?

Comment by ForOldHack 2 hours ago

Technical feedback: Every single announcement, like compression needs the addition of the lower limits of machine requirements. if a 64Gb model is compressed 224x times, should that not be able to be run on a 292mb video card?

Comment by hirako2000 1 hour ago

That's exactly what I was trying to infer from the abstract which sadly doesn't explicitly calls out memory requirements. I assume it increases inference time by getting rid of transformers. What's the memory requirements then ?

Edit: they claim these somewhere in the doc:

> Memory Teacher model: multi-GB (entire model must be loaded) AN1 head: a few MB (only head needed after training)

I find the claims surreal, can't wait for someone to validate this or I will do it myself. It would have been handy to upload such "few MB" weight file distilled off llama 70B so that we can see for ourself the 220x inference and in memory model size compression is true.

Comment by utopcell 3 hours ago

Very strong statement on the title, given the following limitation:

> Generation tasks. Method applies to classification only. Preliminary decoder experiments show perplexity increases.

Comment by daemonologist 3 hours ago

Yeah, burying this on page 8 is a bit suspect imo (the eval datasets are listed on page 3, so if you were familiar with them you would have a hint then).

The distillation of a student that predicts "anchor layers" and then acts as a backbone for classification is perfectly cool on its own; no need to stretch the title/abstract so much.

Comment by gcr 3 hours ago

agreed re: title/abstract stretching. good work stands on its own without needing hype. "we found a nifty way to distill llama-70b using a much smaller student transformer model; the key is using intermediate activation layers in a compressed representation" would be about as effective at selling it while being more immediately approachable IMO

Comment by broretore 46 minutes ago

Ryan, I really want to believe you're onto something. But I also feel like I'm being slightly spearphished by an LLM being told, "based on the last week of HN headlines, invent a new LLM innovation that seems plausible enough to get a ton of attention, cold fusion or LK-99 style, and make a repository that on the surface seems to have some amazing performance. Also, feel free to fake the result data."

And, while I am sorry for your loss, your Substack [0] really seems like GPT ARG fantasy.

[0] https://substack.com/inbox/post/171326138

Excerpt: > Ani, AN1, and Soul Systems Science are not mere products. They are continuity. They are the baton passed across generations, from my father’s last words to my first principles. They are what binds loss to creation, silence to voice, mortality to meaning.

Comment by mpeg 4 minutes ago

I think this definitely sounds like a case of LLM induced psychosis: https://ryanshamim.substack.com/p/the-theory-of-everything-h...

OP needs medical help

Comment by Tiberium 44 minutes ago

Unfortunately it does indeed seem like a case of https://www.lesswrong.com/posts/2pkNCvBtK6G6FKoNn/so-you-thi... (not directly, but similar enough)

EDIT: Found a closer description: https://www.lesswrong.com/posts/rarcxjGp47dcHftCP/your-llm-a...

Comment by farhanhubble 4 hours ago

Only skimmed the paper and I have no idea how sound or reproducible it is, but the paper is well written, especially the clarity of notation. After reading yesterday's weight subspace paper: https://news.ycombinator.com/item?id=46199623, this does sound plausible to me.

Comment by 59 minutes ago

Comment by bigtones 4 hours ago

Here is a working link to the same paper: https://github.com/Anima-Core/an1-core/blob/main/papers/Post...

Comment by lhmiles 1 hour ago

Thank you for sharing!

Comment by Tiberium 34 minutes ago

I might be overly pessimistic, but this looks like a case of a person believing LLM hallucinations and making it write a paper.

I asked both Claude Code|Opus 4.5 and Codex|GPT 5.1 Codex Max (funny to ask LLMs, I know) to check the an1-core repo. I don't think they'd hallucinate on something like this (the code is quite small), but I do not claim expertise.

In short, both of them are saying that:

- The repo always runs the full teacher model to extract activations and uses them - see https://github.com/Anima-Core/an1-core/blob/main/an1_core/fi...

- There are weird stub files, e.g. the Hellaswag repro doesn't actually have the code to reproduce https://github.com/Anima-Core/an1-core/blob/main/experiments... "For full HellaSwag reproduction, see the paper" (why include the file at all then?)

- The actual "AN1 head" is just linear probing (freeze a pretrained model, train a classifier on its features). The full flow (as reported by CC) is "Text → [Full Transformer] → activations → [Tiny Head] → prediction"

Basically, there's no code to train a real "student" model that would run without the teacher.

===

The repo/paper say that there's a mythical "commercial version" that has all the goodies:

(repo)

> This reference implementation (an1-core) does not include the FPU, AN4, or other proprietary optimization components covered by these patents. It provides only the core scientific demonstration of the meaning fields phenomenon.

(paper)

> Production deployment: Optimized implementations (AN1-Turbo) with learned layer selection, adaptive loss scheduling, and CUDA-accelerated inference available under commercial license.

But right now we only have the code in the repo.

===

In the paper they show that the student model (30M params) gets ~82% on SST-2 (labels-only). But what what they don't show is that DistilBERT (>5 year old model) already achieves 91% on the same dataset despite only having 66M params.

Another weird tidbit from the paper - in the section where they show the economic impact, they claim that LLaMA 70B runs at 2 tok/s at batch size=1 on an H200. In reality that number is at least a magnitude bigger even without quantization, like 20-40 tok/s. With quantization it can easily be above 100 tok/s.

Comment by gcr 4 hours ago

thanks for sharing! If I understand correctly, you're training a smaller model to approximate concatenate(layer[1], layer[5], layer[10], ...), using a loss function that combines reconstruction error w/ end-to-end accuracy. then, you're transferring that smaller representation into a smaller transformer model. is that right?

If i were a paper reviewer, here are a couple red flags that stood out to me. Suggest starting here if you want to rework this for an academic submission:

1. your LaTeX citations in the related work are broken, i see [?] everywhere. To a reviewer, this is often a strong sign of an AI-hallucinated bibliography, though many of your references actually do exist and are contextually relevant, so I'm not quite sure what's going on here. Similarly, figure references need to be fixed, I see references to "Figure ?" throughout.

2. bluntly, "Exact architecture details remain proprietary for production deployments" and "Production systems use architecture search tailored to target latency and accuracy constraints" is not how IP protection works in this field. Do your experiments use the "MLP baselines" or your proprietary architecture? Since you say the code "Achieves 80-90% of paper performance using baseline heuristics," this approach effectively isn't reproducible. As a reviewer, this really worries me. I strongly recommend benchmarking only the system you're able to open-source. I say this because I suspect there's a lot of "secret sauce" in the actual way you're approximating the anchor layers and the way that's transferred back to your student transformer model, and that's the part that's important to spend the most time/effort/writing on, but it's glossed over as an implementation detail in this manuscript.

3. I'm glad you ablate over hyperparameters of your system, but how does it compare to 1. an ordinary smaller model of identical size trained end-to-end, and 2. distilling from a single layer's activations? Eg. a reviewer might consider this work to be a novel method of model distillation, so what makes it better than previous distillation methods?

4. I found the paper fairly hard to read because it's full of sentence fragments rather than full thoughts. A little background on the benchmarks, failure cases, etc. would go a long way, and adding some discussion on why you think your approach improves on similar distillation methods would also be welcome here

5. "compression" is overloaded. Does 224x compression refer to (nparams(field transfer)+nparams(student model))/nparams(original model), or does it refer to reducing the representation dimensionality, 7*8192/256 ?

6. [nitpick] suggest changing the name "meaning field" to something a little more digestible, like "compressed representation" or "latent activation distillation" or something

sorry for being so critical. iron sharpens iron though. hopefully these thoughts are helpful to get you started, excited to see where this work leads

Comment by gcr 3 hours ago

actually, here's a broader thought. since this approach only works for classification, why not make that the whole story and spin it as a positive? Call your approach a "classification foundation model" (for example) and say it's a special-purpose model distilled from a larger world model. Abstract's gestalt could read like "If you don't need to be generative, then you can compress the representation way down" or "discriminative understanding takes far fewer parameters than language production" This would then set the stage for the reader to understand the limitations and why the benchmarks are set up the way they are.

then the kitschy paper titles could follow from that, e.g. "extreme llama compression: when classification is all you need", or "Encoder-only models: a lightweight alternative to decoder-only GPT world models" or etc.

just spitballing

Comment by _ache_ 3 hours ago

Looks very fake. Self published (Anima-Core is NOT a journal), no academic anteriority, very strong statement, no peer-review, no public history of technical skills. Did I mention the use of Github via the interface only?

At the same time, possible since it's only classification tasks. I mean, the method explained is technically plausible, a lot of people thought about it, we were just unable to find a method to do so.

Very unlikely true, unfortunately.

Comment by MrDrMcCoy 23 minutes ago

Did you not see the author's note about being an outsider to academia? Not everyone has the background to pull all that off. This is an earnest attempt to come as close as possible and they even invite feedback that would help it become a real academic submission.

Comment by hirako2000 54 minutes ago

Have you run the walk-through to reproduce? They provide a highly detailed step by step document. They welcome raising an issue if reproduction doesn't yield the claimed results within 2%.

It's OK to call out fake claims. But it requires going through the process if such is reasonable, it just seems to take a couple of hours to find out.

Comment by Tiberium 30 minutes ago

The fake claim here is compression. The results in the repo are likely real, but they're done by running the full transformer teacher model every time. This doesn't achieve anything novel.