I have shitty eyesight and old is the only version that’s usable to me because text zooming works far better in the old layout. Are you using a screen reader by any chance?
Academic papers should be neutral when dissing previous work imo.
"[Previous work] is just bad [previous other work]" isn't a professional way to talk about the merits and drawbacks of competing approaches, but boy does it get tiktok views!
Better work stands on its own merits. No need to explicitly shit on the competition in the title/abstract.
I had the misfortune of visiting an Amazon Go store. They charged me for items that I never picked.
No option to contest the receipt....until the "would you recommend a friend visit amazon Go" survey popped up. I responded negatively, then the "why?" question had a "My receipt was incorrect" option.
Suddenly I was able to go through the "contest receipt" workflow.
Which makes tons of sense because iPhone users are higher CLV than Android users. If Google had to choose between major software defects in Android or iOS, they would focus quality on iOS every time.
i basically can't use the ChatGPT app on the subway for these reasons. the moment the websocket connection drops, i have to edit my last message and resubmit it unchanged.
it's like the client, not the server, is responsible for writing to my conversation history or something
it took me a lot of tinkering to get this feeling seamless in my own apps that use the api under the hood. i ended up buffering every token into a redis stream (with a final db save at the end of streaming) and building a mechanism to let clients reconnect to the stream on demand. no websocket necessary.
works great for kicking off a request and closing tab or navigating away to another page in my app to do something.
i dont understand why model providers dont build this resilient token streaming into all of their APIs. would be a great feature
exactly. they need to bring in spotify level of caching of streaming music that it just works if you're in a subway. Constant availability should be table stakes for them.
agreed re: title/abstract stretching. good work stands on its own without needing hype. "we found a nifty way to distill llama-70b using a much smaller student transformer model; the key is using intermediate activation layers in a compressed representation" would be about as effective at selling it while being more immediately approachable IMO
thanks for sharing! If I understand correctly, you're training a smaller model to approximate concatenate(layer[1], layer[5], layer[10], ...), using a loss function that combines reconstruction error w/ end-to-end accuracy. then, you're transferring that smaller representation into a smaller transformer model. is that right?
If i were a paper reviewer, here are a couple red flags that stood out to me. Suggest starting here if you want to rework this for an academic submission:
1. your LaTeX citations in the related work are broken, i see [?] everywhere. To a reviewer, this is often a strong sign of an AI-hallucinated bibliography, though many of your references actually do exist and are contextually relevant, so I'm not quite sure what's going on here. Similarly, figure references need to be fixed, I see references to "Figure ?" throughout.
2. bluntly, "Exact architecture details remain proprietary for production deployments" and "Production systems use architecture search tailored to target latency and accuracy constraints" is not how IP protection works in this field. Do your experiments use the "MLP baselines" or your proprietary architecture? Since you say the code "Achieves 80-90% of paper performance using baseline heuristics," this approach effectively isn't reproducible. As a reviewer, this really worries me. I strongly recommend benchmarking only the system you're able to open-source. I say this because I suspect there's a lot of "secret sauce" in the actual way you're approximating the anchor layers and the way that's transferred back to your student transformer model, and that's the part that's important to spend the most time/effort/writing on, but it's glossed over as an implementation detail in this manuscript.
3. I'm glad you ablate over hyperparameters of your system, but how does it compare to 1. an ordinary smaller model of identical size trained end-to-end, and 2. distilling from a single layer's activations? Eg. a reviewer might consider this work to be a novel method of model distillation, so what makes it better than previous distillation methods?
4. I found the paper fairly hard to read because it's full of sentence fragments rather than full thoughts. A little background on the benchmarks, failure cases, etc. would go a long way, and adding some discussion on why you think your approach improves on similar distillation methods would also be welcome here
5. "compression" is overloaded. Does 224x compression refer to (nparams(field transfer)+nparams(student model))/nparams(original model), or does it refer to reducing the representation dimensionality, 7*8192/256 ?
6. [nitpick] suggest changing the name "meaning field" to something a little more digestible, like "compressed representation" or "latent activation distillation" or something
sorry for being so critical. iron sharpens iron though. hopefully these thoughts are helpful to get you started, excited to see where this work leads
actually, here's a broader thought. since this approach only works for classification, why not make that the whole story and spin it as a positive? Call your approach a "classification foundation model" (for example) and say it's a special-purpose model distilled from a larger world model. Abstract's gestalt could read like "If you don't need to be generative, then you can compress the representation way down" or "discriminative understanding takes far fewer parameters than language production" This would then set the stage for the reader to understand the limitations and why the benchmarks are set up the way they are.
then the kitschy paper titles could follow from that, e.g. "extreme llama compression: when classification is all you need", or "Encoder-only models: a lightweight alternative to decoder-only GPT world models" or etc.
I appreciate this framing a lot. It is actually close to how I think about the result internally. The paper focuses on the geometric behavior of intermediate representations, and classification is the cleanest setting to study that. Generative decoding is a much harder problem, and the limitations section already makes that distinction explicit.
Recasting the work as a “classification-native distilled model” or “discriminative foundation model” is a good way to signal scope without underselling the contribution. You're right that discriminative understanding requires far fewer parameters than generation, and my experiments reinforce that.
This will help me get better. The goal for the next revision is exactly what you describe: make the setup clearer, emphasize the intended domain, and avoid suggestive wording that implies capabilities the method does not claim. Duly noted. Your suggestions on positioning and title direction are genuinely helpful, and I’ll incorporate some of this thinking when I prepare the academic submission.
Thanks for taking the time to articulate it so clearly. I appreciate your time and your critique.
Look, this is an earnest new author who isn’t from academia. Dogpiles are only useful if they include useful feedback. A professor once extended a similar kindness to me on a particularly rough draft of my own (very early) work, and it was incredibly helpful to have a neutral but frank attitude on what wasn’t working and what was.
Thank you for the thoughtful comments. Really. This is actually the most constructive feedback in the thread so far.
A few clarifications.
1. On the LaTeX citations and figure references
That part is definitely on me. I never used LaTeX before this project and moved extremely fast. There's a lot of weird mumbo jumbo going on with formatting and converting it to a pdf. That part isnt interesting to me, and I try to move passed it quickly. I did use AI tools for typesetting help, and I clearly didn’t clean up all the placeholder references. Entirely my mistake, not an attempt to fabricate sources. I’ll fix the citations and figure links in the next revision so they meet normal academic standards.
2. Architecture transparency and reproducibility
The open-source repo contains every component used for the scientific claim:
extraction of activation fields
rank reduction
probing
training the student model
running inference with the student alone
The proprietary references in the paper refer only to optimization layers (CUDA kernels, scheduler heuristics, etc.) that aren’t required for the scientific result. They're not hand wavey secret parts of the method. Just production-grade accelerations I’m still packaging separately for licensing.
The core idea—extract, compress, probe, distill—is fully reproduced in the repo.
3. “Secret sauce” concern
There actually isn’t any.
The paper may read like I’m hinting at hidden architecture, but the method is intentionally simple. The novelty is in how much task-relevant geometry survives after severe rank reduction, not in a complex architecture. The “anchor layers” are just early and mid-layer activations concatenated before compression.
4. Baseline comparisons
Good point on comparing to:
1. a standard small transformer of the same size
2. a distillation from a single layer’s activations
I do have partial results for both, and you’re right that including them would sharpen the contribution. I’ll incorporate them into the revised version.
5. Writing clarity and background
Fair critique. I wrote this at the same time I was building the entire stack, which means the prose lagged behind the experiments. I can expand failure modes, limitations, and benchmark context to make the narrative clearer.
6. On the term “meaning field”
Naming is tricky, and I thought that captured everything im working on pretty effectively. Also, I think it will make more sense when you see everything im releasing in the near future. I used it because I felt as if it captures the intuition behind low-rank activation structure, but I’m not attached to the term. “Compressed activation representation” is probably clearer for a paper audience. I’ll adjust based on reviewer expectations.
7. Correct summary of the method
Your restatement is close, but not quite it. The student isn’t trained to reconstruct specific layers, but to match the compressed field extracted from multiple layers. It’s not a smaller transformer trying to imitate concatenated layers, but a model trying to predict a learned low-dimensional latent that carries most of the task-relevant signal.
All of your points are duly noted, and they will help me to adapt, grow, and mature my work and future releases.
Thank you, sincerely. This is the kind of feedback that actually improves me and the work aswell.
The current colour kindles and kobos don't use real eink colour. It's just a bw screen with lcd colour overlay (eink kaleido)
The real colour screens are used on the remarkable (eink gallery) and they are indeed slow for full page updates though remarkable seems to have done a lot of smarts for local updates while drawing.
See there are users who like Liquid Glass, just as there are users who like TouchID. A lot of Apple’s best work turned out to be quite polarizing at the time.
iOS 7’s design language was almost universally panned, but if it were “the wrong decision,” other phones wouldn’t have adopted similar design language. Material appeared just a year later in 2014. It wasn’t bad, it was just arbitrary.
(“I like Liquid Glass! I like Liquid Glass!” I insist as i slowly shrink down into the size of a corn cob)
On the topic of Alan Dye and the home button though, the swipe gesture interface they introduced when they removed the home button strikes me as one of few genuinely successful system-level Apple design innovations in recent years. That at least seems to have happened under his leadership. Can’t think of much else good to say about what I associate with design under him.
It’s my understanding that Chan Karunamuni was largely responsible for leading the iPhone X home buttonless interface, which, I agree, is fantastic and probably the best bit of UI to come out of Apple in years. Also, the Dynamic Island, which is less impactful, but really good and clever! Anyway, he’s excited about Lemay, so I am too. https://9to5mac.com/2025/12/05/acclaimed-apple-designer-says...
reply