> Contrast with the way a human learns skills - as we gain experience with a ski...

mediaman · 2025-10-16T18:27:46 1760639266

It's a false dichotomy. LLMs are already being trained with RL to have goal directedness.

He is right that non-RL'd LLMs are just mimicry, but the field already moved beyond that.

anomaloustho · 2025-10-16T18:42:34 1760640154

I wrote elsewhere but I’m more interpreting this distinction as “RL in real-time” vs “RL beforehand”.

stevenpetryk · 2025-10-17T02:11:57 1760667117

This is referred to as “online reinforcement learning” and is already something done by, for example Cursor for their tab prediction model.

https://cursor.com/blog/tab-rl

tinodb · 2025-10-19T10:42:47 1760870567

Not sure that’s the same. They just very frequently retrain and “deploy a new model”.

munchler · 2025-10-16T20:00:09 1760644809

I agree with this description, but I'm not sure we really want our AI agents evolving in real time as they gain experience. Having a static model that is thoroughly tested before deployment seems much safer.

mbesto · 2025-10-16T20:20:24 1760646024

> Having a static model that is thoroughly tested before deployment seems much safer.

While that might true, it fundamentally means it's not going to ever replicate human or provide super intelligence.

CryptoBanker · 2025-10-17T03:51:24 1760673084

> While that might true, it fundamentally means it's not going to ever replicate human or provide super intelligence.

Many people would argue that's a good thing

OtherShrezzing · 2025-10-16T19:25:56 1760642756

In the interview transcript, he seems aware that the field is doing RL, and he makes a compelling argument that bootstrapping isn’t as scalable as a purely RL trained AI would be.

isodev · 2025-10-17T03:21:02 1760671262

Let’s not overstate what the technology actually is. LLMs amount to random token generators that try their best to have their outputs “rhyme” with their prompts, instructions, skills, or what humans know as goals and consequences.

adastra22 · 2025-10-17T05:27:15 1760678835

It does a lot more than that.

isodev · 2025-10-17T08:27:03 1760689623

It’s literally a slot machine for random text. With “services around it” to give the randomness some shape and tools.

adastra22 · 2025-10-17T08:58:45 1760691525

It is literally not. 2/3 of the weights are in the multi-layer perceptron which is a dynamic information encoding and retrieval machine. And the attention mechanisms allow for very complex data interrelationships.

At the very end of an extremely long and sophisticated process, the final mapping is softmax transformed and the distribution sampled. That is one operation among hundreds of billions leading up to it.

It’s like saying is a jeopardy player is random word generating machine — they see a question and they generate “what is “ followed by a random word—random because there is some uncertainty in their mind even in the final moment. That is both technically true, but incomplete, and entirely missing the point.

mbesto · 2025-10-16T20:29:58 1760646598

> LLMs are already being trained with RL to have goal directedness.

That might be true, but we're talking about the fundamentals of the concept. His argument is that you're never going to reach AGI/super intelligence on an evolution of the current concepts (mimicry) even through fine tuning and adaptions - it'll like be different (and likely based on some RL technique). At least we have NO history to suggest this will be case (hence his argument for "the bitter lesson").

samrus · 2025-10-16T22:21:27 1760653287

The LLMs dont have RL baked into them. They need that at the token prediction level to be able to do the sort of things humans can do

dingnuts · 2025-10-16T18:40:29 1760640029

Explain something to me that I've long wondered: how does Reinforcement Learning work if you cannot measure your distance from the goal? In other words, how can RL be used for literally anything qualitative?

kmacdough · 2025-10-16T19:25:45 1760642745

This is one of known hardest parts of RL. The short answer is human feedback.

But this is easier said than done. Current models require vastly more learning events than humans, making direct supervision infeasable. One strategy is to train models on human supervisors, so they can bear the bulk of the supervision. This is tricky, but has proven more effective than direct supervision.

But, in my experience, AIs don't specifically struggle with the "qualitative" side of things per-se. In fact, they're great at things like word choice, color theory, etc. Rather, they struggle to understand continuity, consequence and to combine disparate sources of input. They also suck at differentiating fact from fabrication. To speculate wildly, it feels like it's missing the the RL of living in the "real world". In order to eat, sleep and breath, you must operate within the bounds of physics and society and live forever with the consequences of an ever-growing history of choices.

ewoodrich · 2025-10-16T23:43:32 1760658212

Whenever I watch Claude Code or Codex get stuck trying to force a square peg into a round hole and failing over and over it makes me wish that they could feel the creeping sense of uncertainty and dread a human would in that situation after failure after failure.

Which eventually forces you to take a step back and start questioning basic assumptions until (hopefully) you get a spark of realization of the flaws in your original plan, and then recalibrate based on that new understanding and tackle it totally differently.

But instead I watch Claude struggling to find a directory it expects to see and running random npm commands until it comes to the conclusion that, somehow, node_modules was corrupted mysteriously and therefore it needs to wipe everything node related and manually rebuild the project config by vague memory.

Because no big deal, if it’s wrong it’s the human's problem to untangle and Anthropic gets paid either way so why not try?

jon-wood · 2025-10-17T14:58:28 1760713108

> But instead I watch Claude struggling to find a directory it expects to see and running random npm commands until it comes to the conclusion that, somehow, node_modules was corrupted mysteriously and therefore it needs to wipe everything node related and manually rebuild the project config by vague memory.

In fairness I have on many an occasion worked with real life software developers who really should know better deciding the problem lies anywhere but their initial model of how this should work. Quite often that developer has been me, although I like to hope I've learned to be more skeptical when that thought crosses my mind now.

ewoodrich · 2025-10-17T17:20:39 1760721639

Right, but typically making those kind of mistakes creates more work for yourself and with the benefit of experience you get better at recognizing the red flags to avoid getting in that situation again. but it

Which is why I think the parent post had a great observation about human problem solving having evolved in a universe inherently formed by the additive effect of every previous decision you've ever made made in your life.

There's a lot of variance in humans, sure, but inescapable stakes/skin in the game from an instinctual understanding that you can't just revert to a previous checkpoint any time you screw up. That world model of decisions and consequences helps ground abstract problem solving ability with a healthy amount of risk aversion and caution that LLMs lack.

mbesto · 2025-10-16T20:31:30 1760646690

This 100%.

While we might agreed that language is foundational to what it is to be human, it's myopic to think its the only thing. LLMs are based on training sets of language (period).

mediaman · 2025-10-17T23:26:39 1760743599

RL works great on verifiable domains like math, and to some significant extent coding.

Coding is an interesting example because as we change levels of abstraction from the syntax of a specific function to, say, the architecture of a software system, the ability to measure verifiable correctness declines. As a result, RL-tuned LLMs are better at creating syntactically correct functions but struggle as the abstraction layer increases.

In other fields, it is very difficult to verify correctness. What is good art? Here, LLMs and their ilk can still produce good output, but it becomes hard to produce "superhuman" output, because in nonverifiable domains their capability is dependent on mimicry; it is RL that gives the AI the ability to perform at superhuman levels. With RL, rather than merely fitting its parameters to a set of extant data it can follow the scent of a ground truth signal of excellence. No scent, no outperformance.

leptons · 2025-10-16T18:38:38 1760639918

I can't wait to try to convince an LLM/RL/whatever-it-is that what it "thinks" is right is actually wrong.

baxtr · 2025-10-16T18:55:59 1760640959

So it’s on-the-fly adaptive mimicry?

buildbot · 2025-10-16T18:07:43 1760638063

The industry has been doing RL on many kinds of neural networks, including LLMs, for quite some time. Is this person saying we RL on some kind of non neural network design? Why is that more likely to bring AGI than an LLM?.

> More specifically, LLMs don't have goals and consequences of actions, which is the foundation for intelligence.

Citation?

anomaloustho · 2025-10-16T18:25:46 1760639146

Looks like they added the link. But I think it’s doing RL in realtime vs pre-trained as an LLM is.

And I associate that part to AGI being able to do cutting edge research and explore new ideas like humans can. Where, when that seems to “happen” with LLMs it’s been more debatable. (e.g. there was an existing paper that the LLM was able to tap into)

I guess another example would be to get an AGI doing RL in realtime to get really good at a video game with completely different mechanics in the same way a human could. Today, that wouldn’t really happen unless it was able to pre-train on something similar.

ibejoeb · 2025-10-16T20:53:08 1760647988

I don't think any of the commercial models are doing RL at the consumer. The R is just accepting or rejecting the action, right?

jfarina · 2025-10-16T18:12:05 1760638325

Why are you asking them to cite something for that statement? Are you questioning whether it's the foundation for intelligence or whether LLMS understand goals and consequences?

buildbot · 2025-10-16T18:19:04 1760638744

Yes, I'm questioning if that's the foundation of intelligence. Says who?

mbesto · 2025-10-16T20:22:12 1760646132

Richard Sutton. He won a Turing Award. Why ask your question above when you can just watch the YouTube link I posted?

skurilyak · 2025-10-16T18:35:31 1760639731

Besides a "reference manual", Claude Skills is analogous to a "toolkit with an instruction manual" in that it includes both instructions (manuals) and executable functions (tools/code)

hbarka · 2025-10-16T18:16:19 1760638579

For humans, it’s not uncommon to have a clever realization by way of serendipity. How do you skill AI to have serendipity.

Weeenion · 2025-10-17T13:41:07 1760708467

I would love to understand were this notion of LLM becoming AGI ever came from?

ChatGPT broke upen the dam to massive budget on AI/LM and LLM will probably be a puzzle peace to AGI. But otherwise?

I mean it should be clear that we have so much work to do like RL (which now happens btw. on massive scale because you thumb up or down every day), thinking, Model of Experts, toolcalling and super super critical: Architecture.

Compute is a hard upper limit too.

And the math isn't done either. The performance of Context length has advanced, we also saw other approcheas like a diffusion based models.

Whenever you hear the leading experts talking, they mention world models.

We are still in a phase were we have plenty of very obivous ideas people need to try out.

But alone the quality of whispher, llm as an interface and tool calling can solve problems with robotics and stuff, no one was able to solve that easy ever before.

vonneumannstan · 2025-10-16T18:29:05 1760639345

This is an uninformed take. Much of the improvement in performance of LLM based models has been through RLHF and other RL techniques.

mbesto · 2025-10-16T20:26:49 1760646409

> This is an uninformed take.

You may disagree with this take but its not uninformed. Many LLMs use self‑supervised pretraining followed by RL‑based fine‑tuning but that's essentially it - it's fine tuning.

vonneumannstan · 2025-10-17T14:02:09 1760709729

I think you're seriously underestimating the importance of the RL steps on LLM performance.

Also how do you think the most successful RL models have worked? AlphaGo/AlphaZero both use Neural Networks for their policy and value networks which are the central mechanism of those models.