The Way We Think

Thoughts after Karpathy's
Personally, I don't see current LLMs having the right ability to reason. They are more like students who build their high scores with growingly better memorization capabilities but don't understand the concept intrinsically, as we often find their reliance on quality of data and poorer generalization capability as humans would discover. The fact that differentiating promising students from those who memorize is already so difficult in the real world, and simple scoring evaluations is not sufficient.
The only way to examine is via extrapolation, where a naive model trained on very limited training data is questioned on something it has yet seen. True reasoning shouldn't be a result of supervision. It feels more like knowledge distillation out of the mapping already learned, and can happen way after an observation took place. For example, the infant doesn't see the reason of eatting with a spoon. They observe the behavior, try to do it, and then finds this rationale from another behavior, say wearing boots on muddy roads. Then, they're able to draw conclusions that the goal is not to get dirty.
As a matter of fact, I see goals as a higher level of knowledge distillation, certainly not something we all born with. These goals feed to our brain system and thus enforcing loops of mental satisfaction.
It's reasonable and mostly, engineeringly feasible to have commercial LLMs trained on internet. Not just because language is the only means human rely to communicate and it captures a deep history of human knowledge ever created. The outcomes and comparisons are intuitive, and people have more trust in what's explainable. However, the result of such complex distillation process is too hard a curriculum for any model to genuinely understand.
I think Karpathy does make a good engineering point that depriving memorization is a good way to foster thinking. Foundamentally, I think we'll need better representations of reasoning, potentially with more extraction and autonomous structuring.
Even so , it may be interesting to have it on more memorization-side of work, like having-seen a pattern, searching the best route. By using their latent as prior, some traditional pipelines may see some interesting improvement in optimization efficiency.