This isn’t a perfect analogy, but I’m curious what you think of it as it came to mind while reading your piece.
LLMs are to white collar labor as factories are to blue collar craftsmanship.
Both can produce ok but fairly generic products very easily and cheaply, and often that is good enough. But in many cases it’s also not good enough. A skilled craftsman still makes better furniture than IKEA. And even when we don’t need amazing work, there are many cases where you need a result that isn’t generic.
Another way the analogy works is how we’re flooded with low quality consumer products, and many people in rich countries now struggle with clutter in their homes.
I think it's a quite good analogy. But notably, when it comes to intellectual output, where it differs from physical goods is that only the artisanal stuff matters. Which is why great science is still produced by small teams, and great art often still produced by one person. These are markets where there's not a lot of benefit to "mass production" at the top.
I really appreciate this article. I've noticed the people I respect the most talking about AI have now moved to reluctantly acknowledging that there is a difference between training AI in skills that are automatically checkable, like passing a unit test or proving a mathematical theory in a formal language, and skills that are not easy to check, like writing good prose or making a high-quality informal academic argument. Our current methods for training LLMs don't continue to scale for the hard to check skills.
The idea that we will soon have automated software engineers and researchers, much less children's book authors, appears ridiculous when you consider this. Until we have another paradigm shift in machine learning, this isn't going to change. No matter how much the labs pour money into bespoke reinforcement learning frameworks and agentic scaffolds, they aren't going to be able to endow the models with human level expertise in these hard to verify capabilities.
As a software engineer this is so apparent b/c AI "slop" is so easy to identify, simply because none of it actually works in production. A simple, binary audit that reveals as much as I need but way more than I wanted.
I think this has become the case in many other industries where writing is the core modality. Wait, that might be every industry... what am I talking about. You get a rock, you get a rock, EVERYONE GETS A ROCK.
I can't wait for the midwits to move on to the next iteration of OpenClaw.
I liked the piece and agree with most of it. Im curious, regarding this part
>In this conservative view, by 2030, the top 100 math papers of the year won’t look spectacularly different from the top 100 math papers of 2020. The top 1,000 papers?
What do you make of Terence Tao saying that AI is already impacting the way he is currently doing math research? I wouldn’t have thought to pay much mind to AI’s impact on math research were it not for someone like Terence Tao making this statement. I imagine he likely would fall under the top 100 bucket rather than 101-1000, and it feels interesting to note that there are already gains there.
Furthermore, it does seem noteworthy that in the last 6 months, even high level software engineers (say those at big tech companies, AI labs, and trading firms) are all experiencing a boost in productivity (anecdotally from folks I know in such places) from AI tools.
I’m prepared to accept the thesis that math and software engineering are just “different” because 1. they’re verifiable domains which lend themselves more nicely to RL post-training approaches and 2. these are the specific domains AI labs care most to improve and are spending the most money trying to improve. However, I’m curious if you have any different perspective on this
> I’m prepared to accept the thesis that math and software engineering are just “different” because 1. they’re verifiable domains which lend themselves more nicely to RL post-training approaches and 2. these are the specific domains AI labs care most to improve and are spending the most money trying to improve. However, I’m curious if you have any different perspective on this
It's a great question and the answer is I think it's no different from writing. They are just finally experiencing it. We writers were just ahead. We were ahead by years. And we've seen the effects (on books, on blogs, etc). Meanwhile, finally the LLMs can understand what a top mathematician like Tao is even talking about. And so he confronts the same thing the rest of us mere mortals confronted more than half a decade ago. He's shook! Things are changing! Etc. And that's all correct... but it's much more about the Long March finally reaching these outposts than those outposts being actually more affected by it.
> a prompt is an injection of human intelligence, and a scaffold too is an injection of human intelligence (but the advantage of scaffolds is that you can put a ton of domain-specific knowledge and tips and tricks and guides in the scaffolds...
This makes the writing case quite clear. There is no efficiency gain from building the necessary scaffolding, so nobody does it. If I were to build writing infrastructure, I'd need to make ongoing, unpredictable adjustments. Sufficiently improving the tool is approximately equivalent to doing the work, itself.
When polished, efficient writing becomes cheap and ubiquitous, people may begin to value the signals of real authorship...things like idiosyncratic voice, imperfect structure, even mistakes.
I sometimes think messy writing might become a kind of proof of work (to borrow from crypto) for human artistry and thinking, and I'm wondering if I should just keep my mistakes in my blog moving forward. Then I think: but wait, at the same time the total volume of content today means efficiency and messaging will matter more than ever for mass communication. So we may see a split... highly optimized, AI assisted writing dominating the mass market while more distinctive, human writing becomes a marker of taste and authenticity. But at the end of the day, our tastes are going to change. Proust/Doestoevsky, versus the evolution to Stephen King in the 90's versus tomorrow's writers. We will always have to return to the 1800 classics when we want something good. And as for kids books, I'm sure Dr Seuss still rules the day (despite woke controversy).
Not convinced it will encounter the same limitations in math that it does in art or literature. There’s a large element of subjectivity in assessing quality in those realms. In some sense what you’re really measuring is its ability to grasp human sensibilities. With math, answers are correct or not, irrespective of how we feel about it.
Even in math you need some reason to pose a particular problem (notice all the progress has been on well-posed problems that we know likely have solutions). I think it's impossible to understand how big these spaces are. The space of all possible math and science is enormous, and the only way we currently know how to navigate it is exactly the subjective stuff (judgement, elegance, taste) that matters for evaluating literature too.
So the reason LLMs haven’t become world class at writing is because writing is not a verifiable domain and so therefore it cannot learn from its own experience (RL + Synthetic data).
Math however is verifiable, and so I’d expect LLMs to be able to surpass even the most elite humans.
This formula — starting with imitation of humans to provide reasonable priors, and then adding self play (RL + Synthetic data) is exactly how “move 37” was achieved in the game of Go.
If a system of this sort can be genuinely creative and brilliant within the verifiable domain of Go, there’s no clear reason why it’s not the same in Math.
The act of writing is about the experience of reading. What you want as a reader is the reflections of readings on what you're reading, on what's been written, if that makes sense. Subjectivity - so hard to simulate! AI mimicry is the sound without voice without mouth.
I think extrapolating from writing is fraught. Just because they're made of words, that doesn't mean that they should be good at the particular use of words we call writing to communicate ideas.
I'd also say (I think this is the correct use of this) that your argument proves too much. Or is already disproven? Whatever. What I mean is that it's already better at coding relative to professional coders than it is at writing compared to professional writers. So there's already evidence that extrapolating from writing doesn't work. I think this is true of some other fields too.
I'm not saying there's isn't loads of hype or that METR is what most people think it is. And I don't know if I'm in the form of AI psychosis that is all "omg so productive" while having few external effects.
> What I mean is that it's already better at coding relative to professional coders than it is at writing compared to professional writers.
Does this account for the levels of abstraction argument? I think if you just count AI-produced lines, then sure. But then why is Anthropic still hiring? Because coding is still happening, it's just at a higher level of abstraction.
And the results of vibe coding are also subjected to a less competitive market than vibe writing, btw too. You can have buggy and over-written code and no one cares, whereas for writing, the product's quality in all aspects matters.
Hmmm. I'm not going on "produced lines", just the number of people I personally know with decimated (anti-decimated actually, only 1 in 10 remain) programming teams who have had to totally change the way they work and bill.
Anthropic still hiring doesn't necessarily mean they are hiring coders for higher abstraction tasks. They might be. But they're also the hottest thing on the planet and I imagine are growing and building things out - all that is required for them to be hriing is for them to be growing faster than their Claude coding efficiency gains, and it would actually surprise me if this wasn't true given the growth rate.
I totally take your point on the difference between coding and writing. Writing you see, whereas with coding you see the product, not the code. So it's less obvious. And while my expertise as a writer is... limited... my expertise as a coder is less, so I totally can't tell myself. Though to my above point, I'm hearing a lot of people using many fewer coders and statements like "better than 90% of coders" being scoffed at, and responded to with "make that 98%".
What I remember of computer programming in the 1950s was that it was considered impossible by some that a computer would ever be able to play chess at all, much less beat a human. But once we learned how to make a machine that could play itself a million times in a day, that was clearly far from true. What I see missing from your analysis is that we appear to be approaching the day when the LLM will be able to code for the next generation LLM that will be able to code for the next generation, etc. I agree that not much is happening besides the straight line of the curve at the moment, but isn't that true of all geometric growth?
This isn’t a perfect analogy, but I’m curious what you think of it as it came to mind while reading your piece.
LLMs are to white collar labor as factories are to blue collar craftsmanship.
Both can produce ok but fairly generic products very easily and cheaply, and often that is good enough. But in many cases it’s also not good enough. A skilled craftsman still makes better furniture than IKEA. And even when we don’t need amazing work, there are many cases where you need a result that isn’t generic.
Another way the analogy works is how we’re flooded with low quality consumer products, and many people in rich countries now struggle with clutter in their homes.
I think it's a quite good analogy. But notably, when it comes to intellectual output, where it differs from physical goods is that only the artisanal stuff matters. Which is why great science is still produced by small teams, and great art often still produced by one person. These are markets where there's not a lot of benefit to "mass production" at the top.
I really appreciate this article. I've noticed the people I respect the most talking about AI have now moved to reluctantly acknowledging that there is a difference between training AI in skills that are automatically checkable, like passing a unit test or proving a mathematical theory in a formal language, and skills that are not easy to check, like writing good prose or making a high-quality informal academic argument. Our current methods for training LLMs don't continue to scale for the hard to check skills.
The idea that we will soon have automated software engineers and researchers, much less children's book authors, appears ridiculous when you consider this. Until we have another paradigm shift in machine learning, this isn't going to change. No matter how much the labs pour money into bespoke reinforcement learning frameworks and agentic scaffolds, they aren't going to be able to endow the models with human level expertise in these hard to verify capabilities.
OMG, yes:
> The worst users flood the zone.
As a software engineer this is so apparent b/c AI "slop" is so easy to identify, simply because none of it actually works in production. A simple, binary audit that reveals as much as I need but way more than I wanted.
I think this has become the case in many other industries where writing is the core modality. Wait, that might be every industry... what am I talking about. You get a rock, you get a rock, EVERYONE GETS A ROCK.
I can't wait for the midwits to move on to the next iteration of OpenClaw.
Awesome.
Though I would quibble with your summary of “The Giving Tree.”
It is a parable of codependency and narcissism. ;)
I liked the piece and agree with most of it. Im curious, regarding this part
>In this conservative view, by 2030, the top 100 math papers of the year won’t look spectacularly different from the top 100 math papers of 2020. The top 1,000 papers?
What do you make of Terence Tao saying that AI is already impacting the way he is currently doing math research? I wouldn’t have thought to pay much mind to AI’s impact on math research were it not for someone like Terence Tao making this statement. I imagine he likely would fall under the top 100 bucket rather than 101-1000, and it feels interesting to note that there are already gains there.
Furthermore, it does seem noteworthy that in the last 6 months, even high level software engineers (say those at big tech companies, AI labs, and trading firms) are all experiencing a boost in productivity (anecdotally from folks I know in such places) from AI tools.
I’m prepared to accept the thesis that math and software engineering are just “different” because 1. they’re verifiable domains which lend themselves more nicely to RL post-training approaches and 2. these are the specific domains AI labs care most to improve and are spending the most money trying to improve. However, I’m curious if you have any different perspective on this
> I’m prepared to accept the thesis that math and software engineering are just “different” because 1. they’re verifiable domains which lend themselves more nicely to RL post-training approaches and 2. these are the specific domains AI labs care most to improve and are spending the most money trying to improve. However, I’m curious if you have any different perspective on this
It's a great question and the answer is I think it's no different from writing. They are just finally experiencing it. We writers were just ahead. We were ahead by years. And we've seen the effects (on books, on blogs, etc). Meanwhile, finally the LLMs can understand what a top mathematician like Tao is even talking about. And so he confronts the same thing the rest of us mere mortals confronted more than half a decade ago. He's shook! Things are changing! Etc. And that's all correct... but it's much more about the Long March finally reaching these outposts than those outposts being actually more affected by it.
> a prompt is an injection of human intelligence, and a scaffold too is an injection of human intelligence (but the advantage of scaffolds is that you can put a ton of domain-specific knowledge and tips and tricks and guides in the scaffolds...
This makes the writing case quite clear. There is no efficiency gain from building the necessary scaffolding, so nobody does it. If I were to build writing infrastructure, I'd need to make ongoing, unpredictable adjustments. Sufficiently improving the tool is approximately equivalent to doing the work, itself.
also I want to point out - a lot of people (normies?) can't tell AI writing from real writing whatsoever
When polished, efficient writing becomes cheap and ubiquitous, people may begin to value the signals of real authorship...things like idiosyncratic voice, imperfect structure, even mistakes.
I sometimes think messy writing might become a kind of proof of work (to borrow from crypto) for human artistry and thinking, and I'm wondering if I should just keep my mistakes in my blog moving forward. Then I think: but wait, at the same time the total volume of content today means efficiency and messaging will matter more than ever for mass communication. So we may see a split... highly optimized, AI assisted writing dominating the mass market while more distinctive, human writing becomes a marker of taste and authenticity. But at the end of the day, our tastes are going to change. Proust/Doestoevsky, versus the evolution to Stephen King in the 90's versus tomorrow's writers. We will always have to return to the 1800 classics when we want something good. And as for kids books, I'm sure Dr Seuss still rules the day (despite woke controversy).
Not convinced it will encounter the same limitations in math that it does in art or literature. There’s a large element of subjectivity in assessing quality in those realms. In some sense what you’re really measuring is its ability to grasp human sensibilities. With math, answers are correct or not, irrespective of how we feel about it.
Even in math you need some reason to pose a particular problem (notice all the progress has been on well-posed problems that we know likely have solutions). I think it's impossible to understand how big these spaces are. The space of all possible math and science is enormous, and the only way we currently know how to navigate it is exactly the subjective stuff (judgement, elegance, taste) that matters for evaluating literature too.
So the reason LLMs haven’t become world class at writing is because writing is not a verifiable domain and so therefore it cannot learn from its own experience (RL + Synthetic data).
Math however is verifiable, and so I’d expect LLMs to be able to surpass even the most elite humans.
This formula — starting with imitation of humans to provide reasonable priors, and then adding self play (RL + Synthetic data) is exactly how “move 37” was achieved in the game of Go.
If a system of this sort can be genuinely creative and brilliant within the verifiable domain of Go, there’s no clear reason why it’s not the same in Math.
So... when's the bubble pop?
The act of writing is about the experience of reading. What you want as a reader is the reflections of readings on what you're reading, on what's been written, if that makes sense. Subjectivity - so hard to simulate! AI mimicry is the sound without voice without mouth.
I think extrapolating from writing is fraught. Just because they're made of words, that doesn't mean that they should be good at the particular use of words we call writing to communicate ideas.
I'd also say (I think this is the correct use of this) that your argument proves too much. Or is already disproven? Whatever. What I mean is that it's already better at coding relative to professional coders than it is at writing compared to professional writers. So there's already evidence that extrapolating from writing doesn't work. I think this is true of some other fields too.
I'm not saying there's isn't loads of hype or that METR is what most people think it is. And I don't know if I'm in the form of AI psychosis that is all "omg so productive" while having few external effects.
> What I mean is that it's already better at coding relative to professional coders than it is at writing compared to professional writers.
Does this account for the levels of abstraction argument? I think if you just count AI-produced lines, then sure. But then why is Anthropic still hiring? Because coding is still happening, it's just at a higher level of abstraction.
And the results of vibe coding are also subjected to a less competitive market than vibe writing, btw too. You can have buggy and over-written code and no one cares, whereas for writing, the product's quality in all aspects matters.
Hmmm. I'm not going on "produced lines", just the number of people I personally know with decimated (anti-decimated actually, only 1 in 10 remain) programming teams who have had to totally change the way they work and bill.
Anthropic still hiring doesn't necessarily mean they are hiring coders for higher abstraction tasks. They might be. But they're also the hottest thing on the planet and I imagine are growing and building things out - all that is required for them to be hriing is for them to be growing faster than their Claude coding efficiency gains, and it would actually surprise me if this wasn't true given the growth rate.
I totally take your point on the difference between coding and writing. Writing you see, whereas with coding you see the product, not the code. So it's less obvious. And while my expertise as a writer is... limited... my expertise as a coder is less, so I totally can't tell myself. Though to my above point, I'm hearing a lot of people using many fewer coders and statements like "better than 90% of coders" being scoffed at, and responded to with "make that 98%".
What I remember of computer programming in the 1950s was that it was considered impossible by some that a computer would ever be able to play chess at all, much less beat a human. But once we learned how to make a machine that could play itself a million times in a day, that was clearly far from true. What I see missing from your analysis is that we appear to be approaching the day when the LLM will be able to code for the next generation LLM that will be able to code for the next generation, etc. I agree that not much is happening besides the straight line of the curve at the moment, but isn't that true of all geometric growth?