The state we are in right now is what Ethan Mollick has called the "jagged frontier", where the large-language models are substantially better than the average human at many tasks, and substantially worse than the average human at many others, and you cannot tell without experimentation.
I'm glad you are actually using an AI system to try (if unsuccessfully) a use case. And I'm especially glad to see that you tried multiple prompts. Too much AI scorn and hype comes from single attempts, without either experimentation to improve performance, or repeated trials to see whether the result can be achieved consistently. It's very much an in progress technology; I wouldn't want to make a substantial bet on what it can or cannot do 18 months from now.
The problem with Gen AI is that people keep getting lulled into thinking they are working with a thinking, knowledgeable system. They keep forgetting how Gen AI works. It's only a prediction machine. It's predicting what based on its algorithms the most likely next word. When it gets it right or close to right we are amazed. When it gets it wrong we call it an "hallucination". In reality it's always hallucinating. We just like the results a lot of times. We have to keep it mind that even though it uses words to try and reassure us it does not truly understand. It's just predicting and repeating based on what has been previously written. So what good prompters do is figure out how to make it predict better, not make it understand.
The problem with peanut gallery critics of Gen AI is that they keep getting lulled into thinking that their heuristic understanding of sequence transduction from 10 years ago means they are working with relevant, meaningful information about the modern day. They keep forgetting they don't know how Gen AI actually works. It's not predicting the "most likely next word." That would be an RNN from 2015. Modern LLMs like the one under discussion here use sophisticated attention mechanisms that allow them to weigh the importance of different parts of the input (non-linearly!) while considering long-range dependencies, coherence and complexity, and context. They're not merely "predicting the next word," but building rich, contextual representations of language, which they then work with to generate meaningful output.
Of course, they don't "understand" in the sense that we do. And in particular, Erik's use case above is a great example of this -- there is a subtly multimodal nature to his request, requiring a very distinctly human understanding of language (syllabic structure, which is related to biology, and the notion of "sounding out," which requires linking human voice and human hearing to our linguistic constructs). LLMs are still notoriously bad in this very specific domain, for very obvious reasons -- they do not have ears, tongues, glottises, etc.!
Saying that transformers (LLMs) predict the next most likely word skips over some technicalities, but it is generally accurate. The attention mechanism is simply a part of how the next most likely word is chosen. It doesn't change the fact that text is generated sequentially, one token (word) at a time.
Moreover, no, transformers don't have anything that explicitly tracks "coherence and complexity". If someone bolts on additional logic during inference, that in no way changes the fundamental principle of how the model was trained and how it operates.
Text is not "generated" sequentially, it's generated non-linearly, then it's printed sequentially so we can read it. What do you expect? The words in the sentences to be output randomly until it's suddenly readable through diffusion?
Coherence and complexity are qualitative aspects of language, there is nothing in any algorithm, including our own biological ones, that "explicitly" track these things, unless we want to define "complexity" under some sort of lexicographic entropy.
Another thing to keep in mind is that it predicts linearly, token by token. So instructions like "double-check your work" are completely meaningless. Once the prediction is done, it's done; there's no rewind or rethink or considering the whole body of output carefully or anything like that. It only goes forward, previous tokens to next token.
Proximity also matters. Though the chatbots all keep a history of the conversation, that history is limited by its token limit (roughly its "memory") and the farther back in the chat history a reference is, the less weight the contents have in the prediction.
If you want it to retry, the best thing to do is copy the thing it screwed up verbatim and prompt it with something like, "rephrase the following text, but make <XYZ> adjustments:".
This is critically untrue, they absolutely do not predict linearly, and have not done so since the transformer architecture was introduced in 2017. I beg critics of this technology to please have even a passing understanding of how it works.
I have written my own LLM client software for GGUF models, and that is literally exactly what it does. I have more than a passing understanding of how it works. Everything that you talked about here and in the other comment below in regard to self-attention, is in service of next-token prediction. The tokens are produced one at a time.
The model can only work forward; the output is already produced by the time it moves on to the next token. It doesn't have any concept of the N+1 token when it's producing N, because N is a spread of weighted tokens selected by a PRNG, and until the token is selected it can't know what the following token should be.
If you're so-inclined, you can print the tokens in real time as they generate - which I do, so that I can read the response without waiting for generation to complete, and abort if it isn't going in the direction I want it to.
I don't know what the fuck fantasy land you're in, but I beg stans of this technology to please have even a passing understanding of how it works.
Okay, then you are talking about the absolutely trivial mechanics of output, not the operation of the interpretive/predictive architecture, my bad. Yes, it does indeed write its output linearly one token at a time, sort of like how we write our words linearly from left to right, though I don't see how this has much relevance.
But with respect to the actual LLM architecture, you are still operating on an outdated and incorrect understanding. First of all, virtually none of them use PRNG, they use softmax probabilities, and they all have iterative refinement and lookahead capabilities that put the lie to your entire second paragraph. There are also beam search and other parallelization schemes that allow for multiple token stream continuations from multiple branch points, which are not collapsed until all are completed (again using softmax probabilities).
I get that I was being snarky, and I don't mean to discourage investigation and hobbyist exploration of this tech, but the "fantasy land" I'm living in is just literal LLM development
The relevance is in planning and recursive improvement. When I write a sentence, I have the whole sentence in my head before I put it to text. I have a general idea of where I want the whole paragraph to go, and if it's a long narrative or essay, I've in mind an idea of its structure, how I will develop the theme, and where it will conclude.
As I write it, I review the output and sometimes make adjustments to previous output - rewriting words, sentences, or restructuring the whole thing. I've done so on this paragraph already numerous times (and now I come back and have split "this paragraph" into two).
These are all tasks that an LLM is incapable of. Once a token is accepted, it is done and the model moves forward to the next token. You can have other models judging the output and providing feedback as in e.g. speculative decoding, or you can use some other parallelization scheme as you mention (I am not familiar with others in any detail, and I don't have the compute to play with speculative decoding) but this doesn't change the model's operation.
You can look ahead X tokens and select from Y different possibilities generated in parallel, but once a token is produced it's locked in. The model can't subsequently rewind already-generated output. It can only go forward speculatively up to some limit, and select the first token in whatever branch is chosen, and it can only take into account what's within context window while doing it. So okay, you do that, and you infer N1-Nx tokens, then return it. Now that's locked in. Repeat this ten times and you've written a chapter or an essay or whatever, but the LLM at no point can go back to previously generated tokens and revise them. It is still forward-only. All you've accomplished is, effectively, selecting larger forward-blocks of output than a single token.
RE: token selection schemes, I don't know what you're doing, but in my code I see sample going to either argmax if no temperature parameter, or else softmax to yield a probability spread, which is then selected from by PRNG. Are you just sorting and taking the top token?
(now here I've stopped revising because my wife wants attention, so those last couple paragraphs are very LLM stream of consciousness style)
Sure, I think you're making fair points here! I don't think we are even disagreeing on anything all that fundamental, we just have a slightly different frame of reference, or we're talking about a slightly different control surface.
Of course I agree that at the end of a given inference cycle, that output is "locked in." But to me, this is analogous to a sentence actually being written on paper. Even if your thoughts about it have changed, it is in effect "too late" because you've already written it down.
What I am referring to is the actual process of generating the final token stream—this is something that occurs non-linearly, with plenty of lookahead, backpropagation, parallelization, and so on. Put differently, if what you were saying were true within the control surface that I am talking about, then we would have perfectly explainable AI, perfectly explainable deep learning. This is obviously not the case.
As for the fact that you cannot "rewind" already-generated output, I would say we have plenty of approaches no different from the way we would use an eraser to rewrite a written sentence or give our written sentence to someone else to criticize before updating it. RAG methods would be one way, but even just an interactive prompt interface like ChatGPT in which you simply request an update to the previous output achieves the same effect.
Does this make sense to you as well? Can you agree that our disagreement boils down to a frame of reference issue?
Firstly, I would like to see some references demonstrating that Claude Sonnet uses "iterative refinement and lookahead capabilities".
Secondly, this comment pretends that these terms are standard terms of art in AI. I don't think they really are. There are some research papers from last year that use them to describe a particular technique applied at inference time, but that's about it.
Thirdly, if those terms mean what they seem to mean both of them refence, essentially, inference-time hacks. In no way do they change how the underlying transformer model is trained and how it actually functions. Which is what the original comment re "prediction machine" was about.
Before going any further you need to define what you even mean by "predicting the next word," and how that differs from whatever you imagine we are doing when we create sentences ourselves. The token streams output by modern LLMs are in no way created in an ordered, linear fashion one token at a time. The fact that what we end up with is an ordered, linear sentence does not in any way evince this.
Even parallelization alone with beam search, having multiple token streams generating simultaneously from various branch points and comparing them against one another in real-time, already defeats the simplistic premise of a "next-word predictor." This is typical motte-and-bailey argumentation, where initially you try to have people believe that their operation is akin to rudimentary RNN sequence transduction, then when someone actually corrects you suddenly it's all about technicalities and nuance and "well-ackchually"-ing your way into still defining it as a token prediction machine.
Asking the model to produce words with a subset of characters / sounds is hard for modern LLMs for technical reasons related to sub-word tokenization.
But, this is, of course, firstly a nitpick, and second, it is not something that you, as a user, should have to worry about.
In general, I wholeheartedly agree with your assessment. Now and then I ask the model to fix something, or look for an error in the code I program, but this has never lead to a satisfying answer for me. It appears very hard to go from a "good" model to one that is practically reliable in ways that matter to us.
Agreed, there's an odd testing/usage gap that makes me more skeptical of things like benchmark scores.
Regarding tokenization, I had a little section on it but I cut it because, while I've seen a little evidence that some of these sort of problems *might* be caused from tokenization, AFAIK no large-scale frontier model has ever been trained with some far finer-grained tokenization scheme (or token-free), because that's essentially impossible. I think a few have for numbers specifically, but I'm not sure that actually solved all the issues around arithmetic. So it's a common claim to blame tokenization but I do think sometimes it's an excuse, because we don't actually know for sure, it just makes sense as a possible explanation. Additionally, there can be trade-offs, e.g., other tokenization schemes might have other effects, just different ones. Sometimes, you can do work-arounds and create a prompt that the AI can parse in terms of tokens, but then sometimes it still fails. There are also other things Sonnet fails on, like simple logic puzzles involving feathers and bricks (although I've never seen anything as reliable as this). Even if tokenization does contribute in a major way, I'm not sure tokenization alone explains it - I've had other Sonnet failures over other weird tasks I've asked it for, and usually it's just for something that isn't in the training set (like this one). I think asking it questions about phonics is some combination of the two that makes Sonnet absolutely terrible. From a practical perspective though, imo it doesn't really matter, in the sense that right now these things are not capable of being deployed for preschooling and my guess is it would be years before they can get phonics right.
This seems right. I’d expect that at every point in their evolution, LLMs will remain relatively worse at questions requiring lexical unit decomposition than at other tasks.
And yet, at least for me, it appears to still get larger numbers wrong (if I'm testing in the right place, I've never used Llama before). It's definitely *better* but it still gets bigger numbers wrong, making me think tokenization might not always be the fundamental problem as much as it merely amplifies the issues that do exist. But I'd need to double-check all that to be sure.
So glad to see that you are finally coming down from the peak of AI hype. There is still zero evidence of "learning" that goes from one domain to another. If "AI" and "machine learning" had been called from the beginning what they actually are, which is newer fancier search algorithms, far less attention would have been (rightfully) paid to them.
And LA Unified gets scammed again! Will those folks never learn? Maybe they actually could be replaced with AI ...
> Regarding tokenization, I had a little section on it but I cut it because, while I've seen a little evidence that some of these sort of problems *might* be caused from tokenization
It is easy to demonstrate that many of these problems are caused by tokenization by forcing a different tokenization. For example, "strawberry" tokenizes to god-knows-what, but you can force retokenization as individual letters by adding spaces, like 's t r a w b e r r y', and then a LLM is much more capable of answering questions like 'reverse the word' or 'how many letter rs are there in it'.
I would point out that I've been pointing this out for 4 years now (https://gwern.net/gpt-3#bpes), and you should know all about BPEs at this point if you are going to waste time on spelling/phonetics tasks you already know will be sabotaged by BPEs... There is also rigorous evidence from image generation that the tokenization of the LLM part makes a huge difference to their ability to understand & generate text: https://arxiv.org/abs/2212.10562#google
In the case of phonetics, rather than spelling, it's not easy to retokenize, because the knowledge is missing. (Maybe. I didn't get good results out of IPA encoding back in 2020, but LLM capabilities have increased so much that many things now work that didn't in 2020, and IPA might be one of them.)
If you want constrained text generation, like if you were trying to automatically generate Dr Seuss poems, this is a case where sampling constraint approaches can be very useful, to enforce a 'grammar' or 'rules' (I assume you can find a library to analyze sampled words in terms of spelling/phonetics, and then you use that to reject generations which violate the requirements), like https://arxiv.org/abs/2306.03081https://arxiv.org/abs/2307.09702https://arxiv.org/abs/2404.09562
> Llama was supposedly tokenized for arithmetic at an individual level.
Tokenization is a big problem but big-endian number encoding is, unrelatedly, also a poor fit for autoregressive calculation (when you sum or multiply or divide on a piece of paper, where do you start? you start with the last and smallest digits, not the first and biggest...); if you fix both, then you can train much better arithmetic: https://arxiv.org/abs/2307.03381
I don’t think it’s a nitpick; it’s important to understand that this is a task that’s unusually difficult for LLMs because of how they represent words. Whether or not a user “should” have to worry about it, in practice it’s incumbent on technology users to understand where the technology is and is not effective, and use the right tool for the right situation rather than concluding that a tool is bad because it doesn’t do everything you expected.
I agree. Tokenization is important here as this article narrowly focuses on a particular type of task AI does extremely poorly: analyzing language at a level more granular than a single word. "Presumably, steps after reading get harder, not easier" - for todays AI, I think tutoring actually becomes much easier beyond this level.
Can you try again but with the inclusion of short stories that you've crafted yourself in the prompt ? I remember reading a paper recently about how good it is at 'in-context learning' (learning from given examples), becoming really good at estimating the given task function, especially when the number of examples increase (10, 20, 30 etc).
That's interesting. When I give it a few example sentences it doesn't help. Nor "think step by step" or "double-check your work" etc. I haven't tried giving 30 examples, but one worry at some point of giving 30 examples is that solutions might trend to becoming re-mixes. E.g., if I give it "Bob sat up" and "Cat slips in mud," it might put together "Cat sat up" but it's basically just remixing. I feel this way about a lot of such prompt wizardry, which has overall made me skeptical about very detailed prompts overcoming LLM issues, since it's hard to separate out "more context" from "I've given you so much context you can just remix what I've given you and get good results."
Thanks. I have a lot of trouble getting a good grasp on what the AI can be good at and what it'd be bad at. It seems really good at things like object recognition, segmentation, voice cloning, face generation, programming etc., and abysmally bad at things you'd expect it to be good at.
For e.g., at a recent hackathon I wanted to see if I can record myself doing a task on my browser (recording a video of me doing it while narrating what I'm doing) and have the AI repeat it but in slightly different contexts and it turned out to be surprisingly bad at it.
No dispute on the content here, but one area I’ve seen incredible real world help is coding. Anymore I *always* have chatgpt open when I’m writing python. The code often needs minor surgery but I use it for very complex tasks.
Slightly less used languages and its worthless thougb (Julia for instance).
Definitely agree. I will say that in my experience, as someone who did a lot of python coding in graduate school (probably not very well) I found that a large chunk of coding was basically stitching together modular functions that essentially already existed. That could have just been my own lack of expertise, certainly, but at least the way that I would do it I can definitely see how something like Sonnet would help if only because most tasks I was stumped by weren't really original, they were things that had answers online and it was just a matter of finding them and slotting them in. Its modularity made it much more amenable to that strategy, so I've been a bit unsurprised at the success of frontier models for coding.
It is starting feel like the hype is hitting an inflection point, at least where education is concerned. The news about Ed, the LAUSD’s ChatBot, didn’t receive much coverage when The74 first broke the story, but I know at least two reporters who are digging in now. The big tech companies are trying to sell AI services to school districts and colleges but at a price that I don’t think many schools can afford.
Your essay points to what I suspect will be the real driver of disillusionment with generative AI. It just isn’t that good at real educational tasks. It is fun to play with and makes for great demos, but when teachers try to use it for lesson plans or to create problem sets it kinda sucks. When more teachers and parents try to use it to do actual work, the hype balloon will deflate.
True. And my suspicion too is that a lot of these "AI-driven apps" are basically the same sort of learning apps as always (which have had only partial success) but with "AI" thrown in to the marketing. As usual, the frontier models are the best all-around at everything, and so if they can't do it, I highly doubt some specific internally-developed AI actually can (and still, say, respond to prompts conversationally).
Note, however, that The Young Lady's Illustrated Primer only worked for Nell because it was backed up by a full-time, loving mother figure, Miranda. The Primer had to have human backup to work at all. Even with human backup, the other two Primers had no such spectacular results, since the human backup was just paid actors going through the motions. Nell's success is not a tale of successful AI, it is a tale of the blazing power of maternal love facilitated by AI.
Good piece from Andy, thanks for the link. I don't entirely agree with his assessment of the Primer, or what it does, or how much flexibility Miranda had in her interactions with Nell. Also, the Primer taught Nell how to fit in with people from a higher social class, which is a major obstacle for many people. And did the Primer bring her to Constable Moore, her IRL mentor, who taught her all-important lessons about toughness, and fighting, and overcoming her traumatic childhood experiences? Also, he leaves out the fact that Nell's "rogue" Primer, that Lord Finkel-McGraw apparently permitted to go on as an experiment, turned her into the literal founder of a kingdom, a massive achievement. Nonetheless, a good article, and The Diamond Age is an inexhaustible classic.
I have been excited by the possibility of AI to generating drill for my homeschooled children and have yet to be impressed. But the best decodable readers are the Julia Donaldson Songbirds series (produced for Oxford Reading Tree in the UK and available on eBay), which are the only basic readers that are actually witty. She wrote the Gruffalo and is a talented writer. If you need more practice than that (most kids will) the American Language Series of readers by Guyla Nelson are inexpensive and provide 80 pages of short stories for each phonics skill.
The economics wouldn't make sense even if the models were more competent. Have we forgotten that we were promised an education revolution in the 2010s with MOOCs and it never transpired? The problem wasn't tech or data - we did (and do) live at a time when thousands of people are willing to create passable educational resources for next to nothing, since we are all desperate grifters now - but the world largely didn't care. It's not how people want to learn, or not what they want to pay for when a subhuman learning experience can already be had for free with ads.
I've seen a few AI boosters pushing the education angle lately and it smacks of a total failure to find a convincing product use case for the tech. Starved of innovative ideas, all they're asking is: what written materials can we plagiarise without too many quality issues? But whatever the performance of the models, the gig economy will have already filled the space with cheaper, better human labour.
You could probably accomplish most of this with a small enhancement to LLMs to only accept completions that are accepted by a context-free grammar: https://matt-rickard.com/context-free-grammar-parsing-with-llms. In this case even regular expressions will work I think. That way, you can restrict it to avoid generating words with digraphs. However, it’s probably not 100% foolproof, and would probably require some tweaking of the grammar. It’s definitely not as easy as “type prompt, get back good results” - it shows that LLMs are tools, not some incomprehensible intelligence.
I'm not sure how capable Sonnet is with Python, relative to GPT4o, but I wonder whether asking it to use Python to generate these word sequences would yield better results.
The problem with this output is that the stories, such as they are, are not grammatical. Which is obviously not optimal for a young child learning to read. I am not knowledgeable enough about Python to know whether a Python script's output can be forced to be grammatical, but it sounds like an interesting exercise for someone conversant in Python and AI prompting.
Thanks. Yeah, very interesting they aren't grammatical. There's still a few mistakes, like "barn" (the "ar" is pronounced like saying the letter "r" and is different than sounding out "a" and "r" respectively).
As the spouse of an early childhood (pre-school) special education teacher, I can tell you that the number one concern in her corner of education has shifted dramatically from teaching academic skills to social and emotional readiness. Can AI help teach children to read? Maybe. But teaching children to read is easy compared with preparing students to be in a classroom (or life) setting where they're asked to work respectfully alongside peers, care for others, and respect personal boundaries. This, to me, is what society is dealing with at large. You can have all the information you want, and AI can help with that, but if you don't have emotional intelligence, a sense of community, and the skills necessary to successfully navigate life with others, we will not have civilization.
I’m a HS Orchestra Director… it is impossible for me to even conceive the idea of an AI being able to adjust to the minuet differences and individuality in every learners style, content perception, and challenges, never mind assessing and correcting misunderstandings or errors in process alone… even (or should say, especially) a 3 yo’s reading lessons
One of the critiques of AI is that it doesn't have enough human context, all it has is language, and mostly in text format, and some images. You point out that it doesn't have a mouth, but why can't it be trained with audio recordings of words? How does the LLM work for visual images (I mean not how does it work to perform tasks, but how does the attention/transformation/optimization algorithm work)?
There is another type of context it doesn't yet have, which is INTERNAL context. As in parts that themselves have been trained to do subtasks and are selected for by the AI itself instead of the training set (which acts as an EXTERNAL environment). And maybe several levels of parts and sub-parts that are all selected by their respective higher levels. This is different than the levels used to create network depth, I think.
How much prior art is around on the type of requests you have been making? Is the internet full of this kind of "beginner-friendly" sentences, or is this something no one before you tried to do? I have a hunch that easy/hard for AI tracks a lot less with easy/hard for humans than it does with well-precedented/novel with respect to the training data.
One of the main messages from my series on teaching early reading is that getting to reading such sentences should be the first main goal of teaching reading and yet I can't find a single, say, books series or anything like that that focuses on them. I'm unsure as to why, but it's been frustrating. That is to say, I think such "beginner-friendly" sentences are indeed oddly rare, and so that's one reason this might be difficult, just as you say.
Good recommendation, they're a good resource. I have the Bob books and we now use them a lot more (he's already more advanced than when I was doing the sort of lessons described here). But I had a few bones to pick with them early on, just in that even the very first "Level zero" (I think that's what it's called) actually did sometimes just have sounds that simply aren't that common. Also the text is tiny and it's hard for a toddler to focus on tiny text. I beg early reader books - make the font big! Additionally, I found that by making up my own stories I could work on pronunciation, e.g., if he was struggling with the "r" sound we could do lots of words with r's, like "Rat ran on the rug" or something of that nature.
Aha, so I guessed right :) AI is stumbling on unfamiliar ground once again, as it does with simple maths exercises on advanced topics even though it can perfectly reproduce all relevant definitions.
Slightly tangentially, though: Are you sure you want to avoid occasional nonstandard phonemes at all costs, as opposed to carefully pointing them out and treating them as "extra points" exercises? This way you'll have to avoid the words "eat" and "with" for a while, and the word "night" for most of your kid's phonics sequence, which will limit the semantic content of their reading quite significantly. From what I hear, it is hard enough to find kid-friendly reading in the West that is not aggressively infantilizing, and I fear adding this kind of "A Void"-like restriction will make it near-impossible.
Yet another tangent: I have recently been going through one of the less well-maintained languages on Duolingo (Swedish), and I found a similar problem with badly generalizing examples. The worst one was introducing the word "around" through the sentence "They are driving around town" (and it's teaching languages based on English, so you're seeing this sentence written like it is here). Fortunately I knew how to google, use wiktionary *and* google translate (they each have their own weaknesses) and even google books when things got really weird. In the end I don't think these exceptions have hurt me, and I even remember a few words *better* as my confusions and their subsequent resolutions anchored them in my memory. (Duolingo has lots of other weaknesses, which are much more substantial, so I can nevertheless not recommend it.)
@Erik, have you looked at coach.microsoft.com? It won't be able to meet the quite specific requirements you are asking for; however, it does seem a potentially useful tool to aid reading the development of reading skills. Having said that, I've not tried it with my four year old. We are sticking with paper and physical media for now. Screens seem to carry an addiction that bleeds into other uses, even when applied in a specific use case.
The state we are in right now is what Ethan Mollick has called the "jagged frontier", where the large-language models are substantially better than the average human at many tasks, and substantially worse than the average human at many others, and you cannot tell without experimentation.
I'm glad you are actually using an AI system to try (if unsuccessfully) a use case. And I'm especially glad to see that you tried multiple prompts. Too much AI scorn and hype comes from single attempts, without either experimentation to improve performance, or repeated trials to see whether the result can be achieved consistently. It's very much an in progress technology; I wouldn't want to make a substantial bet on what it can or cannot do 18 months from now.
The problem with Gen AI is that people keep getting lulled into thinking they are working with a thinking, knowledgeable system. They keep forgetting how Gen AI works. It's only a prediction machine. It's predicting what based on its algorithms the most likely next word. When it gets it right or close to right we are amazed. When it gets it wrong we call it an "hallucination". In reality it's always hallucinating. We just like the results a lot of times. We have to keep it mind that even though it uses words to try and reassure us it does not truly understand. It's just predicting and repeating based on what has been previously written. So what good prompters do is figure out how to make it predict better, not make it understand.
The problem with peanut gallery critics of Gen AI is that they keep getting lulled into thinking that their heuristic understanding of sequence transduction from 10 years ago means they are working with relevant, meaningful information about the modern day. They keep forgetting they don't know how Gen AI actually works. It's not predicting the "most likely next word." That would be an RNN from 2015. Modern LLMs like the one under discussion here use sophisticated attention mechanisms that allow them to weigh the importance of different parts of the input (non-linearly!) while considering long-range dependencies, coherence and complexity, and context. They're not merely "predicting the next word," but building rich, contextual representations of language, which they then work with to generate meaningful output.
Of course, they don't "understand" in the sense that we do. And in particular, Erik's use case above is a great example of this -- there is a subtly multimodal nature to his request, requiring a very distinctly human understanding of language (syllabic structure, which is related to biology, and the notion of "sounding out," which requires linking human voice and human hearing to our linguistic constructs). LLMs are still notoriously bad in this very specific domain, for very obvious reasons -- they do not have ears, tongues, glottises, etc.!
Disingenuous word salad.
Saying that transformers (LLMs) predict the next most likely word skips over some technicalities, but it is generally accurate. The attention mechanism is simply a part of how the next most likely word is chosen. It doesn't change the fact that text is generated sequentially, one token (word) at a time.
Moreover, no, transformers don't have anything that explicitly tracks "coherence and complexity". If someone bolts on additional logic during inference, that in no way changes the fundamental principle of how the model was trained and how it operates.
Text is not "generated" sequentially, it's generated non-linearly, then it's printed sequentially so we can read it. What do you expect? The words in the sentences to be output randomly until it's suddenly readable through diffusion?
Coherence and complexity are qualitative aspects of language, there is nothing in any algorithm, including our own biological ones, that "explicitly" track these things, unless we want to define "complexity" under some sort of lexicographic entropy.
Exactly.
Another thing to keep in mind is that it predicts linearly, token by token. So instructions like "double-check your work" are completely meaningless. Once the prediction is done, it's done; there's no rewind or rethink or considering the whole body of output carefully or anything like that. It only goes forward, previous tokens to next token.
Proximity also matters. Though the chatbots all keep a history of the conversation, that history is limited by its token limit (roughly its "memory") and the farther back in the chat history a reference is, the less weight the contents have in the prediction.
If you want it to retry, the best thing to do is copy the thing it screwed up verbatim and prompt it with something like, "rephrase the following text, but make <XYZ> adjustments:".
This is critically untrue, they absolutely do not predict linearly, and have not done so since the transformer architecture was introduced in 2017. I beg critics of this technology to please have even a passing understanding of how it works.
I have written my own LLM client software for GGUF models, and that is literally exactly what it does. I have more than a passing understanding of how it works. Everything that you talked about here and in the other comment below in regard to self-attention, is in service of next-token prediction. The tokens are produced one at a time.
The model can only work forward; the output is already produced by the time it moves on to the next token. It doesn't have any concept of the N+1 token when it's producing N, because N is a spread of weighted tokens selected by a PRNG, and until the token is selected it can't know what the following token should be.
If you're so-inclined, you can print the tokens in real time as they generate - which I do, so that I can read the response without waiting for generation to complete, and abort if it isn't going in the direction I want it to.
I don't know what the fuck fantasy land you're in, but I beg stans of this technology to please have even a passing understanding of how it works.
Okay, then you are talking about the absolutely trivial mechanics of output, not the operation of the interpretive/predictive architecture, my bad. Yes, it does indeed write its output linearly one token at a time, sort of like how we write our words linearly from left to right, though I don't see how this has much relevance.
But with respect to the actual LLM architecture, you are still operating on an outdated and incorrect understanding. First of all, virtually none of them use PRNG, they use softmax probabilities, and they all have iterative refinement and lookahead capabilities that put the lie to your entire second paragraph. There are also beam search and other parallelization schemes that allow for multiple token stream continuations from multiple branch points, which are not collapsed until all are completed (again using softmax probabilities).
I get that I was being snarky, and I don't mean to discourage investigation and hobbyist exploration of this tech, but the "fantasy land" I'm living in is just literal LLM development
The relevance is in planning and recursive improvement. When I write a sentence, I have the whole sentence in my head before I put it to text. I have a general idea of where I want the whole paragraph to go, and if it's a long narrative or essay, I've in mind an idea of its structure, how I will develop the theme, and where it will conclude.
As I write it, I review the output and sometimes make adjustments to previous output - rewriting words, sentences, or restructuring the whole thing. I've done so on this paragraph already numerous times (and now I come back and have split "this paragraph" into two).
These are all tasks that an LLM is incapable of. Once a token is accepted, it is done and the model moves forward to the next token. You can have other models judging the output and providing feedback as in e.g. speculative decoding, or you can use some other parallelization scheme as you mention (I am not familiar with others in any detail, and I don't have the compute to play with speculative decoding) but this doesn't change the model's operation.
You can look ahead X tokens and select from Y different possibilities generated in parallel, but once a token is produced it's locked in. The model can't subsequently rewind already-generated output. It can only go forward speculatively up to some limit, and select the first token in whatever branch is chosen, and it can only take into account what's within context window while doing it. So okay, you do that, and you infer N1-Nx tokens, then return it. Now that's locked in. Repeat this ten times and you've written a chapter or an essay or whatever, but the LLM at no point can go back to previously generated tokens and revise them. It is still forward-only. All you've accomplished is, effectively, selecting larger forward-blocks of output than a single token.
RE: token selection schemes, I don't know what you're doing, but in my code I see sample going to either argmax if no temperature parameter, or else softmax to yield a probability spread, which is then selected from by PRNG. Are you just sorting and taking the top token?
(now here I've stopped revising because my wife wants attention, so those last couple paragraphs are very LLM stream of consciousness style)
Sure, I think you're making fair points here! I don't think we are even disagreeing on anything all that fundamental, we just have a slightly different frame of reference, or we're talking about a slightly different control surface.
Of course I agree that at the end of a given inference cycle, that output is "locked in." But to me, this is analogous to a sentence actually being written on paper. Even if your thoughts about it have changed, it is in effect "too late" because you've already written it down.
What I am referring to is the actual process of generating the final token stream—this is something that occurs non-linearly, with plenty of lookahead, backpropagation, parallelization, and so on. Put differently, if what you were saying were true within the control surface that I am talking about, then we would have perfectly explainable AI, perfectly explainable deep learning. This is obviously not the case.
As for the fact that you cannot "rewind" already-generated output, I would say we have plenty of approaches no different from the way we would use an eraser to rewrite a written sentence or give our written sentence to someone else to criticize before updating it. RAG methods would be one way, but even just an interactive prompt interface like ChatGPT in which you simply request an update to the previous output achieves the same effect.
Does this make sense to you as well? Can you agree that our disagreement boils down to a frame of reference issue?
Firstly, I would like to see some references demonstrating that Claude Sonnet uses "iterative refinement and lookahead capabilities".
Secondly, this comment pretends that these terms are standard terms of art in AI. I don't think they really are. There are some research papers from last year that use them to describe a particular technique applied at inference time, but that's about it.
Thirdly, if those terms mean what they seem to mean both of them refence, essentially, inference-time hacks. In no way do they change how the underlying transformer model is trained and how it actually functions. Which is what the original comment re "prediction machine" was about.
Before going any further you need to define what you even mean by "predicting the next word," and how that differs from whatever you imagine we are doing when we create sentences ourselves. The token streams output by modern LLMs are in no way created in an ordered, linear fashion one token at a time. The fact that what we end up with is an ordered, linear sentence does not in any way evince this.
Even parallelization alone with beam search, having multiple token streams generating simultaneously from various branch points and comparing them against one another in real-time, already defeats the simplistic premise of a "next-word predictor." This is typical motte-and-bailey argumentation, where initially you try to have people believe that their operation is akin to rudimentary RNN sequence transduction, then when someone actually corrects you suddenly it's all about technicalities and nuance and "well-ackchually"-ing your way into still defining it as a token prediction machine.
It reminds me how people also project humanity onto animals.
I blame Aesop.
Asking the model to produce words with a subset of characters / sounds is hard for modern LLMs for technical reasons related to sub-word tokenization.
But, this is, of course, firstly a nitpick, and second, it is not something that you, as a user, should have to worry about.
In general, I wholeheartedly agree with your assessment. Now and then I ask the model to fix something, or look for an error in the code I program, but this has never lead to a satisfying answer for me. It appears very hard to go from a "good" model to one that is practically reliable in ways that matter to us.
Agreed, there's an odd testing/usage gap that makes me more skeptical of things like benchmark scores.
Regarding tokenization, I had a little section on it but I cut it because, while I've seen a little evidence that some of these sort of problems *might* be caused from tokenization, AFAIK no large-scale frontier model has ever been trained with some far finer-grained tokenization scheme (or token-free), because that's essentially impossible. I think a few have for numbers specifically, but I'm not sure that actually solved all the issues around arithmetic. So it's a common claim to blame tokenization but I do think sometimes it's an excuse, because we don't actually know for sure, it just makes sense as a possible explanation. Additionally, there can be trade-offs, e.g., other tokenization schemes might have other effects, just different ones. Sometimes, you can do work-arounds and create a prompt that the AI can parse in terms of tokens, but then sometimes it still fails. There are also other things Sonnet fails on, like simple logic puzzles involving feathers and bricks (although I've never seen anything as reliable as this). Even if tokenization does contribute in a major way, I'm not sure tokenization alone explains it - I've had other Sonnet failures over other weird tasks I've asked it for, and usually it's just for something that isn't in the training set (like this one). I think asking it questions about phonics is some combination of the two that makes Sonnet absolutely terrible. From a practical perspective though, imo it doesn't really matter, in the sense that right now these things are not capable of being deployed for preschooling and my guess is it would be years before they can get phonics right.
This seems right. I’d expect that at every point in their evolution, LLMs will remain relatively worse at questions requiring lexical unit decomposition than at other tasks.
Just a note of interest: I don't know for certain, but Llama was supposedly tokenized for arithmetic at an individual level. https://www.beren.io/2024-05-11-Integer-tokenization-is-now-much-less-insane/
And yet, at least for me, it appears to still get larger numbers wrong (if I'm testing in the right place, I've never used Llama before). It's definitely *better* but it still gets bigger numbers wrong, making me think tokenization might not always be the fundamental problem as much as it merely amplifies the issues that do exist. But I'd need to double-check all that to be sure.
So glad to see that you are finally coming down from the peak of AI hype. There is still zero evidence of "learning" that goes from one domain to another. If "AI" and "machine learning" had been called from the beginning what they actually are, which is newer fancier search algorithms, far less attention would have been (rightfully) paid to them.
And LA Unified gets scammed again! Will those folks never learn? Maybe they actually could be replaced with AI ...
> Regarding tokenization, I had a little section on it but I cut it because, while I've seen a little evidence that some of these sort of problems *might* be caused from tokenization
It is easy to demonstrate that many of these problems are caused by tokenization by forcing a different tokenization. For example, "strawberry" tokenizes to god-knows-what, but you can force retokenization as individual letters by adding spaces, like 's t r a w b e r r y', and then a LLM is much more capable of answering questions like 'reverse the word' or 'how many letter rs are there in it'.
I would point out that I've been pointing this out for 4 years now (https://gwern.net/gpt-3#bpes), and you should know all about BPEs at this point if you are going to waste time on spelling/phonetics tasks you already know will be sabotaged by BPEs... There is also rigorous evidence from image generation that the tokenization of the LLM part makes a huge difference to their ability to understand & generate text: https://arxiv.org/abs/2212.10562#google
In the case of phonetics, rather than spelling, it's not easy to retokenize, because the knowledge is missing. (Maybe. I didn't get good results out of IPA encoding back in 2020, but LLM capabilities have increased so much that many things now work that didn't in 2020, and IPA might be one of them.)
If you want constrained text generation, like if you were trying to automatically generate Dr Seuss poems, this is a case where sampling constraint approaches can be very useful, to enforce a 'grammar' or 'rules' (I assume you can find a library to analyze sampled words in terms of spelling/phonetics, and then you use that to reject generations which violate the requirements), like https://arxiv.org/abs/2306.03081 https://arxiv.org/abs/2307.09702 https://arxiv.org/abs/2404.09562
> Llama was supposedly tokenized for arithmetic at an individual level.
Tokenization is a big problem but big-endian number encoding is, unrelatedly, also a poor fit for autoregressive calculation (when you sum or multiply or divide on a piece of paper, where do you start? you start with the last and smallest digits, not the first and biggest...); if you fix both, then you can train much better arithmetic: https://arxiv.org/abs/2307.03381
I don’t think it’s a nitpick; it’s important to understand that this is a task that’s unusually difficult for LLMs because of how they represent words. Whether or not a user “should” have to worry about it, in practice it’s incumbent on technology users to understand where the technology is and is not effective, and use the right tool for the right situation rather than concluding that a tool is bad because it doesn’t do everything you expected.
I agree. Tokenization is important here as this article narrowly focuses on a particular type of task AI does extremely poorly: analyzing language at a level more granular than a single word. "Presumably, steps after reading get harder, not easier" - for todays AI, I think tutoring actually becomes much easier beyond this level.
Can you try again but with the inclusion of short stories that you've crafted yourself in the prompt ? I remember reading a paper recently about how good it is at 'in-context learning' (learning from given examples), becoming really good at estimating the given task function, especially when the number of examples increase (10, 20, 30 etc).
That's interesting. When I give it a few example sentences it doesn't help. Nor "think step by step" or "double-check your work" etc. I haven't tried giving 30 examples, but one worry at some point of giving 30 examples is that solutions might trend to becoming re-mixes. E.g., if I give it "Bob sat up" and "Cat slips in mud," it might put together "Cat sat up" but it's basically just remixing. I feel this way about a lot of such prompt wizardry, which has overall made me skeptical about very detailed prompts overcoming LLM issues, since it's hard to separate out "more context" from "I've given you so much context you can just remix what I've given you and get good results."
Thanks. I have a lot of trouble getting a good grasp on what the AI can be good at and what it'd be bad at. It seems really good at things like object recognition, segmentation, voice cloning, face generation, programming etc., and abysmally bad at things you'd expect it to be good at.
For e.g., at a recent hackathon I wanted to see if I can record myself doing a task on my browser (recording a video of me doing it while narrating what I'm doing) and have the AI repeat it but in slightly different contexts and it turned out to be surprisingly bad at it.
No dispute on the content here, but one area I’ve seen incredible real world help is coding. Anymore I *always* have chatgpt open when I’m writing python. The code often needs minor surgery but I use it for very complex tasks.
Slightly less used languages and its worthless thougb (Julia for instance).
Definitely agree. I will say that in my experience, as someone who did a lot of python coding in graduate school (probably not very well) I found that a large chunk of coding was basically stitching together modular functions that essentially already existed. That could have just been my own lack of expertise, certainly, but at least the way that I would do it I can definitely see how something like Sonnet would help if only because most tasks I was stumped by weren't really original, they were things that had answers online and it was just a matter of finding them and slotting them in. Its modularity made it much more amenable to that strategy, so I've been a bit unsurprised at the success of frontier models for coding.
It is starting feel like the hype is hitting an inflection point, at least where education is concerned. The news about Ed, the LAUSD’s ChatBot, didn’t receive much coverage when The74 first broke the story, but I know at least two reporters who are digging in now. The big tech companies are trying to sell AI services to school districts and colleges but at a price that I don’t think many schools can afford.
Your essay points to what I suspect will be the real driver of disillusionment with generative AI. It just isn’t that good at real educational tasks. It is fun to play with and makes for great demos, but when teachers try to use it for lesson plans or to create problem sets it kinda sucks. When more teachers and parents try to use it to do actual work, the hype balloon will deflate.
True. And my suspicion too is that a lot of these "AI-driven apps" are basically the same sort of learning apps as always (which have had only partial success) but with "AI" thrown in to the marketing. As usual, the frontier models are the best all-around at everything, and so if they can't do it, I highly doubt some specific internally-developed AI actually can (and still, say, respond to prompts conversationally).
Note, however, that The Young Lady's Illustrated Primer only worked for Nell because it was backed up by a full-time, loving mother figure, Miranda. The Primer had to have human backup to work at all. Even with human backup, the other two Primers had no such spectacular results, since the human backup was just paid actors going through the motions. Nell's success is not a tale of successful AI, it is a tale of the blazing power of maternal love facilitated by AI.
Andy Matuschak has some interesting comments as a designer & working on education tools about the siren song of the Primer: https://andymatuschak.org/primer/ https://notes.andymatuschak.org/z9R3ho4NmDFScAohj3J8J3Y
Good piece from Andy, thanks for the link. I don't entirely agree with his assessment of the Primer, or what it does, or how much flexibility Miranda had in her interactions with Nell. Also, the Primer taught Nell how to fit in with people from a higher social class, which is a major obstacle for many people. And did the Primer bring her to Constable Moore, her IRL mentor, who taught her all-important lessons about toughness, and fighting, and overcoming her traumatic childhood experiences? Also, he leaves out the fact that Nell's "rogue" Primer, that Lord Finkel-McGraw apparently permitted to go on as an experiment, turned her into the literal founder of a kingdom, a massive achievement. Nonetheless, a good article, and The Diamond Age is an inexhaustible classic.
I have been excited by the possibility of AI to generating drill for my homeschooled children and have yet to be impressed. But the best decodable readers are the Julia Donaldson Songbirds series (produced for Oxford Reading Tree in the UK and available on eBay), which are the only basic readers that are actually witty. She wrote the Gruffalo and is a talented writer. If you need more practice than that (most kids will) the American Language Series of readers by Guyla Nelson are inexpensive and provide 80 pages of short stories for each phonics skill.
The economics wouldn't make sense even if the models were more competent. Have we forgotten that we were promised an education revolution in the 2010s with MOOCs and it never transpired? The problem wasn't tech or data - we did (and do) live at a time when thousands of people are willing to create passable educational resources for next to nothing, since we are all desperate grifters now - but the world largely didn't care. It's not how people want to learn, or not what they want to pay for when a subhuman learning experience can already be had for free with ads.
I've seen a few AI boosters pushing the education angle lately and it smacks of a total failure to find a convincing product use case for the tech. Starved of innovative ideas, all they're asking is: what written materials can we plagiarise without too many quality issues? But whatever the performance of the models, the gig economy will have already filled the space with cheaper, better human labour.
You could probably accomplish most of this with a small enhancement to LLMs to only accept completions that are accepted by a context-free grammar: https://matt-rickard.com/context-free-grammar-parsing-with-llms. In this case even regular expressions will work I think. That way, you can restrict it to avoid generating words with digraphs. However, it’s probably not 100% foolproof, and would probably require some tweaking of the grammar. It’s definitely not as easy as “type prompt, get back good results” - it shows that LLMs are tools, not some incomprehensible intelligence.
I'm not sure how capable Sonnet is with Python, relative to GPT4o, but I wonder whether asking it to use Python to generate these word sequences would yield better results.
Edit: I just did this, through you.com's interface. Here are the results: https://you.com/search?q=I+want+you+to+use+Python+to+do+the+following.+I+am+teaching+a+child+to+read.+He+is+learning+via...&cid=c1_25bc2ba9-1a4b-4a41-9345-550b12dc4dcb&tbm=youchat .
The problem with this output is that the stories, such as they are, are not grammatical. Which is obviously not optimal for a young child learning to read. I am not knowledgeable enough about Python to know whether a Python script's output can be forced to be grammatical, but it sounds like an interesting exercise for someone conversant in Python and AI prompting.
Thanks. Yeah, very interesting they aren't grammatical. There's still a few mistakes, like "barn" (the "ar" is pronounced like saying the letter "r" and is different than sounding out "a" and "r" respectively).
As the spouse of an early childhood (pre-school) special education teacher, I can tell you that the number one concern in her corner of education has shifted dramatically from teaching academic skills to social and emotional readiness. Can AI help teach children to read? Maybe. But teaching children to read is easy compared with preparing students to be in a classroom (or life) setting where they're asked to work respectfully alongside peers, care for others, and respect personal boundaries. This, to me, is what society is dealing with at large. You can have all the information you want, and AI can help with that, but if you don't have emotional intelligence, a sense of community, and the skills necessary to successfully navigate life with others, we will not have civilization.
If you used the word “tech” instead of “Sonnet”, (“tech must help…”), you would have had a haiku! 🤣
The "ch" is so confusing to teach... in "tech" it's a "k" sound but in "teach" it's the more common one.
I’m a HS Orchestra Director… it is impossible for me to even conceive the idea of an AI being able to adjust to the minuet differences and individuality in every learners style, content perception, and challenges, never mind assessing and correcting misunderstandings or errors in process alone… even (or should say, especially) a 3 yo’s reading lessons
One of the critiques of AI is that it doesn't have enough human context, all it has is language, and mostly in text format, and some images. You point out that it doesn't have a mouth, but why can't it be trained with audio recordings of words? How does the LLM work for visual images (I mean not how does it work to perform tasks, but how does the attention/transformation/optimization algorithm work)?
There is another type of context it doesn't yet have, which is INTERNAL context. As in parts that themselves have been trained to do subtasks and are selected for by the AI itself instead of the training set (which acts as an EXTERNAL environment). And maybe several levels of parts and sub-parts that are all selected by their respective higher levels. This is different than the levels used to create network depth, I think.
How much prior art is around on the type of requests you have been making? Is the internet full of this kind of "beginner-friendly" sentences, or is this something no one before you tried to do? I have a hunch that easy/hard for AI tracks a lot less with easy/hard for humans than it does with well-precedented/novel with respect to the training data.
One of the main messages from my series on teaching early reading is that getting to reading such sentences should be the first main goal of teaching reading and yet I can't find a single, say, books series or anything like that that focuses on them. I'm unsure as to why, but it's been frustrating. That is to say, I think such "beginner-friendly" sentences are indeed oddly rare, and so that's one reason this might be difficult, just as you say.
We used the Bob Books for this.
Good recommendation, they're a good resource. I have the Bob books and we now use them a lot more (he's already more advanced than when I was doing the sort of lessons described here). But I had a few bones to pick with them early on, just in that even the very first "Level zero" (I think that's what it's called) actually did sometimes just have sounds that simply aren't that common. Also the text is tiny and it's hard for a toddler to focus on tiny text. I beg early reader books - make the font big! Additionally, I found that by making up my own stories I could work on pronunciation, e.g., if he was struggling with the "r" sound we could do lots of words with r's, like "Rat ran on the rug" or something of that nature.
The Bob Books! Get the Bob Books! (It has already been suggested but I just wanted to affirm their brilliance)
The elephant and piggie series are good here, slightly more advanced than bob books.
Aha, so I guessed right :) AI is stumbling on unfamiliar ground once again, as it does with simple maths exercises on advanced topics even though it can perfectly reproduce all relevant definitions.
Slightly tangentially, though: Are you sure you want to avoid occasional nonstandard phonemes at all costs, as opposed to carefully pointing them out and treating them as "extra points" exercises? This way you'll have to avoid the words "eat" and "with" for a while, and the word "night" for most of your kid's phonics sequence, which will limit the semantic content of their reading quite significantly. From what I hear, it is hard enough to find kid-friendly reading in the West that is not aggressively infantilizing, and I fear adding this kind of "A Void"-like restriction will make it near-impossible.
Yet another tangent: I have recently been going through one of the less well-maintained languages on Duolingo (Swedish), and I found a similar problem with badly generalizing examples. The worst one was introducing the word "around" through the sentence "They are driving around town" (and it's teaching languages based on English, so you're seeing this sentence written like it is here). Fortunately I knew how to google, use wiktionary *and* google translate (they each have their own weaknesses) and even google books when things got really weird. In the end I don't think these exceptions have hurt me, and I even remember a few words *better* as my confusions and their subsequent resolutions anchored them in my memory. (Duolingo has lots of other weaknesses, which are much more substantial, so I can nevertheless not recommend it.)
@Erik, have you looked at coach.microsoft.com? It won't be able to meet the quite specific requirements you are asking for; however, it does seem a potentially useful tool to aid reading the development of reading skills. Having said that, I've not tried it with my four year old. We are sticking with paper and physical media for now. Screens seem to carry an addiction that bleeds into other uses, even when applied in a specific use case.