LLMs, Statistical Learning, and Explicit Teaching
The Surprising Success of Large Language Models
“The success of large language models is the biggest surprise in my intellectual life. We learned that a lot of what we used to believe may be false and what I used to believe may be false. I used to really accept, to a large degree, the Chomskyan argument that the structures of language are too complex and not manifest in input so that you need to have innate machinery to learn them. You need to have a language module or language instinct, and it’s impossible to learn them simply by observing statistics in the environment.
If it’s true — and I think it is true — that the LLMs learn language through statistical analysis, this shows the Chomskyan view is wrong. This shows that, at least in theory, it’s possible to learn languages just by observing a billion tokens of language.”
–Paul Bloom, in an interview with Tyler Cowen
Challenging the Hypothesis of Innateness
For decades, the Chomskyan view has dominated our understanding of language development. This view argues that language structures are too complex to be learned solely from environmental input and therefore must require some kind of innate linguistic machinery in the brain (a “universal grammar”).
Yet as the quote above from Paul Bloom makes explicit, what LLMs have demonstrated–as a proof of concept–is that grammatical structures for language does not need to be innate. That machines can learn language via statistical associations alone, rather than explicitly programmed grammatical rules.
We have explored in a previous series on this blog the idea that language may not be a completely innate property of our brains, but rather more of a cultural phenomenon. This parallels the insight–much more widely accepted now–that learning to read is not innate.
The success of LLMs in acquiring language-like abilities through mere statistical analysis of texts demonstrates that it's possible to learn languages via statistical associations alone.
The Power of Statistical Learning
This revelation–that LLMs can learn language via statistical associations alone, rather than through any explicitly programmed rules–challenges our traditional understanding of language development and points to the power of implicit statistical learning.
However, unlike human children, who can rapidly learn language from relatively sparse input, current frontier LLMs require astronomical amounts of data to be trained. Yet the fact that machines can learn in this way suggests that the structure of language itself lends itself to such implicit learning.
This insight extends beyond language development and into literacy. We have previously examined seminal papers by Philip Gough and co arguing that learning to read words is more akin to learning a cipher than breaking a code. Rather than learning explicit rules, as from a codebook, we internalize patterns of sounds, letters, and meanings in an algorithmic fashion.
There is a fascinating line of research focused on “statistical learning,” and while there remains much to be learned about this domain, there seems to be an interesting convergence between this research as it relates to reading and as it relates to LLMs.
Reading nerds are already well acquainted with Mark Seidenberg, as he is a steady presence in the public sphere of communication and debates about reading instruction. What may be somewhat less known about him is that his oeuvre of research has been into computational, connectionist models of reading that have demonstrated how learning to read is a process of statistical learning between sounds, spelling, and meaning. It’s not that he hides this, by the way, but rather that the community of educators that are deep into the “science of reading” stuff don’t seem to be as enticed by abstract stuff like computational models and statistical learning.
But the convergence between connectionist accounts of learning language and learning to read and the advent of LLMs are important to understand. Not just from a nerdy stance, which has been mine throughout all these posts, but rather because LLMs have–again, as a proof of concept–demonstrated that implicit learning of statistical associations are fundamental not only to language and to reading, but to our knowledge and experience of the world.
Connectionist Models: Bridging AI and Human Learning
In fact, Seidenberg himself has repeatedly attempted to communicate the understanding that implicit statistical learning is just as fundamental to learning to read as it is to learning language.
He stirred up some recent controversy on this topic when he suggested that the “SOR” movement has over-corrected in response to previous squishy balanced literacy approaches by focusing too hard on explicit instruction as the cure-all for everything. See his provocative presentation and writing on this topic here: https://seidenbergreading.net/2024/06/24/where-does-the-science-of-reading-go-from-here-2/
To summarize his argument, which dovetails with where we started with LLMs, learning to read can not all be taught explicitly, and there is an opportunity cost to an over-reliance on the explicit teaching of “rules” over providing more opportunity for actual reading and writing to build up the statistical associations needed to become fluent:
“The purpose of explicit learning is to scaffold implicit learning about print, sound, meaning. Explicit instruction is the tip of the iceberg. The larger part under the surface is learned implicitly instead of teaching the whole iceberg.”
—slides on “Where does the Science of Reading go from here?”
In other words – only provide enough explicit instruction as needed to successfully spend more time engaged in an increasing volume of reading, writing, and talking.
Balancing Explicit and Implicit Learning in Language and Reading Instruction
In a paper, “The Impact of Language Experience on Language and Reading,” Seidenberg and Maryellen MacDonald also point to the fact that learning to read is easier for children with more advanced spoken language skills, while those with less exposure (due to greater variability of linguistic input) face greater challenges. This is because children exposed to multiple dialects or languages are learning to navigate multiple language systems, each with its own set of statistical linguistic patterns.
For multilingual and multidialectal learners, it is therefore especially critical to find the right combination of statistical learning and explicit teaching. According to the paper, consistent and increased exposure to the language of instruction is important. This exposure should be complemented by explicit teaching of both oral and written language patterns. And by explicitly comparing and contrasting home languages and dialects with the language used at school–both orally and in writing–students can develop metalinguistic awareness and a deeper understanding of varying language structures. This approach, implemented strategically within a welcoming and supportive classroom, allows students to leverage their existing linguistic knowledge while acquiring new language skills.
Another way of thinking about this, as we’ve explored in another post, is the movement from fuzziness to precision. By seeing, hearing, speaking, and writing an increasing volume of language, students can rapidly begin to make statistical associations. However, especially in the initial stages of learning a new language or learning to read, more effort will be required to gain greater precision, and thus, more mistakes will be a part of the learning process, and thus more feedback is needed to course correct at the very beginning.
I’ve written elsewhere about the importance of striking a balance between close reading of shared grade-level texts that are worth reading, while ensuring that each and every student reads a steady volume of texts that are more accessible. I’ve also written here about the need for “daily textual feasts” to increase the volume of rich language, knowledge, and critical thinking, as per Dr. Alfred Tatum.
Rethinking Language and Literacy Instruction
In sum, the surprising and awesome ability of LLMs, derived from mere statistical associations, has challenged traditional assumptions about the innate nature of language and, potentially, the role of explicit and implicit instruction in language and literacy learning.
This underscores the need for a comprehensive approach to teaching of reading and language, in which explicit teaching is strategically counterbalanced alongside implicit learning opportunities.