computation — Language & Literacy

Reviewing Claims I’ve Made on LLMs

Mon, 07 Oct 2024 00:10:15 +0000

When I typically begin a series of blogs to conduct nerdy inquiry into an abstract topic, I don't generally know where I'm going to end up. This series on LLMs was unusual in that in our first post, I outlined pretty much the exact topics I would go on to cover.

Here's where I had spitballed we might go:

The surprisingly inseparable interconnection between form and meaning
Blundering our way to computational precision through human communication; Or, the generative tension between regularity and randomness
The human (and now, machine) capacity for learning and using language may simply be a matter of scale
Is language as separable from thought (and, for that matter, from the world) as Cormac McCarthy said?
Implicit vs. explicit learning of language and literacy

Indeed, we then went on to explore each of these areas, in that order. Cool!

Some Hypotheses from This Series

What theories have we raised through this exploration?

1) LLMs gain their uncanny powers from the statistical nature of language itself; 2) the meaning and experiences of our world are more deeply entwined with the form and structure of our language than we previously imagined; 3) LLMs may offer us an opportunity to further the convergence between human and machine language; 4) AI can potentially extend our cognitive abilities, enabling us to process and understand far more information; 5) Both human and machine learning progresses from fuzzy, imprecise representations to higher precision, and the greater the precision, the greater the effort and practice (or “compute”) that is required; and 6) LLMs challenge Chomsykan notions of innateness and suggest that implicit, statistical learning alone can lead to gaining the grammatical structure and meaning of a language.

While I’ve been mostly positive and excited about the potential of AI (aside from pointing out how it is accelerating the looming ecological catastrophe that seems to be our trajectory) I should probably pause here to acknowledge that there may be important counterpoints to many of these (perhaps somewhat starry-eyed) hypotheses.

Onto the Counterclaims

Let's take a more critical look at some of my claims:

1) I claim that language is fundamental to the generative powers of LLMs.

Yet Andrej Karpathy, who is no stranger to LLM development, tweeted:

It's a bit sad and confusing that LLMs (“Large Language Models”) have little to do with language; It's just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something.

They don't care if the tokens happen to represent little text chunks. It could just as well be little image patches, audio chunks, action choices, molecules, or whatever. If you can reduce your problem to that of modeling token streams (for any arbitrary vocabulary of some set of discrete tokens), you can “throw an LLM at it.”

I agree that LLMs are performing “statistical modeling of token streams,” and that “for any arbitrary vocabulary of some set of discrete tokens, you can ‘throw an LLM at it.’”

We now have multimodal LLMs that are modeling out of token streams of audio, visual, and text, and will no doubt have ones feeding from additional streams of sensory data as they are increasingly paired with cameras on humans, objects, and robots.

Yet I also think Karpathy undersells that when LLMs suddenly exploded into general public awareness and fascination, it was merely a “historical” fact that they were trained upon vast amounts of human generated text and were able to reproduce and generate human language. As we’ve explored in this and a previous series, there is something about human language itself uniquely adapted for our brain circuitry and the propagation of our culture within social interaction in our world. And being able to communicate with a powerful computational model through the medium of conversational human language has been a revolutionary advent. We are just in the beginning stages of grokking it.

As I tweeted in response to Karpathy, token streams may be applied to anything, but human language seems to be uniquely suited to the advancement of combined human and machine learning. Not only because we rely on it for communication – but furthermore due to the algebraic and statistical nature of our language.

Recent case in point: the viral attention currently on NotebookLM’s Audio Overview. Listening to a conversation, however artificial, resonates with us, because that's what's in our social nature. And, surprisingly, it does a fairly good job of surfacing information from across multiple multimodal sources (and soon, across languages) that we find interesting, relevant, and meaningful.

Speaking of NotebookLM Audio Overview. . . here’s one derived from all the blog posts (except this one) from this series, as well as the sources–outlined in post 1–that inspired them all: https://notebooklm.google.com/notebook/a4f35399-e288-4293-b2d2-0489e6b1f037/audio

4) I claim there is great potential for AI to extend our cognitive capabilities

Yet there is a strong case of an equal and commensurate danger that use of LLMs can reduce our cognitive capabilities.

Learning more formal content and skills, like what we learn in school or in a job, requires deliberate effort until we develop an unconscious fluency. If students learning new concepts and skills externally automate their practice of new learning (such as writing or math) to an LLM, then they will not–ironically–be able to develop the automatized internal knowledge and capacity they need to wield powerful tools like AI more effectively.

When “experts” use tools like AI, they know where the gaps are in the output and are able to use it strategically to enhance their own production and output. A few examples of this:

Simon Willison, a programmer who is also a great communicator, uses different LLMs to support his projects, and writes and speaks about how he does so. Here’s a podcast, for example, where he explains how he uses them.
Nicholas Carlini, a research scientist at Google DeepMind, similarly wrote about how he uses AI to support his work.
Cal Newport, who writes extensively about how to do “deep work” in a world of distractions, recently wrote in The New Yorker how he has found ChatGPT useful to his writing.

All the people above are highly skilled at what they do – so when they explore and then figure out how to use AI to support their work, they do so in a way that does not diminish their own hard-earned ability, but rather enhances and extends their capabilities.

On the other hand, for students–who are by definition novices in the skills and knowledge they are learning–an over-reliance on AI tools may limit their ability to develop skills such as literacy, critical thinking, problem-solving, and creativity.

Recent reports on AI in education, such as from Cognitive Resonance, Center for American Progress, and Bellwether, have rightfully raised this concern.

And all educators, whether K-12 or in higher ed, are seeing an increasing use of AI by students to complete homework assignments, so this danger of truncating the development of internal capacity is real.

I think the steps we can to take to address this are two-fold:

limit the use of digital technology for learners at the earliest stages of learning, whether learners are preK-3 or learners being introduced to a new concept
move practice of essential skills directly into the classroom as much as possible, while considering how AI could be used to extend, rather than diminish, any practice and feedback outside of the classroom

In a post on ethical use of AI, Jacob Kaplan-Moss argues that fully automated AI is unethical in the public sector due to its inherent biases and potential for unfairness in high-stakes situations. In contrast, the assistive use of AI can enhance human decision-making.

This assistive vs. automated use of AI may be a useful frame for thinking of how AI can be used most ethically and effectively in education. We want AI to be used to assist the learning process, rather than simply automating the solving of math problems or writing essays. This view aligns with Ethan Mollick’s idea of “co-intelligence” as well.

So far, I find the most powerful and interesting assistive applications for AI are more focused on educators (“the experts”), rather than on students (“the novices”). Teachers can leverage AI to support administrative tasks, analysis of student data, and consider additional enhancements of their instruction based on student data.

That said, I don’t think the assistive use cases of AI are only limited to “experts” in a domain. AI can also help to equip those without knowledge and expertise in a specific area with the language they need to navigate learning or real-world communications more effectively. And there are some really interesting use cases of AI for feedback on student thinking and writing, when structured with specific guidelines and criteria and with the teacher in the loop.

But in the context of classroom learning, such uses must be very strategically designed and cautiously incorporated. For example, see this explanation from professor Michael Brenner on how he has begun incorporating AI into his pedagogy. But note this example is from a graduate level math class, so again, that novice vs. expert dynamic is quite different from what we would need to consider at a preK-8 level. But even at that graduate level, you can see there is quite a bit of complexity the instructor needed to consider and think through to design his course to leverage LLMs so strategically.

There’s a lot more to unpack here on all sides of the equation. I’ll leave this one here for now, accepting non-closure, and I hope to dig further into these tensions and opportunities in both this space and in my professional work.

6) I claim that LLMs have shown that language can be learned without any innate programming or structure – therefore demonstrating the power of statistical, implicit learning

I’d moved into the “Chomsky is wrong” camp for a while now, but I happened to listen to an interview of Jean-Rémi King recently, a scientist at Meta AI, by Stephen Wilson on The Language Neuroscience podcast (did I tell you I’m a nerd?). Towards the end of the conversation, King warns against writing off Chomsky too readily, and that there is something intrinsic to the human brain in its readiness for language.

I uploaded the relevant portion of the transcript from the interview, and asked Claude AI for a concise summary of King's main claims, which it willingly obliged (while I’m sure it drew upon an unconscionable amount of energy):

King argues that human brains likely don't use the same “next word prediction” principle as large language models for language acquisition, primarily because humans are exposed to far less linguistic data than these models.

He contends that while language models have shown impressive capabilities, they are extremely inefficient compared to human language learning, suggesting that we're missing some fundamental principles of how humans acquire language so efficiently.

While I think I’ve tried to temper most of my pronouncements throughout this series, I think it’s important to acknowledge that the fact that LLMs can learn language from statistical associations of word tokens alone does not mean that is exactly how we humans must also learn language.

It is rather a proof of concept that language can be learned in this way (without any innate grammar or teaching of rules). But as King points out, this is via a scale of input that is ridiculously and exponentially larger than that of any child.

That said, there are other Artificial Neural Networks (ANNs), such as in the research of Gašper Beguš, that learn from raw speech in an unsupervised manner, more closely mimicking human language acquisition. His lab has found interesting similarities between these ANNs and the human brain in processing language sounds – a parallel to King’s own research, which has found that LLM models can generate brain-like representations when predicting words from context.

And there will continue to be research into tinier models trained off sparser, and potentially richer, data.

But as King points to, there’s just so much more we need to learn. And this is exactly where I find all of this the most exciting.

Where I may be most rightfully critiqued in my last post, and perhaps in other posts, may be in extrapolating from the theoretical demonstration of LLMs to implications for classrooms.

So let me state my position a bit more clearly in case there was any confusion that I am falling onto the side of the Goodmans or something. Children need consistency, stability, clarity, and coherency in their learning experiences, and teaching what is most important to know for a given subject directly and explicitly is critical. For children at the earliest stages of learning abstract skills and content, such as learning to read, explicit and well-structured teaching is essential. At the same time, however, we need to ensure that students have abundant structured opportunities to apply and practice what they are learning – and this is where ensuring they are spending more time reading, writing, and talking–connected to the content of what we are teaching–is essential.

If you have more critiques that I am missing in any of the above, please do share!

Egads, I think I may actually have ANOTHER post left in me after all of this. Who knew LLMs would be such an interesting topic?!

#language #literacy #AI #LLMs #cognition #research #computation #models

The Pathway of Human Language Towards Computational Precision in LLMs

Sun, 19 May 2024 15:05:11 +0000

Regularity and irregularity. Decodable and tricky words. Learnability and surprisal. Predictability and randomness. Low entropy and high entropy.

Why do such tensions exist in human language? And in our AI tools developed to both create code and use natural language, how can the precision required for computation co-exist alongside this necessary complexity and messiness of our human language?

An Algebraic Symphony of Language and Meaning

In our last post, we examined how there is a statistical and algebraic nature to language that drives the power of LLMs, and that the form and meaning of a language may be much more intertwined than we assume, given just how much meaning LLMs are able to approximate via computation of statistical arrays alone.

This interlacement of form and meaning is in relation to where and how words show up repeatedly in sentences and texts, not simply in the form of words themselves. Because all languages contain words that have the same form but different meanings. Some words that share the same form have entirely unrelated meanings (homophony), while other words with the same form have closely related meanings (polysemy). Yet LLMs are able to use them in a more or less “natural” manner due to the high dimensional mappings of word parts in statistical relation to one another – such that word analogies can be calculated mathematically:

“For example, Google researchers took the vector for biggest, subtracted big, and added small. The word closest to the resulting vector was smallest.” –Large language models, explained with a minimum of math and jargon, Timothy Lee & Sean Trott

That the algebraic and statistical relationship of words in natural language can drive computational models' generative capabilities suggests that language itself may reflect the limitations and potential of AI. And the thing with natural, human language is that while it is endlessly generative, it also tends to be imprecise. For our human usage, gestures and the context of our social interaction, who and when we are speaking to, plays a big role. As long as we get our main message across, we’re good.

Human language is fundamentally communicative and social, and there’s feelings involved.

The Imprecision of Human Expression

Imagine yourself in a bustling restaurant in an international airport, surrounded by people from diverse linguistic backgrounds. You're trying to communicate with a traveller whose language you don't speak. What do you do?

You resort to body language. You gesture hyperbolically and make exaggerated facial expressions. You point to objects, mime actions, and mouth simple words you hope the other person might use as a basis for basic understanding.

Depth, nuance, and complexity are not possible (beyond each individual’s imagination) in this most elemental of interactions.

So what is required for depth, nuance, and complexity?

A shared language, whether spoken, written, or signed. In which a small set of sounds, letters, or signs are concatenated in a wide assortment of ways, both commonplace and surprising, to convey a wide assortment of ideas and feelings.

Yet a shared language, while providing a platform for greater depth, may still remain imprecise. What is meant to be conveyed is not always exactly what is understood.

There are furthermore gradations of precision in language, beginning with the ephemeral and contextual nature of spoken and signed language, moving into the more ossified form of written language, in which spelling must be exact and word selection must be more intentional. There is also a movement from the language we use with our family, with frequent, commonly used words, to the language we use when writing an academic paper, with domain-specific, rarer words. In education, we often refer to this type of language as Tier 2 or 3 vocabulary.

If a person is equipped with more of that academic, domain-specific language, then greater precision in communication can be achieved. Yet the challenge of whether the listener hears and interprets what is intended remains. For example, in this article in Scientific American, “People Have Very Different Understandings of Even the Simplest Words”, they discuss how the more abstract a word, the more it can be tied to an emotional valence and someone’s identity and experiences, rather than a precise meaning.

The Computational Imperative

But in some ways, this inherent fuzziness of our language may be a feature, rather than a bug. It gives us a complex adaptive system for navigating, creating, and communicating in a world of complex adaptive systems.

For computers and computations, however, exactness and precision in language is required – either a line of code input runs the correct function as an output or it doesn’t. So it’s quite interesting that one of the most immediately powerful use cases so far of LLMs seems to be as a natural language interface to develop and review code.

Stephen Wolfram, in a long and interesting explainer on how LLMs work, “What Is ChatGPT Doing … and Why Does It Work?”, explores some of this tension between computational and natural language.

“Human language is fundamentally imprecise, not least because it isn’t “tethered” to a specific computational implementation, and its meaning is basically defined just by a “social contract” between its users. But computational language, by its nature, has a certain fundamental precision—because in the end what it specifies can always be “unambiguously executed on a computer”. Human language can usually get away with a certain vagueness. (When we say “planet” does it include exoplanets or not, etc.?) But in computational language we have to be precise and clear about all the distinctions we’re making.”

Computational Irreducibility and the Limits of Predictability and Learning

One of the limitations Wolfram raises between human and computational language is what he terms “computational irreducibility,” a term he uses to describe the difficulty in making accurate predictions for a highly complex system, such as for weather or climate systems. For such systems, it would require performing step-by-step computation based on an initial state, and thus can’t be swiftly calculated by compressing data.

In some ways, this “compression” of information is what we are doing with language as we use more “Tier 2” and “Tier 3” – or academic – words in our speech or writing. There is a greater density of information provided in academic speech and writing, in which more abstract words are used to convey complex concepts, and our sentences tend to become more compound and complex. The simpler, more frequent words, phrases, and sentences we use in our everyday speech are more regular and thus, more learnable.

. . . there’s just a fundamental tension between learnability and computational irreducibility. Learning involves in effect compressing data by leveraging regularities. But computational irreducibility implies that ultimately there’s a limit to what regularities there may be.

. . . there’s an ultimate tradeoff between capability and trainability: the more you want a system to make “true use” of its computational capabilities, the more it’s going to show computational irreducibility, and the less it’s going to be trainable. And the more it’s fundamentally trainable, the less it’s going to be able to do sophisticated computation.

Irregularity and Regularity in Language

What’s interesting to note here is that all languages have constructive tensions between regularity and irregularity. This tension may be a process of language being honed over time to be more learnable within our cognitive constraints. We’ve explored some of this before in our post, Irregularity Enhances Learning (Maybe), in which we examined a paper by Michael Ramscar that suggested there is some level of tension between language forms that show up again and again, and the language forms that are more infrequent, but thus inherently gain more of our attention. This relates to the theory of “statistical learning” with which we not only learn language, but also when we map a language to its written form.

For Wolfram, that LLMs are as powerful as they are suggests that human language is actually more statistically regular than we may have thought:

“my strong suspicion is that the success of ChatGPT implicitly reveals an important “scientific” fact: that there’s actually a lot more structure and simplicity to meaningful human language than we ever knew—and that in the end there may be even fairly simple rules that describe how such language can be put together.”

And instead what we should conclude is that tasks—like writing essays—that we humans could do, but we didn’t think computers could do, are actually in some sense computationally easier than we thought.

In other words, the reason a neural net can be successful in writing an essay is because writing an essay turns out to be a “computationally shallower” problem than we thought. And in a sense this takes us closer to “having a theory” of how we humans manage to do things like writing essays, or in general deal with language.

And so thus far the unrealized potential, for Wolfram, is that with a greater underlying capability in AI for computational language, it may be able to bridge our more “computationally shallow” human language with the precision required for more complex computations:

”its very success gives us a reason to think that it’s going to be feasible to construct something more complete in computational language form. And, unlike what we’ve so far figured out about the innards of ChatGPT, we can expect to design the computational language so that it’s readily understandable to humans.”

Decontextualized Language: The Pathway to Precision

On this pathway towards integration of human language and computational language, it’s interesting to consider how in our own language development, we are able to better “compress information” and develop greater precision in our thinking and communication as we learn and incorporate rarer and more abstract language into our own. We’ve spoken before about “decontextualized language” – the language that takes us beyond the immediate context and moment, and how such language can take us beyond our own delimited feelings and experiences, and into a realm of interpersonal and cultural thought, knowledge, and perspectives. This is the language of storybooks, of science, and – at it’s greatest extreme – of code. We begin teaching this form of language when we engage in storytelling with our children and reading with them and talking to them about books. It becomes increasingly dense and complex as we move into disciplinary study.

There is some evidence that training LLMs on this specific form of language is more powerful – such as this study training a “tiny LLM” on children’s stories. And if you think about what LLMs have been getting trained on thus far – it’s a corpus of written language, not training on conversations using everyday language. As we’ve explored in depth on this blog, written language is not synonymous with oral language – by nature of it being written, it is already more “decontextualized,” and requires more inference and perspective-taking. That LLMs are trained on such a corpus may be, in fact, why their algebraic and statistical magic can be so surprisingly powerful. There is a greater density of information in the written forms of our languages.

Implications for Teaching and Learning

What might all of this say about teaching and learning? Well, so far, one of the facets we’ve highlighted from LLMs is that the statistical nature of language alone can take us pretty far, which suggests that alongside of social interaction and peer engagement and communication, we want to increase the volume of that language exposure and use. And in terms of the nature of the language we want to increase: the more that the form of that language combines precision with abstraction, the greater computational power it can provide. Turning up the dial on decontextualized language use and exposure – in other words – providing our children with “textual feasts,” to use Alfred Tatum’s term, may be the key to enhanced learning.

Sources for Further Exploration

If you are interested in further exploring some of the tensions we began this post with – between regularity and irregularity in language, here’s some further interesting reads to geek out on:

“Source codes in human communication” by Michael Ramscar
“Expectation-based syntactic comprehension” by Roger Levy
“Cognitive approaches to uniformity and variability in morphology” by Petar Milin, Neil Bermel, and James Blevins

#language #computation #algorithms #learning #LLMs #cognition

Language, Cognition, and LLMs

Tue, 23 Apr 2024 14:48:41 +0000

“Semantic gradients,” are a tool used by teachers to broaden and deepen students' understanding of related words by plotting them in relation to one another. They often begin with antonyms at each end of the continuum. Here are two basic examples:

Now imagine taking this approach and quantifying the relationships between words by adding numbers to the line graph. Now imagine adding another axis to this graph, so that words are plotted in a three dimensional space in their relationships. Then add another dimension, and another . . . heck, make it tens of thousands more dimensions, relating all the words available in your lexicon across a high dimensional space. . .

. . . and you may begin to envision one of the fundamental powers of Large Language Models (LLMs).

LLMs Are Powered by Language: Or, Words as a Vast Sea of Interrelated Statistical Arrays of Tokens

At root, the most powerful current forms of AI derive their capacities from decomposing human language into vast arrays of numbers based on their high dimensional statistical relationships and then predicting probabilistically what the next tokens are most likely to be.

There’s a kind of alchemical transformation that occurs that seems to maintain the meaning in the generative pronouncements of the frontier LLMs, all the more amazing because so far the very engineers who have designed the structure for these operations to occur do not fully understand what the models are doing to arrive at their seemingly oracular destinations.

In other words – the power of LLMs seemingly derives from the statistical power of language. There is something in the nature of language itself that seems to provide these computations of vast arrays of numbers with a lattice of our world, enabling LLMs to gain uncanny abilities from superpowered next word prediction. That LLMs have the generative powers they have—and that they have them without any consciousness or social interaction whatsoever—bolsters the argument that there is something about language itself, not just our brains, that is powerful.

An Aside on Power Law Scaling

One of the interesting features of human language is that it exhibits power scaling laws, as with other complex adaptive systems such as animals, cities, or businesses, as I recently examined in this post about Geoffrey West's fascinating book, Scale. The frequency of word usage, the length of sentences and texts, and the number of words in a language all follow power law distributions. This means that a small number of words are used frequently, while most words are used infrequently, and long sentences and texts are less common than shorter ones. As an interesting parallel, power law scaling is exhibited not only by language itself and through its generative manifestations in LLMs, but furthermore through the data—and the data centers and energy—required for training and using LLMs. Thus far, there is no apparent ceiling for LLM advancement in capability beyond that of the ceiling on the scalability of computer chips, data centers, and training data.

Innate vs. Developed Language: A Review of Our Path Traversed Thus Far

In our series “Innate vs. Developed”, we have explored the nature of language, challenging a widely held view that language is completely and innately hardwired in the human brain. Drawing upon “The Language Game” and “Rethinking Innateness” as sources of inspiration, we have considered the notion that language is an emergent, culturally-evolved phenomenon that mounts atop an “inner scaffold” that exists within our brains and further refines and specializes our neural networks through simple repeated social interactions over time.

We also considered how developing proficiency in reading and writing yet further extends and reinforces these channels across our brains – and how developing proficiency in multiple languages and literacies makes those networks even yet more robust.

We went further afield and investigated Cormac McCarthy’s ponderings on a seeming division between language and the ancient parts of our brain that exist before and beyond language. We also investigated the paradoxical nature of language, in that it can both enhance and potentially occlude our connection to our unconscious selves and to our natural world.

I promised at the end of the first post in this series that I would “maybe dig into the relation of cognition and language and literacy a little, and riff on the implications for AI, ANNs, and LLMs.” It’s taken me some time to let all of this ripen, especially given the rapid pace at which LLMs are developing. I think I’m finally starting to gain some perspective on LLMs that may allow me to indulge in a little riffing.

Sources for Spelunking

Before said indulgence in my next post, I’ll first outline a few sources I will draw upon at the outset so you can go off and explore on your own before being further biased by my own rambling.

First, if you are interested in learning more about that analogy of a high dimensional semantic gradient and gaining insight into how LLMs kinda work, I recommend three sources shared by Ethan Mollick (he himself is also an excellent source):

Second, if you want to explore some interesting aspects of language itself that are related to LLMs, check out the following:

An Anticipation of Where We May Go From Here

From these and other sources, including dabbling with Copilot and Claude and Gemini, I will ponder some of the following points on what computational neural networks may be able to tell us about language and what language may be able to tell us about LLMs – and, ultimately, perhaps, what this all may be able to tell us about teaching and learning:

The surprisingly inseparable interconnection between form and meaning
Blundering our way to computational precision through human communication; Or, the generative tension between regularity and randomness
The human (and now, machine) capacity for learning and using language may simply be a matter of scale
Is language as separable from thought (and, for that matter, from the world) as Cormac McCarthy said? . . . which actually ended up becoming more about fuzziness and precision in language, but hey!
Implicit vs. explicit learning of language and literacy

#language #literacy #LLMs #computation #statistical #learning #ai