Learn About
- Funding
- Research Design
- Participants
- Study Management
- Collaboration
- Dissemination
- Career Advancement
Introduction to Neural Net Modeling in Developmental PsychologyPosted on December 5, 2007 Jeffrey L. Elman (bio) introduces the principles of neural networks as they relate to developmental psychology. |
Dr. Jeffrey L. Elman: We were asked by the conference organizers to present something as part of the Back to School series and to focus on neural networks and developmental psychology. So the goal today is to present a kind of tutorial. But because there’s limited time, we thought it would be better to focus rather than on low level detail, on the operating principles, to try to give you some sense of what makes these things tick, and why it is that people have found them useful. And we think, in fact, providing insights into a lot of developmental phenomena.
In my overview, what I’d like to do is the following: begin by telling you a little bit about why neural networks. That is, why it is that over two decades ago now, people started looking at this alternative to symbolic models. Having done that, say a little bit about how they work, but again focus on what their characteristics are, because I think that’s the important take-home message.
Learning has been very important to neural networks; it’s probably one of the most exciting aspects I think for many of us. So, I’ll talk about how learning by example works, and give a quick example. And then I want to step back and talk about the bigger picture, because as you probably know, it’s been a controversial enterprise and a lot of the controversy stems from a different position taken on some very important issues that have arisen in trying to understand development.
So, why is it that people turn to these things in the first place? Well, there are a number of things that people do remarkably well that have been resistant to understanding through standard machine approaches, that is, using symbolic models. These include things like the ability to extract patterns out of context. This is a very well known photo by R.C. [James], from the ’70s, showing what I think probably everybody will see; or if not, will see now, as a Dalmatian in the park. It illustrates our ability to extract patterns from a somewhat complex, and often noisy, whole. Those patterns, our ability to do that, have a lot to do with the skills we have at integrating context, so for example,
we see the central figure in this word as the letter A, but we see it as an H here, indicating the role that context plays.
Sometimes, we also have other phenomena that have to do with learning. So we like to think of learning as a constant progression uphill towards improvement. Yet we know in development that there are many phenomena in which initially, what seemed like good performance in this example, the ability to produce correctly the past tense of the irregular verb "come," starts off fine, and then it goes through some period of either variability or decline, and then eventually ends up at an adult-like performance.
So, learning is, as we see, often nonmonotonic. There are retreats from progress, temporary, we hope. And that’s been a challenge.
We’re extraordinarily good at perceiving things that are important to us, such as faces. So you should see, I imagine, these faces as different. And understanding how that process happens has been quite a challenge. It’s particularly interesting, because if
you turn faces upside down here, that ability vanishes. So you may see these faces as the same, and yet when you right them, you notice they’re hideously different, again indicating a role of context expectation and overall configuration in processing.
Language is perhaps one of the crown jewels in human cognition. And one of the things that kids do, is figure out generalizations from their exposure, and those generalizations are often fairly complex and not always obvious.
So, for instance, the data that seemed to be, or had been claimed to be, available to kids, with regard to how you create questions, would suggest that given a declarative statement like, “the girl is nice,” and the appropriate question, “is the girl nice,” that there’s a rule to be learned. Which is to take the first "is," in this case, and prepose it, giving you the question form.
Chomsky argued that in fact, this was an example where learning could not happen, because if children made this generalization based on the data available to them, they would be tempted, given the sentence, the declarative, “the girl who is my neighbor is nice,” to incorrectly apply that rule and form the ungrammatical, "is the girl who my neighbor is nice."
Kids don’t do this. So Steve Crain called this the "parade case" of an innate constraint, of innate knowledge, arguing that it could not possibly be learned. Interestingly, that’s been challenged both on empirical grounds by Ben Ambridge and colleagues and also through simulations from John Lewis and myself, but it’s clearly a challenging and interesting question. How can you learn things where the data are not so obvious?
In short then, people do a number of remarkable things and the goals of neural networks have been to try to understand how that happens.
We find subtle patterns; we’re very good at integrating context from a variety of sources. That integration typically obeys multiple constraints. There are times when our knowledge seems to be categorical and discreet and binary and other times when the behavior is graded and finally we learn. So these are all the kinds of things that have pushed people towards neural networks and motivated the interest in that.
What is a neural network? Well, it’s a system, it’s a network composed of simple units that are schematically like simplified neurons which don’t do very much, except they have activations; so these might be encoded in terms of a state or color here, red being active, blue being inactive.
That indicates its level of excitation or inhibition. A network is composed of a variety of these things. And most typically they’re organized in some sort of topology; so there’s an architecture, a very common one and early one is one where there’s a layer of inputs that takes information from the world, delivers it as output back to the world or to another part of the system with mediating units in between the often-called hidden units. These things are connected in much the same way that neurons are connected by synapses.
And the strengths of these connections then determine the nature of the way they interact. So the dynamics would typically proceed in this sort of an architecture where an input layer is activated. That activation passes forward to the output and in real time, there’s a continuing ongoing dynamic. What the network knows at any point in time, or what it thinks, you might say instead is reflected in terms of the specific activation pattern. What it knows is really reflected in the architecture, where – how many units there are, how they’re arranged and how they’re interconnected in the nature of the interconnections.
And then finally, learning what consists of adjusting those connections in such a way through a very simple algorithm so that some are pruned, some are strengthened, some are weakened, some become inhibitory and so on.
What this gives you then is a system that has the following kinds of characteristics. First of all, it’s tempting to look at the activation of a single unit and think of that in terms of a hypothesis detector, a grandmother cell, grandfather cell. But more interesting, and I think more useful, is to realize that it’s possible to express patterns as activations across a population of cells.
In this way the information or the representations are distributed spatially. It gives you a couple of things. It give you enormous expression power, you have a much richer set of representations available, but it also means that the same units can participate in many patterns, typically do participate in many patterns and in turn what that means is that there’s sharing of knowledge. There’s recruiting of old information to serve new purposes.
There’s an interesting example of this from brain imaging work by Fred Dick and colleagues who played environmental sounds to humans and while they did that were scanned in the magnet so this is an activation pattern, while subjects were listening to cows mooing, bells clanging and so on. And what they noted was that these regions tended to be highly active.
They then played to the same subject the words and asked the subjects to read the words cow, moo, bell, and so on and found that the patterns and activation overlapped considerably.
So this is an example from a brain of the kind of power that you get from distributed representations. The knowledge that you have from one domain, what’s thought of as a domain, actually transfers over and is used by the other.
Secondly, the representations are graded. That is, one can talk about partial activation and in that sense, partial knowledge. Incipient knowledge, knowledge that is coming into being and then account for cases where the network or putatively humans might show some knowledge that is subthreshold, subliminal.
It also is a way of conveying the probabilistic nature of behavior and of cognition, because these partial activations can be thought of as representing probabilities.
Thirdly, their representations are highly context-dependent and this is by virtue of the fact that the system is interconnected. Now it’s not widely interconnected with everything to each other, so there’s some parceling of knowledge. But there’s the opportunity for knowledge to interact across domains of knowledge.
Finally, they’re nonlinear. So what I haven’t told you is that the way in which these things respond to inputs is nonlinear. There are regions of great sensitivity and regions where they’re relatively less sensitive. What that means is that sometimes when they’re at their extrema the units may behave in a way that seems to be binary.
You get a very crisp distinction that’s categorical that looks symbolic in fact. There are other regions of great sensitivity where the input can produce dramatically different outputs and you get graded or continuous activations; so there’s differential sensitivity to different inputs.
An important – very important aspect of neural networks and I think the one that probably at this meeting will be of most relevance and interesting to people is their ability to learn and it’s a special kind of learning. It’s inductive inference by example, so rather than being programmed or taught what a rule or regularity is, the networks learn the generalizations, they learn abstraction based on concrete evidence.
So, for instance, in a network that has input units and output units, it might be exposed to pairs where the input is two and it’s trained to produce the correct output for that input, in this case four. And then it will be trained on another pattern or various patterns and similar adjustments made in small increments.
You might think that what it will do is simply learn this stuff by rote. Sometimes happens, but more interesting is when it’s presented with a novel input to see what it does and in this case it will produce, given sufficient experience, the correct output of 169, which of course is the square. So it’s learned or induced that sort of regularity based on examples. That’s not a very interesting example, because we have lots of systems that do this much better than neural networks; my hand calculator will do that marvelously well.
So what’s interesting is where we know there’s a regularity, we believe there’s regularity, but we’re not sure what it is, and then the networks can be used as discovery devices, because having solved the problem, learned the regularity, one can analyze them.
For example, the pronunciation of O – U varies enormously across context. It could be that that’s simply a list of exceptions that has to be memorized. Probably not simply, because if I were to ask you to pronounce P – L – O –U – T – Y, 99% of you would say plouty, it doesn’t mean you’re right, but the consistency suggests that you’re all doing the same thing. On the other hand, P – L – O – U – G – H – Y shows much more variability. People will say pluffy, pluey, because the pronunciation of G – H will condition the pronunciation of the O – U.
So a very early neural network was designed in fact by Charlie Rosenberg and Terry Sejnowski to try to learn this mapping from orthography written form to text. And the interesting part of that work is the analysis of how it solves the problem. Here’s another problem. This is a problem confronted by every infant when it enters the world and that’s the problem of learning language. And one of the first steps of learning language is learning words. The problem is, words do not come segmented in the auditory stream with a sort of auditory analog of the visual white space you find between written text.
So when scanning text, you know what words are, because there’s white space or punctuation. In fluent language, oral language as in gesture, that sort of segmentation is not available. There are no silences that mark the boundaries between words. In fact in this text, and I think I have to hold this up, where are the silences between words? In fact I think there’s only one case where there is a silence between words. More often that not, the silences occur within words. Okay?
So the problem of segmentation, of identifying where the words are, is a prior requisite before you actually learn their content, and it’s a very hard one. How might you tackle this? Well imagine that we take a
simplified version of this task and take written text, but we strip away all of the spaces between words. So we might have a story that runs many years ago, a boy and a girl and so on and so on and we remove the spaces and shove the letters together so now you have the same segmentation problem arising.
And a network is trained to read these things and process them one at a time. Each time it reads a letter, it’s trained to produce in this first case, the letter A, it reads the second letter, produces the letter N as output, reads an N, produces Y and so on and you may have seen the pattern here, which is the output is always the next letter.
So it’s trying to anticipate what’s going to come next. We know that infants are very good at anticipating through eye movement and a variety of movements, infants do anticipate, generate expectations about what’s going to change in the world. Now the task of predicting the next letter is of course, short of memorizing the text, impossible or not practical.
But what you might expect that you would do is, if I asked you what the first letter of a word is, is you’d use your knowledge of the probability distribution of first letters across the language. So you’d be ill-advised to guess a Z or an X. A better guess would be S or M or T, one of the more common letters.
And as you hear successively more letters, you might imagine that your predictions will get better. Your error in that prediction will decline until you get to the end.
In fact, if you look at the errors produced by this network, in predicting various letters, the error at the beginning of the first letter is relatively high and goes down successively as more letters are scanned. In fact the error maxima typically coincide with the onsets of words. So it would seem to be a very good strategy to simply try to anticipate what’s going to come next and where uncertainty is high ,is likely to be a boundary between some sort of segment.
In 1996, I think it was, Jen Saffran, Dick Aslin, and Elissa Newport demonstrated that with just, I think, two minutes of exposure to an artificial language, that eight-month-old infants were able to listen to fluent speech and extract the segment boundaries or the word boundaries probably through sensitivity to distributional statistics.
So let me step back now and talk about bigger issues. There’s a wonderful book I highly recommend to you called Rethinking Innateness.
A number of years ago, Liz Bates, Mark Johnson, Annette Karmiloff-Smith, Domenico Parisi, Kim Plunkett, and I, worried about what are some of the implications for development that are suggested by connectionist models. And there were a couple of things that really struck us as different, that is, a different perspective. These remain highly controversial, but we think there’s some important insights here. These have to do with two things.
The form of knowledge, what does knowledge look like in the brain? And secondly, where does that knowledge come from? What are the origins of knowledge? Together these two questions provide a context for a variety of more specific debates.
For example, is knowledge modular and is the brain organized along lines of a modular architecture? A number of people have suggested yes, most famously Jerry Fodor. In neural networks, modularity very often exists as an outcome; it’s not an initial condition. So, by virtue of the demands of the tasks, of the input and experience, modularity in behavior is often exhibited and subdomains are created.
But as Annette Karmiloff-Smith remarked, “Modules are made, they’re constructed from experience. They’re not born and they’re not part of the basic architecture.” They’re typically leaky; so they respect the demands of whatever the domain is, but very often, and very tellingly, they are interactions between what might be called modules, suggesting they’re not quite so modular.
A second issue having to do with the form of knowledge is the extent to which it is isolated encapsulated in terms of specific domains, and the approach advocated by many folks traditionally working within a symbolic paradigm has been that knowledge is highly specific to a domain. Now obviously what you know about tennis informs you not at all about language. These are specific domains and that’s self evident.
The question is whether the domains exist as starting points reflected in neural architecture, or whether those too are the result of experience and the connection is perspective has been the latter. Secondly, the things that guide us in learning and constraints are enormously crucial, the constraints themselves are not expressed in terms of language as a set of constraints. Sui generis, rather that they’re at a lower level that cut across various domains in conjunction with each other and with experience they may result in what looks like a domain specific behavior.
And finally, nature and nurture has been a real hot point. The connectionist position has been that there is not preknowledge of the sort that would be made available by evolutionarily designed microcircuits, brain circuits that are set up for specific tasks. Rather there are low level constraints and biases that together with experience shape outcomes. So there’s no preknowledge, rather there’s a set of lower level constraints that operate on experience.
|
4researchers has been sponsored by (SBIR) contracts #N43MH32060 and #H9SN278200443100C from The National Institute of Mental Health , The National Institutes of Health, and The Department of Health and Human Services. |
|
|
Copyright © 3-C Institute for Social Development |
