How to Create a Mind: The Secret of Human Thought Revealed - читать бесплатно онлайн полную версию книги автора Ray Kurzweil (CHAPTER 7 THE BIOLOGICALLY INSPIRED DIGITAL NEOCORTEX) #12

CHAPTER 7 THE BIOLOGICALLY INSPIRED DIGITAL NEOCORTEX

Never trust anything that can think for itself if you can’t see where it keeps its brain.

Arthur Weasley, in J. K. Rowling, Harry Potter and the Prisoner of Azkaban

No, I’m not interested in developing a powerful brain. All I’m after is just a mediocre brain, something like the President of the American Telephone and Telegraph Company.

Alan Turing

A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.

Alan Turing

I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.

Alan Turing

A mother rat will build a nest for her young even if she has never seen another rat in her lifetime.¹ Similarly, a spider will spin a web, a caterpillar will create her own cocoon, and a beaver will build a dam, even if no contemporary ever showed them how to accomplish these complex tasks. That is not to say that these are not learned behaviors. It is just that these animals did not learn them in a single lifetime—they learned them over thousands of lifetimes. The evolution of animal behavior does constitute a learning process, but it is learning by the species, not by the individual, and the fruits of this learning process are encoded in DNA.

To appreciate the significance of the evolution of the neocortex, consider that it greatly sped up the process of learning (hierarchical knowledge) from thousands of years to months (or less). Even if millions of animals in a particular mammalian species failed to solve a problem (requiring a hierarchy of steps), it required only one to accidentally stumble upon a solution. That new method would then be copied and spread exponentially through the population.

We are now in a position to speed up the learning process by a factor of thousands or millions once again by migrating from biological to nonbiological intelligence. Once a digital neocortex learns a skill, it can transfer that know-how in minutes or even seconds. As one of many examples, at my first company, Kurzweil Computer Products (now Nuance Speech Technologies), which I founded in 1973, we spent years training a set of research computers to recognize printed letters from scanned documents, a technology called omni-font (any type font) optical character recognition (OCR). This particular technology has now been in continual development for almost forty years, with the current product called OmniPage from Nuance. If you want your computer to recognize printed letters, you don’t need to spend years training it to do so, as we did—you can simply download the evolved patterns already learned by the research computers in the form of software. In the 1980s we began on speech recognition, and that technology, which has also been in continuous development now for several decades, is part of Siri. Again, you can download in seconds the evolved patterns learned by the research computers over many years.

Ultimately we will create an artificial neocortex that has the full range and flexibility of its human counterpart. Consider the benefits. Electronic circuits are millions of times faster than our biological circuits. At first we will have to devote all of this speed increase to compensating for the relative lack of parallelism in our computers, but ultimately the digital neocortex will be much faster than the biological variety and will only continue to increase in speed.

When we augment our own neocortex with a synthetic version, we won’t have to worry about how much additional neocortex can physically fit into our bodies and brains, as most of it will be in the cloud, like most of the computing we use today. I estimated earlier that we have on the order of 300 million pattern recognizers in our biological neocortex. That’s as much as could be squeezed into our skulls even with the evolutionary innovation of a large forehead and with the neocortex taking about 80 percent of the available space. As soon as we start thinking in the cloud, there will be no natural limits—we will be able to use billions or trillions of pattern recognizers, basically whatever we need, and whatever the law of accelerating returns can provide at each point in time.

In order for a digital neocortex to learn a new skill, it will still require many iterations of education, just as a biological neocortex does, but once a single digital neocortex somewhere and at some time learns something, it can share that knowledge with every other digital neocortex without delay. We can each have our own private neocortex extenders in the cloud, just as we have our own private stores of personal data today.

Last but not least, we will be able to back up the digital portion of our intelligence. As we have seen, it is not just a metaphor to state that there is information contained in our neocortex, and it is frightening to contemplate that none of this information is backed up today. There is, of course, one way in which we do back up some of the information in our brains—by writing it down. The ability to transfer at least some of our thinking to a medium that can outlast our biological bodies was a huge step forward, but a great deal of data in our brains continues to remain vulnerable.

Brain Simulations

One approach to building a digital brain is to simulate precisely a biological one. For example, Harvard brain sciences doctoral student David Dalrymple (born in 1991) is planning to simulate the brain of a nematode (a roundworm).² Dalrymple selected the nematode because of its relatively simple nervous system, which consists of about 300 neurons, and which he plans to simulate at the very detailed level of molecules. He will also create a computer simulation of its body as well as its environment so that his virtual nematode can hunt for (virtual) food and do the other things that nematodes are good at. Dalrymple says it is likely to be the first complete brain upload from a biological animal to a virtual one that lives in a virtual world. Like his simulated nematode, whether even biological nematodes are conscious is open to debate, although in their struggle to eat, digest food, avoid predators, and reproduce, they do have experiences to be conscious of.

At the opposite end of the spectrum, Henry Markram’s Blue Brain Project is planning to simulate the human brain, including the entire neocortex as well as the old-brain regions such as the hippocampus, amygdala, and cerebellum. His planned simulations will be built at varying degrees of detail, up to a full simulation at the molecular level. As I reported in chapter 4, Markram has discovered a key module of several dozen neurons that is repeated over and over again in the neocortex, demonstrating that learning is done by these modules and not by individual neurons.

Markram’s progress has been scaling up at an exponential pace. He simulated one neuron in 2005, the year the project was initiated. In 2008 his team simulated an entire neocortical column of a rat brain, consisting of 10,000 neurons. By 2011 this expanded to 100 columns, totaling a million cells, which he calls a mesocircuit. One controversy concerning Markram’s work is how to verify that the simulations are accurate. In order to do this, these simulations will need to demonstrate learning that I discuss below.

He projects simulating an entire rat brain of 100 mesocircuits, totaling 100 million neurons and about a trillion synapses, by 2014. In a talk at the 2009 TED conference at Oxford, Markram said, “It is not impossible to build a human brain, and we can do it in 10 years.”³ His most recent target for a full brain simulation is 2023.⁴

Markram and his team are basing their model on detailed anatomical and electrochemical analyses of actual neurons. Using an automated device they created called a patch-clamp robot, they are measuring the specific ion channels, neurotransmitters, and enzymes that are responsible for the electrochemical activity within each neuron. Their automated system was able to do thirty years of analysis in six months, according to Markram. It was from these analyses that they noticed the “Lego memory” units that are the basic functional units of the neocortex.

Actual and projected progress of the Blue Brain brain simulation project.

Significant contributions to the technology of robotic patch-clamping was made by MIT neuroscientist Ed Boyden, Georgia Tech mechanical engineering professor Craig Forest, and Forest’s graduate student Suhasa Kodandaramaiah. They demonstrated an automated system with one-micrometer precision that can perform scanning of neural tissue at very close range without damaging the delicate membranes of the neurons. “This is something a robot can do that a human can’t,” Boyden commented.

To return to Markram’s simulation, after simulating one neocortical column, Markram was quoted as saying, “Now we just have to scale it up.”⁵ Scaling is certainly one big factor, but there is one other key hurdle, which is learning. If the Blue Brain Project brain is to “speak and have an intelligence and behave very much as a human does,” which is how Markram described his goal in a BBC interview in 2009, then it will need to have sufficient content in its simulated neocortex to perform those tasks.⁶ As anyone who has tried to hold a conversation with a newborn can attest, there is a lot of learning that must be achieved before this is feasible.

The tip of the patch-clamping robot developed at MIT and Georgia Tech scanning neural tissue.

There are two obvious ways this can be done in a simulated brain such as Blue Brain. One would be to have the brain learn this content the way a human brain does. It can start out like a newborn human baby with an innate capacity for acquiring hierarchical knowledge and with certain transformations preprogrammed in its sensory preprocessing regions. But the learning that takes place between a biological infant and a human person who can hold a conversation would need to occur in a comparable manner in nonbiological learning. The problem with that approach is that a brain that is being simulated at the level of detail anticipated for Blue Brain is not expected to run in real time until at least the early 2020s. Even running in real time would be too slow unless the researchers are prepared to wait a decade or two to reach intellectual parity with an adult human, although real-time performance will get steadily faster as computers continue to grow in price/performance.

The other approach is to take one or more biological human brains that have already gained sufficient knowledge to converse in meaningful language and to otherwise behave in a mature manner and copy their neocortical patterns into the simulated brain. The problem with this method is that it requires a noninvasive and nondestructive scanning technology of sufficient spatial and temporal resolution and speed to perform such a task quickly and completely. I would not expect such an “uploading” technology to be available until around the 2040s. (The computational requirement to simulate a brain at that degree of precision, which I estimate to be 10¹⁹ calculations per second, will be available in a supercomputer according to my projections by the early 2020s; however, the necessary nondestructive brain scanning technologies will take longer.)

There is a third approach, which is the one I believe simulation projects such as Blue Brain will need to pursue. One can simplify molecular models by creating functional equivalents at different levels of specificity, ranging from my own functional algorithmic method (as described in this book) to simulations that are closer to full molecular simulations. The speed of learning can thereby be increased by a factor of hundreds or thousands depending on the degree of simplification used. An educational program can be devised for the simulated brain (using the functional model) that it can learn relatively quickly. Then the full molecular simulation can be substituted for the simplified model while still using its accumulated learning. We can then simulate learning with the full molecular model at a much slower speed.

American computer scientist Dharmendra Modha and his IBM colleagues have created a cell-by-cell simulation of a portion of the human visual neocortex comprising 1.6 billion virtual neurons and 9 trillion synapses, which is equivalent to a cat neocortex. It runs 100 times slower than real time on an IBM BlueGene/P supercomputer consisting of 147,456 processors. The work received the Gordon Bell Prize from the Association for Computing Machinery.

The purpose of a brain simulation project such as Blue Brain and Modha’s neocortex simulations is specifically to refine and confirm a functional model. AI at the human level will principally use the type of functional algorithmic model discussed in this book. However, molecular simulations will help us to perfect that model and to fully understand which details are important. In my development of speech recognition technology in the 1980s and 1990s, we were able to refine our algorithms once the actual transformations performed by the auditory nerve and early portions of the auditory cortex were understood. Even if our functional model was perfect, understanding exactly how it is actually implemented in our biological brains will reveal important knowledge about human function and dysfunction.

We will need detailed data on actual brains to create biologically based simulations. Markram’s team is collecting its own data. There are large-scale projects to gather this type of data and make it generally available to scientists. For example, Cold Spring Harbor Laboratory in New York has collected 500 terabytes of data by scanning a mammal brain (a mouse), which they made available in June 2012. Their project allows a user to explore a brain similarly to the way Google Earth allows one to explore the surface of the planet. You can move around the entire brain and zoom in to see individual neurons and their connections. You can highlight a single connection and then follow its path through the brain.

Sixteen sections of the National Institutes of Health have gotten together and sponsored a major initiative called the Human Connectome Project with $38.5 million of funding.⁷ Led by Washington University in St. Louis, the University of Minnesota, Harvard University, Massachusetts General Hospital, and the University of California at Los Angeles, the project seeks to create a similar three-dimensional map of connections in the human brain. The project is using a variety of noninvasive scanning technologies, including new forms of MRI, magnetoencephalography (measuring the magnetic fields produced by the electrical activity in the brain), and diffusion tractography (a method to trace the pathways of fiber bundles in the brain). As I point out in chapter 10, the spatial resolution of noninvasive scanning of the brain is improving at an exponential rate. The research by Van J. Wedeen and his colleagues at Massachusetts General Hospital showing a highly regular gridlike structure of the wiring of the neocortex that I described in chapter 4 is one early result from this project.

Oxford University computational neuroscientist Anders Sandberg (born in 1972) and Swedish philosopher Nick Bostrom (born in 1973) have written the comprehensive Whole Brain Emulation: A Roadmap, which details the requirements for simulating the human brain (and other types of brains) at different levels of specificity from high-level functional models to simulating molecules.⁸ The report does not provide a timeline, but it does describe the requirements to simulate different types of brains at varying levels of precision in terms of brain scanning, modeling, storage, and computation. The report projects ongoing exponential gains in all of these areas of capability and argues that the requirements to simulate the human brain at a high level of detail are coming into place.

An outline of the technological capabilities needed for whole brain emulation, in Whole Brain Emulation: A Roadmap by Anders Sandberg and Nick Bostrom.

An outline of Whole Brain Emulation: A Roadmap by Anders Sandberg and Nick Bostrom.

Neural Nets

In 1964, at the age of sixteen, I wrote to Frank Rosenblatt (1928–1971), a professor at Cornell University, inquiring about a machine called the Mark 1 Perceptron. He had created it four years earlier, and it was described as having brainlike properties. He invited me to visit him and try the machine out.

The Perceptron was built from what he claimed were electronic models of neurons. Input consisted of values arranged in two dimensions. For speech, one dimension represented frequency and the other time, so each value represented the intensity of a frequency at a given point in time. For images, each point was a pixel in a two-dimensional image. Each point of a given input was randomly connected to the inputs of the first layer of simulated neurons. Every connection had an associated synaptic strength, which represented its importance, and which was initially set at a random value. Each neuron added up the signals coming into it. If the combined signal exceeded a particular threshold, the neuron fired and sent a signal to its output connection; if the combined input signal did not exceed the threshold, the neuron did not fire, and its output was zero. The output of each neuron was randomly connected to the inputs of the neurons in the next layer. The Mark 1 Perceptron had three layers, which could be organized in a variety of configurations. For example, one layer might feed back to an earlier one. At the top layer, the output of one or more neurons, also randomly selected, provided the answer. (For an algorithmic description of neural nets, see this endnote.)⁹

Since the neural net wiring and synaptic weights are initially set randomly, the answers of an untrained neural net are also random. The key to a neural net, therefore, is that it must learn its subject matter, just like the mammalian brains on which it’s supposedly modeled. A neural net starts out ignorant; its teacher—which may be a human, a computer program, or perhaps another, more mature neural net that has already learned its lessons—rewards the student neural net when it generates the correct output and punishes it when it does not. This feedback is in turn used by the student neural net to adjust the strength of each interneuronal connection. Connections that are consistent with the correct answer are made stronger. Those that advocate a wrong answer are weakened.

Over time the neural net organizes itself to provide the correct answers without coaching. Experiments have shown that neural nets can learn their subject matter even with unreliable teachers. If the teacher is correct only 60 percent of the time, the student neural net will still learn its lessons with an accuracy approaching 100 percent.

However, limitations in the range of material that the Perceptron was capable of learning quickly became apparent. When I visited Professor Rosenblatt in 1964, I tried simple modifications to the input. The system was set up to recognize printed letters, and would recognize them quite accurately. It did a fairly good job of autoassociation (that is, it could recognize the letters even if I covered parts of them), but fared less well with invariance (that is, generalizing over size and font changes, which confused it).

During the last half of the 1960s, these neural nets became enormously popular, and the field of “connectionism” took over at least half of the artificial intelligence field. The more traditional approach to AI, meanwhile, included direct attempts to program solutions to specific problems, such as how to recognize the invariant properties of printed letters.

Another person I visited in 1964 was Marvin Minsky (born in 1927), one of the founders of the artificial intelligence field. Despite having done some pioneering work on neural nets himself in the 1950s, he was concerned with the great surge of interest in this technique. Part of the allure of neural nets was that they supposedly did not require programming—they would learn solutions to problems on their own. In 1965 I entered MIT as a student with Professor Minsky as my mentor, and I shared his skepticism about the craze for “connectionism.”

In 1969 Minsky and Seymour Papert (born in 1928), the two cofounders of the MIT Artificial Intelligence Laboratory, wrote a book called Perceptrons, which presented a single core theorem: specifically, that a Perceptron was inherently incapable of determining whether or not an image was connected. The book created a firestorm. Determining whether or not an image is connected is a task that humans can do very easily, and it is also a straightforward process to program a computer to make this discrimination. The fact that Perceptrons could not do so was considered by many to be a fatal flaw.

Two images from the cover of the book Perceptrons by Marvin Minsky and Seymour Papert. The top image is not connected (that is, the dark area consists of two disconnected parts). The bottom image is connected. A human can readily determine this, as can a simple software program. A feedforward Perceptron such as Frank Rosenblatt’s Mark 1 Perceptron cannot make this determination.

Perceptrons, however, was widely interpreted to imply more than it actually did. Minsky and Papert’s theorem applied only to a particular type of neural net called a feedforward neural net (a category that does include Rosenblatt’s Perceptron); other types of neural nets did not have this limitation. Still, the book did manage to largely kill most funding for neural net research during the 1970s. The field did return in the 1980s with attempts to use what were claimed to be more realistic models of biological neurons and ones that avoided the limitations implied by the Minsky-Papert Perceptron theorem. Nevertheless, the ability of the neocortex to solve the invariance problem, a key to its strength, was a skill that remained elusive for the resurgent connectionist field.

Sparse Coding: Vector Quantization

In the early 1980s I started a project devoted to another classical pattern recognition problem: understanding human speech. At first, we used traditional AI approaches by directly programming expert knowledge about the fundamental units of speech—phonemes—and rules from linguists on how people string phonemes together to form words and phrases. Each phoneme has distinctive frequency patterns. For example, we knew that vowels such as “e” and “ah” are characterized by certain resonant frequencies called formants, with a characteristic ratio of formants for each phoneme. Sibilant sounds such as “z” and “s” are characterized by a burst of noise that spans many frequencies.

We captured speech as a waveform, which we then converted into multiple frequency bands (perceived as pitches) using a bank of frequency filters. The result of this transformation could be visualized and was called a spectrogram (see page 136).

The filter bank is copying what the human cochlea does, which is the initial step in our biological processing of sound. The software first identified phonemes based on distinguishing patterns of frequencies and then identified words based on identifying characteristic sequences of phonemes.

A spectrogram of three vowels. From left to right: [i] as in “appreciate,” [u] as in “acoustic,” and [a] as in “ah.” The Y axis represents frequency of sound. The darker the band the more acoustic energy there is at that frequency.

A spectrogram of a person saying the word “hide.” The horizontal lines show the formants, which are sustained frequencies that have especially high energy.¹⁰

The result was partially successful. We could train our device to learn the patterns for a particular person using a moderate-sized vocabulary, measured in thousands of words. When we attempted to recognize tens of thousands of words, handle multiple speakers, and allow fully continuous speech (that is, speech with no pauses between words), we ran into the invariance problem. Different people enunciated the same phoneme differently—for example, one person’s “e” phoneme may sound like someone else’s “ah.” Even the same person was inconsistent in the way she spoke a particular phoneme. The pattern of a phoneme was often affected by other phonemes nearby. Many phonemes were left out completely. The pronunciation of words (that is, how phonemes are strung together to form words) was also highly variable and dependent on context. The linguistic rules we had programmed were breaking down and could not keep up with the extreme variability of spoken language.

It became clear to me at the time that the essence of human pattern and conceptual recognition was based on hierarchies. This is certainly apparent for human language, which constitutes an elaborate hierarchy of structures. But what is the element at the base of the structures? That was the first question I considered as I looked for ways to automatically recognize fully normal human speech.

Sound enters the ear as a vibration of the air and is converted by the approximately 3,000 inner hair cells in the cochlea into multiple frequency bands. Each hair cell is tuned to a particular frequency (note that we perceive frequencies as tones) and each acts as a frequency filter, emitting a signal whenever there is sound at or near its resonant frequency. As it leaves the human cochlea, sound is thereby represented by approximately 3,000 separate signals, each one signifying the time-varying intensity of a narrow band of frequencies (with substantial overlap among these bands).

Even though it was apparent that the brain was massively parallel, it seemed impossible to me that it was doing pattern matching on 3,000 separate auditory signals. I doubted that evolution could have been that inefficient. We now know that very substantial data reduction does indeed take place in the auditory nerve before sound signals ever reach the neocortex.

In our software-based speech recognizers, we also used filters implemented as software—sixteen to be exact (which we later increased to thirty-two, as we found there was not much benefit to going much higher than this). So in our system, each point in time was represented by sixteen numbers. We needed to reduce these sixteen streams of data into one while at the same emphasizing the features that are significant in recognizing speech.

We used a mathematically optimal technique to accomplish this, called vector quantization. Consider that at any particular point in time, sound (at least from one ear) was represented by our software by sixteen different numbers: that is, the output of the sixteen frequency filters. (In the human auditory system the figure would be 3,000, representing the output of the 3,000 cochlea inner hair cells.) In mathematical terminology, each such set of numbers (whether 3,000 in the biological case or 16 in our software implementation) is called a vector.

For simplicity, let’s consider the process of vector quantization with vectors of two numbers. Each vector can be considered a point in two-dimensional space.

If we have a very large sample of such vectors and plot them, we are likely to notice clusters forming.

In order to identify the clusters, we need to decide how many we will allow. In our project we generally allowed 1,024 clusters so that we could number them and assign each cluster a 10-bit label (because 2¹⁰ = 1,024). Our sample of vectors represents the diversity that we expect. We tentatively assign the first 1,024 vectors to be one-point clusters. We then consider the 1,025th vector and find the point that it is closest to. If that distance is greater than the smallest distance between any pair of the 1,024 points, we consider it as the beginning of a new cluster. We then collapse the two (one-point) clusters that are closest together into a single cluster. We are thus still left with 1,024 clusters. After processing the 1,025th vector, one of those clusters now has more than one point. We keep processing points in this way, always maintaining 1,024 clusters. After we have processed all the points, we represent each multipoint cluster by the geometric center of the points in that cluster.

We continue this iterative process until we have run through all the sample points. Typically we would process millions of points into 1,024 (2¹⁰) clusters; we’ve also used 2,048 (2¹¹) or 4,096 (2¹²) clusters. Each cluster is represented by one vector that is at the geometric center of all the points in that cluster. Thus the total of the distances of all the points in the cluster to the center point of the cluster is as small as possible.

The result of this technique is that instead of having the millions of points that we started with (and an even larger number of possible points), we have now reduced the data to just 1,024 points that use the space of possibilities optimally. Parts of the space that are never used are not assigned any clusters.

We then assign a number to each cluster (in our case, 0 to 1,023). That number is the reduced, “quantized” representation of that cluster, which is why the technique is called vector quantization. Any new input vector that arrives in the future is then represented by the number of the cluster whose center point is closest to this new input vector.

We can now precompute a table with the distance of the center point of every cluster to every other center point. We thereby have instantly available the distance of this new input vector (which we represent by this quantized point—in other words, by the number of the cluster that this new point is closest to) to every other cluster. Since we are only representing points by their closest cluster, we now know the distance of this point to any other possible point that might come along.

I described the technique above using vectors with only two numbers each, but working with sixteen-element vectors is entirely analogous to the simpler example. Because we chose vectors with sixteen numbers representing sixteen different frequency bands, each point in our system was a point in sixteen-dimensional space. It is difficult for us to imagine a space with more than three dimensions (perhaps four, if we include time), but mathematics has no such inhibitions.

We have accomplished four things with this process. First, we have greatly reduced the complexity of the data. Second, we have reduced sixteen-dimensional data to one-dimensional data (that is, each sample is now a single number). Third, we have improved our ability to find invariant features, because we are emphasizing portions of the space of possible sounds that convey the most information. Most combinations of frequencies are physically impossible or at least very unlikely, so there is no reason to give equal space to unlikely combinations of inputs as to likely ones. This technique reduces the data to equally likely possibilities. The fourth benefit is that we can use one-dimensional pattern recognizers, even though the original data consisted of many more dimensions. This turned out to be the most efficient approach to utilizing available computational resources.

Reading Your Mind with Hidden Markov Models

With vector quantization, we simplified the data in a way that emphasized key features, but we still needed a way to represent the hierarchy of invariant features that would make sense of new information. Having worked in the field of pattern recognition at that time (the early 1980s) for twenty years, I knew that one-dimensional representations were far more powerful, efficient, and amenable to invariant results. There was not a lot known about the neocortex in the early 1980s, but based on my experience with a variety of pattern recognition problems, I assumed that the brain was also likely to be reducing its multidimensional data (whether from the eyes, the ears, or the skin) using a one-dimensional representation, especially as concepts rose in the neocortex’s hierarchy.

For the speech recognition problem, the organization of information in the speech signal appeared to be a hierarchy of patterns, with each pattern represented by a linear string of elements with a forward direction. Each element of a pattern could be another pattern at a lower level, or a fundamental unit of input (which in the case of speech recognition would be our quantized vectors).

You will recognize this situation as consistent with the model of the neocortex that I presented earlier. Human speech, therefore, is produced by a hierarchy of linear patterns in the brain. If we could simply examine these patterns in the brain of the person speaking, it would be a simple matter to match her new speech utterances against her brain patterns and understand what the person was saying. Unfortunately we do not have direct access to the brain of the speaker—the only information we have is what she actually said. Of course, that is the whole point of spoken language—the speaker is sharing a piece of her mind with her utterance.

So I wondered: Was there a mathematical technique that would enable us to infer the patterns in the speaker’s brain based on her spoken words? One utterance would obviously not be sufficient, but if we had a large number of samples, could we use that information to essentially read the patterns in the speaker’s neocortex (or at least formulate something mathematically equivalent that would enable us to recognize new utterances)?

People often fail to appreciate how powerful mathematics can be—keep in mind that our ability to search much of human knowledge in a fraction of a second with search engines is based on a mathematical technique. For the speech recognition problem I was facing in the early 1980s, it turned out that the technique of hidden Markov models fit the bill rather perfectly. The Russian mathematician Andrei Andreyevich Markov (1856–1922) built a mathematical theory of hierarchical sequences of states. The model was based on the possibility of traversing the states in one chain, and if that was successful, triggering a state in the next higher level in the hierarchy. Sound familiar?

A simple example of one layer of a hidden Markov model. S₁ through S₄ represent the “hidden” internal states. The P_{i, j} transitions each represent the probability of going from state S_i to state S_j. These probabilities are determined by the system learning from training data (including during actual use). A new sequence (such as a new spoken utterance) is matched against these probabilities to determine the likelihood that this model produced the sequence.

Markov’s model included probabilities of each state’s successfully occurring. He went on to hypothesize a situation in which a system has such a hierarchy of linear sequences of states, but those are unable to be directly examined—hence the name hidden Markov models. The lowest level of the hierarchy emits signals, which are all we are allowed to see. Markov provides a mathematical technique to compute what the probabilities of each transition must be based on the observed output. The method was subsequently refined by Norbert Wiener in 1923. Wiener’s refinement also provided a way to determine the connections in the Markov model; essentially any connection with too low a probability was considered not to exist. This is essentially how the human neocortex trims connections—if they are rarely or never used, they are considered unlikely and are pruned away. In our case, the observed output is the speech signal created by the person talking, and the state probabilities and connections of the Markov model constitute the neocortical hierarchy that produced it.

I envisioned a system in which we would take samples of human speech, apply the hidden Markov model technique to infer a hierarchy of states with connections and probabilities (essentially a simulated neocortex for producing speech), and then use this inferred hierarchical network of states to recognize new utterances. To create a speaker-independent system, we would use samples from many different individuals to train the hidden Markov models. By adding in the element of hierarchies to represent the hierarchical nature of information in language, these were properly called hierarchical hidden Markov models (HHMMs).

My colleagues at Kurzweil Applied Intelligence were skeptical that this technique would work, given that it was a self-organizing method reminiscent of neural nets, which had fallen out of favor and with which we had had little success. I pointed out that the network in a neural net system is fixed and does not adapt to the input: The weights adapt, but the connections do not. In the Markov model system, if it was set up correctly, the system would prune unused connections so as to essentially adapt the topology.

I established what was considered a “skunk works” project (an organizational term for a project off the beaten path that has little in the way of formal resources) that consisted of me, one part-time programmer, and an electrical engineer (to create the frequency filter bank). To the surprise of my colleagues, our effort turned out to be very successful, having succeeded in recognizing speech comprising a large vocabulary with high accuracy.

After that experiment, all of our subsequent speech recognition efforts have been based on hierarchical hidden Markov models. Other speech recognition companies appeared to discover the value of this method independently, and since the mid-1980s most work in automated speech recognition has been based on this approach. Hidden Markov models are also used in speech synthesis—keep in mind that our biological cortical hierarchy is used not only to recognize input but also to produce output, for example, speech and physical movement.

HHMMs are also used in systems that understand the meaning of natural-language sentences, which represents going up the conceptual hierarchy.

Hidden Markov states and possible transitions to produce a sequence of words in natural-language text.

To understand how the HHMM method works, we start out with a network that consists of all the state transitions that are possible. The vector quantization method described above is critical here, because otherwise there would be too many possibilities to consider.

Here is a possible simplified initial topology:

A simple hidden Markov model topology to recognize two spoken words.

Sample utterances are processed one by one. For each, we iteratively modify the probabilities of the transitions to better reflect the input sample we have just processed. The Markov models used in speech recognition code the likelihood that specific patterns of sound are found in each phoneme, how the phonemes influence one another, and the likely orders of phonemes. The system can also include probability networks on higher levels of language structure, such as the order of words, the inclusion of phrases, and so on up the hierarchy of language.

Whereas our previous speech recognition systems incorporated specific rules about phoneme structures and sequences explicitly coded by human linguists, the new HHMM-based system was not explicitly told that there are forty-four phonemes in English, the sequences of vectors that were likely for each phoneme, or what phoneme sequences were more likely than others. We let the system discover these “rules” for itself from thousands of hours of transcribed human speech data. The advantage of this approach over hand-coded rules is that the models develop probabilistic rules of which human experts are often not aware. We noticed that many of the rules that the system had automatically learned from the data differed in subtle but important ways from the rules established by human experts.

Once the network was trained, we began to attempt to recognize speech by considering the alternate paths through the network and picking the path that was most likely, given the actual sequence of input vectors we had seen. In other words, if we saw a sequence of states that was likely to have produced that utterance, we concluded that the utterance came from that cortical sequence. This simulated HHMM-based neocortex included word labels, so it was able to propose a transcription of what it heard.

We were then able to improve our results further by continuing to train the network while we were using it for recognition. As we have discussed, simultaneous recognition and learning also take place at every level in our biological neocortical hierarchy.

Evolutionary (Genetic) Algorithms

There is another important consideration: How do we set the many parameters that control a pattern recognition system’s functioning? These could include the number of vectors that we allow in the vector quantization step, the initial topology of hierarchical states (before the training phase of the hidden Markov model process prunes them back), the recognition threshold at each level of the hierarchy, the parameters that control the handling of the size parameters, and many others. We can establish these based on our intuition, but the results will be far from optimal.

We call these parameters “God parameters” because they are set prior to the self-organizing method of determining the topology of the hidden Markov models (or, in the biological case, before the person learns her lessons by similarly creating connections in her cortical hierarchy). This is perhaps a misnomer, given that these initial DNA-based design details are determined by biological evolution, though some may see the hand of God in that process (and while I do consider evolution to be a spiritual process, this discussion properly belongs in chapter 9).

When it came to setting these “God parameters” in our simulated hierarchical learning and recognizing system, we again took a cue from nature and decided to evolve them—in our case, using a simulation of evolution. We used what are called genetic or evolutionary algorithms (GAs), which include simulated sexual reproduction and mutations.

Here is a simplified description of how this method works. First, we determine a way to code possible solutions to a given problem. If the problem is optimizing the design parameters for a circuit, then we define a list of all of the parameters (with a specific number of bits assigned to each parameter) that characterize the circuit. This list is regarded as the genetic code in the genetic algorithm. Then we randomly generate thousands or more genetic codes. Each such genetic code (which represents one set of design parameters) is considered a simulated “solution” organism.

Now we evaluate each simulated organism in a simulated environment by using a defined method to assess each set of parameters. This evaluation is a key to the success of a genetic algorithm. In our example, we would run each program generated by the parameters and judge it on appropriate criteria (did it complete the task, how long did it take, and so on). The best-solution organisms (the best designs) are allowed to survive, and the rest are eliminated.

Now we cause each of the survivors to multiply themselves until they reach the same number of solution creatures. This is done by simulating sexual reproduction: In other words, we create new offspring where each new creature draws one part of its genetic code from one parent and another part from a second parent. Usually no distinction is made between male or female organisms; it’s sufficient to generate an offspring from any two arbitrary parents, so we’re basically talking about same-sex marriage here. This is perhaps not as interesting as sexual reproduction in the natural world, but the relevant point here is having two parents. As these simulated organisms multiply, we allow some mutation (random change) in the chromosomes to occur.

We’ve now defined one generation of simulated evolution; now we repeat these steps for each subsequent generation. At the end of each generation we determine how much the designs have improved (that is, we compute the average improvement in the evaluation function over all the surviving organisms). When the degree of improvement in the evaluation of the design creatures from one generation to the next becomes very small, we stop this iterative cycle and use the best design(s) in the last generation. (For an algorithmic description of genetic algorithms, see this endnote.)¹¹

The key to a genetic algorithm is that the human designers don’t directly program a solution; rather, we let one emerge through an iterative process of simulated competition and improvement. Biological evolution is smart but slow, so to enhance its intelligence we greatly speed up its ponderous pace. The computer is fast enough to simulate many generations in a matter of hours or days, and we’ve occasionally had them run for as long as weeks to simulate hundreds of thousands of generations. But we have to go through this iterative process only once; as soon as we have let this simulated evolution run its course, we can apply the evolved and highly refined rules to real problems in a rapid fashion. In the case of our speech recognition systems, we used them to evolve the initial topology of the network and other critical parameters. We thus used two self-organizing methods: a GA to simulate the biological evolution that gave rise to a particular cortical design, and HHMMs to simulate the cortical organization that accompanies human learning.

Another major requirement for the success of a GA is a valid method of evaluating each possible solution. This evaluation needs to be conducted quickly, because it must take account of many thousands of possible solutions for each generation of simulated evolution. GAs are adept at handling problems with too many variables for which to compute precise analytic solutions. The design of an engine, for example, may involve more than a hundred variables and requires satisfying dozens of constraints; GAs used by researchers at General Electric were able to come up with jet engine designs that met the constraints more precisely than conventional methods.

When using GAs you must, however, be careful what you ask for. A genetic algorithm was used to solve a block-stacking problem, and it came up with a perfect solution…except that it had thousands of steps. The human programmers forgot to include minimizing the number of steps in their evaluation function.

Scott Drave’s Electric Sheep project is a GA that produces art. The evaluation function uses human evaluators in an open-source collaboration involving many thousands of people. The art moves through time and you can view it at electricsheep.org.

For speech recognition, the combination of genetic algorithms and hidden Markov models worked extremely well. Simulating evolution with a GA was able to substantially improve the performance of the HHMM networks. What evolution came up with was far superior to our original design, which was based on our intuition.

We then experimented with introducing a series of small variations in the overall system. For example, we would make perturbations (minor random changes) to the input. Another such change was to have adjacent Markov models “leak” into one another by causing the results of one Markov model to influence models that are “nearby.” Although we did not realize it at the time, the sorts of adjustments we were experimenting with are very similar to the types of modifications that occur in biological cortical structures.

At first, such changes hurt performance (as measured by accuracy of recognition). But if we reran evolution (that is, reran the GA) with these alterations in place, it would adapt the system accordingly, optimizing it for these introduced modifications. In general, this would restore performance. If we then removed the changes we had introduced, performance would be again degraded, because the system had been evolved to compensate for the changes. The adapted system became dependent on the changes.

One type of alteration that actually helped performance (after rerunning the GA) was to introduce small random changes to the input. The reason for this is the well-known “overfitting” problem in self-organizing systems. There is a danger that such a system will overgeneralize to the specific examples contained in the training sample. By making random adjustments to the input, the more invariant patterns in the data survive, and the system thereby learns these deeper patterns. This helped only if we reran the GA with the randomization feature on.

This introduces a dilemma in our understanding of our biological cortical circuits. It had been noticed, for example, that there might indeed be a small amount of leakage from one cortical connection to another, resulting from the way that biological connections are formed: The electrochemistry of the axons and dendrites is apparently subject to the electromagnetic effects of nearby connections. Suppose we were able to run an experiment where we removed this effect in an actual brain. That would be difficult to actually carry out, but not necessarily impossible. Suppose we conducted such an experiment and found that the cortical circuits worked less effectively without this neural leakage. We might then conclude that this phenomenon was a very clever design by evolution and was critical to the cortex’s achieving its level of performance. We might further point out that such a result shows that the orderly model of the flow of patterns up the conceptual hierarchy and the flow of predictions down the hierarchy was in fact much more complicated because of this intricate influence of connections on one another.

But that would not necessarily be an accurate conclusion. Consider our experience with a simulated cortex based on HHMMs, in which we implemented a modification very similar to interneuronal cross talk. If we then ran evolution with that phenomenon in place, performance would be restored (because the evolutionary process adapted to it). If we then removed the cross talk, performance would be compromised again. In the biological case, evolution (that is, biological evolution) was indeed “run” with this phenomenon in place. The detailed parameters of the system have thereby been set by biological evolution to be dependent on these factors, so that changing them will negatively affect performance unless we run evolution again. Doing so is feasible in the simulated world, where evolution only takes days or weeks, but in the biological world it would require tens of thousands of years.

So how can we tell whether a particular design feature of the biological neocortex is a vital innovation introduced by biological evolution—that is, one that is instrumental to our level of intelligence—or merely an artifact that the design of the system is now dependent on but could have evolved without? We can answer that question simply by running simulated evolution with and without these particular variations to the details of the design (for example, with and without connection cross talk). We can even do so with biological evolution if we’re examining the evolution of a colony of microorganisms where generations are measured in hours, but it is not practical for complex organisms such as humans. This is another one of the many disadvantages of biology.

Getting back to our work in speech recognition, we found that if we ran evolution (that is, a GA) separately on the initial design of (1) the hierarchical hidden Markov models that were modeling the internal structure of phonemes and (2) the HHMMs’ modeling of the structures of words and phrases, we got even better results. Both levels of the system were using HHMMs, but the GA would evolve design variations between these different levels. This approach still allowed the modeling of phenomena that occurs in between the two levels, such as the smearing of phonemes that often happens when we string certain words together (for example, “How are you all doing?” might become “How’re y’all doing?”).

It is likely that a similar phenomenon took place in different biological cortical regions, in that they have evolved small differences based on the types of patterns they deal with. Whereas all of these regions use the same essential neocortical algorithm, biological evolution has had enough time to fine-tune the design of each of them to be optimal for their particular patterns. However, as I discussed earlier, neuroscientists and neurologists have noticed substantial plasticity in these areas, which supports the idea of a general neocortical algorithm. If the fundamental methods in each region were radically different, then such interchangeability among cortical regions would not be possible.

The systems we created in our research using this combination of self-organizing methods were very successful. In speech recognition, they were able for the first time to handle fully continuous speech and relatively unrestricted vocabularies. We were able to achieve a high accuracy rate on a wide variety of speakers, accents, and dialects. The current state of the art as this book is being written is represented by a product called Dragon Naturally Speaking (Version 11.5) for the PC from Nuance (formerly Kurzweil Computer Products). I suggest that people try it if they are skeptical about the performance of contemporary speech recognition—accuracies are often 99 percent or higher after a few minutes of training on your voice on continuous speech and relatively unrestricted vocabularies. Dragon Dictation is a simpler but still impressive free app for the iPhone that requires no voice training. Siri, the personal assistant on contemporary Apple iPhones, uses the same speech recognition technology with extensions to handle natural-language understanding.

The performance of these systems is a testament to the power of mathematics. With them we are essentially computing what is going on in the neocortex of a speaker—even though we have no direct access to that person’s brain—as a vital step in recognizing what the person is saying and, in the case of systems like Siri, what those utterances mean. We might wonder, if we were to actually look inside the speaker’s neocortex, would we see connections and weights corresponding to the hierarchical hidden Markov models computed by the software? Almost certainly we would not find a precise match; the neuronal structures would invariably differ in many details compared with the models in the computer. However, I would maintain that there must be an essential mathematical equivalence to a high degree of precision between the actual biology and our attempt to emulate it; otherwise these systems would not work as well as they do.

LISP

LISP (LISt Processor) is a computer language, originally specified by AI pioneer John McCarthy (1927–2011) in 1958. As its name suggests, LISP deals with lists. Each LISP statement is a list of elements; each element is either another list or an “atom,” which is an irreducible item constituting either a number or a symbol. A list included in a list can be the list itself, hence LISP is capable of recursion. Another way that LISP statements can be recursive is if a list includes a list, and so on until the original list is specified. Because lists can include lists, LISP is also capable of hierarchical processing. A list can be a conditional such that it only “fires” if its elements are satisfied. In this way, hierarchies of such conditionals can be used to identify increasingly abstract qualities of a pattern.

LISP became the rage in the artificial intelligence community in the 1970s and early 1980s. The conceit of the LISP enthusiasts of the earlier decade was that the language mirrored the way the human brain worked—that any intelligent process could most easily and efficiently be coded in LISP. There followed a mini-boomlet in “artificial intelligence” companies that offered LISP interpreters and related LISP products, but when it became apparent in the mid-1980s that LISP itself was not a shortcut to creating intelligent processes, the investment balloon collapsed.

It turns out that the LISP enthusiasts were not entirely wrong. Essentially, each pattern recognizer in the neocortex can be regarded as a LISP statement—each one constitutes a list of elements, and each element can be another list. The neocortex is therefore indeed engaged in list processing of a symbolic nature very similar to that which takes place in a LISP program. Moreover, it processes all 300 million LISP-like “statements” simultaneously.

However, there were two important features missing from the world of LISP, one of which was learning. LISP programs had to be coded line by line by human programmers. There were attempts to automatically code LISP programs using a variety of methods, but these were not an integral part of the language’s concept. The neocortex, in contrast, programs itself, filling its “statements” (that is, the lists) with meaningful and actionable information from its own experience and from its own feedback loops. This is a key principle of how the neocortex works: Each one of its pattern recognizers (that is, each LISP-like statement) is capable of filling in its own list and connecting itself both up and down to other lists. The second difference is the size parameters. One could create a variant of LISP (coded in LISP) that would allow for handling such parameters, but these are not part of the basic language.

LISP is consistent with the original philosophy of the AI field, which was to find intelligent solutions to problems and to code them directly in computer languages. The first attempt at a self-organizing method that would teach itself from experience—neural nets—was not successful because it did not provide a means to modify the topology of the system in response to learning. The hierarchical hidden Markov model effectively provided that through its pruning mechanism. Today, the HHMM together with its mathematical cousins makes up a major portion of the world of AI.

A corollary of the observation of the similarity of LISP and the list structure of the neocortex is an argument made by those who insist that the brain is too complicated for us to understand. These critics point out that the brain has trillions of connections, and since each one must be there specifically by design, they constitute the equivalent of trillions of lines of code. As we’ve seen, I’ve estimated that there are on the order of 300 million pattern processors in the neocortex—or 300 million lists where each element in the list is pointing to another list (or, at the lowest conceptual level, to a basic irreducible pattern from outside the neocortex). But 300 million is still a reasonably big number of LISP statements and indeed is larger than any human-written program in existence.

However, we need to keep in mind that these lists are not actually specified in the initial design of the nervous system. The brain creates these lists itself and connects the levels automatically from its own experiences. This is the key secret of the neocortex. The processes that accomplish this self-organization are much simpler than the 300 million statements that constitute the capacity of the neocortex. Those processes are specified in the genome. As I will demonstrate in chapter 11, the amount of unique information in the genome (after lossless compression) as applied to the brain is about 25 million bytes, which is equivalent to less than a million lines of code. The actual algorithmic complexity is even less than that, as most of the 25 million bytes of genetic information pertain to the biological needs of the neurons, and not specifically to their information-processing capability. However, even 25 million bytes of design information is a level of complexity we can handle.

Hierarchical Memory Systems

As I discussed in chapter 3, Jeff Hawkins and Dileep George in 2003 and 2004 developed a model of the neocortex incorporating hierarchical lists that was described in Hawkins and Blakeslee’s 2004 book On Intelligence. A more up-to-date and very elegant presentation of the hierarchical temporal memory method can be found in Dileep George’s 2008 doctoral dissertation.¹² Numenta has implemented it in a system called NuPIC (Numenta Platform for Intelligent Computing) and has developed pattern recognition and intelligent data-mining systems for such clients as Forbes and Power Analytics Corporation. After working at Numenta, George has started a new company called Vicarious Systems with funding from the Founder Fund (managed by Peter Thiel, the venture capitalist behind Facebook, and Sean Parker, the first president of Facebook) and from Good Ventures, led by Dustin Moskovitz, cofounder of Facebook. George reports significant progress in automatically modeling, learning, and recognizing information with a substantial number of hierarchies. He calls his system a “recursive cortical network” and plans applications for medical imaging and robotics, among other fields. The technique of hierarchical hidden Markov models is mathematically very similar to these hierarchical memory systems, especially if we allow the HHMM system to organize its own connections between pattern recognition modules. As mentioned earlier, HHMMs provide for an additional important element, which is modeling the expected distribution of the magnitude (on some continuum) of each input in computing the probability of the existence of the pattern under consideration. I have recently started a new company called Patterns, Inc., which intends to develop hierarchical self-organizing neocortical models that utilize HHMMs and related techniques for the purpose of understanding natural language. An important emphasis will be on the ability for the system to design its own hierarchies in a manner similar to a biological neocortex. Our envisioned system will continually read a wide range of material such as Wikipedia and other knowledge resources as well as listen to everything you say and watch everything you write (if you let it). The goal is for it to become a helpful friend answering your questions—before you even formulate them—and giving you useful information and tips as you go through your day.

The Moving Frontier of AI: Climbing the Competence Hierarchy

A long tiresome speech delivered by a frothy pie topping.

A garment worn by a child, perhaps aboard an operatic ship.

Wanted for a twelve-year crime spree of eating King Hrothgar’s warriors; officer Beowulf has been assigned the case.

It can mean to develop gradually in the mind or to carry during pregnancy.

National Teacher Day and Kentucky Derby Day.

Wordsworth said they soar but never roam.

Four-letter word for the iron fitting on the hoof of a horse or a card-dealing box in a casino.

In act three of an 1846 Verdi opera, this Scourge of God is stabbed to death by his lover, Odabella.

—Examples of Jeopardy! queries, all of which Watson got correct. Answers are: meringue harangue, pinafore, Grendel, gestate, May, skylark, shoe. For the eighth query, Watson replied, “What is Attila?” The host responded by saying, “Be more specific?” Watson clarified with, “What is Attila the Hun?,” which is correct.

The computer’s techniques for unraveling Jeopardy! clues sounded just like mine. That machine zeroes in on key words in a clue, then combs its memory (in Watson’s case, a 15-terabyte data bank of human knowledge) for clusters of associations with these words. It rigorously checks the top hits against all the contextual information it can muster: the category name; the kind of answer being sought; the time, place, and gender hinted at in the clue; and so on. And when it feels “sure” enough, it decides to buzz. This is all an instant, intuitive process for a human Jeopardy! player, but I felt convinced that under the hood my brain was doing more or less the same thing.

—Ken Jennings, human Jeopardy! champion who lost to Watson

I, for one, welcome our new robot overlords.

Ken Jennings, paraphrasing The Simpsons, after losing to Watson

Oh my god. [Watson] is more intelligent than the average Jeopardy! player in answering Jeopardy! questions. That’s impressively intelligent.

Sebastian Thrun, former director of the Stanford AI Lab

Watson understands nothing. It’s a bigger steamroller.

Noam Chomsky

Artificial intelligence is all around us—we no longer have our hand on the plug. The simple act of connecting with someone via a text message, e-mail, or cell phone call uses intelligent algorithms to route the information. Almost every product we touch is originally designed in a collaboration between human and artificial intelligence and then built in automated factories. If all the AI systems decided to go on strike tomorrow, our civilization would be crippled: We couldn’t get money from our bank, and indeed, our money would disappear; communication, transportation, and manufacturing would all grind to a halt. Fortunately, our intelligent machines are not yet intelligent enough to organize such a conspiracy.

What is new in AI today is the viscerally impressive nature of publicly available examples. For example, consider Google’s self-driving cars (which as of this writing have gone over 200,000 miles in cities and towns), a technology that will lead to significantly fewer crashes, increased capacity of roads, alleviating the requirement of humans to perform the chore of driving, and many other benefits. Driverless cars are actually already legal to operate on public roads in Nevada with some restrictions, although widespread usage by the public throughout the world is not expected until late in this decade. Technology that intelligently watches the road and warns the driver of impending dangers is already being installed in cars. One such technology is based in part on the successful model of visual processing in the brain created by MIT’s Tomaso Poggio. Called MobilEye, it was developed by Amnon Shashua, a former postdoctoral student of Poggio’s. It is capable of alerting the driver to such dangers as an impending collision or a child running in front of the car and has recently been installed in cars by such manufacturers as Volvo and BMW.

I will focus in this section of the book on language technologies for several reasons. Not surprisingly, the hierarchical nature of language closely mirrors the hierarchical nature of our thinking. Spoken language was our first technology, with written language as the second. My own work in artificial intelligence, as this chapter has demonstrated, has been heavily focused on language. Finally, mastering language is a powerfully leveraged capability. Watson has already read hundreds of millions of pages on the Web and mastered the knowledge contained in these documents. Ultimately machines will be able to master all of the knowledge on the Web—which is essentially all of the knowledge of our human-machine civilization.

English mathematician Alan Turing (1912–1954) based his eponymous test on the ability of a computer to converse in natural language using text messages.¹³ Turing felt that all of human intelligence was embodied and represented in language, and that no machine could pass a Turing test through simple language tricks. Although the Turing test is a game involving written language, Turing believed that the only way that a computer could pass it would be for it to actually possess the equivalent of human-level intelligence. Critics have proposed that a true test of human-level intelligence should include mastery of visual and auditory information as well.¹⁴ Since many of my own AI projects involve teaching computers to master such sensory information as human speech, letter shapes, and musical sounds, I would be expected to advocate the inclusion of these forms of information in a true test of intelligence. Yet I agree with Turing’s original insight that the text-only version of the Turing test is sufficient. Adding visual or auditory input or output to the test would not actually make it more difficult to pass.

One does not need to be an AI expert to be moved by the performance of Watson on Jeopardy! Although I have a reasonable understanding of the methodology used in a number of its key subsystems, that does not diminish my emotional reaction to watching it—him?—perform. Even a perfect understanding of how all of its component systems work—which no one actually has—would not help you to predict how Watson would actually react to a given situation. It contains hundreds of interacting subsystems, and each of these is considering millions of competing hypotheses at the same time, so predicting the outcome is impossible. Doing a thorough analysis—after the fact—of Watson’s deliberations for a single three-second query would take a human centuries.

To continue my own history, in the late 1980s and 1990s we began working on natural-language understanding in limited domains. You could speak to one of our products, called Kurzweil Voice, about anything you wanted, so long as it had to do with editing documents. (For example, “Move the third paragraph on the previous page to here.”) It worked pretty well in this limited but useful domain. We also created systems with medical domain knowledge so that doctors could dictate patient reports. It had enough knowledge of fields such as radiology and pathology that it could question the doctor if something in the report seemed unclear, and would guide the physician through the reporting process. These medical reporting systems have evolved into a billion-dollar business at Nuance.

Understanding natural language, especially as an extension to automatic speech recognition, has now entered the mainstream. As of the writing of this book, Siri, the automated personal assistant on the iPhone 4S, has created a stir in the mobile computing world. You can pretty much ask Siri to do anything that a self-respecting smartphone should be capable of doing (for example, “Where can I get some Indian food around here?” or “Text my wife that I’m on my way,” or “What do people think of the new Brad Pitt movie?”), and most of the time Siri will comply. Siri will entertain a small amount of nonproductive chatter. If you ask her what the meaning of life is, she will respond with “42,” which fans of The Hitchhiker’s Guide to the Galaxy will recognize as its “answer to the ultimate question of life, the universe, and everything.” Knowledge questions (including the one about the meaning of life) are answered by Wolfram Alpha, described on page 170. There is a whole world of “chatbots” who do nothing but engage in small talk. If you would like to talk to our chatbot named Ramona, go to our Web site KurzweilAI.net and click on “Chat with Ramona.”

Some people have complained to me about Siri’s failure to answer certain requests, but I often recall that these are the same people who persistently complain about human service providers also. I sometimes suggest that we try it together, and often it works better than they expect. The complaints remind me of the story of the dog who plays chess. To an incredulous questioner, the dog’s owner replies, “Yeah, it’s true, he does play chess, but his endgame is weak.” Effective competitors are now emerging, such as Google Voice Search.

That the general public is now having conversations in natural spoken language with their handheld computers marks a new era. It is typical that people dismiss the significance of a first-generation technology because of its limitations. A few years later, when the technology does work well, people still dismiss its importance because, well, it’s no longer new. That being said, Siri works impressively for a first-generation product, and it is clear that this category of product is only going to get better.

Siri uses the HMM-based speech recognition technologies from Nuance. The natural-language extensions were first developed by the DARPA-funded “CALO” project.¹⁵ Siri has been enhanced with Nuance’s own natural-language technologies, and Nuance offers a very similar technology called Dragon Go!¹⁶

The methods used for understanding natural language are very similar to hierarchical hidden Markov models, and indeed HHMM itself is commonly used. Whereas some of these systems are not specifically labeled as using HMM or HHMM, the mathematics is virtually identical. They all involve hierarchies of linear sequences where each element has a weight, connections that are self-adapting, and an overall system that self-organizes based on learning data. Usually the learning continues during actual use of the system. This approach matches the hierarchical structure of natural language—it is just a natural extension up the conceptual ladder from parts of speech to words to phrases to semantic structures. It would make sense to run a genetic algorithm on the parameters that control the precise learning algorithm of this class of hierarchical learning systems and determine the optimal algorithmic details.

Over the past decade there has been a shift in the way that these hierarchical structures are created. In 1984 Douglas Lenat (born in 1950) started the ambitious Cyc (for enCYClopedic) project, which aimed to create rules that would codify everyday “commonsense” knowledge. The rules were organized in a huge hierarchy, and each rule involved—again—a linear sequence of states. For example, one Cyc rule might state that a dog has a face. Cyc can then link to general rules about the structure of faces: that a face has two eyes, a nose, and a mouth, and so on. We don’t need to have one set of rules for a dog’s face and then another for a cat’s face, though we may of course want to put in additional rules for ways in which dogs’ faces differ from cats’ faces. The system also includes an inference engine: If we have rules that state that a cocker spaniel is a dog, that dogs are animals, and that animals eat food, and if we were to ask the inference engine whether cocker spaniels eat, the system would respond that yes, cocker spaniels eat food. Over the next twenty years, and with thousands of person-years of effort, over a million such rules were written and tested. Interestingly, the language for writing Cyc rules—called CycL—is almost identical to LISP.

Meanwhile, an opposing school of thought believed that the best approach to natural-language understanding, and to creating intelligent systems in general, was through automated learning from exposure to a very large number of instances of the phenomena the system was trying to master. A powerful example of such a system is Google Translate, which can translate to and from fifty languages. That’s 2,500 different translation directions, although for most language pairs, rather than translate language 1 directly into language 2, it will translate language 1 into English and then English into language 2. That reduces the number of translators Google needed to build to ninety-eight (plus a limited number of non-English pairs for which there is direct translation). The Google translators do not use grammatical rules; rather, they create vast databases for each language pair of common translations based on large “Rosetta stone” corpora of translated documents between two languages. For the six languages that constitute the official languages of the United Nations, Google has used United Nations documents, as they are published in all six languages. For less common languages, other sources have been used.

The results are often impressive. DARPA runs annual competitions for the best automated language translation systems for different language pairs, and Google Translate often wins for certain pairs, outperforming systems created directly by human linguists.

Over the past decade two major insights have deeply influenced the natural-language-understanding field. The first has to do with hierarchies. Although the Google approach started with association of flat word sequences from one language to another, the inherent hierarchical nature of language has inevitably crept into its operation. Systems that methodically incorporate hierarchical learning (such as hierarchical hidden Markov models) provided significantly better performance. However, such systems are not quite as automatic to build. Just as humans need to learn approximately one conceptual hierarchy at a time, the same is true for computerized systems, so the learning process needs to be carefully managed.

The other insight is that hand-built rules work well for a core of common basic knowledge. For translations of short passages, this approach often provides more accurate results. For example, DARPA has rated rule-based Chinese-to-English translators higher than Google Translate for short passages. For what is called the tail of a language, which refers to the millions of infrequent phrases and concepts used in it, the accuracy of rule-based systems approaches an unacceptably low asymptote. If we plot natural-language-understanding accuracy against the amount of training data analyzed, rule-based systems have higher performance initially but level off at fairly low accuracies of about 70 percent. In sharp contrast, statistical systems can reach the high 90s in accuracy but require a great deal of data to achieve that.

Often we need a combination of at least moderate performance on a small amount of training data and then the opportunity to achieve high accuracies with a more significant quantity. Achieving moderate performance quickly enables us to put a system in the field and then to automatically collect training data as people actually use it. In this way, a great deal of learning can occur at the same time that the system is being used, and its accuracy will improve. The statistical learning needs to be fully hierarchical to reflect the nature of language, which also reflects how the human brain works.

This is also how Siri and Dragon Go! work—using rules for the most common and reliable phenomena and then learning the “tail” of the language in the hands of real users. When the Cyc team realized that they had reached a ceiling of performance based on hand-coded rules, they too adopted this approach. Hand-coded rules provide two essential functions. They offer adequate initial accuracy, so that a trial system can be placed into widespread usage, where it will improve automatically. Secondly, they provide a solid basis for the lower levels of the conceptual hierarchy so that the automated learning can begin to learn higher conceptual levels.

As mentioned above, Watson represents a particularly impressive example of the approach of combining hand-coded rules with hierarchical statistical learning. IBM combined a number of leading natural-language programs to create a system that could play the natural-language game of Jeopardy! On February 14–16, 2011, Watson competed with the two leading human players: Brad Rutter, who had won more money than anyone else on the quiz show, and Ken Jennings, who had previously held the Jeopardy! championship for the record time of seventy-five days.

By way of context, I had predicted in my first book, The Age of Intelligent Machines, written in the mid-1980s, that a computer would take the world chess championship by 1998. I also predicted that when that happened, we would either downgrade our opinion of human intelligence, upgrade our opinion of machine intelligence, or downplay the importance of chess, and that if history was a guide, we would minimize chess. Both of these things happened in 1997. When IBM’s chess supercomputer Deep Blue defeated the reigning human world chess champion, Garry Kasparov, we were immediately treated to arguments that it was to be expected that a computer would win at chess because computers are logic machines, and chess, after all, is a game of logic. Thus Deep Blue’s victory was judged to be neither surprising nor significant. Many of its critics went on to argue that computers would never master the subtleties of human language, including metaphors, similes, puns, double entendres, and humor.

The accuracy of natural-language-understanding systems as a function of the amount of training data. The best approach is to combine rules for the “core” of the language and a data-based approach for the “tail” of the language.

That is at least one reason why Watson represents such a significant milestone: Jeopardy! is precisely such a sophisticated and challenging language task. Typical Jeopardy! queries includes many of these vagaries of human language. What is perhaps not evident to many observers is that Watson not only had to master the language in the unexpected and convoluted queries, but for the most part its knowledge was not hand-coded. It obtained that knowledge by actually reading 200 million pages of natural-language documents, including all of Wikipedia and other encyclopedias, comprising 4 trillion bytes of language-based knowledge. As readers of this book are well aware, Wikipedia is not written in LISP or CycL, but rather in natural sentences that have all of the ambiguities and intricacies inherent in language. Watson needed to consider all 4 trillion characters in its reference material when responding to a question. (I realize that Jeopardy! queries are answers in search of a question, but this is a technicality—they ultimately are really questions.) If Watson can understand and respond to questions based on 200 million pages—in three seconds!—there is nothing to stop similar systems from reading the other billions of documents on the Web. Indeed, that effort is now under way.

When we were developing character and speech recognition systems and early natural-language-understanding systems in the 1970s through 1990s, we used a methodology of incorporating an “expert manager.” We would develop multiple systems to do the same thing but would incorporate somewhat different approaches in each one. Some of the differences were subtle, such as variations in the parameters controlling the mathematics of the learning algorithm. Some variations were fundamental, such as including rule-based systems instead of hierarchical statistical learning systems. The expert manager was itself a software program that was programmed to learn the strengths and weaknesses of these different systems by examining their performance in real-world situations. It was based on the notion that these strengths were orthogonal; that is, one system would tend to be strong where another was weak. Indeed, the overall performance of the combined systems with the trained expert manager in charge was far better than any of the individual systems.

Watson works the same way. Using an architecture called UIMA (Unstructured Information Management Architecture), Watson deploys literally hundreds of different systems—many of the individual language components in Watson are the same ones that are used in publicly available natural-language-understanding systems—all of which are attempting to either directly come up with a response to the Jeopardy! query or else at least provide some disambiguation of the query. UIMA is basically acting as the expert manager to intelligently combine the results of the independent systems. UIMA goes substantially beyond earlier systems, such as the one we developed in the predecessor company to Nuance, in that its individual systems can contribute to a result without necessarily coming up with a final answer. It is sufficient if a subsystem helps narrow down the solution. UIMA is also able to compute how much confidence it has in the final answer. The human brain does this also—we are probably very confident of our response when asked for our mother’s first name, but we are less so in coming up with the name of someone we met casually a year ago.

Thus rather than come up with a single elegant approach to understanding the language problem inherent in Jeopardy! the IBM scientists combined all of the state-of-the-art language-understanding modules they could get their hands on. Some use hierarchical hidden Markov models; some use mathematical variants of HHMM; others use rule-based approaches to code directly a core set of reliable rules. UIMA evaluates the performance of each system in actual use and combines them in an optimal way. There is some misunderstanding in the public discussions of Watson in that the IBM scientists who created it often focus on UIMA, which is the expert manager they created. This leads to comments by some observers that Watson has no real understanding of language because it is difficult to identify where this understanding resides. Although the UIMA framework also learns from its own experience, Watson’s “understanding” of language cannot be found in UIMA alone but rather is distributed across all of its many components, including the self-organizing language modules that use methods similar to HHMM.

A separate part of Watson’s technology uses UIMA’s confidence estimate in its answers to determine how to place Jeopardy! bets. While the Watson system is specifically optimized to play this particular game, its core language- and knowledge-searching technology can easily be adapted to other broad tasks. One might think that less commonly shared professional knowledge, such as that in the medical field, would be more difficult to master than the general-purpose “common” knowledge that is required to play Jeopardy! Actually, the opposite is the case: Professional knowledge tends to be more highly organized, structured, and less ambiguous than its commonsense counterpart, so it is highly amenable to accurate natural-language understanding using these techniques. As mentioned, IBM is currently working with Nuance to adapt the Watson technology to medicine.

The conversation that takes place when Watson is playing Jeopardy! is a brief one: A question is posed, and Watson comes up with an answer. (Again, technically, it comes up with a question to respond to an answer.) It does not engage in a conversation that would require tracking all of the earlier statements of all participants. (Siri actually does do this to a limited extent: If you ask it to send a message to your wife, it will ask you to identify her, but it will remember who she is for subsequent requests.) Tracking all of the information in a conversation—a task that would clearly be required to pass the Turing test—is a significant additional requirement but not fundamentally more difficult than what Watson is doing already. After all, Watson has read hundreds of millions of pages of material, which obviously includes many stories, so it is capable of tracking through complicated sequential events. It should therefore be able to follow its own conversations and take that into consideration in its subsequent replies.

Another limitation of the Jeopardy! game is that the answers are generally brief: It does not, for example, pose questions of the sort that ask contestants to name the five primary themes of A Tale of Two Cities. To the extent that it can find documents that do discuss the themes of this novel, a suitably modified version of Watson should be able to respond to this. Coming up with such themes on its own from just reading the book, and not essentially copying the thoughts (even without the words) of other thinkers, is another matter. Doing so would constitute a higher-level task than Watson is capable of today—it is what I call a Turing test–level task. (That being said, I will point out that most humans do not come up with their own original thoughts either but copy the ideas of their peers and opinion leaders.) At any rate, this is 2012, not 2029, so I would not expect Turing test–level intelligence yet. On yet another hand, I would point out that evaluating the answers to questions such as finding key ideas in a novel is itself not a straightforward task. If someone is asked who signed the Declaration of Independence, one can determine whether or not her response is true or false. The validity of answers to higher-level questions such as describing the themes of a creative work is far less easily established.

It is noteworthy that although Watson’s language skills are actually somewhat below that of an educated human, it was able to defeat the best two Jeopardy! players in the world. It could accomplish this because it is able to combine its language ability and knowledge understanding with the perfect recall and highly accurate memories that machines possess. That is why we have already largely assigned our personal, social, and historical memories to them.

Although I’m not prepared to move up my prediction of a computer passing the Turing test by 2029, the progress that has been achieved in systems like Watson should give anyone substantial confidence that the advent of Turing-level AI is close at hand. If one were to create a version of Watson that was optimized for the Turing test, it would probably come pretty close.

American philosopher John Searle (born in 1932) argued recently that Watson is not capable of thinking. Citing his “Chinese room” thought experiment (which I will discuss further in chapter 11), he states that Watson is only manipulating symbols and does not understand the meaning of those symbols. Actually, Searle is not describing Watson accurately, since its understanding of language is based on hierarchical statistical processes—not the manipulation of symbols. The only way that Searle’s characterization would be accurate is if we considered every step in Watson’s self-organizing processes to be “the manipulation of symbols.” But if that were the case, then the human brain would not be judged capable of thinking either.

It is amusing and ironic when observers criticize Watson for just doing statistical analysis of language as opposed to possessing the “true” understanding of language that humans have. Hierarchical statistical analysis is exactly what the human brain is doing when it is resolving multiple hypotheses based on statistical inference (and indeed at every level of the neocortical hierarchy). Both Watson and the human brain learn and respond based on a similar approach to hierarchical understanding. In many respects Watson’s knowledge is far more extensive than a human’s; no human can claim to have mastered all of Wikipedia, which is only part of Watson’s knowledge base. Conversely, a human can today master more conceptual levels than Watson, but that is certainly not a permanent gap.

One important system that demonstrates the strength of computing applied to organized knowledge is Wolfram Alpha, an answer engine (as opposed to a search engine) developed by British mathematician and scientist Dr. Wolfram (born 1959) and his colleagues at Wolfram Research. For example, if you ask Wolfram Alpha (at WolframAlpha.com), “How many primes are there under a million?” it will respond with “78,498.” It did not look up the answer, it computed it, and following the answer it provides the equations it used. If you attempted to get that answer using a conventional search engine, it would direct you to links where you could find the algorithms required. You would then have to plug those formulas into a system such as Mathematica, also developed by Dr. Wolfram, but this would obviously require a lot more work (and understanding) than simply asking Alpha.

Indeed, Alpha consists of 15 million lines of Mathematica code. What Alpha is doing is literally computing the answer from approximately 10 trillion bytes of data that have been carefully curated by the Wolfram Research staff. You can ask a wide range of factual questions, such as “What country has the highest GDP per person?” (Answer: Monaco, with $212,000 per person in U.S. dollars), or “How old is Stephen Wolfram?” (Answer: 52 years, 9 months, 2 days as of the day I am writing this). As mentioned, Alpha is used as part of Apple’s Siri; if you ask Siri a factual question, it is handed off to Alpha to handle. Alpha also handles some of the searches posed to Microsoft’s Bing search engine.

In a recent blog post, Dr. Wolfram reported that Alpha is now providing successful responses 90 percent of the time.¹⁷ He also reports an exponential decrease in the failure rate, with a half-life of around eighteen months. It is an impressive system, and uses handcrafted methods and hand-checked data. It is a testament to why we created computers in the first place. As we discover and compile scientific and mathematical methods, computers are far better than unaided human intelligence in implementing them. Most of the known scientific methods have been encoded in Alpha, along with continually updated data on topics ranging from economics to physics. In a private conversation I had with Dr. Wolfram, he estimated that self-organizing methods such as those used in Watson typically achieve about an 80 percent accuracy when they are working well. Alpha, he pointed out, is achieving about a 90 percent accuracy. Of course, there is self-selection in both of these accuracy numbers in that users (such as myself) have learned what kinds of questions Alpha is good at, and a similar factor applies to the self-organizing methods. Eighty percent appears to be a reasonable estimate of how accurate Watson is on Jeopardy! queries, but this was sufficient to defeat the best humans.

It is my view that self-organizing methods such as I articulated in the pattern recognition theory of mind are needed to understand the elaborate and often ambiguous hierarchies we encounter in real-world phenomena, including human language. An ideal combination for a robustly intelligent system would be to combine hierarchical intelligence based on the PRTM (which I contend is how the human brain works) with precise codification of scientific knowledge and data. That essentially describes a human with a computer. We will enhance both poles of intelligence in the years ahead. With regard to our biological intelligence, although our neocortex has significant plasticity, its basic architecture is limited by its physical constraints. Putting additional neocortex into our foreheads was an important evolutionary innovation, but we cannot now easily expand the size of our frontal lobes by a factor of a thousand, or even by 10 percent. That is, we cannot do so biologically, but that is exactly what we will do technologically.

A Strategy for Creating a Mind

There are billions of neurons in our brains, but what are neurons? Just cells. The brain has no knowledge until connections are made between neurons. All that we know, all that we are, comes from the way our neurons are connected.

Tim Berners-Lee

Let’s use the observations I have discussed above to begin building a brain. We will start by building a pattern recognizer that meets the necessary attributes. Next we’ll make as many copies of the recognizer as we have memory and computational resources to support. Each recognizer computes the probability that its pattern has been recognized. In doing so, it takes into consideration the observed magnitude of each input (in some appropriate continuum) and matches these against the learned size and size variability parameters associated with each input. The recognizer triggers its simulated axon if that computed probability exceeds a threshold. This threshold and the parameters that control the computation of the pattern’s probability are among the parameters we will optimize with a genetic algorithm. Because it is not a requirement that every input be active for a pattern to be recognized, this provides for autoassociative recognition (that is, recognizing a pattern based on only part of the pattern being present). We also allow for inhibitory signals (signals that indicate that the pattern is less likely).

Recognition of the pattern sends an active signal up the simulated axon of this pattern recognizer. This axon is in turn connected to one or more pattern recognizers at the next higher conceptual level. All of the pattern recognizers connected at the next higher conceptual level are accepting this pattern as one of its inputs. Each pattern recognizer also sends signals down to pattern recognizers at lower conceptual levels whenever most of a pattern has been recognized, indicating that the rest of the pattern is “expected.” Each pattern recognizer has one or more of these expected signal input channels. When an expected signal is received in this way, the threshold for recognition of this pattern recognizer is lowered (made easier).

The pattern recognizers are responsible for “wiring” themselves to other pattern recognizers up and down the conceptual hierarchy. Note that all the “wires” in a software implementation operate via virtual links (which, like Web links, are basically memory pointers) and not actual wires. This system is actually much more flexible than that in the biological brain. In a human brain, new patterns have to be assigned to an actual physical pattern recognizer, and new connections have to be made with an actual axon-to-dendrite link. Usually this means taking an existing physical connection that is approximately what is needed and then growing the necessary axon and dendrite extensions to complete the full connection.

Another technique used in biological mammalian brains is to start with a large number of possible connections and then prune the neural connections that are not used. If a biological neocortex reassigns cortical pattern recognizers that have already learned older patterns in order to learn more recent material, then the connections need to be physically reconfigured. Again, these tasks are much simpler in a software implementation. We simply assign new memory locations to a new pattern recognizer and use memory links for the connections. If the digital neocortex wishes to reassign cortical memory resources from one set of patterns to another, it simply returns the old pattern recognizers to memory and then makes the new assignment. This sort of “garbage collection” and reassignment of memory is a standard feature of the architecture of many software systems. In our digital brain we would also back up old memories before discarding them from the active neocortex, a precaution we can’t take in our biological brains.

There are a variety of mathematical techniques that can be employed to implement this approach to self-organizing hierarchical pattern recognition. The method I would use is hierarchical hidden Markov models, for several reasons. From my personal perspective, I have several decades of familiarity with this method, having used it in the earliest speech recognition and natural-language systems starting in the 1980s. From the perspective of the overall field, there is greater experience with hidden Markov models than with any other approach for pattern recognition tasks. They are also extensively used in natural-language understanding. Many NLU systems use techniques that are at least mathematically similar to HHMM.

Note that not all hidden Markov model systems are fully hierarchical. Some allow for just a few levels of hierarchy—for example, going from acoustic states to phonemes to words. To build a brain, we will want to enable our system to create as many new levels of hierarchy as needed. Also, most hidden Markov model systems are not fully self-organizing. Some have fixed connections, although these systems do effectively prune many of their starting connections by allowing them to evolve zero connection weights. Our systems from the 1980s and 1990s automatically pruned connections with connection weights below a certain level and also allowed for making new connections to better model the training data and to learn on the fly. A key requirement, I believe, is to allow for the system to flexibly create its own topologies based on the patterns it is exposed to while learning. We can use the mathematical technique of linear programming to optimally assign connections to new pattern recognizers.

Our digital brain will also accommodate substantial redundancy of each pattern, especially ones that occur frequently. This allows for robust recognition of common patterns and is also one of the key methods to achieving invariant recognition of different forms of a pattern. We will, however, need rules for how much redundancy to permit, as we don’t want to use up excessive amounts of memory on very common low-level patterns.

The rules regarding redundancy, recognition thresholds, and the effect on the threshold of a “this pattern is expected” indication are a few examples of key overall parameters that affect the performance of this type of self-organizing system. I would initially set these parameters based on my intuition, but we would then optimize them using a genetic algorithm.

A very important consideration is the education of a brain, whether a biological or a software one. As I discussed earlier, a hierarchical pattern recognition system (digital or biological) will only learn about two—preferably one—hierarchical levels at a time. To bootstrap the system I would start with previously trained hierarchical networks that have already learned their lessons in recognizing human speech, printed characters, and natural-language structures. Such a system would be capable of reading natural-language documents but would only be able to master approximately one conceptual level at a time. Previously learned levels would provide a relatively stable basis to learn the next level. The system can read the same documents over and over, gaining new conceptual levels with each subsequent reading, similar to the way people reread and achieve a deeper understanding of texts. Billions of pages of material are available on the Web. Wikipedia itself has about four million articles in the English version.

I would also provide a critical thinking module, which would perform a continual background scan of all of the existing patterns, reviewing their compatibility with the other patterns (ideas) in this software neocortex. We have no such facility in our biological brains, which is why people can hold completely inconsistent thoughts with equanimity. Upon identifying an inconsistent idea, the digital module would begin a search for a resolution, including its own cortical structures as well as all of the vast literature available to it. A resolution might simply mean determining that one of the inconsistent ideas is simply incorrect (if contraindicated by a preponderance of conflicting data). More constructively, it would find an idea at a higher conceptual level that resolves the apparent contradiction by providing a perspective that explains each idea. The system would add this resolution as a new pattern and link to the ideas that initially triggered the search for the resolution. This critical thinking module would run as a continual background task. It would be very beneficial if human brains did the same thing.

I would also provide a module that identifies open questions in every discipline. As another continual background task, it would search for solutions to them in other disparate areas of knowledge. As I noted, the knowledge in the neocortex consists of deeply nested patterns of patterns and is therefore entirely metaphorical. We can use one pattern to provide a solution or insight in an apparently disconnected field.

As an example, recall the metaphor I used in chapter 4 relating the random movements of molecules in a gas to the random movements of evolutionary change. Molecules in a gas move randomly with no apparent sense of direction. Despite this, virtually every molecule in a gas in a beaker, given sufficient time, will leave the beaker. I noted that this provides a perspective on an important question concerning the evolution of intelligence. Like molecules in a gas, evolutionary changes also move every which way with no apparent direction. Yet we nonetheless see a movement toward greater complexity and greater intelligence, indeed to evolution’s supreme achievement of evolving a neocortex capable of hierarchical thinking. So we are able to gain an insight into how an apparently purposeless and directionless process can achieve an apparently purposeful result in one field (biological evolution) by looking at another field (thermodynamics).

I mentioned earlier how Charles Lyell’s insight that minute changes to rock formations by streaming water could carve great valleys over time inspired Charles Darwin to make a similar observation about continual minute changes to the characteristics of organisms within a species. This metaphor search would be another continual background process.

We should provide a means of stepping through multiple lists simultaneously to provide the equivalent of structured thought. A list might be the statement of the constraints that a solution to a problem must satisfy. Each step can generate a recursive search through the existing hierarchy of ideas or a search through available literature. The human brain appears to be able to handle only four simultaneous lists at a time (without the aid of tools such as computers), but there is no reason for an artificial neocortex to have such a limitation.

We will also want to enhance our artificial brains with the kind of intelligence that computers have always excelled in, which is the ability to master vast databases accurately and implement known algorithms quickly and efficiently. Wolfram Alpha uniquely combines a great many known scientific methods and applies them to carefully collected data. This type of system is also going to continue to improve given Dr. Wolfram’s observation of an exponential decline in error rates.

Finally, our new brain needs a purpose. A purpose is expressed as a series of goals. In the case of our biological brains, our goals are established by the pleasure and fear centers that we have inherited from the old brain. These primitive drives were initially set by biological evolution to foster the survival of species, but the neocortex has enabled us to sublimate them. Watson’s goal was to respond to Jeopardy! queries. Another simply stated goal could be to pass the Turing test. To do so, a digital brain would need a human narrative of its own fictional story so that it can pretend to be a biological human. It would also have to dumb itself down considerably, for any system that displayed the knowledge of, say, Watson would be quickly unmasked as nonbiological.

More interestingly, we could give our new brain a more ambitious goal, such as contributing to a better world. A goal along these lines, of course, raises a lot of questions: Better for whom? Better in what way? For biological humans? For all conscious beings? If that is the case, who or what is conscious?

As nonbiological brains become as capable as biological ones of effecting changes in the world—indeed, ultimately far more capable than unenhanced biological ones—we will need to consider their moral education. A good place to start would be with one old idea from our religious traditions: the golden rule.