Further Readings

If this book whetted your appetite for machine learning and the issues surrounding it, you’ll find many suggestions in this section. Its aim is not to be comprehensive but to provide an entrance to machine learning’s garden of forking paths (as Borges put it). Wherever possible, I chose books and articles appropriate for the general reader. Technical publications, which require at least some computational, statistical, or mathematical background, are marked with an asterisk (*). Even these, however, often have large sections accessible to the general reader. I didn’t list volume, issue, or page numbers, since the web renders them superfluous; likewise for publishers’ locations.

If you’d like to learn more about machine learning in general, one good place to start is online courses. Of these, the closest in content to this book is, not coincidentally, the one I teach (www.coursera.org/course/machlearning). Two other options are Andrew Ng’s course (www.coursera.org/course/ml) and Yaser Abu-Mostafa’s (http://work.caltech.edu/telecourse.html). The next step is to read a textbook. The closest to this book, and one of the most accessible, is Tom Mitchell’s Machine Learning* (McGraw-Hill, 1997). More up-to-date, but also more mathematical, are Kevin Murphy’s Machine Learning: A Probabilistic Perspective* (MIT Press, 2012), Chris Bishop’s Pattern Recognition and Machine Learning* (Springer, 2006), and An Introduction to Statistical Learning with Applications in R,* by Gareth James, Daniela Witten, Trevor Hastie, and Rob Tibshirani (Springer, 2013). My article “A few useful things to know about machine learning” (Communications of the ACM, 2012) summarizes some of the “folk knowledge” of machine learning that textbooks often leave implicit and was one of the starting points for this book. If you know how to program and are itching to give machine learning a try, you can start from a number of open-source packages, such as Weka (www.cs.waikato.ac.nz/ml/weka). The two main machine-learning journals are Machine Learning and the Journal of Machine Learning Research. Leading machine-learning conferences, with yearly proceedings, include the International Conference on Machine Learning, the Conference on Neural Information Processing Systems, and the International Conference on Knowledge Discovery and Data Mining. A large number of machine-learning talks are available on http://videolectures.net. The www.KDnuggets.com website is a one-stop shop for machine-learning resources, and you can sign up for its newsletter to keep up-to-date with the latest developments.

Prologue

An early list of examples of machine learning’s impact on daily life can be found in “Behind-the-scenes data mining,” by George John (SIGKDD Explorations, 1999), which was also the inspiration for the “day-in-the-life” paragraphs of the prologue. Eric Siegel’s book Predictive Analytics (Wiley, 2013) surveys a large number of machine-learning applications. The term big data was popularized by the McKinsey Global Institute’s 2011 report Big Data: The Next Frontier for Innovation, Competition, and Productivity. Many of the issues raised by big data are discussed in Big Data: A Revolution That Will Change How We Live, Work, and Think, by Viktor Mayer-Schönberger and Kenneth Cukier (Houghton Mifflin Harcourt, 2013). The textbook I learned AI from is Artificial Intelligence,* by Elaine Rich (McGraw-Hill, 1983). A current one is Artificial Intelligence: A Modern Approach, by Stuart Russell and Peter Norvig (3rd ed., Prentice Hall, 2010). Nils Nilsson’s The Quest for Artificial Intelligence (Cambridge University Press, 2010) tells the story of AI from its earliest days.

Chapter One

Nine Algorithms That Changed the Future, by John MacCormick (Princeton University Press, 2012), describes some of the most important algorithms in computer science, with a chapter on machine learning. Algorithms,* by Sanjoy Dasgupta, Christos Papadimitriou, and Umesh Vazirani (McGraw-Hill, 2008), is a concise introductory textbook on the subject. The Pattern on the Stone, by Danny Hillis (Basic Books, 1998), explains how computers work. Walter Isaacson recounts the lively history of computer science in The Innovators (Simon & Schuster, 2014).

“Spreadsheet data manipulation using examples,”* by Sumit Gulwani, William Harris, and Rishabh Singh (Communications of the ACM, 2012), is an example of how computers can program themselves by observing users. Competing on Analytics, by Tom Davenport and Jeanne Harris (HBS Press, 2007), is an introduction to the use of predictive analytics in business. In the Plex, by Steven Levy (Simon & Schuster, 2011), describes at a high level how Google’s technology works. Carl Shapiro and Hal Varian explain the network effect in Information Rules (HBS Press, 1999). Chris Anderson does the same for the long-tail phenomenon in The Long Tail (Hyperion, 2006).

The transformation of science by data-intensive computing is surveyed in The Fourth Paradigm, edited by Tony Hey, Stewart Tansley, and Kristin Tolle (Microsoft Research, 2009). “Machine science,” by James Evans and Andrey Rzhetsky (Science, 2010), discusses some of the different ways computers can make scientific discoveries. Scientific Discovery: Computational Explorations of the Creative Processes,* by Pat Langley et al. (MIT Press, 1987), describes a series of approaches to automating the discovery of scientific laws. The SKICAT project is described in “From digitized images to online catalogs,” by Usama Fayyad, George Djorgovski, and Nicholas Weir (AI Magazine, 1996). “Machine learning in drug discovery and development,”* by Niki Wale (Drug Development Research, 2001), gives an overview of just that. Adam, the robot scientist, is described in “The automation of science,” by Ross King et al. (Science, 2009).

Sasha Issenberg’s The Victory Lab (Broadway Books, 2012) dissects the use of data analysis in politics. “How President Obama’s campaign used big data to rally individual votes,” by the same author (MIT Technology Review, 2013), tells the story of its greatest success to date. Nate Silver’s The Signal and the Noise (Penguin Press, 2012) has a chapter on his poll aggregation method.

Robot warfare is the theme of P. W. Singer’s Wired for War (Penguin, 2009). Cyber War, by Richard Clarke and Robert Knake (Ecco, 2012), sounds the alarm on cyberwar. My work on combining machine learning with game theory to defeat adversaries, which started as a class project, is described in “Adversarial classification,”* by Nilesh Dalvi et al. (Proceedings of the Tenth International Conference on Knowledge Discovery and Data Mining, 2004). Predictive Policing, by Walter Perry et al. (Rand, 2013), is a guide to the use of analytics in police work.

Chapter Two

The ferret brain rewiring experiments are described in “Visual behaviour mediated by retinal projections directed to the auditory pathway,” by Laurie von Melchner, Sarah Pallas, and Mriganka Sur (Nature, 2000). Ben Underwood’s story is told in “Seeing with sound,” by Joanna Moorhead (Guardian, 2007), and at www.benunderwood.com. Otto Creutzfeldt makes the case that the cortex is one algorithm in “Generality of the functional structure of the neocortex” (Naturwissenschaften, 1977), as does Vernon Mountcastle in “An organizing principle for cerebral function: The unit model and the distributed system,” in The Mindful Brain, edited by Gerald Edelman and Vernon Mountcastle (MIT Press, 1978). Gary Marcus, Adam Marblestone, and Tom Dean make the case against in “The atoms of neural computation” (Science, 2014).

“The unreasonable effectiveness of data,” by Alon Halevy, Peter Norvig, and Fernando Pereira (IEEE Intelligent Systems, 2009), argues for machine learning as the new discovery paradigm. Benoît Mandelbrot explores the fractal geometry of nature in the eponymous book* (Freeman, 1982). James Gleick’s Chaos (Viking, 1987) discusses and depicts the Mandelbrot set. The Langlands program, a research effort that seeks to unify different subfields of mathematics, is described in Love and Math, by Edward Frenkel (Basic Books, 2014). The Golden Ticket, by Lance Fortnow (Princeton University Press, 2013), is an introduction to NP-completeness and the P = NP problem. The Annotated Turing,* by Charles Petzold (Wiley, 2008), explains Turing machines by revisiting Turing’s original paper on them.

The Cyc project is described in “Cyc: Toward programs with common sense,”* by Douglas Lenat et al. (Communications of the ACM, 1990). Peter Norvig discusses Noam Chomsky’s criticisms of statistical learning in “On Chomsky and the two cultures of statistical learning” (http://norvig.com/chomsky.html). Jerry Fodor’s The Modularity of Mind (MIT Press, 1983) summarizes his views on how the mind works. “What big data will never explain,” by Leon Wieseltier (New Republic, 2013), and “Pundits, stop sounding ignorant about data,” by Andrew McAfee (Harvard Business Review, 2013), give a flavor of the controversy surrounding what big data can and can’t do. Daniel Kahneman explains why algorithms often beat intuitions in Chapter 21 of Thinking, Fast and Slow (Farrar, Straus and Giroux, 2011). David Patterson makes the case for the role of computing and data in the fight against cancer in “Computer scientists may have what it takes to help cure cancer” (New York Times, 2011).

More on the various tribes’ paths to the Master Algorithm in the corresponding sections below.

Chapter Three

Hume’s classic formulation of the problem of induction appears in Volume I of A Treatise of Human Nature (1739). David Wolpert derives his “no free lunch” theorem for induction in “The lack of a priori distinctions between learning algorithms”* (Neural Computation, 1996). I discuss the importance of prior knowledge in machine learning in “Toward knowledge-rich data mining”* (Data Mining and Knowledge Discovery, 2007) and misinterpretations of Occam’s razor in “The role of Occam’s razor in knowledge discovery”* (Data Mining and Knowledge Discovery, 1999). Overfitting is one of the main themes of The Signal and the Noise, by Nate Silver (Penguin Press, 2012), who calls it “the most important scientific problem you’ve never heard of.” “Why most published research findings are false,”* by John Ioannidis (PLoS Medicine, 2005), discusses the problem of mistaking chance findings for true ones in science. Yoav Benjamini and Yosef Hochberg propose a way to combat it in “Controlling the false discovery rate: A practical and powerful approach to multiple testing”* (Journal of the Royal Statistical Society, Series B, 1995). The bias-variance decomposition is presented in “Neural networks and the bias/variance dilemma,” by Stuart Geman, Elie Bienenstock, and René Doursat (Neural Computation, 1992). “Machine learning as an experimental science,” by Pat Langley (Machine Learning, 1988), discusses the role of experimentation in machine learning.

William Stanley Jevons first proposed viewing induction as the inverse of deduction in The Principles of Science (1874). The paper “Machine learning of first-order predicates by inverting resolution,”* by Steve Muggleton and Wray Buntine (Proceedings of the Fifth International Conference on Machine Learning, 1988), initiated the use of inverse deduction in machine learning. The book Relational Data Mining,* edited by Sašo Džeroski and Nada Lavrač (Springer, 2001), is an introduction to the field of inductive logic programming, where inverse deduction is studied. “The CN2 Induction Algorithm,”* by Peter Clark and Tim Niblett (Machine Learning, 1989), summarizes some of the main Michalski-style rule induction algorithms. The rule-mining approach used by retailers is described in “Fast algorithms for mining association rules,”* by Rakesh Agrawal and Ramakrishnan Srikant (Proceedings of the Twentieth International Conference on Very Large Databases, 1994). An example of rule induction for cancer prediction is described in “Carcinogenesis predictions using inductive logic programming,” by Ashwin Srinivasan, Ross King, Stephen Muggleton, and Michael Sternberg (Intelligent Data Analysis in Medicine and Pharmacology, 1997).

The two leading decision tree learners are presented in C4.5: Programs for Machine Learning,* by J. Ross Quinlan (Morgan Kaufmann, 1992), and Classification and Regression Trees,* by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone (Chapman and Hall, 1984). “Real-time human pose recognition in parts from single depth images,”* by Jamie Shotton et al. (Communications of the ACM, 2013), explains how Microsoft’s Kinect uses decision trees to track gamers’ motions. “Competing approaches to predicting Supreme Court decision making,” by Andrew Martin et al. (Perspectives on Politics, 2004), describes how decision trees beat legal experts at predicting Supreme Court votes and shows the decision tree for Justice Sandra Day O’Connor.

Allen Newell and Herbert Simon formulated the hypothesis that all intelligence is symbol manipulation in “Computer science as empirical enquiry: Symbols and search” (Communications of the ACM, 1976). David Marr proposed his three levels of information processing in Vision* (Freeman, 1982). Machine Learning: An Artificial Intelligence Approach,* edited by Ryszard Michalski, Jaime Carbonell, and Tom Mitchell (Tioga, 1983), gives a snapshot of the early days of symbolist research in machine learning. “Connectionist AI, symbolic AI, and the brain,”* by Paul Smolensky (Artificial Intelligence Review, 1987), gives a connectionist view of symbolist models.

Chapter Four

Sebastian Seung’s Connectome (Houghton Mifflin Harcourt, 2012) is an accessible introduction to neuroscience, connectomics, and the daunting challenge of reverse engineering the brain. Parallel Distributed Processing,* edited by David Rumelhart, James McClelland, and the PDP research group (MIT Press, 1986), is the bible of connectionism in its 1980s heyday. Neurocomputing,* edited by James Anderson and Edward Rosenfeld (MIT Press, 1988), collates many of the classic connectionist papers, including: McCulloch and Pitts on the first models of neurons; Hebb on Hebb’s rule; Rosenblatt on perceptrons; Hopfield on Hopfield networks; Ackley, Hinton, and Sejnowski on Boltzmann machines; Sejnowski and Rosenberg on NETtalk; and Rumelhart, Hinton, and Williams on backpropagation. “Efficient backprop,”* by Yann LeCun, Léon Bottou, Genevieve Orr, and Klaus-Robert Müller, in Neural Networks: Tricks of the Trade, edited by Genevieve Orr and Klaus-Robert Müller (Springer, 1998), explains some of the main tricks needed to make backprop work.

Neural Networks in Finance and Investing,* edited by Robert Trippi and Efraim Turban (McGraw-Hill, 1992), is a collection of articles on financial applications of neural networks. “Life in the fast lane: The evolution of an adaptive vehicle control system,” by Todd Jochem and Dean Pomerleau (AI Magazine, 1996), describes the ALVINN self-driving car project. Paul Werbos’s PhD thesis is Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences* (Harvard University, 1974). Arthur Bryson and Yu-Chi Ho describe their early version of backprop in Applied Optimal Control* (Blaisdell, 1969).

Learning Deep Architectures for AI,* by Yoshua Bengio (Now, 2009), is a brief introduction to deep learning. The problem of error signal diffusion in backprop is described in “Learning long-term dependencies with gradient descent is difficult,”* by Yoshua Bengio, Patrice Simard, and Paolo Frasconi (IEEE Transactions on Neural Networks, 1994). “How many computers to identify a cat? 16,000,” by John Markoff (New York Times, 2012), reports on the Google Brain project and its results. Convolutional neural networks, the current deep learning champion, are described in “Gradient-based learning applied to document recognition,”* by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (Proceedings of the IEEE, 1998). “The $1.3B quest to build a supercomputer replica of a human brain,” by Jonathon Keats (Wired, 2013), describes the European Union’s brain modeling project. “The NIH BRAIN Initiative,” by Thomas Insel, Story Landis, and Francis Collins (Science, 2013), describes the BRAIN initiative.

Steven Pinker summarizes the symbolists’ criticisms of connectionist models in Chapter 2 of How the Mind Works (Norton, 1997). Seymour Papert gives his take on the debate in “One AI or Many?” (Daedalus, 1988). The Birth of the Mind, by Gary Marcus (Basic Books, 2004), explains how evolution could give rise to the human brain’s complex abilities.

Chapter Five

“Evolutionary robotics,” by Josh Bongard (Communications of the ACM, 2013), surveys the work of Hod Lipson and others on evolving robots. Artificial Life, by Steven Levy (Vintage, 1993), gives a tour of the digital zoo, from computer-created animals in virtual worlds to genetic algorithms. Chapter 5 of Complexity, by Mitch Waldrop (Touchstone, 1992), tells the story of John Holland and the first few decades of research on genetic algorithms. Genetic Algorithms in Search, Optimization, and Machine Learning,* by David Goldberg (Addison-Wesley, 1989), is the standard introduction to genetic algorithms.

Niles Eldredge and Stephen Jay Gould propose their theory of punctuated equilibria in “Punctuated equilibria: An alternative to phyletic gradualism,” in Models in Paleobiology, edited by T. J. M. Schopf (Freeman, 1972). Richard Dawkins critiques it in Chapter 9 of The Blind Watchmaker (Norton, 1986). The exploration-exploitation dilemma is discussed in Chapter 2 of Reinforcement Learning,* by Richard Sutton and Andrew Barto (MIT Press, 1998). John Holland proposes his solution, and much else, in Adaptation in Natural and Artificial Systems* (University of Michigan Press, 1975).

John Koza’s Genetic Programming* (MIT Press, 1992) is the key reference on this paradigm. An evolved robot soccer team is described in “Evolving team Darwin United,”* by David Andre and Astro Teller, in RoboCup-98: Robot Soccer World Cup II, edited by Minoru Asada and Hiroaki Kitano (Springer, 1999). Genetic Programming III,* by John Koza, Forrest Bennett III, David Andre, and Martin Keane (Morgan Kaufmann, 1999), includes many examples of evolved electronic circuits. Danny Hillis argues that parasites are good for evolution in “Co-evolving parasites improve simulated evolution as an optimization procedure”* (Physica D, 1990). Adi Livnat, Christos Papadimitriou, Jonathan Dushoff, and Marcus Feldman propose that sex optimizes mixability in “A mixability theory of the role of sex in evolution”* (Proceedings of the National Academy of Sciences, 2008). Kevin Lang’s paper comparing genetic programming and hill climbing is “Hill climbing beats genetic search on a Boolean circuit synthesis problem of Koza’s”* (Proceedings of the Twelfth International Conference on Machine Learning, 1995). Koza’s reply is “A response to the ML-95 paper entitled…”* (unpublished; online at www.genetic-programming.com/jktahoe24page.html).

James Baldwin proposed the eponymous effect in “A new factor in evolution” (American Naturalist, 1896). Geoff Hinton and Steven Nowlan describe their implementation of it in “How learning can guide evolution”* (Complex Systems, 1987). The Baldwin effect was the theme of a 1996 special issue* of the journal Evolutionary Computation edited by Peter Turney, Darrell Whitley, and Russell Anderson.

The distinction between descriptive and normative theories was articulated by John Neville Keynes in The Scope and Method of Political Economy (Macmillan, 1891).

Chapter Six

Sharon Bertsch McGrayne tells the history of Bayesianism, from Bayes and Laplace to the present, in The Theory That Would Not Die (Yale University Press, 2011). A First Course in Bayesian Statistical Methods,* by Peter Hoff (Springer, 2009), is an introduction to Bayesian statistics.

The Naïve Bayes algorithm is first mentioned in Pattern Classification and Scene Analysis,* by Richard Duda and Peter Hart (Wiley, 1973). Milton Friedman argues for oversimplified theories in “The methodology of positive economics,” which appears in Essays in Positive Economics (University of Chicago Press, 1966). The use of Naïve Bayes in spam filtering is described in “Stopping spam,” by Joshua Goodman, David Heckerman, and Robert Rounthwaite (Scientific American, 2005). “Relevance weighting of search terms,”* by Stephen Robertson and Karen Sparck Jones (Journal of the American Society for Information Science, 1976), explains the use of Naïve Bayes-like methods in information retrieval.

“First links in the Markov chain,” by Brian Hayes (American Scientist, 2013), recounts Markov’s invention of the eponymous chains. “Large language models in machine translation,”* by Thorsten Brants et al. (Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007), explains how Google Translate works. “The PageRank citation ranking: Bringing order to the Web,”* by Larry Page, Sergey Brin, Rajeev Motwani, and Terry Winograd (Stanford University technical report, 1998), describes the PageRank algorithm and its interpretation as a random walk over the web. Statistical Language Learning,* by Eugene Charniak (MIT Press, 1996), explains how hidden Markov models work. Statistical Methods for Speech Recognition,* by Fred Jelinek (MIT Press, 1997), describes their application to speech recognition. The story of HMM-style inference in communication is told in “The Viterbi algorithm: A personal history,” by David Forney (unpublished; online at arxiv.org/pdf/cs/0504020v2.pdf). Bioinformatics: The Machine Learning Approach,* by Pierre Baldi and Søren Brunak (2nd ed., MIT Press, 2001), is an introduction to the use of machine learning in biology, including HMMs. “Engineers look to Kalman filtering for guidance,” by Barry Cipra (SIAM News, 1993), is a brief introduction to Kalman filters, their history, and their applications.

Judea Pearl’s pioneering work on Bayesian networks appears in his book Probabilistic Reasoning in Intelligent Systems* (Morgan Kaufmann, 1988). “Bayesian networks without tears,”* by Eugene Charniak (AI Magazine, 1991), is a largely nonmathematical introduction to them. “Probabilistic interpretation for MYCIN’s certainty factors,”* by David Heckerman (Proceedings of the Second Conference on Uncertainty in Artificial Intelligence, 1986), explains when sets of rules with confidence estimates are and aren’t a reasonable approximation to Bayesian networks. “Module networks: Identifying regulatory modules and their condition-specific regulators from gene expression data,” by Eran Segal et al. (Nature Genetics, 2003), is an example of using Bayesian networks to model gene regulation. “Microsoft virus fighter: Spam may be more difficult to stop than HIV,” by Ben Paynter (Fast Company, 2012), tells how David Heckerman took inspiration from spam filters and used Bayesian networks to design a potential AIDS vaccine. The probabilistic or “noisy” OR is explained in Pearl’s book.* “Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base,” by M. A. Shwe et al. (Parts I and II, Methods of Information in Medicine, 1991), describes a noisy-OR Bayesian network for medical diagnosis. Google’s Bayesian network for ad placement is described in Section 26.5.4 of Kevin Murphy’s Machine Learning* (MIT Press, 2012). Microsoft’s player rating system is described in “TrueSkillTM: A Bayesian skill rating system,”* by Ralf Herbrich, Tom Minka, and Thore Graepel (Advances in Neural Information Processing Systems 19, 2007).

Modeling and Reasoning with Bayesian Networks,* by Adnan Darwiche (Cambridge University Press, 2009), explains the main algorithms for inference in Bayesian networks. The January/February 2000 issue* of Computing in Science and Engineering, edited by Jack Dongarra and Francis Sullivan, has articles on the top ten algorithms of the twentieth century, including MCMC. “Stanley: The robot that won the DARPA Grand Challenge,” by Sebastian Thrun et al. (Journal of Field Robotics, 2006), explains how the eponymous self-driving car works. “Bayesian networks for data mining,”* by David Heckerman (Data Mining and Knowledge Discovery, 1997), summarizes the Bayesian approach to learning and explains how to learn Bayesian networks from data. “Gaussian processes: A replacement for supervised neural networks?,”* by David MacKay (NIPS tutorial notes, 1997; online at www.inference.eng.cam.ac.uk/mackay/gp.pdf), gives a flavor of how the Bayesians co-opted NIPS.

The need for weighting the word probabilities in speech recognition is discussed in Section 9.6 of Speech and Language Processing,* by Dan Jurafsky and James Martin (2nd ed., Prentice Hall, 2009). My paper on Naïve Bayes, with Mike Pazzani, is “On the optimality of the simple Bayesian classifier under zero-one loss”* (Machine Learning, 1997; expanded journal version of the 1996 conference paper). Judea Pearl’s book,* mentioned above, discusses Markov networks along with Bayesian networks. Markov networks in computer vision are the subject of Markov Random Fields for Vision and Image Processing,* edited by Andrew Blake, Pushmeet Kohli, and Carsten Rother (MIT Press, 2011). Markov networks that maximize conditional likelihood were introduced in “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,”* by John Lafferty, Andrew McCallum, and Fernando Pereira (International Conference on Machine Learning, 2001).

The history of attempts to combine probability and logic is surveyed in a 2003 special issue* of the Journal of Applied Logic devoted to the subject, edited by Jon Williamson and Dov Gabbay. “From knowledge bases to decision models,”* by Michael Wellman, John Breese, and Robert Goldman (Knowledge Engineering Review, 1992), discusses some of the early AI approaches to the problem.

Chapter Seven

Frank Abagnale details his exploits in his autobiography, Catch Me If You Can, cowritten with Stan Redding (Grosset & Dunlap, 1980). The original technical report on the nearest-neighbor algorithm by Evelyn Fix and Joe Hodges is “Discriminatory analysis: Nonparametric discrimination: Consistency properties”* (USAF School of Aviation Medicine, 1951). Nearest Neighbor (NN) Norms,* edited by Belur Dasarathy (IEEE Computer Society Press, 1991), collects many of the key papers in this area. Locally linear regression is surveyed in “Locally weighted learning,”* by Chris Atkeson, Andrew Moore, and Stefan Schaal (Artificial Intelligence Review, 1997). The first collaborative filtering system based on nearest neighbors is described in “GroupLens: An open architecture for collaborative filtering of netnews,”* by Paul Resnick et al. (Proceedings of the 1994 ACM Conference on Computer-Supported Cooperative Work, 1994). Amazon’s collaborative filtering algorithm is described in “Amazon.com recommendations: Item-to-item collaborative filtering,”* by Greg Linden, Brent Smith, and Jeremy York (IEEE Internet Computing, 2003). (See Chapter 8’s further readings for Netflix’s.) Recommender systems’ contribution to Amazon and Netflix sales is referenced in, among others, Mayer-Schönberger and Cukier’s Big Data and Siegel’s Predictive Analytics (cited earlier). The 1967 paper by Tom Cover and Peter Hart on nearest-neighbor’s error rate is “Nearest neighbor pattern classification”* (IEEE Transactions on Information Theory).

The curse of dimensionality is discussed in Section 2.5 of The Elements of Statistical Learning,* by Trevor Hastie, Rob Tibshirani, and Jerry Friedman (2nd ed., Springer, 2009). “Wrappers for feature subset selection,”* by Ron Kohavi and George John (Artificial Intelligence, 1997), compares attribute selection methods. “Similarity metric learning for a variable-kernel classifier,”* by David Lowe (Neural Computation, 1995), is an example of a feature weighting algorithm.

“Support vector machines and kernel methods: The new generation of learning machines,”* by Nello Cristianini and Bernhard Schölkopf (AI Magazine, 2002), is a mostly nonmathematical introduction to SVMs. The paper that started the SVM revolution was “A training algorithm for optimal margin classifiers,”* by Bernhard Boser, Isabel Guyon, and Vladimir Vapnik (Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992). The first paper applying SVMs to text classification was “Text categorization with support vector machines,”* by Thorsten Joachims (Proceedings of the Tenth European Conference on Machine Learning, 1998). Chapter 5 of An Introduction to Support Vector Machines,* by Nello Cristianini and John Shawe-Taylor (Cambridge University Press, 2000), is a brief introduction to constrained optimization in the context of SVMs.

Case-Based Reasoning,* by Janet Kolodner (Morgan Kaufmann, 1993), is a textbook on the subject. “Using case-based retrieval for customer technical support,”* by Evangelos Simoudis (IEEE Expert, 1992), explains its application to help desks. IPsoft’s Eliza is described in “Rise of the software machines” (Economist, 2013) and on the company’s website. Kevin Ashley explores case-based legal reasoning in Modeling Legal Arguments* (MIT Press, 1991). David Cope summarizes his approach to automated music composition in “Recombinant music: Using the computer to explore musical style” (IEEE Computer, 1991). Dedre Gentner proposed structure mapping in “Structure mapping: A theoretical framework for analogy”* (Cognitive Science, 1983). “The man who would teach machines to think,” by James Somers (Atlantic, 2013), discusses Douglas Hofstadter’s views on AI.

The RISE algorithm is described in my paper “Unifying instance-based and rule-based induction”* (Machine Learning, 1996).

Chapter Eight

The Scientist in the Crib, by Alison Gopnik, Andy Meltzoff, and Pat Kuhl (Harper, 1999), summarizes psychologists’ discoveries about how babies and young children learn.

The k-means algorithm was originally proposed by Stuart Lloyd at Bell Labs in 1957, in a technical report entitled “Least squares quantization in PCM”* (which later appeared as a paper in the IEEE Transactions on Information Theory in 1982). The original paper on the EM algorithm is “Maximum likelihood from incomplete data via the EM algorithm,”* by Arthur Dempster, Nan Laird, and Donald Rubin (Journal of the Royal Statistical Society B, 1977). Hierarchical clustering and other methods are described in Finding Groups in Data: An Introduction to Cluster Analysis,* by Leonard Kaufman and Peter Rousseeuw (Wiley, 1990).

Principal-component analysis is one of the oldest techniques in machine learning and statistics, having been first proposed by Karl Pearson in 1901 in the paper “On lines and planes of closest fit to systems of points in space”* (Philosophical Magazine). The type of dimensionality reduction used to grade SAT essays was introduced by Scott Deerwester et al. in the paper “Indexing by latent semantic analysis”* (Journal of the American Society for Information Science, 1990). Yehuda Koren, Robert Bell, and Chris Volinsky explain how Netflix-style collaborative filtering works in “Matrix factorization techniques for recommender systems”* (IEEE Computer, 2009). The Isomap algorithm was introduced in “A global geometric framework for nonlinear dimensionality reduction,”* by Josh Tenenbaum, Vin de Silva, and John Langford (Science, 2000).

Reinforcement Learning: An Introduction,* by Rich Sutton and Andy Barto (MIT Press, 1998), is the standard textbook on the subject. Universal Artificial Intelligence,* by Marcus Hutter (Springer, 2005), is an attempt at a general theory of reinforcement learning. Arthur Samuel’s pioneering research on learning to play checkers is described in his paper “Some studies in machine learning using the game of checkers”* (IBM Journal of Research and Development, 1959). This paper also marks one of the earliest appearances in print of the term machine learning. Chris Watkins’s formulation of the reinforcement learning problem appeared in his PhD thesis Learning from Delayed Rewards* (Cambridge University, 1989). DeepMind’s reinforcement learner for video games is described in “Human-level control through deep reinforcement learning,”* by Volodymyr Mnih et al. (Nature, 2015).

Paul Rosenbloom retells the development of chunking in “A cognitive odyssey: From the power law of practice to a general learning mechanism and beyond” (Tutorials in Quantitative Methods for Psychology, 2006). A/B testing and other online experimentation techniques are explained in “Practical guide to controlled experiments on the Web: Listen to your customers not to the HiPPO,”* by Ron Kohavi, Randal Henne, and Dan Sommerfield (Proceedings of the Thirteenth International Conference on Knowledge Discovery and Data Mining, 2007). Uplift modeling, a multidimensional generalization of A/B testing, is the subject of Chapter 7 of Eric Siegel’s Predictive Analytics (Wiley, 2013).

Introduction to Statistical Relational Learning,* edited by Lise Getoor and Ben Taskar (MIT Press, 2007), surveys the main approaches in this area. My work with Matt Richardson on modeling word of mouth is summarized in “Mining social networks for viral marketing” (IEEE Intelligent Systems, 2005).

Chapter Nine

Model Ensembles: Foundations and Algorithms,* by Zhi-Hua Zhou (Chapman and Hall, 2012), is an introduction to metalearning. The original paper on stacking is “Stacked generalization,”* by David Wolpert (Neural Networks, 1992). Leo Breiman introduced bagging in “Bagging predictors”* (Machine Learning, 1996) and random forests in “Random forests”* (Machine Learning, 2001). Boosting is described in “Experiments with a new boosting algorithm,” by Yoav Freund and Rob Schapire (Proceedings of the Thirteenth International Conference on Machine Learning, 1996).

“I, Algorithm,” by Anil Ananthaswamy (New Scientist, 2011), chronicles the road to combining logic and probability in AI. Markov Logic: An Interface Layer for Artificial Intelligence,* which I cowrote with Daniel Lowd (Morgan & Claypool, 2009), is an introduction to Markov logic networks. The Alchemy website, http://alchemy.cs.washington.edu, also includes tutorials, videos, MLNs, data sets, publications, pointers to other systems, and so on. An MLN for robot mapping is described in “Hybrid Markov logic networks,”* by Jue Wang and Pedro Domingos (Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, 2008). Thomas Dietterich and Xinlong Bao describe the use of MLNs in DARPA’s PAL project in “Integrating multiple learning components through Markov logic”* (Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, 2008). “Extracting semantic networks from text via relational clustering,”* by Stanley Kok and Pedro Domingos (Proceedings of the Nineteenth European Conference on Machine Learning, 2008), describes how we used MLNs to learn a semantic network from the Web.

Efficient MLNs with hierarchical class and part structure are described in “Learning and inference in tractable probabilistic knowledge bases,”* by Mathias Niepert and Pedro Domingos (Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 2015). Google’s approach to parallel gradient descent is described in “Large-scale distributed deep networks,”* by Jeff Dean et al. (Advances in Neural Information Processing Systems 25, 2012). “A general framework for mining massive data streams,”* by Pedro Domingos and Geoff Hulten (Journal of Computational and Graphical Statistics, 2003), summarizes our sampling-based method for learning from open-ended data streams. The FuturICT project is the subject of “The machine that would predict the future,” by David Weinberger (Scientific American, 2011).

“Cancer: The march on malignancy” (Nature supplement, 2014) surveys the current state of the war on cancer. “Using patient data for personalized cancer treatments,” by Chris Edwards (Communications of the ACM, 2014), describes the early stages of what could grow into CanceRx. “Simulating a living cell,” by Markus Covert (Scientific American, 2014), explains how his group built a computer model of a whole infectious bacterium. “Breakthrough Technologies 2015: Internet of DNA,” by Antonio Regalado (MIT Technology Review, 2015), reports on the work of the Global Alliance for Genomics and Health. Cancer Commons is described in “Cancer: A Computational Disease that AI Can Cure,” by Jay Tenenbaum and Jeff Shrager (AI Magazine, 2011).

Chapter Ten

“Love, actuarially,” by Kevin Poulsen (Wired, 2014), tells the story of how one man used machine learning to find love on the OkCupid dating site. Dataclysm, by Christian Rudder (Crown, 2014), mines OkCupid’s data for sundry insights. Total Recall, by Gordon Moore and Jim Gemmell (Dutton, 2009), explores the implications of digitally recording everything we do. The Naked Future, by Patrick Tucker (Current, 2014), surveys the use and abuse of data for prediction in our world. Craig Mundie argues for a balanced approach to data collection and use in “Privacy pragmatism” (Foreign Affairs, 2014). The Second Machine Age, by Erik Brynjolfsson and Andrew McAfee (Norton, 2014), discusses how progress in AI will shape the future of work and the economy. “World War R,” by Chris Baraniuk (New Scientist, 2014) reports on the debate surrounding the use of robots in battle. “Transcending complacency on superintelligent machines,” by Stephen Hawking et al. (Huffington Post, 2014), argues that now is the time to worry about AI’s risks. Nick Bostrom’s Superintelligence (Oxford University Press, 2014) considers those dangers and what to do about them.

A Brief History of Life, by Richard Hawking (Random Penguin, 1982), summarizes the quantum leaps of evolution in the eons BC. (Before Computers. Just kidding.) The Singularity Is Near, by Ray Kurzweil (Penguin, 2005), is your guide to the transhuman future. Joel Garreau considers three different scenarios for how human-directed evolution will unfold in Radical Evolution (Broadway Books, 2005). In What Technology Wants (Penguin, 2010), Kevin Kelly argues that technology is the continuation of evolution by other means. Darwin Among the Machines, by George Dyson (Basic Books, 1997), chronicles the evolution of technology and speculates on where it will lead. Craig Venter explains how his team synthesized a living cell in Life at the Speed of Light (Viking, 2013).

Загрузка...