Chapter 3. Life As We Knew It

A poet can survive everything but a misprint.

Oscar Wilde

If we are going to understand epigenetics, we first need to understand a bit about genetics and genes. The basic code for pretty much all independent life on earth, from bacteria to elephants, from Japanese knotweed to humans, is DNA (deoxyribonucleic acid). The phrase ‘DNA’ has become an expression in its own right with increasingly vague meanings. Social commentators may refer to the DNA of a society or of a corporation, by which they mean the real core of values behind an organisation. There’s even been a perfume called after it. The iconic scientific image of the mid-20th century was the atomic mushroom cloud. The double helix of DNA had similar cachet in the later part of the same century.

Science is just as prone to mood swings and fashions as any other human activity. There was a period when the prevailing orthodoxy seemed to be that the only thing that mattered was our DNA script, our genetic inheritance. Chapters 1 and 2 showed that this can’t be the case, as the same script is used differently depending on its cellular context. The field is now possibly at risk of swinging a bit too far in the opposite direction, with hardline epigeneticists almost minimizing the significance of the DNA code. The truth is, of course, somewhere in between.

In the Introduction, we described DNA as a script. In the theatre, if a script is lousy then even a wonderful director and a terrific cast won’t be able to create a great production. On the other hand, we have probably all suffered through terrible productions of our favourite plays. Even if the script is perfect, the final outcome can be awful if the interpretation is poor. In the same way, genetics and epigenetics work intimately together to create the miracles that are us and every organic thing around us.

DNA is the fundamental information source in our cells, their basic blueprint. DNA itself isn’t the real business end of things, in the sense that it doesn’t carry out all the thousands of activities required just to keep us alive. That job is mainly performed by the proteins. It’s proteins that carry oxygen around our bloodstream, that turn chips and burgers into sugars and other nutrients that can be absorbed from our guts and used to power our brains, that contract our muscles so we can turn the pages of this book. But DNA is what carries the codes for all these proteins.

If DNA is a code, then it must contain symbols that can be read. It must act like a language. This is indeed exactly what the DNA code does. It might seem odd when we think how complicated we humans are, but our DNA is a language with only four letters. These letters are known as bases, and their full names are adenine, cytosine, guanine and thymine. They are abbreviated to A, C, G and T. It’s worth remembering C, cytosine, in particular, because this is the most important of all the bases in epigenetics.

One of the easiest ways to visualise DNA mentally is as a zip. It’s not a perfect analogy, but it will get us started. Of course, one of the most obvious things that we know about a zip is that it is formed of two strips facing each other. This is also true of DNA. The four bases of DNA are the teeth on the zip. The bases on each side of the zip can link up to each other chemically and hold the zip together. Two bases facing each other and joined up like this are known as a base-pair. The fabric strips that the teeth are stitched on to on a zip are the DNA backbones. There are always two backbones facing each other, like the two sides of the zip, and DNA is therefore referred to as double-stranded. The two sides of the zip are basically twisted around to form a spiral structure – the famous double helix. Figure 3.1 is a stylised representation of what the DNA double helix looks like.

Figure 3.1 A schematic representation of DNA. The two backbones are twisted around each other to form a double helix. The helix is held together by chemical bonds between the bases in the centre of the molecule.


The analogy will only get us so far, however, and that’s because the teeth of the DNA zip aren’t all equivalent. If one of the teeth is an A base, it can only link up with a T base on the opposite strand. Similarly, if there is a G base on one strand, it can only link up with a C on the other one. This is known as the base-pairing principle. If an A tried to link with a C on the opposite strand it would throw the whole shape of the DNA out of kilter, a bit like a faulty tooth on a zip.

Keeping it pure

The base-pairing principle is incredibly important in terms of DNA function. During development, and even during a lot of adult life, the cells of our bodies divide. They do this so that organs can get bigger as a baby matures, for example. They also grow to replace cells that die off quite naturally. An example of this is the production by the bone marrow of white blood cells, produced to replace those that are lost in our bodies’ constant battles with infectious micro-organisms. The majority of cell types reproduce by first copying their entire DNA, and then dividing it equally between two daughter cells. This DNA replication is essential. Without it, daughter cells could end up with no DNA, which in most cases would render them completely useless, like a computer that’s lost its operating software.

It’s the copying of DNA before each cell division that shows why the base-pairing principle is so important. Hundreds of scientists have spent their entire careers working out the details of how DNA gets faithfully copied. Here’s the gist of it. The two strands of DNA are pulled apart and then the huge number of proteins involved in the copying (known as the replication complex) get to work.

Figure 3.2 shows in principle what happens. The replication complex moves along each single strand of DNA, and builds up a new strand facing it. The complex recognises a specific base – base C for example – and always puts a G in the opposite position on the strand that it’s building. That’s why the base-pairing principle is so important. Because C has to pair up with G, and A has to pair up with T, the cells can use the existing DNA as a template to make the new strands. Each daughter cell ends up with a new perfect copy of the DNA, in which one of the strands came from the original DNA molecule and the other was newly synthesised.

Figure 3.2 The first stage in replication of DNA is the separation of the two strands of the double helix. The bases on each separated backbone act as the template for the creation of a new strand. This ensures that the two new double-stranded DNA molecules have exactly the same base sequence as the parent molecule. Each new double helix of DNA has one backbone that was originally part of the parent molecule (in black) and one freshly synthesised backbone (in white).


Even in nature, in a system which has evolved over billions of years, nothing is perfect and occasionally the replication machinery makes a mistake. It might try to insert a T where a C should really go. When this happens the error is almost always repaired very quickly by another set of proteins that can recognise that this has happened, take out the wrong base and put in the right one. This is the DNA repair machinery, and one of the reasons it’s able to act is because when the wrong bases pair up, it recognises that the DNA ‘zip’ isn’t done up properly.

The cell puts a huge amount of energy into keeping the DNA copies completely faithful to the original template. This makes sense if we go back to our model of DNA as a script. Consider one of the most famous lines in all of English literature:

O Romeo, Romeo! wherefore art thou Romeo?

If we insert just one extra letter, then no matter how well the line is delivered on stage, its effect is unlikely to be the one intended by the Bard:

O Romeo, Romeo! wherefore fart thou Romeo?

This puerile example illustrates why a script needs to be reproduced faithfully. It can be the same with our DNA – one inappropriate change (a mutation) can have devastating effects. This is particularly true if the mutation is present in an egg or a sperm, as this can ultimately lead to the birth of an individual in whom all the cells carry the mutation. Some mutations have devastating clinical effects. These range from children who age so prematurely that a ten-year-old has the body of a person of 70, to women who are pretty much predestined to develop aggressive and difficult to treat breast cancer before they are 40 years of age. Thankfully, these sorts of genetic mutations and conditions are relatively rare compared with the types of diseases that afflict most people.

The 50,000,000,000,000 or so cells in a human body are all the result of perfect replication of DNA, time after time after time, whenever cells divide after the formation of that single-cell zygote from Chapter 1. This is all the more impressive when we realise just how much DNA has to be reproduced each time one cell divides to form two daughter cells. Each cell contains six billion base-pairs of DNA (half originally came from your father and half from your mother). This sequence of six billion base-pairs is what we call the genome. So every single cell division in the human body was the result of copying 6,000,000,000 bases of DNA. Using the same type of calculation as in Chapter 1, if we count one base-pair every second without stopping, it would take a mere 190 years to count all the bases in the genome of a cell. When we consider that a baby is born just nine months after the creation of the single-celled zygote, we can see that our cells must be able to replicate DNA really fast.

The three billion base-pairs we inherit from each parent aren’t formed of one long string of DNA. They are arranged into smaller bundles, which are the chromosomes. We’ll delve deeper into these in Chapter 9.

Reading the script

Let’s go back to the more fundamental question of what these six billion base-pairs of DNA actually do, and how the script works. More specifically how can a code that only has four letters (A, C, G and T) create the thousands and thousands of different proteins found in our cells? The answer is surprisingly elegant. It could be described as the modular paradigm of molecular biology but it’s probably far more useful to think of it as Lego.

Lego used to have a great advertising slogan ‘It’s a new toy every day’, and it was very accurate. A large box of Lego contains a limited number of designs, essentially a fairly small range of bricks of certain shapes, sizes and colours. Yet it’s possible to use these bricks to create models of everything from ducks to houses, and from planes to hippos. Proteins are rather like that. The ‘bricks’ in proteins are quite small molecules called amino acids, and there are twenty standard amino acids (different Lego bricks) in our cells. But these twenty amino acids can be joined together in an incredible array of combinations of all sorts of diversity and length, to create an enormous number of proteins.

That still leaves the problem of how even as few as twenty amino acids can be encoded by just four bases in DNA. The way this works is that the cell machinery ‘reads’ DNA in blocks of three base-pairs at a time. Each block of three is known as a codon and may be AAA, or GCG or any other combination of A, C, G and T. From just four bases it’s possible to create sixty-four different codons, more than enough for the twenty amino acids. Some amino acids are coded for by more than one codon. For example, the amino acid called lysine is coded for by AAA and AAG. A few codons don’t code for amino acids at all. Instead they act as signals to tell the cellular machinery that it’s at the end of a protein-coding sequence. These are referred to as stop codons.

How exactly does the DNA in our chromosomes act as a script for producing proteins? It does it through an intermediary protein, a molecule called messenger RNA (mRNA). mRNA is very like DNA although it does differ in a few significant details. Its backbone is slightly different from DNA (hence RNA, which stands for ribonucleic acid rather than deoxyribonucleic acid); it is single-stranded (only one backbone); it replaces the T base with a very similar but slightly different one called U (we don’t need to go into the reason it does this here). When a particular DNA stretch is ‘read’ so that a protein can be produced using that bit of script, a huge complex of proteins unzips the right piece of DNA and makes mRNA copies. The complex uses the base-pairing principle to make perfect mRNA copies. The mRNA molecules are then used as temporary templates at specialised structures in the cell that produce protein. These read the three letter codon code and stitch together the right amino acids to form the longer protein chains. There is of course a lot more to it than all this, but that’s probably sufficient detail.

An analogy from everyday life may be useful here. The process of moving from DNA to mRNA to protein is a bit like controlling an image from a digital photograph. Let’s say we take a photograph on a digital camera of the most amazing thing in the world. We want other people to have access to the image, but we don’t want them to be able to change the original in any way. The raw data file from the camera is like the DNA blueprint. We copy it into another format, that can’t be changed very much – a PDF maybe – and then we email out thousands of copies of this PDF, to everyone who asks for it. The PDF is the messenger RNA. If people want to, they can print paper copies from this PDF, as many as they want, and these paper copies are the proteins. So everyone in the world can print the image, but there is only one original file.

Why so complicated, why not just have a direct mechanism? There are a number of good reasons that evolution has favoured this indirect method. One of them is to prevent damage to the script, the original image file. When DNA is unzipped it is relatively susceptible to damage and that’s something that cells have evolved to avoid. The indirect way in which DNA codes for proteins minimises the period of time for which a particular stretch of DNA is open and vulnerable. The other reason this indirect method has been favoured by evolution is that it allows a lot of control over the amount of a specific protein that’s produced, and this creates flexibility.

Consider the protein called alcohol dehydrogenase (ADH). This is produced in the liver and breaks down alcohol. If we drink a lot of alcohol, the cells of our livers will increase the amounts of ADH they produce. If we don’t drink for a while, the liver will produce less of this protein. This is one of the reasons why people who drink frequently are better able to tolerate the immediate effects of alcohol than those who rarely drink, who will become tipsy very quickly on just a couple of glasses of wine. The more often we drink alcohol, the more ADH protein our livers produce (up to a limit). The cells of the liver don’t do this by increasing the number of copies of the ADH gene. They do this by reading the ADH gene more efficiently, i.e. producing more mRNA copies and/or by using these mRNA copies more efficiently as protein templates.

As we shall see, epigenetics is one of the mechanisms a cell uses to control the amount of a particular protein that is produced, especially by controlling how many mRNA copies are made from the original template.

The last few paragraphs have all been about how genes encode proteins. How many genes are there in our cells? This seems like a simple question but oddly enough there is no agreed figure on this. This is because scientists can’t agree on how to define a gene. It used to be quite straightforward – a gene was a stretch of DNA that encoded a protein. We now know that this is far too simplistic. However, it’s certainly true to say that all proteins are encoded by genes, even if not all genes encode proteins. There are about 20,000 to 24,000 protein-encoding genes in our DNA, a much lower estimate than the 100,000 that scientists thought was a good guess just ten years ago[17].

Editing the script

Most genes in human cells have quite a similar structure. There’s a region at the beginning called the promoter, which binds the protein complexes that copy the DNA to form mRNA. The protein complexes move along through what’s known as the body of the gene, making a long mRNA strand, until they finally fall off at the end of the gene.

Imagine a gene body that is 3,000 base-pairs long, a perfectly sensible length for a gene. The mRNA will also be 3,000 base-pairs long. Each amino acid is encoded by a codon composed of three bases, so we would predict that this mRNA will encode a protein that is 1,000 amino acids long. But, perhaps unexpectedly, what we find is that the protein is usually considerably shorter than this.

If the sequence of a gene is typed out it looks like a long string of combinations of the letters A, C, G and T. But if we analyse this with the right software, we find that we can divide that long string into two types of sequences. The first type is called an exon (for expressed sequence) and an exon can code for a run of amino acids. The second type is called an intron (for inexpressed sequence). This doesn’t code for a run of amino acids. Instead it contains lots of the ‘stop’ codons that signal that the protein should come to an end.

When the mRNA is first copied from the DNA it contains the whole run of exons and introns. Once this long RNA molecule has been created, another multi-sub-unit protein complex comes along. It removes all the intron sequences and then joins up the exons to create an mRNA that codes for a continuous run of amino acids. This editing process is called splicing.

This again seems extremely complicated, but there’s a very good reason that this complex mechanism has been favoured by evolution. It’s because it enables a cell to use a relatively small number of genes to create a much bigger number of proteins. The way this works is shown in Figure 3.3.

Figure 3.3 The DNA molecule is shown at the very top of this diagram. The exons, which code for stretches of amino acids, are shown in the dark boxes. The introns, which don’t code for amino acid sequences, are represented by the white boxes. When the DNA is first copied into RNA, indicated by the first arrow, the RNA contains both the exons and the introns. The cellular machinery then removes some or all of the introns (the process known as splicing). The final messenger RNA molecules can thereby code for a variety of proteins from the same gene, as represented by the various words shown in the diagram. For simplicity, all the introns and exons have been drawn as the same size, but in reality they can vary widely.


The initial mRNA contains all the exons and all the introns. Then it’s spliced to remove the introns. But during this splicing some of the exons may also be removed. Some exons will be retained in the final mRNA, others will be skipped over. The various proteins that this creates may have quite similar functions, or they may differ dramatically. The cell can express different proteins depending on what that cell has to do at a particular time, or because of different signals that it receives. If we define a gene as something that encodes a protein, this mechanism means that just 20,000 or so genes can code for far more than just 20,000 proteins.

Whenever we describe the genome we talk about it in very two-dimensional terms, almost like a railway track. Peter Fraser’s laboratory at the Babraham Institute outside Cambridge has published some extraordinary work showing it’s probably nothing like this at all. He works on the genes that code for the proteins required to make haemoglobin, the pigment in red blood cells that carries oxygen all around the body. There are a number of different proteins needed to create the final pigment, and they lie on different chromosomes. Doctor Fraser has shown that in cells that produce large amounts of haemoglobin, these chromosome regions become floppy and loop out like tentacles sticking out of the body of an octopus. These floppy regions mingle together in a small area of the cell nucleus, waving about until they can find each other. By doing this, there is an increased chance that all the proteins needed to create the functional haemoglobin pigment will be expressed together at the same time[18].

Each cell in our body contains 6,000,000,000 base-pairs. About 120,000,000 of these code for proteins. One hundred and twenty million sounds like a lot, but it’s actually only 2 per cent of the total amount. So although we think of proteins as being the most important things our cells produce, about 98 per cent of our genome doesn’t code for protein.

Until recently, the reason that we have so much DNA when so little of it leads to a protein was a complete mystery. In the last ten years we’ve finally started to get a grip on this, and once again it’s connected with regulating gene expression through epigenetic mechanisms. It’s now time to move on to the molecular biology of epigenetics.

Загрузка...