Science commits suicide when it adopts a creed.
One of the most influential books on the philosophy of science is Thomas Kuhn’s The Structure of Scientific Revolutions, published in 1962. One of the claims in Kuhn’s book is that science does not proceed in an orderly, linear and polite fashion, with all new findings viewed in a completely unbiased way. Instead, there is a prevailing theory which dominates a field. When new conflicting data are generated, the theory doesn’t immediately topple. It may get tweaked slightly, but scientists can and often do continue to believe in a theory long after there is sufficient evidence to discount it.
We can visualise the theory as a shed, and the new conflicting piece of data as an oddly shaped bit of builder’s rubble that has been cemented onto the roof. Now, we can probably continue cementing bits of rubble onto the roof for quite some time, but eventually there will come a point when the shed collapses under the sheer weight of odd bits of masonry. In science, this is when a new theory develops, and all those bits of masonry are used to build the foundations of a new shed.
Kuhn described this collapse-and-rebuild as the paradigm shift, introducing the phrase that has now become such a cliché in the high-end media world. The paradigm shift isn’t just based on pure rationality. It involves emotional and sociological changes in the psyches of the upholders of the prevailing theory. Many years before Thomas Kuhn’s book, the great German scientist Max Planck, winner of the 1918 Nobel Prize for Physics, put this rather more succinctly when he wrote that, ‘Scientific theories don’t change because old scientists change their minds; they change because old scientists die[122].’
We are in the middle of just such a paradigm shift in biology.
In 1965, the Nobel Prize in Physiology or Medicine was awarded to François Jacob, André Lwoff and Jacques Monod ‘for their discoveries concerning genetic control of enzyme and virus synthesis’. Included in this work was the discovery of messenger RNA (mRNA), which we first met in Chapter 3. mRNA is the relatively short-lived molecule that transfers the information from our chromosomal DNA and acts as the intermediate template for the production of proteins.
We’ve known for many years that there are some other classes of RNA in our cells, specifically molecules called transfer RNA (tRNA) and ribosomal RNA (rRNA). tRNAs are small RNA molecules that can hold a specific amino acid at one end. When an mRNA molecule is read to form a protein, a tRNA carries its amino acid to the correct place on the growing protein chain. This takes place at large structures in the cytoplasm of a cell called ribosomes. The ribosomal RNA is a major component of ribosomes, where it acts like a giant scaffold to hold various other RNA and protein molecules in position. The world of RNA therefore seemed quite straightforward. There were structural RNAs (the tRNA and rRNA) and there was messenger RNA.
For decades, the stars of the molecular biology catwalk were DNA (the underlying code) and proteins (the functional, can-do molecules of the cell). RNA was relegated to being a relatively uninteresting intermediate molecule, carrying information from a blueprint to the workers on the factory floor.
Everyone working in molecular biology accepts that proteins are immensely important. They carry out a huge range of functions that enable life to happen. Therefore, the genes that encode proteins are also incredibly important. Even small changes to these protein-coding genes can result in devastating effects, such as the mutations that cause haemophilia or cystic fibrosis.
But this world view has potentially left the scientific community a bit blinkered. The fact that proteins, and therefore by extension protein-coding genes, are vitally important should not imply that everything else in the genome is unimportant. Yet this is the theoretical construct that has applied for decades now. That’s actually quite odd, because we’ve had access for many years to data that show that proteins can’t be the whole story.
Scientists have recognised for some time that the blueprint is edited by cells before it is delivered to the workers. This is because of introns, which we met in Chapter 3. They are the sequences that are copied from DNA into mRNA, but then spliced out before the message is translated into a protein sequence by the ribosomes. Introns were first identified in 1975[123] and the Nobel Prize for their discovery was awarded to Richard Roberts and Phillip Sharp in 1993.
Back in the 1970s scientists compared simple one-celled organisms and complex creatures like humans. The amount of DNA in their cells seemed surprisingly similar, considering how dissimilar the organisms were. This suggested that some genomes must contain a lot of DNA that isn’t really used for anything, and led to the concept of ‘junk DNA’[124] – chromosome sequences that don’t do anything useful, because they don’t code for proteins. At around the same time a number of labs showed that large amounts of the mammalian genome contain DNA sequences that seem to be repeated over and over again, and don’t code for proteins (repetitive DNA). Because they don’t code for protein, it was assumed they weren’t contributing anything to the cell’s functions. They just appeared to be along for the ride[125][126]. Francis Crick and others coined the phrase ‘selfish DNA’ to describe these regions. These two models, of junk DNA and selfish DNA, have been delightfully described recently as ‘the emerging view of the genome as being largely populated by genetic hobos and evolutionary debris[127]’.
We humans are remarkable, with our trillions of cells, our hundreds of cell types, our multitudes of tissues and organs. Let’s compare ourselves (a little smugly, perhaps) with a distant relative, a microscopic worm, the nematode Caenorhabditis elegans. C. elegans, as we usually call it, is only about one millimetre long and lives in soil. It has many of the same organs as higher animals, such as a gut, mouth and gonads. However, it only consists of around 1,000 cells. Remarkably, as C. elegans develops, scientists have been able to identify exactly how each of these cells arises.
This tiny worm is a powerful experimental tool, because it provides a roadmap for cell and tissue development. Scientists are able to alter expression of a gene and then plot out with great precision the effects of that mutated gene on normal development. In fact, C. elegans has laid the foundation for so many breakthroughs in developmental biology that in 2002 the Nobel Committee awarded the Prize in Physiology or Medicine to Sydney Brenner, Robert Horvitz and John Sulston for their work on this organism.
We can’t fault C. elegans on grounds of utility, but it is clearly a much less complex organism than our good selves. Why are we so much more sophisticated? Given the importance of proteins in cellular function, the original assumption was that complex organisms like mammals have more protein-coding genes than simple creatures like C. elegans. This was a perfectly reasonable hypothesis but it has fallen foul of a phenomenon described by Thomas Henry Huxley. He was Darwin’s great champion in the 19th century and it was Huxley who first described ‘the slaying of a beautiful hypothesis by an ugly fact’.
As DNA sequencing technologies improved in cost and efficiency, numerous labs throughout the world sequenced the genomes of a number of different organisms. They were able to use various software tools to identify the likely protein-coding genes in these different genomes. What they found was really surprising. There were far fewer protein-coding genes than expected. Before the human genome was decoded, scientists had predicted there would be over 100,000 such genes. We now know the real number is between 20,000 and 25,000 genes[128]. Even more oddly, C. elegans contains about 20,200 genes[129], not so very different a number from us.
Not only do we and C. elegans have about the same number of genes, these genes tend to code for pretty much the same proteins. By this we mean that if we analyse the sequence of a gene in human cells, we can find a gene of broadly similar sequence in the nematode worm. So the phenotypic differences between worms and humans aren’t caused by Homo sapiens having more, different or ‘better’ genes.
Admittedly, more complicated organisms tend to splice their genes in more ways than simpler creatures. Using our CARDIGAN example from Chapter 3 as an analogy once again, C. elegans might only be able to make the proteins DIG and DAN whereas mammals would be able to make those two proteins and also CARD, RIGA, CAIN and CARDIGAN.
This certainly would allow humans to generate a much greater repertoire of proteins than the 1mm worm, but it introduces a new problem. How do more complicated organisms regulate their more complicated splicing patterns? This regulation could in theory be controlled solely by proteins, but this in turn has difficulties. The more proteins a cell needs to regulate in a complicated network, the more proteins it needs to do the regulation. Mathematical models have shown that this rapidly leads to a situation where the number of proteins that we need begins to out-strip the number of proteins that we actually possess – clearly a non-starter.
Do we have an alternative? We do, and it’s indicated in Figure 10.1.
Figure 10.1 This graph demonstrates that the complexity of living organisms scales much better with the percentage of the genome that doesn’t code for protein (black columns) than it does with the number of basepairs coding for protein in a genome (white columns). The data are adapted from Mattick, J. (2007), Exp Biol. 210: 1526–1547.
At one extreme we have the bacteria. Bacteria have very small, highly compacted genomes. Their protein-coding genes cover about 4,000,000 base-pairs, which is about 90 per cent of their genome. Bacteria are very simple organisms and fairly rigid in the way they control their gene expression. But things change as we move further up the evolutionary tree.
The protein-coding genes of C. elegans cover about 24,000,000 base-pairs, but that only accounts for about 25 per cent of their genome. The remaining 75 per cent doesn’t code for protein. By the time we reach humans, the protein-coding regions cover about 32,000,000 base-pairs, but this only represents about 2 per cent of the total genome. There are various ways that we can calculate the protein-coding regions, but they make relatively little difference to the astonishing bottom line. Over 98 per cent of the human genome doesn’t code for protein. All but 2 per cent of our genome is ‘junk’.
In other words, the numbers of genes, or the sizes of these genes, don’t scale with complexity. The only feature of a genome that really seems to get bigger as organisms get more complicated is the section that doesn’t code for protein.
So what are these non-coding regions of the genome doing, and why are they so important? It’s when we start to consider this that we begin to notice what a strong effect language and terminology have on human thought processes. These regions are called non-coding, but what we mean is that they don’t code for protein. This isn’t the same as not coding at all.
There is a well-known scientific proverb: absence of evidence is not the same as evidence of absence. For example, in astronomy, once scientists had developed telescopes that could detect infrared radiation, they were able to detect thousands of stars that had never been ‘seen’ before. The stars had always been there, but we couldn’t detect them conclusively until we had an instrument for doing so. A more everyday example might be a mobile phone signal. Such signals are all around us, but we cannot detect them unless we have a mobile phone. In other words, what we find depends very much on how we are looking.
Scientists identify the genes which are expressed in a specific cell type by analysing the RNA molecules. This is done by extracting all the RNA from cells and then analysing it using various different techniques, so that you build a database of all the RNA molecules that are present. When researchers in the 1980s first began investigating which genes were expressed in a given cell type, the techniques available were relatively insensitive. They were also designed to detect only mRNA molecules, as these were the ones that were assumed to be important. These methods tended to be good at detecting highly expressed mRNAs and quite poor at detecting the less well-expressed sequences. Another confounding factor was that the software used to analyse mRNA was set so that it would ignore signals originally generated from repetitive, i.e. ‘junk’, DNA.
These techniques served us very well for profiling the mRNA that we were already interested in – the mRNA molecules that coded for proteins. But as we have seen, this only represents about 2 per cent of the genome. It wasn’t until new detection technologies were coupled with hugely increased computing power that we began to realise that something very interesting was happening in the remaining 98 per cent – the non-coding part of our genome.
With these improved methodologies, the scientific world began to appreciate that there was actually a huge amount of transcription going on in the parts of the genome that didn’t code for proteins. Initially this was dismissed as ‘transcriptional noise’. It was suggested that there was a baseline murmur of expression from all over the genome, as if these regions of DNA occasionally produced an RNA molecule that got above a detection threshold. The concept was that although we could detect these molecules with our new, more sensitive equipment, they weren’t really biologically meaningful.
The phrase ‘transcriptional noise’ implies a basically random event. However, the patterns of expression of these non-protein-coding RNAs were different for different cell types, which suggested that their transcription was far from random[130]. For example, there was a lot of this expression in the brain. It’s now become clear that the patterns of expression are different in different brain regions[131]. This effect is reproducible when the various brain regions are compared from different individuals. This isn’t what we would expect if this low-level transcription of RNA was a purely random process.
It is becoming clearer that this transcription from genes that don’t code for protein is actually critically important for cellular function. Oddly, however, we remain caught in a linguistic trap of our own making. The RNA that is produced from these regions, the RNA that was previously under our radar, is still called non-coding RNA (ncRNA). It’s a sloppy shorthand, because what we really mean is non-protein-coding RNA. The ncRNA does, in fact, code for something – it codes for itself, a functional RNA molecule. Unlike mature mRNA, which is an RNA means to a protein end, ncRNAs are themselves the end-points.
This is the paradigm shift. For at least 40 years molecular biologists and geneticists have focused almost exclusively on the genes that code for proteins, and the proteins themselves. There have been exceptions, but we’ve just treated these as the odd bits of rubble on the top of the shed. But non-coding RNAs are finally starting to stand firmly alongside proteins as fully functional molecules. Different but equal.
These ncRNAs are found all over the genome. Some come from introns. Originally it was assumed that the spliced-out bits of mRNA from the introns get degraded by cells. It now seems much more likely that at least some (if not all or most) are actually processed to act as functional ncRNAs in their own right. Others overlap genes, frequently transcribed from the opposite strand to the protein-coding mRNA. Yet others are found in regions where there are no protein-coding genes at all.
We met two ncRNAs in the last chapter. These were Xist and Tsix, the ncRNAs that are required for X inactivation. These are both very long ncRNAs, of several thousand kilobases in length. When Xist was first identified, it was only the second known ncRNA. Current estimates suggest there are thousands of such molecules in the cells of higher mammals, with over 30,000 ‘long’ ncRNAs (defined as having a length greater than 200 bases) reported in mice[132]. Long ncRNAs may actually out-number protein-coding mRNAs.
In addition to X inactivation, long ncRNAs also appear to play a critical role in imprinting. Many imprinted regions contain a section that encodes a long ncRNA, which silences the expression of surrounding genes. This is similar to the effect of Xist. The protein-coding mRNAs are silenced on the copy of the chromosome which expresses the long ncRNA. For example, there is an ncRNA called Air, expressed in the placenta, exclusively from the paternally inherited mouse chromosome 11. Expression of Air ncRNA represses the nearby Igf2r gene, but only on the same chromosome[133]. This mechanism ensures that Igf2r is only expressed from the maternally inherited chromosome.
The Air ncRNA gave scientists important insights into how these long ncRNAs repress gene expression. The ncRNA remained localised to a specific region in the cluster of imprinted genes, and acted as a magnet for an epigenetic enzyme called G9a. G9a puts a repressive mark on the histone H3 proteins in the nucleosomes deposited on this region of DNA. This histone modification creates a repressive chromatin environment, which switches off the genes.
This finding was particularly important as it provided some of the first insights into a question that had been puzzling epigeneticists. How do histone modifying enzymes, which put on or remove epigenetic marks, get localised to specific regions of the genome? Histone modifying enzymes can’t recognise specific DNA sequences directly, so how do they end up in the right part of the genome?
The patterns of histone modifications are localised to different genes in different cell types, leading to exquisitely well-regulated gene expression. For example, the enzyme known as EZH2 methylates the amino acid called lysine at position 27 on histone H3, but it targets different histone H3 molecules in different cell types. To put it simply, it may methylate histone H3 proteins positioned on gene A in white blood cells but not in neurons. Alternatively, it may methylate histone H3 proteins positioned on gene B in neurons, but not in white blood cells. It’s the same enzyme in both cells, but it’s being targeted differently.
There is increasing evidence that at least some of the targeting of epigenetic modifications can be explained by interactions with long ncRNAs. Jeannie Lee and her colleagues have recently investigated long ncRNAs that bind to a complex of proteins. The complex is called PRC2 and it generates repressive modifications on histones. PRC2 contains a number of proteins, and the one that interacts with the long ncRNAs is probably EZH2. The researchers found that the PRC2 complex bound to literally thousands of different long ncRNA molecules in embryonic stem cells from mice[134]. These long ncRNAs may act as bait. They can stay tethered to the specific region of the genome where they are produced, and then attract repressive enzymes to shut off gene expression. This happens because the repressive enzyme complexes contain proteins like EZH2 that are capable of binding to RNA.
Scientists love to build theories, and in some ways a nice one was shaping up around long ncRNAs. It seemed that they bind to the region from which they are transcribed, and repress gene expression on that same chromosome. But if we go back to our analogy from the start of this chapter, we’d have to say that it’s now becoming clear we have built a pretty small shed and already cemented quite a bit of rubble to the roof.
There’s an amazing family of genes, called HOX genes. When they’re mutated in fruit flies (Drosophila melanogaster) the results are incredible phenotypes, such as legs growing out of the head[135]. There’s a long ncRNA known as HOTAIR, which regulates a region of genes called the HOX-D cluster. Just like the long ncRNAs investigated by Jeannie Lee, HOTAIR binds the PRC2 complex and creates a chromatin region which is marked with repressive histone modifications. But HOTAIR is not transcribed from the HOX-D position on chromosome 12. Instead it is encoded at a different cluster of genes called HOX-C on chromosome 2[136]. No-one knows how or why HOTAIR binds at the HOX-D position.
There’s a related mystery around the best studied of all long ncRNAs, Xist. Xist ncRNA spreads out along almost the entire inactive X chromosome but we really don’t know how. Chromosomes don’t normally become smothered with RNA molecules. There’s no obvious reason why Xist RNA should be able to bind like this, but we know it’s nothing to do with the sequence of the chromosome. The experiments described in the last chapter, where Xist could inactivate an entire autosome as long as it contained an X inactivation centre, showed that Xist just keeps on travelling once it’s on a chromosome. Scientists are basically still completely baffled about these fundamental characteristics of this best-studied of all ncRNAs.
Here’s another surprising thing. Until very recently, all long ncRNAs were thought to repress gene expression. In 2010, Professor Ramin Shiekhattar at the Wistar Institute in Philadelphia identified over 3,000 long ncRNAs in a number of human cell types. These long ncRNAs showed different expression patterns in different human cell types, suggesting they had specific roles. Professor Shiekhattar and his colleagues tested a small number of the long ncRNAs to try to determine their functions. They used well-established experimental methods to knock down expression of their test ncRNAs and then analysed expression of their neighbouring genes. The predicted outcome, and the actual results, are shown in Figure 10.2.
Figure 10.2 ncRNAs were thought to repress expression of target genes. If this hypothesis were correct, then decreasing the expression of a specific ncRNA should result in more expression of the target gene, as the repression diminishes. This is shown in the middle panel. However, it is now becoming clear that a large number of ncRNAs actually drive up expression of their target genes. This has been shown by cases in which experimentally decreasing the expression of an ncRNA has the effect shown in the right hand side of this figure.
Twelve ncRNAs were tested, and in seven cases the scientists found the result shown in the right-hand panel of Figure 10.2. This was contrary to expectations, because it suggests that about 50 per cent of long ncRNAs may actually increase expression of neighbouring genes, not decrease it[137].
Rather pithily, the authors of the paper stated, ‘The precise mechanism by which our ncRNAs function to enhance gene expression is not known.’ It’s a statement that is very hard to argue with. It has considerable merit as it makes clear that we currently have no idea how this is happening. Ramin Shiekhattar’s work does demonstrate rather convincingly that there is a lot we don’t understand about long ncRNAs, and that we should be wary of creating new dogma too quickly.
We should also be wary of assuming that size is everything and that big is best. The long ncRNAs clearly have major importance in cell function, but there is another equally importance class of ncRNAs that also has a significant impact in the cell. The ncRNAs in this class are short (usually 20–24 bases in length), and they target mRNA molecules, not DNA. This was first shown in our favourite worm, C. elegans.
As we have already discussed, C. elegans is a very useful model system because we know exactly how every cell should normally develop. The timing and sequence of the different stages is very tightly regulated. One of the key regulators is a protein called LIN-14. The LIN-14 gene is highly expressed (a lot of LIN-14 protein is produced) during the very early embryo stages, but is down-regulated as the worms move from larval stage 1 to larval stage 2. If the LIN-14 gene is mutated the worm gets the timing of the different stages wrong. If LIN-14 protein stays on for too long the worm starts to repeat early developmental stages. If LIN-14 protein is lost too early the worm moves into later larval stages prematurely. Either way, the worm gets very messed up, and normal adult structures don’t develop.
In 1993 two labs working independently showed how LIN-14 expression was controlled[138][139]. Unexpectedly, the key event was binding of a small ncRNA to the LIN-14 mRNA molecule. This is shown in Figure 10.3. It is an example of post-transcriptional gene silencing, where an mRNA is produced but is prevented from generating a protein. This is a very different way of controlling gene expression from that used by the long ncRNAs.
Figure 10.3 Schematic to demonstrate how expression of microRNAs at specific developmental stages can radically alter expression of a target gene.
The importance of this work is that it laid the foundation for a whole new model for the regulation of gene expression. Small ncRNAs are now known to be a mechanism used by organisms throughout the plant and animal kingdoms to control gene expression. There are various different types of small ncRNAs, but we’ll concentrate mainly on the microRNAs (miRNAs).
At least 1,000 different miRNAs have been identified in mammalian cells. miRNAs are about 21 nucleotides (bases) in length (sometimes slightly smaller or longer) and most of them seem to act as post-transcriptional regulators of gene expression. They don’t stop production of an mRNA, instead they regulate how that mRNA behaves. Typically, they do this by binding to the 3′ untranslated region (3′ UTR) of an mRNA molecule. This region is shown in Figure 10.3. It’s present in the mature mRNA, but it doesn’t code for any amino acids.
When genomic DNA is copied to make mRNA, the original transcript tends to be very long because it contains both exons (which code for amino acids) and introns (which do not). As we saw in Chapter 3, introns are removed during splicing to create an mRNA which codes for protein. But the Chapter 3 description passed over something. There are stretches of RNA at the beginning (known as 5′ UTR) and the end (3′ UTR) which don’t code for amino acids, but don’t get spliced out like introns either. Instead, these non-coding regions are retained on the mature mRNA and act as regulatory sequences. One of the functions of the 3′ UTR in particular is to bind regulatory molecules, including miRNAs.
How does a miRNA bind to an mRNA and what happens when it does? The miRNA and the 3′ UTR of the mRNA only interact if they recognise each other. This uses base-pairing, quite similar to that in double stranded DNA. G can bind C, A can bind U (in RNA, T is replaced by U). Although miRNAs are usually 21 bases in length, they don’t have to match the mRNA over the entire 21 nucleotides. The key region is positions 2 to 8 on the miRNA.
Sometimes the match from 2 to 8 is not perfect, but it’s still close enough for the two molecules to pair up. In these cases, binding of the miRNA prevents translation of the mRNA into protein (this is what happened in the case shown in Figure 10.3). If, however, the match is perfect, the binding of miRNA to mRNA triggers destruction of the mRNA, by enzymes that attach to the miRNA[140]. It’s not yet clear if positions 9 to 21 on the miRNAs also influence in a less direct way how these small molecules are targeted, or what the consequences of their targeting are. One thing we do know, however, is that a single miRNA can regulate more than one mRNA molecule. We saw in Chapter 3 how one gene could encode lots of different protein molecules, by altering the way in which messenger RNA is spliced. A single miRNA can influence many of these differently spliced versions simultaneously. Alternatively, a single miRNA can also influence quite unrelated proteins that are encoded by different genes but have similar 3′ UTR sequences.
This can make it very difficult to unravel exactly what a miRNA is doing in a cell, as the effects will vary depending on the cell type and the other genes (protein-coding and non-protein-coding) that the cell is expressing at any one time. That can be important experimentally, but also has significant consequences for normal health and disease. In conditions where there are an abnormal number of chromosomes, for example, it won’t just be protein-coding genes that change in number. There will also be abnormal production of ncRNAs (large and small). Because miRNAs in particular can regulate lots of other genes, the effects of disrupting miRNA copy numbers may be very extensive.
The fact that 98 per cent of the human genome does not code for protein suggests that there has been a huge evolutionary investment in the development of complicated ncRNA-mediated regulatory processes. Some authors have even gone so far as to speculate that ncRNAs are the genetic features that have underpinned the development of Homo sapiens’ greatest distinguishing feature – our higher thought processes[141].
The chimpanzee is our closest relative and its genome was published in 2005[142]. There isn’t one simple, meaningful average figure that we can give to express how similar the human and chimp genomes are. The statistics are actually very complicated, because you have to take into account that different genomic regions (for example repetitive sections versus single copy protein-coding gene regions) affect the statistics differently. However, there are two things we can say quite firmly. One is that human and chimp proteins are incredibly similar. About a third of all proteins are exactly the same between us and our knuckle-dragging cousins, and the rest differ only by one or two amino acids. Another thing we have in common is that over 98 per cent of our genomes don’t code for protein. This suggests that both species use ncRNAs to create complex regulatory networks which govern gene and protein expression. But there is a particular difference which may be very important between chimps and humans. This lies in how ncRNA is treated in the cells of the two species.
It’s all to do with a process called editing. It seems that human cells just can’t leave well-enough alone, particularly when it comes to ncRNA[143]. Once an ncRNA has been produced, human cells use various mechanisms to modify it yet further. In particular, they will often change the base A to one called I (inosine). Base A can bind to T in DNA, or U in RNA. But base I can pair with A, C or G. This alters the sequences to which an ncRNA can bind and hence regulate.
We humans, more than any other species, edit our ncRNA molecules to a remarkable degree. Not even other primates carry out this reaction as well as we do[144]. We also edit particularly extensively in the brain. This makes editing of ncRNA an attractive candidate process to explain why we are mentally so much more sophisticated than our primate relatives, even though we share so much of our DNA template in common.
In some ways, this is the beauty of ncRNAs. They create a relatively safe method for organisms to use to alter various aspects of cellular regulation. Evolution has probably favoured this mechanism because it is simply too risky to try to improve function by changing proteins. Proteins, you see, are the Mary Poppins of the cell. They are ‘practically perfect in every way’.
Hammers always look pretty similar. Some may be big, some may be small, but in terms of basic design, there’s not much you can change that would make a hammer much better. It’s the same with proteins. The proteins in our bodies have evolved over billions of years. Let’s take just one example. Haemoglobin is the pigment that transports oxygen around our bodies, in the red blood cells. It’s beautifully adept at picking up oxygen in the lungs and releasing it where it’s needed in the tissues. Nobody working in a lab has been able to create an altered version of haemoglobin that does a better job than the natural protein.
Creating a haemoglobin molecule that’s worse than normal is surprisingly easy to do, unfortunately. In fact, that’s what happens in disorders like sickle cell disease, where mutations create poor haemoglobin proteins. A similar situation is true for most proteins. So, unless environmental conditions change dramatically, most alterations to a protein turn out to be a bad thing. Most proteins are as good as they’re going to get.
So how has evolution solved the problem of creating ever more complex and sophisticated organisms? Basically, by altering the regulation of proteins, rather than altering the proteins themselves. This is what can be achieved using complicated networks of ncRNA molecules to influence how, when and to what degree specific proteins are expressed – and there is evidence to show this actually happens.
miRNAs play major roles in control of pluripotency and control of cellular differentiation. ES cells can be encouraged to differentiate into other cell types by changing the culture conditions in which they’re grown. When they begin to differentiate, it’s essential that ES cells switch off the gene expression pathways that normally allow them to keep producing additional ES cells (self-renewal). There is a miRNA family called let-7 which is essential for this switch-off process[145].
One of the mechanisms the let-7 family uses is the down-regulation of a protein called Lin28. This implies that Lin28 is a pro-pluripotency protein. It’s therefore not that surprising to discover that Lin28 can act as a Yamanaka factor. Over-expression of Lin28 protein in somatic cells increases the chances of reprogramming them to iPS cells[146].
Conversely, there are other miRNA families that help ES cells to stay pluripotent and self-renewing. Unlike let-7, these miRNAs promote the pluripotent state. In ES cells, the key pluripotency factors such as Oct4 and Sox2 are bound to the promoters of these miRNAs, activating their expression. As the ES cells start to differentiate, these factors fall off the miRNA promoters, and stop driving their expression[147]. Just like the Lin28 protein, these miRNAs also improve reprogramming of somatic cells into iPS cells[148].
When we compare stem cells with their differentiated descendants, we find that they express very different populations of mRNA molecules. This seems reasonable, as the stem and differentiated cells express different proteins. But some mRNAs can take a long time to break down in a cell. This means that when a stem cell starts to differentiate, there will be a period when it still contains many of the stem cell mRNAs. Happily, when the stem cell starts differentiating, it switches on a new set of miRNAs. These target the residual stem cell mRNAs and accelerate their destruction. This rapid degradation of the pre-existing mRNAs ensures that the cell moves into a differentiated state as quickly and irreversibly as possible[149].
This is an important safety feature. It’s not good for cells to retain inappropriate stem cell characteristics – it increases the chance they will move down a cancer cell pathway. This mechanism is used even more dramatically in species where embryonic development is very rapid, such as fruit flies or zebrafish. In these species this process ensures that maternally-inherited mRNA transcripts supplied by the egg are rapidly degraded as the fertilised egg turns into a pluripotent zygote[150].
miRNAs are also vital for that all-important phase in imprinting control, the formation of primordial germ cells. A key stage in creation of primordial germ cells is the activation of the Blimp1 protein that we met in Chapter 8. Blimp1 expression is controlled by a complex interplay between Lin28 and let-7 activity[151]. Blimp1 also regulates an enzyme that methylates histones, and a class of proteins known as PIWI proteins. PIWI proteins in turn bind to another type of small ncRNAs known as PIWI RNAs[152]. PIWI ncRNAs and proteins don’t seem to play much of a role in the somatic cells but are required for generation of the male germline[153]. PIWI actually stands for P element-induced wimpy testis. If the PIWI ncRNAs and PIWI proteins don’t interact properly, the testes in a male foetus don’t form normally.
We are finding more and more instances of cross-talk and interactions between ncRNAs and epigenetic events. Remember that the genetic interlopers, the retrotransposons, are normally methylated in the germline, to prevent their activation. The PIWI pathway is involved in targeting this DNA methylation[154][155]. A substantial number of epigenetic proteins are able to interact with RNA. Binding of non-coding RNAs to the genome may act as the general mechanism by which epigenetic modifications are targeted to the correct chromatin region in a specific cell type[156].
ncRNAs have recently been implicated in Lamarckian transmission of inherited characteristics. In one example, fertilised mouse eggs were injected with a miRNA which targeted a key gene involved in growth of heart tissue. The mice which developed from these eggs had enlarged hearts (cardiac hypertrophy) suggesting that the early injection of the miRNA disturbed the normal developmental processes. Remarkably, the offspring of these mice also had a high frequency of cardiac hypertrophy. This was apparently because the abnormal expression of the miRNA was recreated during generation of sperm in these mice. There was no change in the DNA code of the mice, so this was a clear case of a miRNA driving epigenetic inheritance[157].
But if ncRNAs are so important for cellular function, surely we would expect to find that sometimes diseases are caused by problems with them. Shouldn’t there be lots of examples where defects in production or expression of ncRNAs lead to clinical disorders, aside from the imprinting or X inactivation conditions? Well, yes and no. Because these ncRNAs are predominantly regulatory molecules, acting in networks that are rich in compensatory mechanisms, defects may only have relatively subtle impacts. The problem this creates experimentally is that most genetic screens are good at detecting the major phenotypes caused by mutations in proteins, but may not be so useful for more subtle effects.
There is a small ncRNA called BC1 which is expressed in specific neurons in mice. When researchers at the University of Munster in Germany deleted this ncRNA, the mice seemed fine. But then the scientists moved the mutant animals from the very controlled laboratory setting into a more natural environment. Under these conditions, it became clear that the mutants were not the same as normal mice. They were reluctant to explore their surroundings and were anxious[158]. If they had simply been left in their cages, we would never have appreciated that loss of the BC1 ncRNA actually had a quite pronounced effect on behaviour. A clear case of what we see being dependent on how we look.
The impact of ncRNAs in clinical conditions is starting to come into focus, at least for a few examples. There is a breed of sheep called a Texel, and the kindest description would be that it’s chunky. The Texel is well known for having a lot of muscle, which is a good thing in an animal that’s being bred to be eaten. The muscularity of the breed has been shown to be at least partially due to a change in a miRNA binding site in the 3′ UTR of a specific gene. The protein coded for by this gene is called myostatin, and it normally slows down muscle growth[159]. The impact of the single base change is summarised in Figure 10.4. The final size of the Texel sheep has been exaggerated for clarity.
Figure 10.4 A single base change which is in a part of the myostatin gene that does not code for protein nevertheless has a dramatic impact on the phenotype in the Texel sheep breed. The presence of an A base instead of a G in the myostatin mRNA leads to binding of two specific miRNAs. This alters myostatin expression, resulting in sheep with very pronounced muscle growth.
Tourette’s syndrome is a neurodevelopmental disorder where the patient frequently suffers from involuntary convulsive movements (tics) which in some cases are associated with involuntary swearing. Two unrelated individuals with this disorder were shown to have the same single base change in the 3′ UTR of a gene called SLITRK1[160]. SLITRK1 appears to be required for neuronal development. The base change in the Tourette’s patients introduced a binding site for a short ncRNA called miR-189. This suggests that SLITRK1 expression may be abnormally down-regulated via such binding, at critical points in development. This alteration is only present in a few cases of Tourette’s but raises the tantalising suggestion that mis-regulation of miRNA binding sites in other neuronal genes may be involved in other patients.
Earlier in this chapter we encountered the theory that ncRNAs may have been vitally important for the development of increased brain complexity and sophistication in humans. If that is the case, we might predict that the brain would be particularly susceptible to defects in ncRNA activity and function. Indeed, the Tourette’s cases in the previous paragraph give an intriguing glimpse of such a scenario.
There is a condition in humans called DiGeorge syndrome in which a region of about 3,000,000 bases has been lost from one of the two copies of chromosome 22[161]. This region contains more than 25 genes. It’s probably not surprising that many different organ systems may be affected in patients with this condition, including genito-urinary, cardiovascular and skeletal. Forty per cent of DiGeorge patients suffer seizures and 25 per cent of adults with this condition develop schizophrenia. Mild to moderate mental retardation is also common. Different genes in the 3,000,000 base-pair region probably contribute to different aspects of the disorder. One of the genes is called DGCR8 and the DGCR8 protein is essential for the normal production of miRNAs. Genetically modified mice have been created with just one functional copy of Dgcr8. These mice develop cognitive problems, especially in learning and spatial processing[162]. This supports the idea that miRNA production may be important in neurological function.
We know that ncRNAs are important in the control of cellular pluripotency and cellular differentiation. It’s not much of a leap from that to hypothesise that miRNAs may be important in cancer. Cancer is classically a disease in which cells can keep proliferating. This has parallels with stem cells. Additionally, in cancer, the tumours often look relatively undifferentiated and disorganised under the microscope. This is in contrast to the fully differentiated and well-organised appearance of normal, healthy tissues. There is now a strong body of evidence that ncRNAs play a role in cancer. This role may involve either loss of selected miRNAs or over-expression of other miRNAs, as shown in Figure 10.5.
Figure 10.5 Decreased levels of certain types of microRNAs, or increased levels of others, may each ultimately have the same disruptive effect on gene expression. The end result may be increased expression of genes that drive cells into a highly proliferative state, increasing the likelihood of cancer development.
Chronic lymphocytic leukaemia is the commonest human leukaemia. Approximately 70 per cent of cases of this type of cancer[163] have lost the ncRNAs called miR-15a and miR-16-1. Cancer is a multi-step disease and a lot of things need to go wrong in an individual cell before it becomes cancerous. The fact that so many cases of this type of leukaemia, the most common human leukaemia, lacked these particular miRNAs suggested that loss of these sequences happened early in the development of the disease.
An example of the alternative mechanism – over-expression of miRNAs in cancer – is the case of the miR-17-92 cluster. This cluster is over-expressed in a range of cancers[164]. In fact, a considerable number of reports have now been published on abnormal expression of miRNAs in cancer[165]. In addition, a gene called TARBP2 is mutated in some inherited cancer conditions[166]. The TARBP2 protein is involved in normal processing of miRNAs. This strengthens the case for a role of miRNAs in the initiation and development of certain human cancers.
Given the increasing amounts of data suggesting a major role for miRNAs in cancer, it isn’t surprising that scientists began to get excited about the possibilities of using these molecules to treat cancer. The idea would be to replace ‘missing’ miRNAs or to inhibit ones that were over-expressed. The hope was that this could be achieved by dosing cancer patients with the miRNAs, or artificial variants of them. This could also have applications in other diseases where miRNA expression may have become abnormal.
Big pharmaceutical companies are certainly investing heavily in this area. Sanofi-Aventis and GlaxoSmithKline have each formed multi-million dollar collaborations with a company called Regulus Therapeutics in San Diego. They are exploring the development of miRNA replacements or inhibitors, to use in the treatment of diseases ranging from cancer to auto-immune disorders.
There are molecules very like miRNAs called siRNAs (small interfering RNAs). They use much the same processes as miRNA molecules to repress gene expression, especially degradation of mRNA. siRNAs have been used as tools very extensively in research, as they can be administered to cells in culture to switch off a gene for experimental investigations. In 2006, the scientists who first developed this technology, Andrew Fire and Craig Mello, were awarded the Nobel Prize for Physiology or Medicine.
Pharmaceutical companies became very interested in using siRNAs as potential new drugs. Theoretically, siRNA molecules could be used to knock down expression of any protein that was believed to be harmful in a disease. In the same year that Fire and Mello were awarded their Nobel Prize, the giant pharmaceutical company Merck paid over one billion US dollars for a siRNA company in California called Sirna Therapeutics. Other large pharmaceutical companies have also invested heavily.
But in 2010 a bit of a chill breeze began to drift through the pharmaceutical industry. Roche, the giant Swiss company, announced that it was stopping its siRNA programmes, despite having spent more than $500 million on them over three years. Its neighbouring Swiss corporation, Novartis, pulled out of a collaboration with a siRNA company called Alnylam in Massachusetts. There are still plenty of other companies who have stayed in this particular game, but it would probably be fair to say there’s a bit more nervousness around this technology than in the past.
One of the major problems with using this kind of approach therapeutically may sound rather mundane. Nucleic acids, such as DNA and RNA, are just difficult to turn into good drugs. Most good existing drugs – ibuprofen, Viagra, anti-histamines – have certain characteristics in common. You can swallow them, they get across your gut wall, they get distributed around your body, they don’t get destroyed too quickly by your liver, they get taken up by cells, and they work their effects on the molecules in or on the cells. Those all sound like really simple things, but they’re often the most difficult things to get right when developing a new drug. Companies will spend tens of millions of dollars – at least – getting this bit right, and it is still a surprisingly hit-and-miss process.
It’s so much worse when trying to create drugs around nucleic acids. This is partly because of their size. An average siRNA molecule is over 50 times larger than a drug like ibuprofen. When creating drugs (especially ones to be taken orally rather than injected) the general rule is, the smaller the better. The larger a drug is, the greater the problems with getting high enough doses into patients, and keeping them in the body for long enough. This may be why a company like Roche has decided it can spend its money more effectively elsewhere. This doesn’t mean that siRNA won’t ever work in the treatment of illnesses, it’s just quite high risk as a business venture. miRNA essentially faces all the same problems, because the nucleic acids are so similar for both approaches.
Luckily, there is usually more than one way to treat a cat and in the next chapter, we’ll see how drugs targeting epigenetic enzymes are already treating patients with severe cancer conditions.