What's in a genome?

In remote Japanese mountain settings thrives a plant with a whole lot of DNA. With the largest known genome of any plant, each of the cells of Paris japonica contains 150 billion letters of instructions for building and maintaining this rare flower. But just how much information lies within this vast chemical manual for life?

It’s not just the largest known genome of any plant. The genome of Paris japonica is the size of 50 human genomes, or the largest animal genome (the marbled lungfish) and the wheat genome combined.

However, even this large genome might not be the largest. There is an amoeba with a genome size reported to be up to 600 billion base pairs long, although this figure is contested due to the sequence being obtained before the advent of more rigorous and accurate modern genome sequencing methods.

Complexity and size.

Still, it’s likely that this amoeba has a massive genome. But how is it that such a tiny organism, only a single cell in size, can have a longer genome than a human? Surely a human is a much larger and more complicated biological entity? What information can there possibly be in 600 billion base pairs of DNA that can’t be squeezed into much less?

Considering that the smallest animal genome is reportedly less than 20 million base pairs, while some bacteria manage to get by with only 182 genes in total, just what is the minimum information needed to keep an organism ticking by?

It’s clear that, on the whole, with an apparent increase in the complexity of an organism, the genome has similarly become more complicated.

The defining quality of a prokaryotic bacterium is that it does not contain a nucleus, but instead has most of its DNA in a nucleoid, which is a sort of quasi-chromosome. Eukaryotes, like people and algae and plants and fungi, for example, like to package their DNA mainly into a nucleus, folding it nicely into little structures called chromosomes.

The eukaryotic cell is the grown-up version, with everything nicely arranged in handy containers (organelles), each surrounded by a membrane, essentially your parent’s house compared to your student house. There’s a container for everything. Even, occasionally, things that used to be bacteria - such as chloroplasts and mitochondria - which contain their own tiny, circular genomes, reminiscent of their formerly independent existence long ago in the annals of evolutionary time.

However, this nice packaging doesn’t mean the genome is any easier to decipher. Far from it. Whereas bacterial genes come in blocks - with clear and often linear instructions, as any student who has studied the lac operon can attest - eukaryotic genes tend to come in bits and pieces (introns and exons), which makes it difficult to understand entirely what they do.

The fact is, one gene can often be stuck together in several different ways, which could all be read from a different starting point, or even both ways around. There might even be information within the spaces separating the genes, or within the introns (the non-gene bits) that separate the exons (the pieces with most of the instructions).

In organisms with genomes as large as Paris japonica, the marbled lungfish and wheat, there is yet more complexity when we consider that there are often multiple versions of the same “gene.” Whatever one of those is.

Amoeba come in many shape-shifting forms, such as this amoeba proteus. Credit: Lebendkulturen.de, Shutterstock

Synthetic biology.

This is where synthetic biology comes into its own, or at least is attempting to.

Synthetic biology is a field aiming to get down to the naked nuts and bolts of life, in the pursuit of turning genomes into lego bricks, which can be stuck back together in interesting combinations to essentially allow us to engineer life.

Indeed, the world already has its first synthetic organism, which is a quasi-bacterium invented by the team of Craig Venter. However, we are a long way off being able to do this for a eukaryotic genome, despite the optimism of many scientists.

Yet, the pursuit of this knowledge is of paramount importance to better understanding just what makes up a genome in the first place. As with many aspects of biology, just when we think we understand a phenomenon of molecular genetics, life finds a way to throw some deeper complexity and hidden effects into the mix.

Synthetic biology aims to piece together the jigsaw while defining the pieces themselves. What is important? What is junk? What can we keep and what can we get rid of?

First, though, it’s important to delve into the inherent complexities of solving such a problem.

Genes, proteins and so many in betweens.

At school, we learn the old adage; DNA → RNA → Protein ( → Living organism).

This is the simple version. It’s a bit like explaining how to compose a symphony of music and suggesting; Musical score → Instruments → Melody ( → Beethoven’s Fifth).

Actually, it’s quite a good analogy, come to think of it. From a relatively simple set of instructions, in the case of music, some symbols on a stave, for DNA, a set of four letters, we get more complex direction.

The letters of DNA become small words known as codons, while the notes on a stave become bars of music. The codons of DNA become sentences, known as genes, while the bars of music come together to make complete phrases. Once those genes and phrases are read, in cells by messenger RNA and in an ensemble by talented musicians, the instructions are converted into proteins, or indeed, a sweet melody.

Like with a page of a piece of music, on the face of it the genome looks like it’s made up of all these essential blocks, which come together to make up a whole. However, if you look closely between the bars and the staves, there are extra instructions littered throughout complex musical scores.

There are repeats, which might indicate how many times a melody is to be played over. There are treble and bass clefs which might switch from one bar to the next. The key signature, which indicates which notes are to be played, might also switch between bars and phrases.

Above each of the notes on a stave, indeed within the notes themselves, there is more information. A minim is played for longer than a crotchet, which itself is one quarter of a semibreve. A small hash might indicate that a note should be sharpened, while a dot indicates a note should be played sharply and quickly. There are even more precise instructions, written over the piece, informing the player how to set the tempo, or the loudness.

Then, we have what I am going to term the “jazz effect” of genomics… though a musical piece has many nuts and bolts, or bars and hemisemidemiquavers, there is always room to improvise. But (outside of Baker Street) is that saxophone solo essential for the song to remain the same?

This is similarly true of genomes. With DNA we are still trying to unravel just what the pieces of filler do. What do those long stretches of in-between bits mean in terms of function?

Credit: Africa Studio/Shutterstock.com

What is this even here for?

It all used to seem so simple. At least at school, when we were taught the basic structure of a gene, with introns and exons. Exons were bits that were copied, then stuck together, to make genes which turn into proteins and make eyes blue and hair brown. The introns were just there for fun.

Then, someone told me about alternative splicing and my mind was blown, again.

So, these “genes” that everyone was so excited about could actually make up to ten proteins, or no proteins at all, depending on which of the exons were stuck together. Great. So those introns are actually really useful.

But how were these genes read by cells, by promoters, right?

Right, but some of these were stronger than others, depending on the proteins that were present in the cell at the time these genes were turned on, or turned off. “Transcription factors,” they say, which become active depending on what is happening to the cell at a certain time.

But wait, there’s more.

Sometimes, these genes can be turned on, but then there are bits of RNA floating around in cells which locate the mRNA generated from gene sequences and get them destroyed. These bits of RNA are so important that they play roles in development, disease and everything in between.

For a while further, we assumed that it was just the exons, and the bits of DNA that proteins attached to, and the bits of RNA floating around regulating them, which seemed important.

Quite complicated enough, thank you very much, what with promoters, enhancers, special DNA motifs telling proteins when to start reading DNA, when to stop reading DNA, when to chop DNA in different parts.

But then Wilfried Haerty came along to Earlham Institute and told us about long non coding RNA. Bits of RNA even longer than the mRNA-destroying microRNA that apparently are also useful, on which we have already written a fascinating article.

Shapeshifting histones and chromosomes.

Even the scientists interested in studying long non coding RNA seem to agree that, outside of the 8% of human DNA with a provable evolutionary function, the rest might merely be junk.

But how can we be sure?

Let’s jump back to our organisms with ridiculously long DNA, such as Paris japonica. If this was all merely junk, why would it be so abundant? Why would a cell waste time replicating something that has absolutely no link to the survival or propagation of that cell?

Maybe, within all of that seemingly excess material, there is something that we are missing yet further? Something to unravel, perhaps?

There is, after all, another level of gene regulation that is, many believe to be, the most fascinating of all. Eukaryotes are characterised by having a nucleus, within which DNA is masterfully packaged into structures we know as chromosomes. In a human there are 23 pairs of these structures, with Xs and Ys determining the genitals you exhibit, among other things.

These chromosomes are essentially made up of DNA and proteins, with DNA strings wrapping themselves around a chain of proteins we call histones. When we learn this at university, we refer to the basic structure as beads on a string. These beaded strings then fold back on themselves again and again, until we manage to fit a whole 2m long string of DNA into a tiny compartment which has a diameter of only 1.75×10−15 m (that’s 0.00000000000000175m).

It’s kind-of like DNA crochet.

This structure itself is of vital importance to how genes are regulated. This is the field of epigenetics.

At certain points, the histones around which DNA is wrapped can themselves be modified, which can then lead to sections of DNA being unravelled. So, as well as introns and exons, we get stuff called euchromatin and heterochromatin, essentially bits of DNA that are accessible or more tightly packaged, respectively.

So, for various processes in cells, including how plants come to recognise the end of winter, it’s not just the genes that are important, nor the long non coding RNA, but how accessible that string of DNA is to the proteins that want to read it.

That’s right, DNA is essentially packaged into a molecular wardrobe in winter, then the spring clothes come out just in time for warmer weather… or so to speak.

In a human there are 26 pairs of these structures, with Xs and Ys determining the genitals you exhibit, among other things. Credit: Rost9, Shutterstock

The further we delve.

The further we delve into the complexities of the genome, the more complexity we find within.

Considering the amount of information that can be encoded in a single “gene” sequence, from promoters to enhancers, TATA boxes to STOP codons, adenines to thymines and stretches of evolutionary relevant “junk” DNA, it seems foolish to rule out a function for the abundant DNA stretching between these more obviously useful sequences.

However, with the advent of modern genome sequencing methods, as well as novel practises such as synthetic biology, we’re surely well on the way to finding out whether Paris japonica really needs all this DNA, or whether it’s just showing off.

And if there ever were a time to get interested in the biology of living things, whether you’re a biologist, a mathematician, a physicist, a chemist, a computer scientist, or simply someone who is just fascinated by the natural world… now is perhaps the most interesting of times.

What's in a genome?

Complexity and size.

Synthetic biology.

Genes, proteins and so many in betweens.

What is this even here for?

Shapeshifting histones and chromosomes.

The further we delve.

Related reading.

New perspectives on human health and biodiversity using cell atlases

Support, community, and confidence: takeaways from a year in industry

Mapping cellular dynamics with the lichen cell atlas

Pangenome annotation opens up a multiverse of genes

Integrating single-cell and spatial genomics across the tree of life

Leading with curiosity: forging a career in engineering biology

Every cell tells a story: single-cell analysis in forensic science

Examining the science of evidence-based policy

New fellowship launched to embed FAIR data across the UK life sciences

New wheat diversity discovery could provide an urgently-needed solution to global food security

Single-cell genomics reveals hidden bacterial threats in Amoeba

New project explores potential of soil microbes to achieve UK net zero goals

ELIXIR-UK awarded strategic funding to support UK life sciences data community

Scientists look to biotechnology to improve crop resilience and nutritional value

Precision Breeding for plants signed into law

Starting point of DNA replication mystery solved