Aevol
Download Latest Release View Source on GitLab Contributors Contact us Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Example of Genome Decoding

This page show-cases how genomes are decoded in Aevol.

In Aevol the Genotype-to-Phenotype map is divided in 6 steps. The first 3 steps are based on sequence decoding and the other 3 on a mathematical formulation of the biological functions.

  1. Transcrition
  2. ORF identification
  3. Translation
  4. Folding
  5. Metabolism
  6. Fitness computation

The genome being decoded here was generated using the ‘virus’ example parameter file (see this page). It has evolved for 250,000 generations. It is 674 bp long, has 6 mRNAs, 4 of which are non-coding, and 17 genes.

Download the genome in fasta format

1. Transcription

The genome is first parsed for transcription-initiation sequences (promoters): any sequence that has an edit distance with the predefined consensus sequence of at most \(d_{max}\) is considered a promoter.

On the example genome we are exploring here, 6 promoters are found: 4 on the leading strand and 2 on the lagging strand.

One of these is found between positions \(598\) and \(620\) (GFF coordinates) on the lagging strand. Its sequence \(0101011001110010010110\) differs from that of the predefined consensus sequence (\(0101011001110010010110\)) by \(d = 4\) nucleotides. Its expression level is then computed as \(e = 1 - \frac{d}{d_{max} + 1}\). Here \(e = 0.2\) since \(d = d_{max} = 4\).

Transcription starts right after the promoter, i.e. at position \(620 - 22 = 598\) (keep in mind that we are dealing with the lagging strand which means we read in decreasing nucleotide position order) until it reaches a terminator. Here, a terminator is found between positions \(589\) and \(600\) whose sequence is \(10010110110\). It is identified as a terminator because this sequence could form a stem-loop structure (the first 4 bases \(1001\) are complementary to the last 4 in reverse \(0110\)). The stem size is set to 4 and the loop size to 3.

Here, as is often the case in dense virus-like genomes, the whole genome is transcribed onto a single mRNA.

This figure shows the mRNAs upon the genome:

mRNAs identified on the genome

mRNAs identified on the genome

2. Detailed Translation and Folding of an arbitrary gene

2.1. ORF identification

Each transcribed region is searched for translation-initiation sequences.

One of these sequences is found on the mRNA we have focused on in the previous section at position \(516\). Its sequence is \(0110110010000\). This is considered a translation-initiation sequences because it matches the sequence \(011011\)----\(000\) which is a Ribosome Binding Site (RBS – \(011011\)) followed, \(4\) bases downstream, by a Start codon (\(000\)).

A total of 17 translation-initiation sequences are found on this mRNA. For each of them, the subsequent nucleotides will be read 3 at a time until a Stop codon is found, marking the end of the gene.

This figure shows the genes upon the genome:

Genes identified on the genome

Genes identified on the genome

2.2. Translation

Translation begins right after the translation-initiation sequence.

For the translation-initiation sequence found at position \(516\) (lagging strand), this is at position \(503\). The following sequence is read one codon (i.e. 3 nucleotides) at a time until a Stop codon is reached.

Here, the whole sequence of the gene is \(0110110010000111100010101010101111100110011000111000000001\) and can be decomposed as follows:

\(011011\ 0010\ 000\) \(111 100 010 101 010 101 111 100 110 011 000 111 000 000\) \(001\)
(RBS + 4 bases + Start) Codons to be translated Stop

Each codon can then be deciphered using a simple genetic code, resulting in the primary sequence of the protein corresponding to the gene. Note that a Start codon within a CDS is deciphered as an \(H0\).

\(111\) \(100\) \(010\) \(101\) \(010\) \(101\) \(111\) \(100\) \(110\) \(011\) \(000\) \(111\) \(000\) \(000\)
\(H1\) \(M0\) \(W0\) \(M1\) \(W0\) \(M1\) \(H1\) \(M0\) \(H0\) \(W1\) \(H0\) \(H1\) \(H0\) \(H0\)

The primary sequence of the protein is hence \(H1\)-\(M0\)-\(W0\)-\(M1\)-\(W0\)-\(M1\)-\(H1\)-\(M0\)-\(H0\)-\(W1\)-\(H0\)-\(H1\)-\(H0\)-\(H0\)

2.3. Folding

This primary sequence is then interpreted to compute the three parameters characterizing the protein function. It is separated into 3 distinct sequences: those made up of the \(M\) codons, of the \(W\) codons, and of the \(H\) codons. Here : \(M0\)-\(M1\)-\(M1\)-\(M0\); \(W0\)-\(W0\)-\(W1\); and \(H1\)-\(H1\)-\(H0\)-\(H0\)-\(H1\)-\(H0\)-\(H0\). We now have 3 binary sequences: M: \(0110\); W: \(001\) and H: \(1100100\)

These are interpreted using the standard Gray code which gives us the values \(M: 4\); \(W: 1\); and \(H: 71\) which are in term normalized in the corresponding parameter range set with respect to the maximum value that can be coded with the number of digits each sequence contains. For instance, the sequence for the \(M\) parameter contains 4 bits so the maximum codable value is \(2^{4}-1 = 15\) and the \(M\) parameter is defined in \([0, 1]\). We then compute \(m = \frac{4}{15} \approx 0.2666666666\) We do the same for the \(W\) parameter (defined in \([0, W_{max}]\)) and \(H\) (defined in \([-1, 1]\)) In the current example \(W_{max}\) was set to \(0.1\) so we obtain \(w = 0.1 \times \frac{1}{7} \approx 0.014285714\) and \(h = −1+2 \times \frac{71}{127} \approx 0.118110236\)

Finally, the h parameter of the protein function is scaled by the expression level of the promoter(s) that express that gene, here \(0.2\) so \(h = 0,118110236 \times 0.2 = 0,023622047\).

The final parameters characterizing the protein function are then: \(m = 0.2666666666\), \(w = 0.014285714\), and \(h = 0.023622047\)

3. Translation and Folding of additinal genes

Metabolism (or phenotype computation) is the process whereby several proteins interact, resulting in the organism’s phenotype. Before we move forward to metabolism, we hence need more than one protein.

Lets call the gene we have already processed \(gene\_ A\) and consider 2 additional genes. Let \(gene\_ B\) be the gene found on the lagging strand at position \(596\), and \(gene\_ C\) the gene found on the lagging strand at position \(630\).

Decoding \(gene\_ B\)

Whole sequence:
\(0110110110000110111010110111010110111010110110110000111100110011011101111111010001\)

Decomposed sequence:
\(0110110110000\)-\(110\)-\(111\)-\(010\)-\(110\)-\(111\)-\(010\)-\(110\)-\(111\)-\(010\)-\(110\)-\(110\)-\(110\)-\(000\)-\(111\)-\(100\)-\(110\)-\(011\)-\(011\)-\(101\)-\(111\)-\(111\)-\(010\)-\(001\)

Protein primary sequence:
\(H0\)-\(H1\)-\(W0\)-\(H0\)-\(H1\)-\(W0\)-\(H0\)-\(H1\)-\(W0\)-\(H0\)-\(H0\)-\(H0\)-\(H0\)-\(H1\)-\(M0\)-\(H0\)-\(W1\)-\(W1\)-\(M1\)-\(H1\)-\(H1\)-\(W0\)

Functional parameters computation

Parameter Gray code Raw value Normalization Normalized value
M \(01\) \(1\) \(\frac{raw\_ value}{2^{2} - 1}\) 0.3333333333
W \(000110\) \(4\) \(W_{max} \times \frac{raw\_ value}{2^{6} - 1}\) 0,0063492063
H \(01010100001011\) \(6642\) \(−1 + 2 \times \frac{raw\_ value}{2^{14} - 1}\) −0,189159495

Expression level of the promoter(s) that express \(gene\_ B\): 0.2

Final parameters characterizing the protein function: m = 0.3333333333, w = 0,0063492063, and h = −0,037831899

Decoding \(gene\_ C\)

Whole sequence:
\(0110110110000111100111011101011001\)

Decomposed sequence:
\(0110110110000\)-\(111\)-\(100\)-\(111\)-\(011\)-\(101\)-\(011\)-\(001\)

Protein primary sequence:
\(H1\)-\(M0\)-\(H1\)-\(W1\)-\(M1\)-\(W1\)

Functional parameters computation

Parameter Gray code Raw value Normalization Normalized value
M \(01\) \(1\) \(\frac{raw\_ value}{2^{2} - 1}\) 0.3333333333
W \(11\) \(2\) \(W_{max} \times \frac{raw\_ value}{2^{2} - 1}\) 0,0666666667
H \(11\) \(2\) \(−1 + 2 \times \frac{raw\_ value}{2^{2} - 1}\) 0.3333333333

Expression level of the promoter(s) that express \(gene\_ C\): \(0.2 + 0.2 = 0.4\)

Final parameters characterizing the protein function: \(m = 0.3333333333\), \(w = 0,0666666667\), and \(h = 0,133333333\)

5. Metabolism

We have computed the parameters of the proteins \(protein\_ A\), \(protein\_ B\), and \(protein\_ C\) corresponding to genes \(gene\_ A\), \(gene\_ B\), and \(gene\_ C\). Their parameters is compiled in this table:

\(m\) \(w\) \(h\)
\(protein\_ A\) \(0.2666666666\) \(0,014285714\) \(0,023622047\)
\(protein\_ B\) \(0.3333333333\) \(0,0063492063\) \(−0,037831899\)
\(protein\_ C\) \(0.3333333333\) \(0,0666666667\) \(0,133333333\)

To Be Continued…

6. Fitness computation