Aevol
Download Latest Release View Source on GitLab Contributors Contact us Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Example of Genome Decoding

This page show-cases how genomes are decoded in Aevol.

In Aevol the Genotype-to-Phenotype map is divided in 6 steps. The first 3 steps are based on sequence decoding and the other 3 on a mathematical formulation of the biological functions.

  1. Transcrition
  2. ORF identification
  3. Translation
  4. Folding
  5. Metabolism
  6. Fitness computation

The genome being decoded here has evolved for 200,000 generations in the setup defined by this parameter file It is 563 bp long, has 2 mRNAs and 12 genes.

Download the genome in fasta format

Download the annotated genome in gff3 format

1. Transcription

The genome is first parsed for transcription-initiation sequences (promoters): any sequence that has an edit distance with the predefined consensus sequence of at most \(d_{max}\) is considered a promoter.

On the example genome we are exploring here, 2 promoters are found: 1 on the leading strand and 1 on the lagging strand.

The promoter on the leading strand is found between positions \(559\) and \(581\) (GFF coordinates), which means it spans the origin of replication. Its sequence \(0101011001110010010110\) differs from that of the predefined consensus sequence (\(0101011001110010010110\)) by \(d = 4\) nucleotides. Its expression level is then computed as \(e = 1 - \frac{d}{d_{max} + 1}\). Here \(e = 0.2\) since \(d = d_{max} = 4\).

The promoter on the lagging strand is found between positions \(513\) and \(535\) (GFF coordinates). Note that the corresponding sequence in the fasta file does not look like a promoter. This is because, being on the lagging strand, it is the reverse complement we are considering, which is here \(0001001001110010010101\), differing from the predefined consensus sequence by \(d = 4\) nucleotides.

Because it is more straightforward, we will focus on sequences on the leading strand for the remainder of this example.

Transcription starts right after the promoter until it reaches a terminator. Here, a terminator is found between positions \(555\) and \(566\). Its sequence is \(00110100011\). It is identified as a terminator because this sequence could form a stem-loop structure (the first 4 bases \(0011\) are the reverse complement of the last 4 \(0011\)). The stem size is set to 4 and the loop size to 3.

Here, as is often the case in dense virus-like genomes, almost the whole genome is transcribed onto a single mRNA.

This figure shows the mRNAs upon the genome:

mRNAs identified on the genome

mRNAs identified on the genome

2. Detailed Translation and Folding of an arbitrary gene

2.1. ORF identification

Each transcribed region is searched for translation-initiation sequences.

One of these sequences is found on the mRNA we have focused on in the previous section at position \(461\). Its sequence is \(0110111100000\). This is considered a translation-initiation sequences because it matches the sequence \(011011\)----\(000\) which is a Ribosome Binding Site (RBS – \(011011\)) followed, \(4\) bases downstream, by a Start codon (\(000\)).

A total of 5 translation-initiation sequences are found on this mRNA. For each of them, the subsequent nucleotides will be read 3 at a time until a Stop codon is found, marking the end of the gene.

This figure shows the genes upon the genome:

Genes identified on the genome

Genes identified on the genome

2.2. Translation

Translation begins right after the translation-initiation sequence.

For the translation-initiation sequence found at position \(461\) (leading strand), this is at position \(474\). The following sequence is read one codon (i.e. 3 nucleotides) at a time until a Stop codon is reached.

Here, the whole sequence of the gene is
\(0110111100000010011100010111000111110101010101101101010101101100011011011100011110110111001\)
It can be decomposed as follows:

\(011011\ 1100\ 000\) \(010 011 100 010 111 000 111 110 101 010 101 101 101 010 101 101 100 011 011 011 100 011 110 110 111\) \(001\)
(RBS + 4 bases + Start) Codons to be translated Stop

Each codon can then be deciphered using a simple genetic code, resulting in the primary sequence of the protein corresponding to the gene. Note that a Start codon within a CDS is deciphered as an \(H0\).

\(010\) \(011\) \(100\) \(010\) \(111\) \(000\) \(111\) \(110\) \(101\) \(010\) \(101\) \(101\) \(101\) \(010\) \(101\) \(101\) \(100\) \(011\) \(011\) \(011\) \(100\) \(011\) \(110\) \(110\) \(111\)
\(W0\) \(W1\) \(M0\) \(W0\) \(H1\) \(H0\) \(H1\) \(H0\) \(M1\) \(W0\) \(M1\) \(M1\) \(M1\) \(W0\) \(M1\) \(M1\) \(M0\) \(W1\) \(W1\) \(W1\) \(M0\) \(W1\) \(H0\) \(H0\) \(H1\)

The primary sequence of the protein is hence \(W0\)-\(W1\)-\(M0\)-\(W0\)-\(H1\)-\(H0\)-\(H1\)-\(H0\)-\(M1\)-\(W0\)-\(M1\)-\(M1\)-\(M1\)-\(W0\)-\(M1\)-\(M1\)-\(M0\)-\(W1\)-\(W1\)-\(W1\)-\(M0\)-\(W1\)-\(H0\)-\(H0\)-\(H1\)

2.3. Folding

This primary sequence is then interpreted to compute the three parameters characterizing the protein function. It is separated into 3 distinct sequences: those made up of the \(M\) codons, of the \(W\) codons, and of the \(H\) codons.
Here :

  • \(M0\)-\(M1\)-\(M1\)-\(M1\)-\(M1\)-\(M1\)-\(M1\)-\(M0\)-\(M0\)
  • \(W0\)-\(W1\)-\(W0\)-\(W0\)-\(W0\)-\(W1\)-\(W1\)-\(W1\)-\(W1\)
  • \(H1\)-\(H0\)-\(H1\)-\(H0\)-\(H0\)-\(H0\)-\(H1\)

We now have 3 binary sequences:

  • M: \(011111100\)
  • W: \(010001111\)
  • H: \(1010001\)

These are interpreted using the standard Gray code which gives us the values \(M: 168\); \(W: 245\); and \(H: 97\) which are in term normalized in the corresponding parameter range with respect to the maximum value that can be coded with the number of digits each sequence contains. For instance, the sequence for the \(M\) parameter contains 9 bits so the maximum codable value is \(2^{9}-1 = 511\) and the \(M\) parameter is defined in \([0, 1]\). We then compute \(m = \frac{168}{511} \approx 0.328767123\) We do the same for the \(W\) parameter (defined in \([0, W_{max}]\)) and \(H\) (defined in \([-1, 1]\)) In the current example \(W_{max}\) was set to \(0.1\) so we obtain \(w = 0.1 \times \frac{245}{511} \approx 0.047945205\) and \(h = −1+2 \times \frac{97}{127} \approx 0.527559055\)

Functional parameters computation

Parameter Gray code Raw value Normalization Normalized value
M \(011111100\) \(168\) \(\frac{raw\_ value}{2^{9} - 1}\) 0.328767123
W \(010001111\) \(245\) \(W_{max} \times \frac{raw\_ value}{2^{9} - 1}\) 0.047945205
H \(1010001\) \(97\) \(−1 + 2 \times \frac{raw\_ value}{2^{7} - 1}\) 0.527559055

Finally, the h parameter of the protein function is scaled by the expression level \(e\) of the promoter(s) that express that gene, here \(0.2\) so \(h = 0.527559055 \times 0.2 = 0.105511811\).

The final parameters characterizing the protein function are then: \(m = 0.328767123\), \(w = 0.047945205\), and \(e \times h = 0.105511811\)

3. Translation and Folding of additinal genes

Metabolism (or phenotype computation) is the process whereby several proteins interact, resulting in the organism’s phenotype. Before we move forward to metabolism, we hence need more than one protein.

Lets call the gene we have already processed \(gene\_ A\) and consider 2 additional genes. Let \(gene\_ B\) be the gene found on the leading strand at position \(144\), and \(gene\_ C\) the gene found on the lagging strand at position \(184\).

Decoding \(gene\_ B\)

Whole sequence:
\(0110110000000011100101100000001\)

Decomposed sequence:
\(0110110000000\)-\(011\)-\(100\)-\(101\)-\(100\)-\(000\)-\(001\)

Protein primary sequence:
\(W1\)-\(M0\)-\(M1\)-\(M0\)-\(H0\)

Functional parameters computation

Parameter Gray code Raw value Normalization Normalized value
M \(010\) \(3\) \(\frac{raw\_ value}{2^{3} - 1}\) 0.428571429
W \(1\) \(1\) \(W_{max} \times \frac{raw\_ value}{2^{1} - 1}\) 0.1
H \(0\) \(0\) \(−1 + 2 \times \frac{raw\_ value}{2^{1} - 1}\) −1.0

Expression level \(e\) of the promoter(s) that express \(gene\_ B\): \(0.2\)

Final parameters characterizing the protein function: \(m = 0.428571429\), \(w = 0.1\), and \(e \times h = −0.2\)

Decoding \(gene\_ C\)

Whole sequence (raw):
\(0110000000100000010011011100010001111001001001\)

Since \(gene\_ C\) is on the lagging strand, we need to reverse complement the sequence.

Reverse-complemented whole sequence:
\(0110110110000111011100010011011111101111111001\)

Decomposed sequence:
\(0110110110000\)-\(111\)-\(011\)-\(100\)-\(010\)-\(011\)-\(011\)-\(111\)-\(101\)-\(111\)-\(111\)-\(001\)

Protein primary sequence:
\(H1\)-\(W1\)-\(M0\)-\(W0\)-\(W1\)-\(W1\)-\(H1\)-\(M1\)-\(H1\)-\(H1\)

Functional parameters computation

Parameter Gray code Raw value Normalization Normalized value
M \(01\) \(1\) \(\frac{raw\_ value}{2^{2} - 1}\) 0.333333333
W \(1011\) \(13\) \(W_{max} \times \frac{raw\_ value}{2^{4} - 1}\) 0.086666667
H \(1111\) \(10\) \(−1 + 2 \times \frac{raw\_ value}{2^{4} - 1}\) 0.333333333

Expression level \(e\) of the promoter(s) that express \(gene\_ C\): \(0.2\)

Final parameters characterizing the protein function: \(m = 0.333333333\), \(w = 0.086666667\), and \(e \times h = 0.066666667\)

5. Metabolism

We have computed the parameters of the proteins \(protein\_ A\), \(protein\_ B\), and \(protein\_ C\) corresponding to genes \(gene\_ A\), \(gene\_ B\), and \(gene\_ C\). Their parameters is compiled in this table:

\(m\) \(w\) \(h\) \(e \times h\)
\(protein\_ A\) \(0.328767123\) \(0.047945205\) \(0.527559055\) \(0.105511811\)
\(protein\_ B\) \(0.428571429\) \(0.1\) \(−1.0\) \(−0.2\)
\(protein\_ C\) \(0.333333333\) \(0.086666667\) \(0.333333333\) \(0.066666667\)

Each of these proteins contribute to the interval of metabolic functions \((m-w, m+w)\) with an efficacy given by a piecewise-linear function with a maximum of \(e \times h\) at position \(m\) in the functional space. This efficacy decreases linearly as the metabolic function moves further from \(m\).

Proteins are combined together by simply summing their contribution for each metabolic function.

The figure hereunder shows a zoom on the phenotypic contribution of the 3 proteins we have detailed above and their combined phenotypic contribution. We can see that between \(\sim 0.25\) and \(\sim 0.28\) only one protein (\(protein\_ C\)) contributes. From \(\sim 0.28\) to \(\sim 0.33\), both \(protein\_ A\) and \(protein\_ C\) contribute, adding their contributions with one another. Starting \(\sim 0.33\), \(protein\_ B\) begins to inhibit the action of \(protein\_ A\) and \(protein\_ C\), finally inhibitting them completely from \(\sim 0.36\).

Partial proteome and corresponding combined phenotypic contribution

Partial proteome and corresponding combined phenotypic contribution

Note that the detail of which proteins are combined with which depends on the dosage_effect setting:

Absolute dosage effect

In the absolute dosage effect setting, the phenotype is computed through these steps

  1. Combine all activating proteins (\(h \gt 0\))
  2. Apply upper bound at 1.0 (maximum activating level)
  3. Combine all inhibiting proteins (\(h \lt 0\))
  4. Apply lower bound at -1.0 (maximum inhibiting level)
  5. Combine activating and inhibiting contributions together
  6. Apply lower bound at 0.0 (functions that are more inhibited than they are activated are not realized)

Relative dosage effect

In the relative dosage effect setting, the phenotype is computed through these steps

  1. Combine all proteins
  2. Scale the height of the entire phenotype so that its geometric area equals that of the phenotypic target

6. Fitness computation

The complete proteome and phenotype corresponding to the genome we have used so far is shown in the following 2 figures:

Complete proteome

Complete proteome

Complete phenotype

Complete phenotype

Fitness is computed as a function of the difference between the phenotype \(P\) and the phenotypic target \(E\), defined as \(\Delta:=\int_\Omega{}|E(x) - P(x)| dx, \forall x\in\Omega\) and called the “metabolic error”. This “metabolic error” is used to measure adaptation penalizing both the under-realization and the over-realization of phenotypic traits. Given the metabolic error of an individual, its fitness \(f\) is given by \(f:=\exp(-k \Delta)\) with \(k\) a fixed parameter regulating the selection strength (the higher \(k\), the larger the effect of metabolic error variations on the fitness values).

In our example, \(k = 2000\) and the fitness corresponding to our example genome is \(0.06599596961984196\).