Baidu Develops Algorithms that Improve mRNA Vaccine Developments for COVID-19


Back to list

Scientists are racing to develop a vaccine to prevent the COVID-19 pandemic, which has sickened over four million people and caused 280,000 deaths globally. Among all vaccines under the development, mRNA vaccine has emerged as a promising preventive tool because of its rapid and scalable production. For instance, the U.S. biotechnology company Moderna has begun human clinical trials evaluating its mRNA vaccine for COVID-19 and will start phase 2 testing soon. However, a widespread adoption of mRNA vaccines has been restricted in prior years because of its instability, causing it to degrade easily, and leading to a low protein expression level.


Recent findings have proved that the more secondary structure of a mRNA sequence leads to a more stable and productive mRNA vaccine, but finding such a sequence with a robust secondary structure remains a difficult challenge because there are exponentially many mRNA sequences that encode the same protein.


While this is a typical bioinformatics problem, we believe by designing efficient algorithms we can improve the mRNA vaccine development. We are proud to announce LinearDesign, an efficient algorithm for optimized mRNA sequence design. The algorithm needs only 16 minutes to design a stable mRNA sequence that has substantially better stability compared to the wildtype sequence and random generated ones.


We have launched an easy-to-use LinearDesign webserver for public use so that biotech companies and research institutes can utilize our technology, with the paper also released on arXiv.


“The LinearDesign algorithm, developed by Baidu Research in collaboration with Oregon State University and University of Rochester, can theoretically design the mRNA sequence with the most stable secondary structure, helping many mRNA vaccine companies to optimize their vaccine sequence designs,” said Liang Huang, Distinguished Scientist at Baidu USA.


LinearDesign is our latest anti-pandemic research effort inspired by our previous project LinearFold, the world’s fastest algorithm for RNA secondary structure prediction that significantly speeds up the analysis of the SARS-COV-2, the virus that causes the COVID-19 pandemic, from 55 minutes to 27 seconds.


Baidu has also signed a strategic partnership with China’s CDC NIVDC to support anti-pandemic efforts and long-term public health. Baidu will provide AI and big data technologies, including LinearFold and LinearDesign, for genome analysis and vaccine R&D, while jointly establishing a genome sequencing workstation with the NIVDC's emergency tech center.


The vaccine development for COVID-19 is bound to be a long and challenging journey ahead, making an international collaboration of crucial importance to vaccines. We encourage scientists and researchers to work with us and move quickly to bring a safe and efficacious vaccine to patients.


Why is mRNA vaccine important to prevent the spread of COVID-19?


As one of the most effective ways to prevent diseases, a vaccine stimulates the body's immune system to recognize and fight pathogens like viruses or bacteria or any of the associated microorganisms.


The field is pursing emerging techniques for more rapid development and large-scale deployment because a wide variety of infectious diseases like COVID-19 are evolving so rapidly that their variants may have emerged before vaccines are churned out.


The most common type of vaccine is protein vaccine, but its manufacturing process often takes too long, rendering it less desirable for the current pandemic. DNA vaccine, on the other hand, enjoys faster production but suffers from safety issues due to potential integration into the human genome.


Meanwhile, mRNA vaccine, which refers to the direct injection of messenger RNA that is translated into proteins in the human body, has stood out with several beneficial features, including safety, rapid and scalable production, as well as non-infectious and non-integrating properties.


“The reason one might want to use an mRNA is that it should stimulate the immune system in a much more similar way to a real viral infection. And that's advantageous because then the immune system is going to be recognizing a real viral infection much more easily,” said Dr. David H. Mathews, professor oDepartment of Biochemistry and Biophysics at University of Rochester.


Recently, Moderna said its candidate for a coronavirus vaccine will get to be evaluated further. The biotech company said it has submitted a new-drug application with the U.S. Food and Drug Administration to evaluate the vaccine candidate, mRNA-1273, in a more extensive study if supported by safety data from an initial study.


Why we developed LinearDesign?


Despite mRNA vaccine’s promising potential, there remain major hurdles for designing an mRNA sequence that achieves high stability and protein production, both of which are critical concerns for vaccines.


mRNA vaccines may fail due to degradation of mRNA during storage and transportation. At the same time, the mRNA vaccine generates proteins by translating the mRNA in the body. How much protein it can synthesize is directly related to the immune effect.


In a recent paper published on Proceedings of the National Academy of Sciences in the U.S., Moderna’s research team demonstrated that secondary structures and codon optimality can increase mRNA’s stability and protein expression. The problem can therefore be formulated to find the mRNA sequences that are good in both secondary structure stability and codon optimality among the exponentially many synonymous sequences that encode the same protein. With no doubt, this is difficult.


Each amino acid is translated by a codon, which is 3 adjacent mRNA nucleotides. For example, the start codon AUG translates into methionine, the first amino acid in any protein sequence. But due to redundancies in the genetic code (43 =64 triplet codons for 21amino acids), most amino acids can be translated from multiple codons. This fact makes the search space of mRNA design increase exponentially with protein length, e.g., for the spike protein of SARS-CoV-2, which contains 1,273 amino acids (plus the stop codon which is part of the mRNA but not part of a protein), there are about 10632 mRNA candidates.


To find an optimal mRNA sequence, traditionally scientists have to make random changes to a sequence and then see if they are beneficial. The scientific community is seeking different approaches to solve the problem. For example, Eterna, a browser-based gaming platform developed by Stanford University, is assembling online gamers to develop a safe mRNA vaccine by solving puzzles. Eterna has been using Baidu’s LinearFold algorithm to accelerate the analysis of RNA’s secondary structure.


LinearFold is a successful project that translates a biological challenge into a classical problem in computational linguistics. Inspired by LinearFold, our research team came to the idea of using computer science to find more stable and productive mRNA sequences than the wild type in nature. That’s how LinearDesign was developed.


“LinearDesign is software that designs a set of sequences that have structure and use easily read codons. Its speed is key in providing a set of good sequences that can be tested by experiment for their ability to work as vaccines,” says Dr. Mathews.


How does LinearDesign work?


Essentially, we use a dynamic programming algorithm to reduce the search space from exponential to polynomial. We first use a Deterministic Finite Automaton (DFA), a directed graph with labeled edges and distinct start and end nodes, to express amino acids and proteins. Shown in the figure below are four examples of DFA representations for amino acids, with each representing one amino acid.


Next, we concatenate them into a single DFA D(p) for a protein sequence p, which represents all possible mRNA sequences that translate into that protein D(p) = D(p1) ◦ D(p2) ◦ ··· ◦ D(pm) ◦ D(STOP) by stitching the end node of each DFA with the start node of the next.


We need to find the mRNA sequence with the most stable secondary structure through DFA. Here we borrowed a tool commonly used in computational linguistics, stochastic context-free grammar (SCFG), which are used to represent RNA folding. The mRNA design problem is now a simple extension of the single-sequence folding problem to the case of multiple inputs. We find the minimum free energy structure (and its corresponding sequence) among all possible structures for all possible sequences. This can be solved by intersecting the SCFG on the protein DFA.


The optimization of mRNA vaccine sequence design is actually to extend the secondary structure calculation (RNA folding) of a single RNA sequence to multiple RNA sequences. After we abstract multiple RNA sequences with DFA, we find the sequence with the most stable secondary structure from multiple mRNA sequences by taking the intersection of DFA and SCFG.


The following figure shows an example of how the DFA and SCFG intersect to generate the sequence of "methionine leucine stop" as "AUGCUGUGA".


 On this basis, our algorithm has also been extended in the following aspects:

(1) Borrowing the LinearFold idea to further reduce the computational complexity from the cubic complexity to linearity, greatly reducing the time required to design the mRNA sequence;

(2) From providing an optimal mRNA sequence to providing the top k suboptimal mRNA sequences as alternatives. Vaccine companies can select the most suitable vaccine sequence from these alternatives;

(3) Simultaneously optimize the secondary structure stability and codon optimality, and design an mRNA vaccine sequence with good stability and high protein expression efficiency.


Experiment results


Our experimental result shows that LinearDesign can efficiently design mRNA sequences. For spike protein of SARS-CoV-2, LinearDesign can finish the mRNA sequence design in 1.6 hours with exact search. With linear-time approximation, the design time is shortened to 16 minutes (b=1,000) and 78 seconds (b=100). 



We also compared the stability of our designed sequences with the wildtype sequence and random generated sequences. The wildtype sequence, denoted in a red circle, folds into a structure with the minimum free energy change of –967.8 kcal/mol. The most random sequences, denoted in blue cloud and orange cloud, have similar free energy changes (-987.9 kcal/mol and -1063.23 kcal/mol on average, respectively) as the wildtype. The sequence designed by LinearDesign in exact search has the lowest MFE of -2,477.70 kcal/mol (less energy indicating more stability). With only 0.56% MFE loss from the exact search sequence, the designed sequence with beam size b = 1, 000 achieves an MFE of -2,463.8 kcal/mol.

The results of MFE and CAI joint optimization, showed in light-blue curve and magenta curve, are also astonishing. We see that the curve is on the top-left of the figure, indicating that the sequences on the curve have both stable secondary structures and high expression levels. In fact, this curve is the accessible boundary of all possible sequences, i.e., no sequences can achieve the region beyond (to the top-left) the curve. The points on the curve are good candidates for mRNA vaccine. For example, the point with λ = 100, has the free energy change of -2,414.6 kcal/mol and CAI of 0.823, which is only 2.5% away from the optimal MFE sequence but with 0.097 increase in CAI. Shifting right from the light-blue curve with a small margin, the magenta curve is the results of joint optimization using b = 1, 000. This curve shows that the approximation quality is good with b = 1, 000.


Figure above shows the secondary structures of the wildtype sequences, our designed sequences with b = 1, 000 and b = +∞, as well as designed sequences with an absence of base pairing in the 5’-end leader regions.