When you have the flu, your body changes in a multitude of ways. You have a fever, your nose runs, your body aches. What is less known, however, is that the changes caused by infections extend all the way to the activation of your genes. Infections heighten the expression of genes that control your body’s fighting mechanisms. Moreover, as your body is changing to fight the infection, the flu viruses are also changing to better infect you.
While this is all fairly well understood for the flu, the same is not true for the vast majority of other animal and plant pathogens. One way of understanding how hosts fight off their infections is to analyze how the pathogen alters the gene expression of their hosts. To do that, scientists use a technique called dual RNA-seq: they sequence RNA, the molecules that control gene expression.
RNA can tell us which genes are activated at any given time, and how active are they. Because of that, they give us a measure of how genes respond to an infection. These RNA sequences can be obtained at different points in time, to monitor how gene expression is changing throughout the infection’s cycle. They can also be compared to those of non-infected tissues. Small hiccup: when you have an infection, your genetic material is mixed up with the genetic material of the pathogen. How can we tell who is who?
When you sequence RNA samples, you obtain several small fragments of genetic material, like puzzle pieces. The most common method to separate host and pathogen sequences is to map them onto so-called reference genomes. It’s like building two puzzles using the puzzle box’s covers as guides. There’s one big problem, though: we only have the full genome of a few very well-studied species.
Dr. Kayleigh O’Keeffe and Dr. Corbin Jones used a TriCEM grant to test for multiple methods of separating the host’s genetic sequences from those of the pathogen when one or both reference genomes are missing. They came up with a workflow, where you choose the best method to use based on characteristics of the system you are working on. This workflow was published in the journal Methods in Ecology and Evolution in 2019.
The motivation from the study came from Dr. O’Keeffe’s dissertation research at UNC Chapel Hill. “My own study systems didn’t have reference genomes, so I had to troubleshoot the best way to separate the two sequences”, says Dr. O’Keeffe, who is currently a postdoc at the University of Pennsylvania.
Dr. O’Keeffe, who is an ecologist, enlisted the help of Dr. Jones, an evolutionary biologist, to find the best solution to this challenge. They tested three different methods: mapping their sequences to the genome of a related species (like using similar images to build their puzzles); concatenating (linking together) the reference genomes from a related host and a related pathogen, then mapping their own sequences to this long chimera; and re-assembling the host and pathogen genomes from scratch (carefully putting together pieces that fit, with no guide, until you end up with bigger chunks of puzzle).
The goal was to assemble something close to two complete puzzles with as few misplaced pieces as possible. To test these three methods, Drs. O’Keeffe and Jones used sophisticated bioinformatics methods to simulate sequences from two well-studied species. They blended them together in different proportions, and then tried to separate the simulated pathogen from the simulate host.
All of these methods had problems. “Mapping the sequences to the genome of related species is probably the method most people use, but it’s also the one that generated the most errors,” says Dr. O’Keeffe. Concatenating and re-assembling sequences seem to be the most promising paths. Those were the methods where fewer sequence fragments ended up in the wrong place. Which method is best, then? It depends on your type of data! Better refer to their workflow…