Current Projects

My quals are coming up soon. To sum up my interests, I have come up with a (very)+ preliminary title and pseudoabstract:

The effects of genetic variation on RNA: From structure to regulation

In this decade, measuring the genetic variation of a specific population at single nucleotide resolution has become more common given new methodologies in sequencing technologies that have driven experimental costs dramatically. Traditionally, genotypes are associated with observed phenotypes by searching for possible correlations between allele frequencies and traits (the famous QTL studies, now incarnated in more sophisticated eQTL associations). In my graduate work, I would like to focus in a fairly unexplored terrain of questions regarding how genetic variation can affect RNA products and how, in turn, can these transcriptional variation be translated to upper levels in the phenotype, like protein synthesis. My work will have to explore both protein-coding and non-coding variations, to help distinguish between variations that happen in and outside coding transcripts. Many questions of interests fall in this field:

  1. What are the mechanistic differences between mutations that occur inside an mRNA and a ncRNA
  2. How well can RNA transfer a mutation into its protein product or regulation target?
  3. Can we gain information by looking at the correlation between the effects of genotypes in RNA and the observed physical traits (as opposed to simply genotype and physical trait)?

To address question 2, in part, the following project is proposed:

Codon usage bias in term of mRNA structure

To address question 3, the following project is proposed:

Modelling genetic effects in observed genotypes as a chain of intermediate phenotypes

It is interesting to see that QTL studies these days do not venture further than performing simple trait-phenotype regressions to explain the impact of genetic variation on observed traits. Eric Shadt and his now disbanded Rossetta group made efforts to get out of this paradigm by using co-expression networks to explain disease. This is a great step forward, but we certainly can do better. Indeed, if you look closely, most eQTL studies explain very little of the trait variance (even the additive effects of measuring even millions of SNPs are not that great). This has to be expected, after all, you are modelling your phenotype as simple correlation of your genotype plus some environmental noise. There is no telling if a) intermediate steps in the mechanism that takes the genetic variations have a significant predictive power over the variance of the trait or b) if this is indeed a linear correlation. Assuming that b) is too ambitious to be tractable (besides, linear correlations have proven to be fairly good approximations), we can certainly tackle a) in some manner. We could therefore construct a chain of regressions, each one taking the input of the previous one, in which DNA variations are translated into RNA variations, which are translated into protein variations, which are translated to mechanistic alterations, which are translated to physical traits. Sounds familiar? Indeed, this looks like a probabilistic graphical model, and we can make the relations as complicated as we want (heck, as long as it goes in a forward fashion, we could directly apply some kind of belief propagation method to get the final correlation; this of course precludes the possibility of feed forward loops, but it is still better than a simple regression). Many more things can be inferred here. For example, look at the DNA-RNA step. As you may realize, information is not passed down in totally faithful manner, things such as RNA editing interfere with the signal passed from the DNA to the RNA. This means that we have another step of variation that we could take into account. Now look at RNA to protein steps. Much more complicated things happen here, but it could be that any variation included in the RNA could be overridden by a post-transcriptional modification in the protein product. Wait, so now everything looks so complicated, perhaps too complicated to model? On the contrary, variation at each step simplifies the model, because it makes it recursive: there is variation at every step of the chain. What gets complicated is that you have to measure each of these variation effects experimentally, not just the genetic variation. But I think it is worth it, because you could know exactly what part of the step is accounting for your trait’s variation, or if it’s a combination of many of these steps. Furthermore, this variation model saves you the complexity of a more general model where all known molecules are connected and interact in some quantitative manner to produce changes in the phenotype you observe. Each of these interactions is encapsulated in the variation at every step. One last note is that suppose you could do these type of studies, measuring all of these variations at the DNA and RNA level, for all quantitative traits that you can think of. Somehow, and this is a big somehow, I believe that you could measure how well does RNA pass on information, i.e. how lossy is RNA as an information channel.
Things that need to be done here:

  • Get a dataset from a QTL study,┬á preferrably an eQTL study. Apply a simple chain of regressions
No comments

No Comments

Leave a comment

You must be logged in to post a comment.