BIS Seminar - 6.23.2020 - Model-averaged estimation of molecular evolution and natural selection in SARS-COV-1 and SARS-CoV-2 coronaviruses during zoonosis

June 23, 2020

Information

Jeffrey Townsend, PhD
Elihu Professor of Biostatistics and Professor of Ecology and Evolutionary Biology

ID5349

To CiteDCA Citation Guide

00:19- All right, I see more people joining
00:32Jeff, how long do you how long do you have like an hour?
00:36Less than that?
00:36- I think I can probably finish in less than an hour.
00:40- Less than hour, all right.
00:58I think we should get started.
01:02So hi, everyone.
01:03Welcome to our seminar series on COVID-19,
01:07organized by the Department of Biostatistics.
01:10I'm very pleased to have here today, Jeff Thompson,
01:15Professor of biostatistics, Ecology and Evolutionary Biology
01:20from the Yale School of Public Health.
01:23Thank you, Jeff, for being here today with us.
01:27As usual, you're welcome to write questions
01:30in the chat box or even unmute yourself, if you can,
01:35and other people are not talking.
01:38And, Jeff, why don't you take it from here?
01:42- Okay, thank you very much for the introduction, Laura.
01:45I'm really pleased to have an opportunity to talk
01:46about the work that we've been doing.
01:49I think like many speakers in this series, you know,
01:52we've been doing a lot of work very hard
01:54on a short period to try to get some progress on COVID-19.
01:58Ironically, this is the first work
01:59I think that I started In response to the COVID-19 epidemic
02:03and it's turned out to be a lot of work.
02:07So it's actually gotten the least far.
02:11So we've done a little bit of work, for instance,
02:13on epidemic modeling of COVID-19.
02:15That's already, it's actually been submitted,
02:18I actually have some other work on quarantine
02:20and stuff that turns out to be really interesting
02:24and far along in the research.
02:26And then this work, which I started early on,
02:28which is more evolutionary, and looking at the zoonotic
02:31process has gone a little bit slower.
02:32So what that means is consistent with
02:35many other speakers in this series,
02:35I'm gonna be talking a lot about
02:38the methods that we're going to be using,
02:40which are well developed, and what we're planning to do,
02:43I don't have a lot of results.
02:44But I think that's consistent with these talks in general.
02:47So hopefully, that will be of interest to you
02:49and also be illuminating in terms
02:53of possible research approaches towards this kind of work.
02:58So as Laura mentioned,
03:00I use a lot of evolutionary approaches
03:02to do my analyses of things.
03:04And the title of this talk is model averaged estimation
03:08of molecular evolution and natural selection
03:12in SARS coronavirus, one and SARS coronavirus two
03:14two Corona viruses during the zoonotic period.
03:18So what was attracting my interest in this particular case
03:21is that it's usually very difficult and challenging to find.
03:25And I'll get to this later in the talk to figure
03:27out what's going on during the zoonotic period,
03:29because you don't usually get much sampling there.
03:32So, what I wanted to do was apply some techniques
03:35that I've developed to this problem.
03:38And I will get to those techniques
03:41and the application to this problem.
03:43But I first just wanna give a little bit of introduction,
03:46I think, maybe from a statistics point of view
03:47towards some of the methodologies that we're using,
03:49just so everyone can sort of see on board
03:51at least how I see this as contributing
03:55to interesting statistical questions.
03:57So and in a broad sense, if I can get this to Move forward.
04:00Here we go.
04:01I think one of the most intriguing
04:03and interesting and challenging areas of mathematics
04:05and statistics is understanding this border
04:08between the discrete and the continuous.
04:09So these are just some one particular
04:13example you can pick out is, if you look at discrete
04:16and continuous distributions that are frequently
04:19in use in statistical probabilistic analyses,
04:21we have the geometric and negative binomial distributions.
04:25And we have the exponential and gamma distributions.
04:30These are basically essentially waiting for discrete events
04:32when you have a probability over time.
04:33We're waiting for the earth event if you
04:35have probably over time,
04:37and they correspond to the distributions on a continuous
04:39time for the wait for the first event
04:42or the wait for the alpha event.
04:45So there's a real clear correspondence
04:46between these two distributions.
04:48And you can actually see in the mathematics,
04:50how they're similar as well.
04:53And that correspondence is kind of interesting.
04:54And the reason why I say it's interesting is
04:56because often many of the biggest problems I think
04:59we wrestle with in statistics are when we're trying
05:01to deal with data that is some intermediate
05:04level between continuous and discrete,
05:07and where we're trying to figure out which
05:08approach is the best to use, should we use some sort
05:11sort of parameterize distribution to address it?
05:13Or should we use some sort of nonparametric
05:17approach based on the discrete?
05:18I'm not sure in any particular case.
05:19But I just wanna mention
05:21that I think that's a very interesting area.
05:22And the technique I'm gonna tell you about
05:23is definitely wrestling with exactly this kind of question.
05:27So what kind of question do I mean?
05:29Well, I mean, questions that deal with state spaces,
05:32over time, or over any discrete or continuous axis.
05:36And you can see in this diagram just give you a picture
05:40of the kinds of problems that one deals with
05:43between discrete and continuous measures.
05:45You can have here it's depicted as time,
05:48you could have a discrete state space,
05:51state space you're measuring over time,
05:53you could have a continuous sorry,
05:56you're gonna have discrete measurements
05:59over where You've got discrete time
06:01in a discrete state space,
06:03you could also have discrete time
06:06and a continuous state space.
06:08You can have continuous, continuous
06:12or you can have discrete, continuous.
06:13And this two on the bottom are, two on the left,
06:15sorry, are the relevant ones for
06:17what I wanna talk to you about.
06:19In my research, which is largely focused
06:22on informatik data that we can obtain from sequencing
06:26or other approaches like that.
06:28A lot of what we're trying to do is look at these discrete
06:30linear sequences that have sites DNA sites or amino acid
06:34sites and trying to understand is there some
06:37pattern in those sites that allows us to understand
06:40something about the biology of the organism
06:41or the biology that we want to know something more about?
06:45So what essentially I'm gonna be doing
06:48is telling you about approach an approach
06:50that takes essentially discrete items over some X axis
06:54here, in which case in my case, it's always going to be
06:56sequence space, like the nucleotides
06:58or the amino acids of a sequence.
07:01And turns it into these kinds of more discrete models.
07:04And then in some, in a procedure that I'm going to tell you
07:07about actually gives us more of a continuous measure
07:10over that space, it's not completely continuous,
07:13it actually is on every site.
07:14But when you work with hundreds of sites,
07:17it turns out to look very continuous
07:20in terms of how it appears.
07:22But it's done with a discrete model
07:23that looks over multiple sites.
07:24So well, I'll tell you how it works in a moment.
07:26And I hope it's of interest to you guys.
07:28So just to introduce that, in general,
07:31the lab has worked on a lot of different kinds of data,
07:34and including things like gene expression data
07:36that borders this discrete continuous measurement.
07:39The old micro arrays we used to use give us
07:43essentially continuous measures of gene expression.
07:44Now we get discrete counts
07:46from our census sequencing approaches.
07:49Then all the sequence data we work with
07:51often ends up being essentially clusters
07:53of sites and various kinds.
07:56And then we also use a lot of phylogenetic inference,
07:59which is another kind of just discrete modeling
08:01in terms of the topology, but the borders
08:03between these two because we have discrete modeling of the
08:07topology, there are certain topologies
08:10that the taxa that we're interested in looking at
08:12that show their relationship to each other.
08:13At the same time, there's also a continuous
08:15measure out of that, which is these branch lengths,
08:17or how diverge these different tacks
08:19are from each other and constructing the phylogeny.
08:22So this sort of border between discrete
08:24and continuous measures, always sort of plagues
08:28and intrigues me, I guess it would be the question.
08:30Okay, so what am I gonna do today?
08:32What I wannado today is talk about
08:35maximum likelihood model averaging to profile clustering
08:37of site types across discrete linear sequences.
08:40So at the very base level,
08:41how do we take kind of these discrete sequences
08:44of amino acids or nucleotides
08:46and understand whether sites are closer to each other
08:50or farther apart from each other
08:52this is the question are they just uniformly
08:53distributed site types across a sequence?
08:55Are they clustered close together or far apart?
08:58Secondly, I'm gonna talk about how we can
09:01then use that approach to understand whether sites
09:04are under selection in a gene expressed in a sequence.
09:07And what I mean by under selection is that,
09:09in fact, sites are changing in a rapid
09:12or at a more rapid pace than you'd expect simply
09:14by mutation alone.
09:16So mutation, of course, is going to introduce
09:18variation into a genetic sequence.
09:19But when you see changes that are happening faster
09:21over time in a population,
09:23then mutation alone would produce
09:26that implies that every time that mutation is happening,
09:29it's spreading across the population.
09:30And that's why you see that uptick
09:31in the rate of change of those sites.
09:34So we can actually use this clustering approach
09:36to identify regions of the gene that have
09:38that sort of uptick and I'll explain how we do that.
09:41Now lastly, I'm just going to show you a very few slides
09:43on the title of the talk,
09:45which is this model average estimation of the molecular
09:48evolution and natural selection in SARS Coronavirus one
09:51and SARS Coronavirus two during the zoonosis.
09:55So by the time we refer to these,
09:57I'll just let you know we're almost done with the talk.
09:59AlL right, so to talk about the first one
10:01maximum likelihood model averaging five clustering
10:03of sites across the street linear sequences.
10:09I just want to... (phone ringing)
10:11Sorry, emphasize that we wanna figure out
10:20whether site types are clustered within a linear sequence.
10:22This sounds like a very straightforward
10:24statistical question seems like something
10:27that should have been addressed many, many times
10:28in the statistical literature.
10:29Much to my surprise,
10:30it's actually not terribly well explored.
10:34You have a linear sequence,
10:36it's so long and you have site types of one type
10:38or another are they clustered next to each other?
10:39Well, if you know the bounds of the region of interest,
10:42and others, if you can describe oh,
10:43it's I'm interested in this domain right here,
10:46and it's from site to site 90 or some other description.
10:48If you know the bounds,
10:49it's very simple to analyze that kind of data.
10:52You can just quantify the site type proportions
10:55within and outside those bounds.
10:57use something like a straightforward fisher's exact
10:59test for significance extremely simple problem.
11:01But what if you don't actually know those bounds?
11:04What if you don't know even what you're looking for exactly?
11:05you just know you're interested in concentrations
11:07of one site type compared to another site type
11:10across some discrete linear sequence,
11:12like this series of zeros and ones you see below.
11:15There's one, zero, zeros, there's one, zero, ones,
11:17there's periods where ones are closer to each other a series
11:20of ones are closer or farther apart from each other.
11:22How should we figure out whether things
11:24are actually clustered in that site?
11:26Or are they random?
11:27So if you don't know exactly where to describe,
11:31or what size you're looking for,
11:33the most common solution people use
11:35is some kind of sliding window,
11:36they take a window over the series,
11:38and they slide it across and say,
11:40"How many are in this window?"
11:41And then you can come up with based on the sliding window
11:44a sort of diagram of the clustering.
11:46And that's an approach that actually does
11:49give a good metric of the clustering
11:51in terms of like you see peaks where there's
11:53a lot of clustering and valleys where there is none.
11:56However, significance testing with that kind of approach
11:59is often awkward to construct.
12:00Due to a strong or autocorrelation
12:02among this URL overlapping windows.
12:04And of course, if you just sort of
12:06take windows arbitrarily from one location to another,
12:09then you're really instituting, (indistinct chatter)
12:13then that causes problems.
12:14Because what if the cluster is really on a border
12:16between two windows, so you have to slide it over and then
12:19you have the autocorrelation.
12:20And it becomes actually statistically
12:21quite challenging to sort of account
12:24for all of those auto correlations.
12:25Secondly, they need to specify that window
12:27size itself presents a user with a procedural ambiguity
12:31that almost inevitably leads to post hoc selection of window
12:34size and can mislead inference that is just the fact that
12:37you have to choose a window size.
12:39And if you don't actually have a good arbitrary
12:41outside reason to choose it.
12:43It's very hard not to choose a window size
12:44that ends up validating your hypothesis in some way.
12:49So it'd be better if we could just have an approach
12:51that does not require us to place in some
12:53arbitrary parameter that gives us a window size.
12:56So in order to address this question,
12:58a postdoc of mine, John John, who you see below work
13:01with me to address it.
13:03Oh, I wanted to say one other thing,
13:04which is that, yes, this has been addressed with some
13:07nonparametric methods that people have developed,
13:11including some rather famous people like Sam Carlin.
13:14And these are methods that do not assume prior knowledge.
13:17And they've been suggested to detect this clustering
13:20and discrete linear sequences.
13:21So you can do runs tests that look for
13:22the longest unbroken run, or the variance of the run
13:26links across the entire sequence.
13:27Both of these are indicators of clustering.
13:30Unfortunately, both of those are using
13:32are not sufficient tests.
13:34And those they don't use enough of the information
13:36to say that you're actually have as much power as you'd
13:39like to do the analysis.
13:40And that's because if you use like
13:42the longest run link, for instance, of course,
13:44you're only really using a little bit
13:45of information about the entire sequence.
13:47And of course, you're really missing anything
13:49like the cluster of ones that are have a bunch of small
13:52clusters that are all next to each other interspersed
13:54with a few of the other type,
13:56so the longest unbroken run doesn't work well.
13:59If you use the In terms of power,
14:01if you use the variance of long run link
14:04that gets rid of the fact that you're looking for just one.
14:05But unfortunately, a variance doesn't tell you anything
14:07about the relative position of site
14:11that are of the same type across the sequence.
14:14So the fact that this one, one, one, one here is close
14:18to the one, one here, and the one another is,
14:20and this the fact that these are all close to each other,
14:22does not give us the power that it should
14:25for understanding this region may
14:27be under maybe cluster.
14:30So variants of run length is also an underpowered approach.
14:33The most powerful approach that's been published out there,
14:36aside from the ones we've been working on,
14:38is the empirical cumulative distribution functions
14:41to sick that's where you sort of go across the sequence
14:43and just say, "oh, okay, we're accumulating ones here,
14:47we're shooting more accumulating more."
14:49And there's fortunately a number
14:52of highly developed statistical approaches
14:53to look at the empirical distribution and figure
14:55out whether you see an increase beyond
15:00expected during some period during that ECDF,
15:03the power is better than either the previous methods,
15:05but it's still not very strong.
15:07It's not clear that it includes all the
15:08information that it should.
15:10And it can be affected.
15:12Research has shown that it can be affected
15:14by the location of the cluster, which is not desirable.
15:16So if you have a cluster on an end,
15:18that has less the ECDF will have less power
15:21or more power compared to a cluster in the middle.
15:23It's also challenging to interpret in the end,
15:26for reasons I'm not gonna go into right away.
15:29So what did we do?
15:30What we did was develop a tripartite divide
15:32and conquer approach to model variant sites
15:35based on iterative sub clustering.
15:37And I'll describe it in detail right now.
15:39I'll just tell you the plus and the minus
15:40of this approach at the beginning,
15:42which is it's sort of a bioinformatics approach
15:45and that are bioinformatics statisticians approach
15:48and that it uses intensive computation
15:50to solve the problem instead of giving
15:52a strict analytical result.
15:55And in fact, what it does is it just says,
15:58Well, if we're interested in clustering in any case,
16:00clusters should be represented by increases in
16:03the probability within some cluster central region
16:06compared to some side regions.
16:08And if we define CS and CE to be anything
16:11from the very beginning to the very end of the sequence,
16:14it encompasses all possible single clusters
16:17within a sequence.
16:19So, for instance, if the cluster were on the far left
16:22we can just define CS to be at zero,
16:25the left hand cluster is nothing and the right hand cluster,
16:28right hand area that has depressed in variant type intensity
16:35would be the other category.
16:38Anyway, so, what we can do is divide any sequence
16:42into three sections, just count up the number
16:44of site types in each one, estimate the maximum
16:46likelihood probability for the site type
16:50to be of the variant type of interest,
16:52say it's a glycine amino acids within a protein
16:55or add mean nucleotides limited gene, whatever it is.
17:00So then you can just come up with a null hypothesis,
17:03which is the likelihood under the hypothesis
17:06that these things are located at random
17:09across the whole sequence.
17:11And then an alternate hypothesis that allows
17:14that is invoking a model which involves more parameters,
17:18which then separate separates into a clustered
17:21versus non-clustered state.
17:23So that would be fine if what we really
17:25expected in a sequence was one cluster,
17:27compared to nothing else,
17:29compared to the sort of baseline rate of clustering,
17:33sort of baseline rate of variant types.
17:35And but what we really want is an approach
17:39that can take clustering at many, many levels.
17:42So what if there's a cluster within the cluster
17:43or cluster within left?
17:45So what you can do is then take each
17:46of these sub clusters you've identified and actually
17:50do the same process on them looking for whether there's
17:53a higher likelihood of the data given another cluster
17:56somewhere within this sequence, et cetera, et cetera.
17:59Now, if you think so this sort of dictates a procedure,
18:04which is that you start, you input the sequence,
18:07you start at, you know, the first at
18:09the left and move all the way to the right,
18:11essentially, you find the most likely cluster
18:13among all the possible clusters.
18:15If the cluster is statistically significant,
18:17you then sub sequence each of those three parts,
18:21the left hand part, the central center part
18:24and the right hand part, find the most
18:26likely clusters within each of them.
18:27And proceed doing this until you reach a point
18:30where you can no longer find any statistical evidence
18:32that there is continued clustering within it.
18:34And that's the point at which you stop.
18:36And then what you can do.
18:37And this, I think, is sort of a key because
18:39at the end of that, what you get is one discrete diagram,
18:42kind of like that diagram I showed you initially,
18:44where it proceeds flat, goes up,
18:46proceeds flat goes down, et cetera.
18:47I'll show you an example of that in a moment.
18:50But what you really wanna do possibly,
18:53right, what I think is really appealing about
18:55this approach is that then you can take
18:56that as one model, the most likely model and you can look
18:59at all the other possible models
19:00that you could have constructed.
19:02And you can use AIC weighting to actually figure
19:05out how much you should believe what is the weight
19:11for every possible model.
19:13And then you can average across those models
19:14to give you a continuous description
19:17of how much clustering you see across the sequence.
19:18And again, the advantage that I mentioned
19:20early on about this,
19:22from my standpoint is I haven't put in anything
19:24about how big a window how big a cluster,
19:26I put in nothing about what I'm expecting
19:28to see out of the sequence.
19:30I'm just asking, what's the most likely description
19:32of this given the assay penalty for parameterization
19:37and what the result gives me.
19:39So then we have a bunch of different weights
19:41for all our different models.
19:44And what it gives us something like this.
19:45So on the top, I've shown you the AIC model selection
19:48which is the first thing I showed you
19:49if I just took the most likely description
19:51of this particular sequence.
19:53It's not important what it is it's PRF
19:55ADHD, which has been widely studied in evolutionary biology.
19:59But if you take this model selection would,
20:02the most likely description
20:05given that sub clustering looks something like this
20:07where we have a region with fairly high concentration
20:10of polymorphism, in this case, a valley,
20:14a region, an intermediate level,
20:16a point where we have a lot of polymorphism.
20:19And then it moves and changes across the sequence.
20:21Now, if you then instead take not just that one model,
20:25but a series of models and do the AIC model average,
20:28you get a much more continuous description across
20:30the sequence of what the probability
20:33of sight types being different is.
20:36And that enables us to ask a question
20:37that's a little bit more interesting in many cases,
20:41and I'll show you how it enables us to ask questions
20:43about natural selection in a moment.
20:45So in particular, it allows us to get an estimate,
20:49you know of what the probability
20:50is across the entire sequence.
20:51Even though we don't have
20:52observations within the central region
20:54or this barren region here.
20:56We can still estimate what the model average,
21:00probably of a change of hearing in different places
21:02have this gene are and that enables us
21:05to ask questions that we otherwise could not do.
21:08All right, so that's an introduction of MACML.
21:11I'll just mention, and I could give you more detail on this.
21:14It's like this is actually published work,
21:16so you can find it.
21:17But compared to the ECDF statistics,
21:19that approach I just showed you has greater power
21:21to detect heterogeneous clusters
21:23it identifies clusters with greater accuracy and precision
21:26based on the Kullback-Liebler divergence between
21:28the actual distribution of the observed distribution,
21:31sorry, the actual distribution
21:34and the inferred distribution.
21:36It has better power and accuracy across
21:37different levels of clustering,
21:38better power and accuracy across
21:40different sequence links,
21:41and better power and accuracy and finding
21:43multiple clusters compared to a single cluster.
21:45The disadvantage is, it's extraordinarily
21:47computationally intensive, and it is prohibitively
21:49so for very long sequences.
21:51So for genes a very long length,
21:53we can't actually run it on the full-length gene
21:55and we have to do some more heuristic processes
21:58to crunch those genes into smaller size.
22:01Which we then can analyze and then build them up.
22:03Again, I won't go into those at the moment.
22:05But the point is that at certain links,
22:07it gets just computationally too intensive to go
22:09through all the possible models that could explain the data.
22:13Now, I've talked about the maximum-likelihood averaging
22:17to profile clustering of site types
22:19across discrete linear sequences,
22:21introduced that methodology to now I'm gonna talk about
22:24how we can at apply that methodology
22:26to get us a better idea of which sites are under selection
22:29using a what's called a pause on random fields approach.
22:32And don't worry about that terminology.
22:34You might know it from statistics,
22:37it has to do with a particular observation
22:40in molecular evolutionary biology,
22:42which is why they're using it
22:44and it's not really important for this talk,
22:46why it's called that.
22:48So let's go on and go ahead and do that talk
22:51about the model-averaged site selection
22:53using Poisson random fields.
22:54So first, I need to give you a little bit of background
22:56in the evolutionary biology for those of you
22:59who haven't had a lot of biology,
23:00so you understand how this fits in with
23:02what we tend to do another strategy.
23:03Of course, evolutionary biologists
23:05are often very interested in understanding
23:06what things are under selection.
23:07And in the context of this talk,
23:09why is that important?
23:10Well, we'd really like to know what things
23:12are under selection in the COVID epidemic,
23:14because we'd like to know what sites
23:16are actually causing the COVID epidemic
23:18to spread more or not, and what sites may have
23:21been important in it prior to zoonosis,
23:24MSN, perhaps, especially in the context of this talk,
23:26what sites were selected during
23:28that zoonotic process that made this virus perhaps able
23:31to infect humans in the first place.
23:33So what we're doing is,
23:34so to give you an introduction,
23:36I just wanna mention that they're sort of ways
23:39to look at ancient times and understand
23:40whether selection was happening.
23:42And that's this approach that's called
23:44that looks at phylogenetic divergence,
23:45looking at multiple sites and saying,
23:47"Oh, we have a whole bunch of phylogeny
23:49of how these organisms are related."
23:51And then we have a bunch of sites that are for each taxon.
23:55When we see sites like this, for instance,
23:57that's having A and then a couple C's and then a G
24:00and another tacks on, we know that this site changed twice
24:03on that phylogeny, at least right?
24:05So it changed to probably change from C ancestrally
24:09to an A in this lineage and to a G
24:11in this lineage independently.
24:13And so the fact that it changed twice means
24:16that it's got an elevated rate of change.
24:18And that elevated rate of change is an indication
24:20that there's been positive selection for change.
24:22It's especially likely in sort of pathogen hosts
24:25interactions that high rates of high change are
24:28because pathogens are changing in order
24:30to not be recognizable by their hosts.
24:33And often the host has recognition proteins
24:35that are changing to still recognize the pathogen,
24:36even the pathogen is changing.
24:38So these high rates of evolution
24:40are very strong indicators of selection
24:42in host pathogen situations.
24:45So this is one way to study a natural selection.
24:48It does depend, though, on having a lot of data going back
24:52in time because you're actually reliant on these changes
24:55are occurring in multiple places on multiple lineages.
24:58Now, a more recent level, and I'm going to go back
25:02to the middle in a moment.
25:05But a very recent time, you may have
25:07heard of selective sweep detection,
25:08a couple of methods people use are tajima's D,
25:11or IHS, there's a bunch of other methods that are out now.
25:14And the idea there is to look at polymorphism.
25:16And if you look at an individual, before selection,
25:20this is sort of just a idea diagram,
25:22not what you look at.
25:23But so if you look at an individual who has a variant,
25:26and what you see in a population is that
25:30one individual with variant, a variant that's important
25:33as somehow swept across the population.
25:35So if you see this would be before selection,
25:37there's a lot of variation at a particular locus
25:39in the genome after selection,
25:41that one individuals variant which contributed
25:44to the reproductive fitness would then imply
25:46that they would spread across the population.
25:50And if they spread across the population,
25:52then the genetic variants that were present
25:54in that original individual spread across
25:56the population as well along with this selected site,
26:00and so you can look for this kind of partial or speedy.
26:04And the selection is going on neither
26:07of the approaches that I just talked about
26:09or the approach that I'm doing today.
26:10So I just wanted to introduce those,
26:12so you knew those are different.
26:13And they're different because we're looking
26:15at a more intermediate timescale.
26:16That's like the sweet detection is purely
26:19dependent on polymorphism in the population,
26:21like what's happening in a population right now.
26:24The phylogenetic divergence is purely dependent
26:26on this ancient changes that you get from a phylogeny
26:28understanding how different species are related
26:31to each other at an intermediate level,
26:33our methods use that use both the polymorphism
26:35and the divergence.
26:37And the idea here in the McDonald-Kreitman approach,
26:40and the master approach I'm going to tell you
26:42about is that the polymorphism what you see generally
26:46in the population is sort of consistent with this.
26:48Sorry, if I go back to this slide.
26:51With this before selection, you know,
26:53all of these blue sites are assumed
26:55to not be under selection,
26:57and that generally what we believe in evolutionary biology,
26:59because of empirical data that validates it
27:02is that most sites that you find varying in populations
27:05are not under strong selection.
27:07If they were on stronger selection,
27:08they would probably fix it, everyone would have them.
27:11And if they were under negative selection,
27:13they wouldn't rise to a high frequency.
27:14So generally speaking sites that you actually see
27:17change differences between us and our genetics
27:18typically are not affecting anything.
27:20Of course, we spend in our...
27:23In the media, you only hear about the changes
27:24that actually affect things.
27:25And that's because those are important to us,
27:26the ones that don't change anything
27:28we don't really care about.
27:29So nobody talks about that much.
27:30But most of the changes within population or differences
27:33within population don't have much material effect.
27:35So under that hypothesis,
27:37then when you look at polymorphism,
27:39most polymorphism is just an indication
27:41of the underlying mutation rate,
27:43some mutation happened didn't have any effect.
27:45It's drifting up and down in the population.
27:47And so the advantage of that is if you know
27:50that polymorphism is signal is a signature
27:52of just random mutation, it gives us an estimate
27:54of the underlying mutation rate, which we can then compare
27:57to the divergence and using that comparison,
28:00we can understand how organisms are related.
28:02So whether organisms are under selection
28:05or not, if the divergence is high compared
28:07to the polymorphism, that indicates a lot of selection.
28:09That means (indistinct chatter)
28:12in the timescale of the analysis you're doing,
28:14we have a lot of change the population,
28:17and on the other hand, you have a lot of polymorphism
28:20and not that much divergence, then that indicates
28:22you've got a lot of change going on,
28:23but it's not actually being directionally
28:26selected because the divergence is much lower.
28:27So how does that test work in practice?
28:30Well, just to step back for one moment,
28:32so we're gonna apply that kind of test.
28:35In this talk I'm applying that test
28:36to the emergence of COVID-19.
28:39I'm actually applying it but also to SARS, which is fairly
28:44closely related the SARS coronavirus one
28:46because we have similar data and can apply
28:48the same test in the same way to that data set.
28:52And we're using in addition the SARS like
28:55Coronavirus in a sample that had been sequence
28:58basically collected from bats.
29:00Over the past 20 years or so,
29:02so what you can see here is a phylogeny,
29:05which includes COVID-19 epidemic ongoing now in humans,
29:09the SARS epidemic, which caused some 400 deaths
29:13or so back in the early 2000s.
29:18And what we're doing is analyzing both and looking at,
29:21in particular, the very short internode here
29:25were between the most closely related non human infections
29:31and the human infection set that we can see.
29:33And this internode here, also,
29:36between these non human infections and the human
29:39infections we can see here, because the changes
29:42that may have enabled, we don't know,
29:45there may be no changes that enabled it,
29:47maybe this virus throughout
29:49its entire history could have infected humans,
29:51but it just never managed to or never did.
29:53But if there are changes that are unique to this virus
29:56that happened during zoonosis, enabling it to infect us,
29:59they happened on this lineage,
30:00and so we're interested in seeing what those changes are.
30:04And so that's what we're gonna do is we're gonna run
30:06this polymorphism and divergence approach on this lineage.
30:10And what I just want to make (indistinct chatter)
30:13clear to you is the reason
30:14why the polymorphism divergence approach is important is
30:18the phylogenetic approach, the ancient approach
30:20relies on a large clade of data, which we don't have
30:22for that particular lineage here,
30:24we just have the human infection,
30:26which is no longer zoonotic.
30:26And we have this one lineage.
30:28And so what we can do is ancestrally reconstruct
30:30the ancestor of this lineage, which is right here,
30:33actually on the phylogeny,
30:34and also the ancestor right here,
30:37and then use mass PRF, this approach that's based
30:40on polymorphism in the room, so I'll explain to you
30:43on the divergence between that ancestor
30:46and the first ancestor of all the human infections.
30:48And we can take that as the near zoonosis time
30:51and figure out what mutations might
30:53have happened during that time.
30:54All right, so we're gonna do that in both
30:56the COVID-19 and SARS cases.
30:59Now, how does this work in principle?
31:02Well, there's an old approach,
31:03which is not what we're using.
31:05But I have to compare it to in order to
31:06sort of reference it in terms of the literature.
31:09And that is that when you assume
31:11that polymorphism is neutral,
31:13we expect a different proportion of replacement
31:16to synonymous divergence compared to replacement
31:18to synonymous polymorphism in a gene.
31:21So it's just a two by two table here, again,
31:23very simple statistics, where we look at
31:25the number of replacement sites that are divergent
31:28the number of synonymous sites replacement,
31:30again, is when an amino acid change
31:32occurs in a DNA sequence.
31:33DNA sequence changes can either change the amino acid
31:35or not depending on what the sequence of the code on
31:39the three base pair code on in the DNA sequences.
31:42So if there's a replacement, we tally it here,
31:44if it's a synonymous change, that doesn't change the amino
31:46acid, we tally it here, these ones are preserved.
31:48Sometimes changes are presumably neutral because
31:50they don't change anything about your protein.
31:52And then the if it's a polymorphic replacement,
31:56then we see it here.
31:57And if it's a synonymous polymorphism we see it here.
31:59So under the hypothesis that I mentioned,
32:01all three of these cells should occur, it should
32:04be sort of changing in exactly the same way
32:06because polymorphic sites, whether they're replacement
32:09are synonymous, we're assuming are neutral,
32:11synonymous sites, whether the divergent
32:12or polymorphic, we're assuming is neutral.
32:15The only one that apparently that under
32:17assumption is not neutral are these replacement
32:19changes at replacement divergence sites.
32:22So, if this replacement divergence, if the marginals
32:25add up so that this replacement divergence is sort of in
32:29line with all these others, then we assume nothing important
32:30is happening in that gene, it's probably not selected,
32:33it's just neutral changes that are happening there.
32:35If this divergence is higher, though,
32:38then we might conclude that it's under
32:39selection for changes at a rapid pace.
32:41So neutrality yields a DN over DS that's equal
32:44to the PN over PS positive selection means
32:46that the DN DS is greater than the PN PS and negative
32:50selection where changes are actually being selected against
32:53at a high level indicates the DN DS
32:56is gonna be less than PN PS.
32:59All right now Let's get to a little bit of the
33:01complexity on this thing that I mentioned that's called
33:04Poisson random field theory, quantitatively estimates
33:05gene-wide selection intensity.
33:09So what you can do is take that
33:12same two by two table, and you can say under a model of
33:14selection, what do we actually think is happening here.
33:18And that gives us the ability to estimate the selection
33:20coefficient, which is a basically the rate at which that
33:22change allows the virus to increase its reproductive ability
33:25or survival ability in the host.
33:27And that that is this gamma term right here
33:32in these terms, and this, these look complicated,
33:34but essentially, these formulas are just saying
33:36that the expectation for a synonymous sorry,
33:39the synonymous and replacement have reversed
33:41on this chart compared to the last,
33:43so don't be confused by that.
33:45But the expectation under synonymous
33:45changes is essentially the mutation rate.
33:48And these terms are just about the sampling properties
33:50of when you sequence how many of these things you get,
33:52I don't need to go into the detail about that here.
33:55Similarly, the polymorphic sequence
33:57is just basically dependent on the mutation rate.
34:00How the replacement sequences are a little bit more
34:02complicated in that they have to account
34:07for kinds of selection that may be going on.
34:11For reasons that I don't wanna get into
34:12the polymorphic selection, so both of them are depending
34:16on the mutation rate for replacement sites,
34:18and both of them depend on
34:20how much each variant is selected.
34:23Selection doesn't pack the polymorphism
34:25to a certain degree in the sense that if variants
34:27are moving through the population very fast,
34:30that can change how much polymorphism you see.
34:32But then if you use these sampling formulas, and the formula
34:36for the estimate of the strength of selection,
34:38given how many variants we see changing,
34:41you get these formulas for how much replacement
34:44divergence and polymorphism you expect to see.
34:47So this is a population genetics that was worked
34:49out by Stan Sawyer and Dan Hurley in 1992.
34:52The only change I'm making in this is pure F,
34:56instead of using a year which was how many grants
35:00that you see in the the McConnell Craven uses it,
35:04I'm taking the probabilities of replacement divergence
35:08and the probabilities of some polymorphism
35:11and putting them in here.
35:12And the advantage here is that what
35:13I can do with that is what I mentioned earlier,
35:15I can go back to the old mass MACML
35:18approach sequence clustering approach
35:20that I mentioned before, estimating those probabilities
35:25across the entire gene, I can then estimate action across
35:27the entire gene by using these probability single site,
35:30I don't have changes for single site.
35:32So what this allows
35:34us to estimate this gamma, minimizing likelihood of what
35:38gamma is to blame those problems exist, see.
35:42So this is a very complex diagram of how this all works,
35:46again, is a pretty elaborate method of computation.
35:50But again, has the nice properties that I'm not putting
35:53in any I'm not using assumptions
35:55and not putting in any parameters.
35:56They go in.
35:58I just take the polymorph at the end analyze it for
36:01weather sites are clustered into four different categories.
36:04Again, replacement polymorphism.
36:06That's this arc here.
36:07So polymorphisms anonymous divergence, placement divergence,
36:12we cluster within all four of those categories.
36:15We calculate the model average probability,
36:17all those clusters and merge the data together.
36:20I'm not going to go through the details.
36:22But just if you were to do essentially the KML,
36:25like clustering on those four categories
36:27for a particular gene polymorphisms
36:30and Ana's polymorphisms, monster and placement divergence
36:33if you plug those in, to the formulas I showed you before,
36:37you're basically plugging into these categories,
36:39you can estimate those formulas.
36:41And in the end, what you get is
36:42an estimate of gamma across nucleotide positions in a gene.
36:49I won't go into what this result here,
36:51it's an interesting result for reasons
36:54that are only of interest mostly to evolutionary
36:56biologist, but you can see here in this particular gene
36:58that there's a lot of variation in the selection
37:02intensity across the gene.
37:04Now, that is actually really
37:06consistent with what we'd expect.
37:08From a sort of basic biology standpoint.
37:11Different parts of a gene are gonna either
37:13be very strongly selected to stay the same
37:15or they're gonna change, you shouldn't really expect
37:18that all parts of gene are equally likely to change.
37:20And this gives a very nice diagram
37:22that allows you to understand how
37:23it's different across the gene.
37:25So if we compare this kind of approach
37:27to the McDonald kreitman tests, which again,
37:30are just putting in the DN DS, PN PS values
37:33into this two by two table,
37:36and I went through that, the important difference is that
37:39the Mk test assumes this intergenic homogeneous selection
37:42that in fact, a gene has the same selection
37:44across the entire sequence.
37:46The problem with that is if you have one small
37:48region that's under selection,
37:50the averaging out process across that entire gene
37:53can mean that you don't detect the selection there,
37:54even though it may be very strong for that small region.
37:57And so the hope is that mastery graph can
38:01identify those regions much better
38:02than MK for instance, would.
38:04And in fact, I went through this already.
38:09I'll just skip past this because I went through it already.
38:13And this it does do that.
38:18So this is an example of McDonnell Craven
38:21tests here applied to a Drosophila gene,
38:23what you see is this high evolution of a high level
38:27of replacement divergence, which turns out
38:30to indicate high selection.
38:33And you can see here that the DN DS ratio
38:35is about eight to one word as the PN PS ratio
38:38is almost even.
38:40So this is a gene that's under very strong selection
38:42based on the McDonald kreitman test.
38:45Now, interestingly, so this one works
38:47with a homogeneity.
38:49And then if you analyze the ACP 26 AA gene
38:55and look for the probability of all four categories.
38:58These are the four categories and of course,
39:01the replacement divergence here is the one
39:04that's most likely to drive selection.
39:06What do you get when you estimate gamma using this?
39:09Well, interestingly, what you see is not something
39:10that's under very strong selection across the entire gene,
39:13but something that's on moderately strong selection,
39:15basically in the second half of the gene,
39:17and then one peak of very strong
39:19selection right around the middle of the gene.
39:21And this is visible in currents because
39:23of a number of changes that occur
39:26in one particular domain of the gene here.
39:28Now, if you look at just the replacement divergence,
39:30you wouldn't be able to figure this out.
39:32Because you see there are other
39:34peaks along here.
39:35Those don't turn out to be so important.
39:36And the reason why they don't turn out to be so important
39:39is that the synonymous divergence synonymous by morphism
39:41replacement polymorphism.
39:42Tell us more about the underlying mutation rate
39:44that says those elevations are probably have
39:47something to do with mutation rate, and not necessarily
39:49to do with added divergence.
39:52You can sort of see this elevation
39:54on the right hand side over here compared
39:56to the small dip right here and up here
39:59and the way it all works out mathematically
40:02is we can really see that there's strong selection here.
40:04We can also get what I call model intervals for this.
40:06If you look across all the models,
40:08what are the estimates of selection?
40:11Possibly, what do we get is the 95% model interval for this?
40:14And that's what these very faint gray lines you
40:17may be able to see are those allow us to detect whether
40:19these are significant, least significant,
40:22statistically significant differences in selection.
40:24All right, I'm gonna skip through this
40:27just because I want to spend the time
40:29but the point is, you can do this for other genes,
40:29and it shows similar results that allow us
40:32to understand where sites are under selection in that gene.
40:34I'll just cover a few more examples
40:37of how we've used this to give you an idea
40:39of what it can look like in a comparison between humans
40:42and chimpanzees where we've run this just to understand
40:44how we've diverged from chimpanzees.
40:47We see a bunch of different examples here.
40:50Again, doing a little bit of comparison to
40:52that traditional McDonald kreitman test
40:54and the mass PRF test.
40:56Here you see a gene, which is statistically significant
41:00people's point of view.
41:01Based on the Mk tests, the four categories
41:04of the four tallies of which are indicated here.
41:07Here's the MASS -PRF profile, and it shows us again
41:10a particular region within this SLC AA
41:12one gene that is under selection.
41:14There are interesting stories behind all of these,
41:17but I'm not gonna take the time to go through them.
41:19Here's another example where and this is an example
41:22where the McDonald pregnant test
41:23comes out is not significant.
41:25There's just not that much divergence
41:26compared to the other categories.
41:28But if you do this, spatially with the MASS-PRF test,
41:32you actually see that a very central region there
41:34has very strong selection, and then the rest of the gene
41:37is under almost zero selection or almost no selection.
41:41So this is an example I talked about,
41:43where you could have some very small portion
41:45of the gene under very strongest selection.
41:47And McDonald-Kreitman test wouldn't detect it
41:49because it's averaging over the entire gene.
41:51Similarly, you'll get some genes.
41:52Oops, I didn't mean to do that.
41:54Some jeans, here's M gamma over here, where there's a...
41:58Well, let me turn to that one last.
41:59Actually, let me look at TPH First,
42:02there's no statistical selection according to the Mk tests.
42:06And in fact, in our MASS-PRF,
42:08there's no specific selection either
42:09the error bars are entirely overlapping zero here,
42:12which indicates no selection.
42:15Lastly, here's M gamma.
42:16This is the one of the very few examples
42:18we were able to find where McDonald test did detect
42:21selection where, where MASS-PRF didn't.
42:24As you can see, there's quite high tallies here,
42:26which means there's a lot of power
42:27to detect selection if it's there,
42:28but it's probably not very strong,
42:30because the numbers are not all that different
42:32from each other.
42:34And McDonald-Kreitman says it's statistically significant.
42:36Now the reason why McDonald Kreitman is telling
42:40it's statistic's nothing compared to mass PRF
42:41is that actually, I didn't explain this in detail to you.
42:44But McDonald- Kreitman doesn't actually assume
42:47that there's an elevation of rate here.
42:48And so the significance here is actually driven by
42:51the high polymorphic replacement level.
42:53So there's a lot of polymorphic replacements in there.
42:56And what that means is there's some other
43:00kind of selection that isn't a directional selection.
43:01I won't go into the details there.
43:02But the nice thing is that in the examples
43:04where we find that McDonald kreitman is statistically
43:07significant and MASS-PRF isn't examples
43:10where in fact MASS-PRF is not designed to detect
43:12that kind of selection and MK test is.
43:15In general MASS-PRF turned out to be significant
43:18in almost every case math MK tests were not.
43:21Okay, so how can we use this, apply this
43:24to instances like COVID-19, the point of this whole talk,
43:27and I'm just gonna give you one example first
43:30to justify why we think it's a good idea,
43:32because we don't have results on doing it,
43:34at least not many results on doing it to COVID-19
43:36yet, and that is that we applied this influenza before,
43:39which has some similarities to COVID-19, as everyone knows
43:43and in influenza, again, we're interested in looking across
43:46the gene are there sites that are under selection
43:48because those sites that are under selection
43:50are candidates where we need to be aware that
43:53in fact, vaccines need like for every year they design
43:58a new influenza vaccine, right?
43:58And what they're trying to do is accommodate
44:00the fact that these changes occur on the sites
44:03that are actually susceptible
44:04to your immune system recognizing the influenza virus.
44:08So we need to understand those sites that are changing
44:11and where they are in in order to design
44:13more universal vaccines that maybe could target sites
44:16that won't change rapidly because they can't change
44:19because they're structurally constrained in the virus.
44:22So what we did was apply this MASS-PRF approach
44:25to influenza similarly on a phylogeny
44:29to like I described for Coronavirus.
44:30I don't have the phylogeny in the slide set,
44:33but the point is just looking at the ancestral influenza
44:36and it's divergent sites within a particular region.
44:40And what we were able to do is identify a set of sites
44:43that are under select---ion using mass PRF
44:46that are beyond what people had prophesied
44:48as positive selection sites in the past.
44:50So there's a paper by Westgeest al 2012
44:53which is essentially the gold standard for this
44:55and they found a bunch of sites that are all
44:58these circled sites in gray MASS-PRF.
45:00Also found those the orange diagram here
45:03is the MASS-PRF for this gene.
45:09And it also identified other sites
45:10that are under selection as well.
45:14And we're in the process of understanding
45:16better how those can be validated.
45:17But the ultimate point is that
45:20these are important selected sites that may be relevant
45:25to the design of vaccines for influenza.
45:28So similarlY, we'd like to illuminate
45:31which sites might be changing rapidly
45:34and under positive selection in Coronavirus,
45:37not only during the human epidemic,
45:39but again during the zonotic zoonotic time period.
45:41And so now we're finally coming to the final
45:43part of my talk, which is what we're doing
45:46in terms of the model average estimation the mcos
45:48and natural selection in SARS coronavirus,
45:51one and SARS coronavirus two,
45:53Corona viruses during zoonosis.
45:53But the whole point here is really
45:56explain to you what I've done because the results I have
45:57as I said are I just have a few plots of some of the stuff
46:01longest selection we were able to check
46:03because we have to process through a lot more data
46:05before we get a more in depth look at the lesser
46:07selected sites that are on these genes.
46:10And so we looked at this for the for Coronavirus.
46:13This is just a Coronavirus, Getty image that Yale
46:17has used looking at Coronavirus.
46:21And again, as I mentioned,
46:23we're looking at these two sites of where COVID-19
46:26emergence occurred, and where SARS emergence occurred.
46:30And the question is, are there changes
46:33that happen there that are specifically
46:34responsible perhaps for those zoonosis and the only results
46:38I have are just a few results again, highlighting some of
46:40the strongest selection we saw.
46:42This is actually a diagram of the spike
46:44protein which if you've heard much about COVID-19
46:47molecular biology, you probably have heard about the spike
46:50protein, it's what sticks out from the virus.
46:52It's what grabs onto the AC receptor,
46:56and essentially is what most vaccines
46:58that one might design for the virus would target.
47:01And the point is that the recombination binding
47:04domain, which has gotten a lot of press already turns out
47:07to have the selected sites.
47:08You can see them here, here, here and here.
47:12These are sites that are selected,
47:13meaning they're changing rapidly
47:13during the pre zoonotic phase.
47:17So these are sites that are changing, not in humans,
47:20but in the bats in the pangolins.
47:22And whatever other animals that this virus
47:25is spreading among, or has been spreading among
47:27before the zoonosis to humans.
47:29So then the question is, are similar sites under
47:30selection during zoonosis?
47:31And during post zoonosis?
47:36And the answer right now is yes,
47:38it seems kind of similar,
47:39although we don't get the same sites.
47:40So we have to do a little bit
47:42more molecular, you know, staring at this and understanding
47:44it because these results are literally
47:46I got these results today, actually.
47:48So we have to sort of do more of this
47:51and we actually can actually look at more depth
47:54and get more sites with other approaches
47:55that we haven't implemented at this moment.
47:57But during near zoonosis what you see is again,
47:58the selected sites which are in bright red
48:06are also on the sort of the visible side
48:08of the recombination binding domain
48:13of the spike protein, which is the tip
48:17the outside portion of this gene.
48:23Lastly, if we look post-zoonosis that's in
48:24the evolution of humans, we again see that
48:26the selected sites are sites that are at this tip region.
48:33Again, none of this is terribly surprising.
48:35The interesting thing is that it kind of indicates
48:36that the zoonosis it kind of indicates consistency.
48:38Again, there's a lot more to do before
48:40we can conclude anything like this,
48:42but the idea we have right now indicates
48:44a good deal of consistency between the selection
48:46that's ongoing in humans during zoonosis and pre zoonosis.
48:51And what that implies is that this may
48:54well have been as I said, very briefly,
48:56during this talk an instance where there's a virus
49:00just circulating around in bats and penguins
49:01that could have caused this disease at any time,
49:04it's just a matter of whether or not we actually
49:07have exposure to, to those organisms
49:11that allows the transmission to happen.
49:14Consistent with this, I'll just mention
49:17a couple like verbal points,
49:18which is that all the evidence that we have indicates
49:20that this virus spread extremely quickly
49:23from the moment that it zoonosis into humans.
49:26And in fact, in most cases of zoonosis,
49:28we find that that's true,
49:31which is somewhat counterintuitive.
49:33Obviously, it hasn't adapted to humans,
49:34it has adapted to the amount of mammalian immune system.
49:37And so to the extent that our immune system is not
49:39tremendously different from that of bats or pangolins,
49:41it may be not surprising that it can infect us.
49:44But one of the things that is true is that
49:47if it did not spread very quickly,
49:48very easily from the very moment it transmitted to someone,
49:51it would probably lead to a dead end.
49:52In other words, if you don't have
49:55an ability to transmit and spread just from the get go,
49:57the first person who gets infected
50:00is very unlikely to transmit it to someone else.
50:02So it sort of has to be well pre adapted
50:04for a zoonotic event to actually spread in humans.
50:07Now there's, we need more zoonotic events,
50:11God forbid that it actually happens,
50:13to really get a better picture of that.
50:15But the general result and the scientific
50:16literature does seem to show that zoonosis happens.
50:18the disease's already well set to cause problems.
50:22And the examples that we don't have where
50:24it happens like that, like MERS
50:27or like, well, MERS is a good example.
50:30It's a really deadly disease,
50:31but it doesn't transmit well among humans.
50:32And so that's an example where maybe it's transmitting
50:35to humans, but it's not transmitting among humans.
50:37And it's very hard for that disease
50:40to catch on within the human population
50:43and do human transmission as opposed to zoonotic events.
50:45And that's because it doesn't transmit
50:47and it doesn't usually evolve that ability
50:48to transmit over the short time that
50:51that individuals might get infected.
50:53when when they get it usually from camels.
50:57Okay, so I've showed you those examples.
50:59I just wanna to mention what else we're gonna be doing.
51:02So I what I just showed you was actually
51:05the sort of SARS coronavirus to some sites
51:06that are under selection in search
51:08for Coronavirus two genes.
51:10This is the S gene right here.
51:12That's the spike gene.
51:13We're gonna be looking at that in SARS coronavirus,
51:15one and two, we're also going to be looking
51:18at other genes in the genomes.
51:22These have other functions.
51:23The M gene, for instance, is a membrane gene.
51:26So it might be relevant to and the gene
51:28as well might be relevant to vaccine generation.
51:32Like if we could generate a vaccine that targeted
51:35those, maybe they would be unable to change at the same
51:41pace that spike protein would they might be more conserved.
51:44And that might be one approach towards developing a vaccine.
51:46That would be a longer term vaccine because one thing we
51:49have to worry about, of course with this Coronavirus,
51:53is and I have other research that we're doing on
51:55this question, which I'd love to talk about if anyone's
51:57curious, but you can estimate
51:59what the actual waning immunity of it is,
52:00even though we don't have data on that by Looking
52:03at other related species and using the phylogeny
52:05to understand how the how the waning immunity
52:08has evolved across the species
52:09and what the projected or most likely
52:12waning immunity of SARS coronavirus is,
52:15and it's, it tends to be it looks like
52:16it's around 80 weeks or so.
52:18So if we get about 8 weeks of waiting a period
52:21of immunity from this, that's not that
52:22much in terms of every two years or so we're gonna have
52:25Coronavirus coming around and in terms of we're going to
52:28be susceptible again to Coronavirus.
52:30Not that we're going to get it every two years.
52:33And what that would mean is that
52:36it's likely to persist as a circulating virus.
52:38And if it remains as deadly as it is that's a serious issue.
52:40So we're gonna really want to buy a vaccine.
52:42And we're not necessarily going to wanna have another flu
52:44vaccine that we have to get every year.
52:49So what we really want to do is target
52:51some genes that may be under more constraint
52:53then the recombination binding protein gene, the spike gene.
52:57So anyway, so the point is looking at multiple genes for
53:00trying to understand where conservative regions are where
53:03regions that are under selection are important.
53:05And we'll be doing that.
53:07And hopefully some of those results will
53:11help to guide the kind of generation of vaccines,
53:15and also the generation of therapeutics,
53:16because sites that are under
53:19selection are functional.
53:20So if you actually design a therapeutic
53:21that interferes with the sites that are under selection
53:22sort of in an opposite way, from vaccines, vaccines,
53:25we really want to target something that just doesn't change.
53:26With therapeutics, we may want to target
53:27the changing regions, if we can design something
53:30that generically does, because those changing
53:31regions are functional.
53:32In other words, those sites at the end of the spike protein
53:33are clearly ones that do bind the ACE gene.
53:35It's just that they're flexible
53:38about what they are in order to bind it.
53:42So we need to include
53:43all of those changing sites, if we wanna dissolve develop
53:46a therapeutic that for instance, would somehow interfering
53:50with the binding of Ace to receptors from the spike genes.
53:53So thank you very much for listening to the ongoing work
53:56we're doing on COVID-19.
53:59I would love to entertain any questions that you have.
54:03Let me just take one moment to acknowledge
54:05some of the people that I should acknowledge in this work,
54:09I already showed you a picture of John John who was earlier
54:11the the picture and the associated with the Mac ml approach
54:13that we developed many years ago 10 years ago basically
54:15Yinfei Wu has been taking the lead on this project.
54:18She's a master student.
54:19Yano os Wang was an assistant was in visiting
54:22Assistant Professor Stephen Gaugham,
54:24is in the Evie department
54:26has been helping out with this analysis.
54:28Haley Hassler is in my lab, has been helping out
54:30with phylogenetics Jayveer Singh is an undergrad
54:32who's been doing some of the research work
54:35some of the actually literature research
54:37that has helped us to contextualize
54:39the work we're doing Mofeed Najib
54:41produced those diagrams of the spike protein
54:44with the sites that we have identified
54:46as under selection so far,
54:48Zheng Wang is a long term collaborator of mine who works
54:54on nearly all the phylogenetic projects
54:56that I do, who's works with me.
54:59And then Alex Thornburg is A long term collaborator of mine,
55:02now in North Carolina.
55:06He was while he's currently at the North Carolina
55:08Museum of sciences, but he works on a lot of phylogenetic
55:11projects with me as well.
55:13And by the way, all of this, fortunately
55:16was recently awarded one of the NSF rapid grants
55:19to do this research.
55:20So we're very pleased to have funding to
55:22continue to work on this as time goes on, which is good
55:25because it's taking quite a lot of work
55:27to do the sequence wrangling.
55:29And the analyses themselves.
55:30As I mentioned, they're computationally intensive.
55:32So Alex and I were the PI's on that particular
55:36grant from the NSF.
55:37So we're excited to continue to do that work.
55:41And with that, I think I would
55:42like to entertain any questions you might have.
55:45- Thank you, Jeff, this was great.
55:48I'm sure we have a lot of questions
55:49who gets first?
55:54Again, you can type the questions on the
55:59chat box or just mute.
56:13- I have a quick question.
56:14- Okay.
56:16- You mentioned or you touched a bit on this before,
56:20but how would this compare to cite wise estimates
56:24of omega that you would get from Pamel
56:28or similar program?
56:29- So I'm sorry, I sort of was rushing at the end,
56:32I didn't explain that, in fact, I'm using pamel for some,
56:35So I'm using Pamela
56:36for the pre zoonosis analysis, and for the post zoonosis
56:40analysis, because as I mentioned during the talk,
56:44if you have a large phylogeny
56:46with multiple branches, et cetera, et cetera,
56:49where you can look over that entire phylogeny then you
56:51can get multiple changes at individual sites,
56:53which is what pamel actually uses to infer selection, right?
56:55You have to have the site change not just once
56:57but twice or three times.
57:02And then it says all that's under selection because
57:07it keeps changing again and again and again.
57:12So, so Pamela allows you to do that
57:13if you have this sort of deep time
57:15or large amount of time and multiple lineages that you're
57:17looking at, the master of approach that I'm using, enables
57:19you to do that on just a single lineage without needing
57:22multiple changes, I mean, multiple changes
57:23on a single language you can't even detect
57:25because it just looks like one change
57:26if you have the ancestral sequence, which is what we do
57:28ancestral data summation, get the ancestral sequence.
57:31And if you have the descendant sequence, a changes
57:33to T, you don't know if it changed to A to G to C to T again
57:35or if it just changed a to T, you have no idea you can
57:36just say it changed once.
57:38And so there's no real way to run pants,
57:40there is a way but it's really it's statistically
57:41really underpowered terrible thing
57:42to do to try to run pamel on a single lineage
57:44and figure out whether something's under selection.
57:47The advantage of this approach is because it
57:49can use that polymorphism data, the data of like what's
57:51just circulating in within populations as a metric for how
57:54much mutation is occurring.
57:56You can essentially divide out by that
57:59and then again, because we're integrating over all
58:04these models of how these things change, we're essentially
58:07borrowing information from neighboring sites for what their
58:10rates of change are, et cetera et cetera
58:13to estimate what the possible amount
58:15of selection is on all these sites.
58:16So by using the polymorphism data, and by doing this model
58:19averaging approach, we're actually able
58:21to take individual lineages and estimate
58:23the selection on them.
58:25And that's what we're doing in the near zonosis analysis
58:29that I showed you in the middle here.
58:33So there are different ways of doing the analysis.
58:35And it's necessitated by the fact that we just have this
58:37one lineage and there's no way it won't be a single lineage
58:39in any dataset we look at because for zoonosis,
58:42we're going to have human sequences,
58:44we're gonna have some animal sequences,
58:45we're not going to know we're not going
58:48to have any information about the actual zoonosis.
58:50Even if we knew the first human,
58:52we could just take that as an estimate.
58:54We still probably need some data here.
58:56Maybe you could have the first human
58:58and the first animal that you got it from.
59:00That just doesn't exist.
59:01We don't have that data for any zoonosis.
59:04How would we would never be there at the moment.
59:07So we have to assume that there's a number
59:09of transmissions among humans
59:10and a number of transmissions among animals
59:13during that near zoonotic period.
59:14And it's just a single lineage.
59:16So we can't really run pamel on that,
59:19in summary, because pamel requires multiple
59:21changes multiple lineages to have power
59:23to actually infer evolutionary change.
59:25MASS-PRF fortunatelY, can do that,
59:27because you can look on single lineages.
59:28So you can use MK tests as well on single lineage
59:33is basically designed to look at single lineages.
59:36But the problem with MK tests, as I mentioned,
59:37is that they're assuming the entire
59:39gene is under selection, which means it doesn't give you
59:41the scope or understanding about recombination
59:44binding gene sites under selection or something like that.
59:46It often will just give you a result of the genes not under
59:47selection, which is not true.
59:51- Does that answer your question?
59:54- Yes.
59:55- Great.
60:00- Any other questions?
01:00:04- I have one more if no one else wants to.
01:00:05- Sure, go ahead.
01:00:07- So in B cells, we have mechanisms
01:00:10that have mutation that specifically
01:00:13bias towards replacement mutations.
01:00:17So in the absence of selection,
01:00:18the mutation mechanisms actually cause
01:00:21an Omega greater than one.
01:00:24would this have any way of correcting for that?
01:00:28- So the tricky part is, and I don't know how it might,
01:00:31the tricky part is not so much running the software,
01:00:33which you could certainly do on that.
01:00:37The tricky part would be identifying
01:00:39what polymorphism is, in the case of those cells.
01:00:43So if you could identify sets of cells that are undergoing
01:00:47the mutation but aren't under selection in some way, then
01:00:51you could use that as the proxy for the way we use it here
01:00:54is polymorphism within population polymorphism,
01:00:57and then estimate that.
01:00:59I just don't know whether you have a way of
01:01:01doing Doing that if you want to discuss
01:01:03it with me, we could.
01:01:05That's sort of always the key for detecting selection.
01:01:09And it's, you know, many of you may be familiar that I work
01:01:11on cancer and some of the work that I do.
01:01:13It's the same
01:01:18problem that I'm working on there all the time, I'm trying
01:01:21to understand what the baseline mutation rates of cancer
01:01:23in cancer and somatic evolution of cells are.
01:01:25Because if I understand the baseline rates
01:01:27, how often those things change,
01:01:29just the mutation alone,
01:01:30then I can always estimate selection.
01:01:32And that's the thing we almost always want to
01:01:34know about in the analog analysis of sequence data.
01:01:37So, again, it's all about figuring out if there's some piece
01:01:42of the data that can be used to estimate that polymorphism
01:01:46and an approach like this, the benefit of an approach like
01:01:48this would be, you know, maybe you can estimate that for
01:01:50some portions of the gene, but not others, you know, maybe
01:01:52then there's a way that you could use this sort of model
01:01:54averaging approach to get at the underlying rate that it's
01:01:56happening, even if you can't estimate
01:01:58for that particular site, for instance.
01:02:00So I think the Might be potential to do it,
01:02:02but it just depends, you know, about on whether
01:02:04there's a critical, you know, set of data in what you're
01:02:09looking at which I haven't spent much time
01:02:12looking at back in the day.
01:02:13So I wouldn't know whether there's some way
01:02:15of baseline getting that baseline polymorphism or baseline
01:02:19mutation rate, which essentially amounts to the same thing.
01:02:23It just depends on whether, you know, you're assuming the
01:02:26population is sort of has, you know,
01:02:29it's just whether you're looking at at a population level,
01:02:31or you have some sort of covariance matrix
01:02:34to better understand the mutation rates itself.
01:02:36- I think there is a similar population B cells,
01:02:38- Great, so I encourage you to look into that.
01:02:44- Jeff, I have a quick question.
01:02:47I'm not too familiar with genome sequencing.
01:02:50But I think the Clustering Problem,
01:02:53the issue and the solution you have
01:02:55can be applied to many types of data.
01:02:58So I'm kind of confused.
01:02:59So you start In the diagram where you describe
01:03:02the different steps, you said that you first pick the most
01:03:06likely cluster and then you essentially
01:03:07keep splitting the clusters, right?
01:03:09How do you get the first clusters? Like
01:03:12there is some randomness in how you split the first?
01:03:16- Oh, so I sorry, I apologize.
01:03:19I didn't explain it in enough detail.
01:03:22The reason why it's so computationally intensive
01:03:24is we look at all possible.
01:03:27all possible exhaustedly.
01:03:29Now, I actually spent a year of my life trying
01:03:31to find a way to develop a Bayesian approach
01:03:34or some approach that would allow me
01:03:38to not look at all possible, you know, like to
01:03:40make this because because if you could do that,
01:03:41this would be a great way for doing tons of different things
01:03:45on very large data sets, right, large, like,
01:03:47and what amazed me is, I found that
01:03:50it was just an impenetrable problem.
01:03:53If I didn't look at every possible model.
01:03:56I could not get it to work I couldn't prove that
01:04:00That's Through like, I don't have any proof, that's true.
01:04:04And I would encourage anyone who really wants to dive
01:04:05in there, go ahead.
01:04:06But I'll warn you that I spent a year
01:04:07banging my head against that problem.
01:04:09And when I didn't
01:04:10exhaustively search all the models, I could not, I always
01:04:12caused these biases, like there was no way to sample them.
01:04:16I even have ways of sampling the models
01:04:17according to their probability.
01:04:24But even that causes a bias because sometimes
01:04:31there's a large number.
01:04:31So if you look at the, if you think
01:04:34about the set of models, it's a very large set of models.
01:04:35And there isn't actually a huge amount
01:04:38of likelihood differences between these models.
01:04:42That's the thing.
01:04:45So when you don't exhaustively sample the models,
01:04:49if you just sample some of the most likely models,
01:04:53you actually are sampling just
01:04:56one corner of the space.
01:04:57And it's possible for a bunch of
01:04:59not quite so likely models, but reasonable models
01:05:00that are not in that corner to sort of be actually
01:05:03highly influential on the model average.
01:05:04And so the bottom line is like sampling
01:05:05by trying to pick in the you know, most likely space doesn't
01:05:06work sampling by picking randomly doesn't work.
01:05:07And I could go into more detail about it.
01:05:09But it turned out that I couldn't do it
01:05:10any way other than exhaustive sampling.
01:05:12So, I say that Sorry, I missed that mistake.
01:05:14I couldn't do it by any biased approach
01:05:16towards that exhaustive handling
01:05:18the approach that I'm showing you right here.
01:05:21Actually, there are two ways of doing it.
01:05:22One is to sample stochastically,
01:05:23according to likelihood, and the other is to sample exactly
01:05:27across all exhausted sampling significantly works.
01:05:30In fact, it's implemented in the approach that I
01:05:33was just showing, I'm sorry, I just sort of jumped too fast
01:05:35to say what I was saying.
01:05:37So sampling stochastically works
01:05:38and sampling exhaustively work sampling stochastically is
01:05:40still very computationally intensive.
01:05:42But there's no I couldn't
01:05:44find any way to sort of, you know, important sample or do
01:05:48some sort of approach that would allow me to get a smaller
01:05:50set of models, which would then if we could do that,
01:05:53that could be really important,
01:05:55because then you could do this
01:05:57on more than like 2000 site,
01:05:59it's somewhere around 2000 sites.
01:06:00So you start running into real problems with
01:06:04just too much computing computation time
01:06:06to make it worthwhile.
01:06:07So we could extend this to 10,000 100,000, you know,
01:06:11potentially really, really large numbers of sites,
01:06:13and really, really sparse sets of sites.
01:06:16If only we could find a way
01:06:19to bias the sampling towards models that are more likely
01:06:24without causing biases in the results.
01:06:26I couldn't find any way to do.
01:06:27- This seems very much related to tree based
01:06:30methods where essentially you've got, like split the space
01:06:36and then you model of geology models,
01:06:39like the random forest, for example,
01:06:41or is very much related to that right.
01:06:45- Yeah, I have to say I was now familiar
01:06:47with those approaches.
01:06:49But when I was completely unfamiliar with it, yeah, I sort
01:06:52of thought about it that way.
01:06:54But you're absolutely right.
01:06:56Yeah, I guess the difference but here
01:06:57you have a sequence like one sequence,
01:07:00tghere you have a space.
01:07:01So you just split in
01:07:02different dimensions, but it is really good.
01:07:05- And I can mention, just to speculate,
01:07:10I'm kind of interested in a number of
01:07:13other ways of applying this.
01:07:15So for instance, if the one I've been thinking about
01:07:18and actually worked on a little
01:07:20bit haven't gotten very far with, but it's like,
01:07:21when you're dealing with event spaces over time,
01:07:22like if you have days, and you have individuals like,
01:07:24prominent us in public health,
01:07:27like individuals who are undergoing events
01:07:29you end up with a very sparse matrix of events.
01:07:31And so we use these approaches like survival plots
01:07:38all these approaches that we use to sort of understand
01:07:40how these rare events are happening,
01:07:42and how people are changing over this,
01:07:44that event space is actually really sparse.
01:07:45But it's kind of a matrix.
01:07:47And you could do this in two dimensions,
01:07:48not just one, right?
01:07:49So you could model average across two dimensions,
01:07:52and then you could get something
01:07:53that the thing that really appeals to me about that is that
01:07:55again, it's really this approach is really,
01:08:00it only builds up from the this binomial event
01:08:04No, no event, stuff, a picture that's very continuous over
01:08:09over the space and involves no assumptions
01:08:11about distribution whatsoever.
01:08:12So I'm just wondering if there aren't instances
01:08:14where, you know, we could come up
01:08:17with a better understanding of what's going on
01:08:19with individuals in a matrix such as
01:08:20that by using this approach.
01:08:22And it's an approach that is
01:08:23that still works even with these sparse spaces, because
01:08:26you can model average over these tremendously large number
01:08:29of models that all have fairly likely fairly
01:08:33equal likelihood to get a result.
01:08:35So I don't know that's just a sort of a
01:08:37speculation that there might be some interesting approaches
01:08:38, ways to approach those problems using this kind of kind
01:08:41of model averaging technique.
01:08:46- Great, I think we should wrap up.
01:08:49Thank you, Jeff, for this great presentation was great.
01:08:52And thank you all for joining today.
01:08:57See you next next seminar
01:08:58is gonna be I think, July 14.
01:09:01So we'll send out invites.
01:09:05All right, thank you, Jeff.
01:09:07Thank you all, bye, bye.