# BIS Seminar - 6.23.2020 - Model-averaged estimation of molecular evolution and natural selection in SARS-COV-1 and SARS-CoV-2 coronaviruses during zoonosis

June 23, 2020## Information

Jeffrey Townsend, PhD

Elihu Professor of Biostatistics and Professor of Ecology and Evolutionary Biology

ID5349

To CiteDCA Citation Guide

- 00:19- All right, I see more people joining
- 00:32Jeff, how long do you how long do you have like an hour?
- 00:36Less than that?
- 00:36- I think I can probably finish in less than an hour.
- 00:40- Less than hour, all right.
- 00:58I think we should get started.
- 01:02So hi, everyone.
- 01:03Welcome to our seminar series on COVID-19,
- 01:07organized by the Department of Biostatistics.
- 01:10I'm very pleased to have here today, Jeff Thompson,
- 01:15Professor of biostatistics, Ecology and Evolutionary Biology
- 01:20from the Yale School of Public Health.
- 01:23Thank you, Jeff, for being here today with us.
- 01:27As usual, you're welcome to write questions
- 01:30in the chat box or even unmute yourself, if you can,
- 01:35and other people are not talking.
- 01:38And, Jeff, why don't you take it from here?
- 01:42- Okay, thank you very much for the introduction, Laura.
- 01:45I'm really pleased to have an opportunity to talk
- 01:46about the work that we've been doing.
- 01:49I think like many speakers in this series, you know,
- 01:52we've been doing a lot of work very hard
- 01:54on a short period to try to get some progress on COVID-19.
- 01:58Ironically, this is the first work
- 01:59I think that I started In response to the COVID-19 epidemic
- 02:03and it's turned out to be a lot of work.
- 02:07So it's actually gotten the least far.
- 02:11So we've done a little bit of work, for instance,
- 02:13on epidemic modeling of COVID-19.
- 02:15That's already, it's actually been submitted,
- 02:18I actually have some other work on quarantine
- 02:20and stuff that turns out to be really interesting
- 02:24and far along in the research.
- 02:26And then this work, which I started early on,
- 02:28which is more evolutionary, and looking at the zoonotic
- 02:31process has gone a little bit slower.
- 02:32So what that means is consistent with
- 02:35many other speakers in this series,
- 02:35I'm gonna be talking a lot about
- 02:38the methods that we're going to be using,
- 02:40which are well developed, and what we're planning to do,
- 02:43I don't have a lot of results.
- 02:44But I think that's consistent with these talks in general.
- 02:47So hopefully, that will be of interest to you
- 02:49and also be illuminating in terms
- 02:53of possible research approaches towards this kind of work.
- 02:58So as Laura mentioned,
- 03:00I use a lot of evolutionary approaches
- 03:02to do my analyses of things.
- 03:04And the title of this talk is model averaged estimation
- 03:08of molecular evolution and natural selection
- 03:12in SARS coronavirus, one and SARS coronavirus two
- 03:14two Corona viruses during the zoonotic period.
- 03:18So what was attracting my interest in this particular case
- 03:21is that it's usually very difficult and challenging to find.
- 03:25And I'll get to this later in the talk to figure
- 03:27out what's going on during the zoonotic period,
- 03:29because you don't usually get much sampling there.
- 03:32So, what I wanted to do was apply some techniques
- 03:35that I've developed to this problem.
- 03:38And I will get to those techniques
- 03:41and the application to this problem.
- 03:43But I first just wanna give a little bit of introduction,
- 03:46I think, maybe from a statistics point of view
- 03:47towards some of the methodologies that we're using,
- 03:49just so everyone can sort of see on board
- 03:51at least how I see this as contributing
- 03:55to interesting statistical questions.
- 03:57So and in a broad sense, if I can get this to Move forward.
- 04:00Here we go.
- 04:01I think one of the most intriguing
- 04:03and interesting and challenging areas of mathematics
- 04:05and statistics is understanding this border
- 04:08between the discrete and the continuous.
- 04:09So these are just some one particular
- 04:13example you can pick out is, if you look at discrete
- 04:16and continuous distributions that are frequently
- 04:19in use in statistical probabilistic analyses,
- 04:21we have the geometric and negative binomial distributions.
- 04:25And we have the exponential and gamma distributions.
- 04:30These are basically essentially waiting for discrete events
- 04:32when you have a probability over time.
- 04:33We're waiting for the earth event if you
- 04:35have probably over time,
- 04:37and they correspond to the distributions on a continuous
- 04:39time for the wait for the first event
- 04:42or the wait for the alpha event.
- 04:45So there's a real clear correspondence
- 04:46between these two distributions.
- 04:48And you can actually see in the mathematics,
- 04:50how they're similar as well.
- 04:53And that correspondence is kind of interesting.
- 04:54And the reason why I say it's interesting is
- 04:56because often many of the biggest problems I think
- 04:59we wrestle with in statistics are when we're trying
- 05:01to deal with data that is some intermediate
- 05:04level between continuous and discrete,
- 05:07and where we're trying to figure out which
- 05:08approach is the best to use, should we use some sort
- 05:11sort of parameterize distribution to address it?
- 05:13Or should we use some sort of nonparametric
- 05:17approach based on the discrete?
- 05:18I'm not sure in any particular case.
- 05:19But I just wanna mention
- 05:21that I think that's a very interesting area.
- 05:22And the technique I'm gonna tell you about
- 05:23is definitely wrestling with exactly this kind of question.
- 05:27So what kind of question do I mean?
- 05:29Well, I mean, questions that deal with state spaces,
- 05:32over time, or over any discrete or continuous axis.
- 05:36And you can see in this diagram just give you a picture
- 05:40of the kinds of problems that one deals with
- 05:43between discrete and continuous measures.
- 05:45You can have here it's depicted as time,
- 05:48you could have a discrete state space,
- 05:51state space you're measuring over time,
- 05:53you could have a continuous sorry,
- 05:56you're gonna have discrete measurements
- 05:59over where You've got discrete time
- 06:01in a discrete state space,
- 06:03you could also have discrete time
- 06:06and a continuous state space.
- 06:08You can have continuous, continuous
- 06:12or you can have discrete, continuous.
- 06:13And this two on the bottom are, two on the left,
- 06:15sorry, are the relevant ones for
- 06:17what I wanna talk to you about.
- 06:19In my research, which is largely focused
- 06:22on informatik data that we can obtain from sequencing
- 06:26or other approaches like that.
- 06:28A lot of what we're trying to do is look at these discrete
- 06:30linear sequences that have sites DNA sites or amino acid
- 06:34sites and trying to understand is there some
- 06:37pattern in those sites that allows us to understand
- 06:40something about the biology of the organism
- 06:41or the biology that we want to know something more about?
- 06:45So what essentially I'm gonna be doing
- 06:48is telling you about approach an approach
- 06:50that takes essentially discrete items over some X axis
- 06:54here, in which case in my case, it's always going to be
- 06:56sequence space, like the nucleotides
- 06:58or the amino acids of a sequence.
- 07:01And turns it into these kinds of more discrete models.
- 07:04And then in some, in a procedure that I'm going to tell you
- 07:07about actually gives us more of a continuous measure
- 07:10over that space, it's not completely continuous,
- 07:13it actually is on every site.
- 07:14But when you work with hundreds of sites,
- 07:17it turns out to look very continuous
- 07:20in terms of how it appears.
- 07:22But it's done with a discrete model
- 07:23that looks over multiple sites.
- 07:24So well, I'll tell you how it works in a moment.
- 07:26And I hope it's of interest to you guys.
- 07:28So just to introduce that, in general,
- 07:31the lab has worked on a lot of different kinds of data,
- 07:34and including things like gene expression data
- 07:36that borders this discrete continuous measurement.
- 07:39The old micro arrays we used to use give us
- 07:43essentially continuous measures of gene expression.
- 07:44Now we get discrete counts
- 07:46from our census sequencing approaches.
- 07:49Then all the sequence data we work with
- 07:51often ends up being essentially clusters
- 07:53of sites and various kinds.
- 07:56And then we also use a lot of phylogenetic inference,
- 07:59which is another kind of just discrete modeling
- 08:01in terms of the topology, but the borders
- 08:03between these two because we have discrete modeling of the
- 08:07topology, there are certain topologies
- 08:10that the taxa that we're interested in looking at
- 08:12that show their relationship to each other.
- 08:13At the same time, there's also a continuous
- 08:15measure out of that, which is these branch lengths,
- 08:17or how diverge these different tacks
- 08:19are from each other and constructing the phylogeny.
- 08:22So this sort of border between discrete
- 08:24and continuous measures, always sort of plagues
- 08:28and intrigues me, I guess it would be the question.
- 08:30Okay, so what am I gonna do today?
- 08:32What I wannado today is talk about
- 08:35maximum likelihood model averaging to profile clustering
- 08:37of site types across discrete linear sequences.
- 08:40So at the very base level,
- 08:41how do we take kind of these discrete sequences
- 08:44of amino acids or nucleotides
- 08:46and understand whether sites are closer to each other
- 08:50or farther apart from each other
- 08:52this is the question are they just uniformly
- 08:53distributed site types across a sequence?
- 08:55Are they clustered close together or far apart?
- 08:58Secondly, I'm gonna talk about how we can
- 09:01then use that approach to understand whether sites
- 09:04are under selection in a gene expressed in a sequence.
- 09:07And what I mean by under selection is that,
- 09:09in fact, sites are changing in a rapid
- 09:12or at a more rapid pace than you'd expect simply
- 09:14by mutation alone.
- 09:16So mutation, of course, is going to introduce
- 09:18variation into a genetic sequence.
- 09:19But when you see changes that are happening faster
- 09:21over time in a population,
- 09:23then mutation alone would produce
- 09:26that implies that every time that mutation is happening,
- 09:29it's spreading across the population.
- 09:30And that's why you see that uptick
- 09:31in the rate of change of those sites.
- 09:34So we can actually use this clustering approach
- 09:36to identify regions of the gene that have
- 09:38that sort of uptick and I'll explain how we do that.
- 09:41Now lastly, I'm just going to show you a very few slides
- 09:43on the title of the talk,
- 09:45which is this model average estimation of the molecular
- 09:48evolution and natural selection in SARS Coronavirus one
- 09:51and SARS Coronavirus two during the zoonosis.
- 09:55So by the time we refer to these,
- 09:57I'll just let you know we're almost done with the talk.
- 09:59AlL right, so to talk about the first one
- 10:01maximum likelihood model averaging five clustering
- 10:03of sites across the street linear sequences.
- 10:09I just want to... (phone ringing)
- 10:11Sorry, emphasize that we wanna figure out
- 10:20whether site types are clustered within a linear sequence.
- 10:22This sounds like a very straightforward
- 10:24statistical question seems like something
- 10:27that should have been addressed many, many times
- 10:28in the statistical literature.
- 10:29Much to my surprise,
- 10:30it's actually not terribly well explored.
- 10:34You have a linear sequence,
- 10:36it's so long and you have site types of one type
- 10:38or another are they clustered next to each other?
- 10:39Well, if you know the bounds of the region of interest,
- 10:42and others, if you can describe oh,
- 10:43it's I'm interested in this domain right here,
- 10:46and it's from site to site 90 or some other description.
- 10:48If you know the bounds,
- 10:49it's very simple to analyze that kind of data.
- 10:52You can just quantify the site type proportions
- 10:55within and outside those bounds.
- 10:57use something like a straightforward fisher's exact
- 10:59test for significance extremely simple problem.
- 11:01But what if you don't actually know those bounds?
- 11:04What if you don't know even what you're looking for exactly?
- 11:05you just know you're interested in concentrations
- 11:07of one site type compared to another site type
- 11:10across some discrete linear sequence,
- 11:12like this series of zeros and ones you see below.
- 11:15There's one, zero, zeros, there's one, zero, ones,
- 11:17there's periods where ones are closer to each other a series
- 11:20of ones are closer or farther apart from each other.
- 11:22How should we figure out whether things
- 11:24are actually clustered in that site?
- 11:26Or are they random?
- 11:27So if you don't know exactly where to describe,
- 11:31or what size you're looking for,
- 11:33the most common solution people use
- 11:35is some kind of sliding window,
- 11:36they take a window over the series,
- 11:38and they slide it across and say,
- 11:40"How many are in this window?"
- 11:41And then you can come up with based on the sliding window
- 11:44a sort of diagram of the clustering.
- 11:46And that's an approach that actually does
- 11:49give a good metric of the clustering
- 11:51in terms of like you see peaks where there's
- 11:53a lot of clustering and valleys where there is none.
- 11:56However, significance testing with that kind of approach
- 11:59is often awkward to construct.
- 12:00Due to a strong or autocorrelation
- 12:02among this URL overlapping windows.
- 12:04And of course, if you just sort of
- 12:06take windows arbitrarily from one location to another,
- 12:09then you're really instituting, (indistinct chatter)
- 12:13then that causes problems.
- 12:14Because what if the cluster is really on a border
- 12:16between two windows, so you have to slide it over and then
- 12:19you have the autocorrelation.
- 12:20And it becomes actually statistically
- 12:21quite challenging to sort of account
- 12:24for all of those auto correlations.
- 12:25Secondly, they need to specify that window
- 12:27size itself presents a user with a procedural ambiguity
- 12:31that almost inevitably leads to post hoc selection of window
- 12:34size and can mislead inference that is just the fact that
- 12:37you have to choose a window size.
- 12:39And if you don't actually have a good arbitrary
- 12:41outside reason to choose it.
- 12:43It's very hard not to choose a window size
- 12:44that ends up validating your hypothesis in some way.
- 12:49So it'd be better if we could just have an approach
- 12:51that does not require us to place in some
- 12:53arbitrary parameter that gives us a window size.
- 12:56So in order to address this question,
- 12:58a postdoc of mine, John John, who you see below work
- 13:01with me to address it.
- 13:03Oh, I wanted to say one other thing,
- 13:04which is that, yes, this has been addressed with some
- 13:07nonparametric methods that people have developed,
- 13:11including some rather famous people like Sam Carlin.
- 13:14And these are methods that do not assume prior knowledge.
- 13:17And they've been suggested to detect this clustering
- 13:20and discrete linear sequences.
- 13:21So you can do runs tests that look for
- 13:22the longest unbroken run, or the variance of the run
- 13:26links across the entire sequence.
- 13:27Both of these are indicators of clustering.
- 13:30Unfortunately, both of those are using
- 13:32are not sufficient tests.
- 13:34And those they don't use enough of the information
- 13:36to say that you're actually have as much power as you'd
- 13:39like to do the analysis.
- 13:40And that's because if you use like
- 13:42the longest run link, for instance, of course,
- 13:44you're only really using a little bit
- 13:45of information about the entire sequence.
- 13:47And of course, you're really missing anything
- 13:49like the cluster of ones that are have a bunch of small
- 13:52clusters that are all next to each other interspersed
- 13:54with a few of the other type,
- 13:56so the longest unbroken run doesn't work well.
- 13:59If you use the In terms of power,
- 14:01if you use the variance of long run link
- 14:04that gets rid of the fact that you're looking for just one.
- 14:05But unfortunately, a variance doesn't tell you anything
- 14:07about the relative position of site
- 14:11that are of the same type across the sequence.
- 14:14So the fact that this one, one, one, one here is close
- 14:18to the one, one here, and the one another is,
- 14:20and this the fact that these are all close to each other,
- 14:22does not give us the power that it should
- 14:25for understanding this region may
- 14:27be under maybe cluster.
- 14:30So variants of run length is also an underpowered approach.
- 14:33The most powerful approach that's been published out there,
- 14:36aside from the ones we've been working on,
- 14:38is the empirical cumulative distribution functions
- 14:41to sick that's where you sort of go across the sequence
- 14:43and just say, "oh, okay, we're accumulating ones here,
- 14:47we're shooting more accumulating more."
- 14:49And there's fortunately a number
- 14:52of highly developed statistical approaches
- 14:53to look at the empirical distribution and figure
- 14:55out whether you see an increase beyond
- 15:00expected during some period during that ECDF,
- 15:03the power is better than either the previous methods,
- 15:05but it's still not very strong.
- 15:07It's not clear that it includes all the
- 15:08information that it should.
- 15:10And it can be affected.
- 15:12Research has shown that it can be affected
- 15:14by the location of the cluster, which is not desirable.
- 15:16So if you have a cluster on an end,
- 15:18that has less the ECDF will have less power
- 15:21or more power compared to a cluster in the middle.
- 15:23It's also challenging to interpret in the end,
- 15:26for reasons I'm not gonna go into right away.
- 15:29So what did we do?
- 15:30What we did was develop a tripartite divide
- 15:32and conquer approach to model variant sites
- 15:35based on iterative sub clustering.
- 15:37And I'll describe it in detail right now.
- 15:39I'll just tell you the plus and the minus
- 15:40of this approach at the beginning,
- 15:42which is it's sort of a bioinformatics approach
- 15:45and that are bioinformatics statisticians approach
- 15:48and that it uses intensive computation
- 15:50to solve the problem instead of giving
- 15:52a strict analytical result.
- 15:55And in fact, what it does is it just says,
- 15:58Well, if we're interested in clustering in any case,
- 16:00clusters should be represented by increases in
- 16:03the probability within some cluster central region
- 16:06compared to some side regions.
- 16:08And if we define CS and CE to be anything
- 16:11from the very beginning to the very end of the sequence,
- 16:14it encompasses all possible single clusters
- 16:17within a sequence.
- 16:19So, for instance, if the cluster were on the far left
- 16:22we can just define CS to be at zero,
- 16:25the left hand cluster is nothing and the right hand cluster,
- 16:28right hand area that has depressed in variant type intensity
- 16:35would be the other category.
- 16:38Anyway, so, what we can do is divide any sequence
- 16:42into three sections, just count up the number
- 16:44of site types in each one, estimate the maximum
- 16:46likelihood probability for the site type
- 16:50to be of the variant type of interest,
- 16:52say it's a glycine amino acids within a protein
- 16:55or add mean nucleotides limited gene, whatever it is.
- 17:00So then you can just come up with a null hypothesis,
- 17:03which is the likelihood under the hypothesis
- 17:06that these things are located at random
- 17:09across the whole sequence.
- 17:11And then an alternate hypothesis that allows
- 17:14that is invoking a model which involves more parameters,
- 17:18which then separate separates into a clustered
- 17:21versus non-clustered state.
- 17:23So that would be fine if what we really
- 17:25expected in a sequence was one cluster,
- 17:27compared to nothing else,
- 17:29compared to the sort of baseline rate of clustering,
- 17:33sort of baseline rate of variant types.
- 17:35And but what we really want is an approach
- 17:39that can take clustering at many, many levels.
- 17:42So what if there's a cluster within the cluster
- 17:43or cluster within left?
- 17:45So what you can do is then take each
- 17:46of these sub clusters you've identified and actually
- 17:50do the same process on them looking for whether there's
- 17:53a higher likelihood of the data given another cluster
- 17:56somewhere within this sequence, et cetera, et cetera.
- 17:59Now, if you think so this sort of dictates a procedure,
- 18:04which is that you start, you input the sequence,
- 18:07you start at, you know, the first at
- 18:09the left and move all the way to the right,
- 18:11essentially, you find the most likely cluster
- 18:13among all the possible clusters.
- 18:15If the cluster is statistically significant,
- 18:17you then sub sequence each of those three parts,
- 18:21the left hand part, the central center part
- 18:24and the right hand part, find the most
- 18:26likely clusters within each of them.
- 18:27And proceed doing this until you reach a point
- 18:30where you can no longer find any statistical evidence
- 18:32that there is continued clustering within it.
- 18:34And that's the point at which you stop.
- 18:36And then what you can do.
- 18:37And this, I think, is sort of a key because
- 18:39at the end of that, what you get is one discrete diagram,
- 18:42kind of like that diagram I showed you initially,
- 18:44where it proceeds flat, goes up,
- 18:46proceeds flat goes down, et cetera.
- 18:47I'll show you an example of that in a moment.
- 18:50But what you really wanna do possibly,
- 18:53right, what I think is really appealing about
- 18:55this approach is that then you can take
- 18:56that as one model, the most likely model and you can look
- 18:59at all the other possible models
- 19:00that you could have constructed.
- 19:02And you can use AIC weighting to actually figure
- 19:05out how much you should believe what is the weight
- 19:11for every possible model.
- 19:13And then you can average across those models
- 19:14to give you a continuous description
- 19:17of how much clustering you see across the sequence.
- 19:18And again, the advantage that I mentioned
- 19:20early on about this,
- 19:22from my standpoint is I haven't put in anything
- 19:24about how big a window how big a cluster,
- 19:26I put in nothing about what I'm expecting
- 19:28to see out of the sequence.
- 19:30I'm just asking, what's the most likely description
- 19:32of this given the assay penalty for parameterization
- 19:37and what the result gives me.
- 19:39So then we have a bunch of different weights
- 19:41for all our different models.
- 19:44And what it gives us something like this.
- 19:45So on the top, I've shown you the AIC model selection
- 19:48which is the first thing I showed you
- 19:49if I just took the most likely description
- 19:51of this particular sequence.
- 19:53It's not important what it is it's PRF
- 19:55ADHD, which has been widely studied in evolutionary biology.
- 19:59But if you take this model selection would,
- 20:02the most likely description
- 20:05given that sub clustering looks something like this
- 20:07where we have a region with fairly high concentration
- 20:10of polymorphism, in this case, a valley,
- 20:14a region, an intermediate level,
- 20:16a point where we have a lot of polymorphism.
- 20:19And then it moves and changes across the sequence.
- 20:21Now, if you then instead take not just that one model,
- 20:25but a series of models and do the AIC model average,
- 20:28you get a much more continuous description across
- 20:30the sequence of what the probability
- 20:33of sight types being different is.
- 20:36And that enables us to ask a question
- 20:37that's a little bit more interesting in many cases,
- 20:41and I'll show you how it enables us to ask questions
- 20:43about natural selection in a moment.
- 20:45So in particular, it allows us to get an estimate,
- 20:49you know of what the probability
- 20:50is across the entire sequence.
- 20:51Even though we don't have
- 20:52observations within the central region
- 20:54or this barren region here.
- 20:56We can still estimate what the model average,
- 21:00probably of a change of hearing in different places
- 21:02have this gene are and that enables us
- 21:05to ask questions that we otherwise could not do.
- 21:08All right, so that's an introduction of MACML.
- 21:11I'll just mention, and I could give you more detail on this.
- 21:14It's like this is actually published work,
- 21:16so you can find it.
- 21:17But compared to the ECDF statistics,
- 21:19that approach I just showed you has greater power
- 21:21to detect heterogeneous clusters
- 21:23it identifies clusters with greater accuracy and precision
- 21:26based on the Kullback-Liebler divergence between
- 21:28the actual distribution of the observed distribution,
- 21:31sorry, the actual distribution
- 21:34and the inferred distribution.
- 21:36It has better power and accuracy across
- 21:37different levels of clustering,
- 21:38better power and accuracy across
- 21:40different sequence links,
- 21:41and better power and accuracy and finding
- 21:43multiple clusters compared to a single cluster.
- 21:45The disadvantage is, it's extraordinarily
- 21:47computationally intensive, and it is prohibitively
- 21:49so for very long sequences.
- 21:51So for genes a very long length,
- 21:53we can't actually run it on the full-length gene
- 21:55and we have to do some more heuristic processes
- 21:58to crunch those genes into smaller size.
- 22:01Which we then can analyze and then build them up.
- 22:03Again, I won't go into those at the moment.
- 22:05But the point is that at certain links,
- 22:07it gets just computationally too intensive to go
- 22:09through all the possible models that could explain the data.
- 22:13Now, I've talked about the maximum-likelihood averaging
- 22:17to profile clustering of site types
- 22:19across discrete linear sequences,
- 22:21introduced that methodology to now I'm gonna talk about
- 22:24how we can at apply that methodology
- 22:26to get us a better idea of which sites are under selection
- 22:29using a what's called a pause on random fields approach.
- 22:32And don't worry about that terminology.
- 22:34You might know it from statistics,
- 22:37it has to do with a particular observation
- 22:40in molecular evolutionary biology,
- 22:42which is why they're using it
- 22:44and it's not really important for this talk,
- 22:46why it's called that.
- 22:48So let's go on and go ahead and do that talk
- 22:51about the model-averaged site selection
- 22:53using Poisson random fields.
- 22:54So first, I need to give you a little bit of background
- 22:56in the evolutionary biology for those of you
- 22:59who haven't had a lot of biology,
- 23:00so you understand how this fits in with
- 23:02what we tend to do another strategy.
- 23:03Of course, evolutionary biologists
- 23:05are often very interested in understanding
- 23:06what things are under selection.
- 23:07And in the context of this talk,
- 23:09why is that important?
- 23:10Well, we'd really like to know what things
- 23:12are under selection in the COVID epidemic,
- 23:14because we'd like to know what sites
- 23:16are actually causing the COVID epidemic
- 23:18to spread more or not, and what sites may have
- 23:21been important in it prior to zoonosis,
- 23:24MSN, perhaps, especially in the context of this talk,
- 23:26what sites were selected during
- 23:28that zoonotic process that made this virus perhaps able
- 23:31to infect humans in the first place.
- 23:33So what we're doing is,
- 23:34so to give you an introduction,
- 23:36I just wanna mention that they're sort of ways
- 23:39to look at ancient times and understand
- 23:40whether selection was happening.
- 23:42And that's this approach that's called
- 23:44that looks at phylogenetic divergence,
- 23:45looking at multiple sites and saying,
- 23:47"Oh, we have a whole bunch of phylogeny
- 23:49of how these organisms are related."
- 23:51And then we have a bunch of sites that are for each taxon.
- 23:55When we see sites like this, for instance,
- 23:57that's having A and then a couple C's and then a G
- 24:00and another tacks on, we know that this site changed twice
- 24:03on that phylogeny, at least right?
- 24:05So it changed to probably change from C ancestrally
- 24:09to an A in this lineage and to a G
- 24:11in this lineage independently.
- 24:13And so the fact that it changed twice means
- 24:16that it's got an elevated rate of change.
- 24:18And that elevated rate of change is an indication
- 24:20that there's been positive selection for change.
- 24:22It's especially likely in sort of pathogen hosts
- 24:25interactions that high rates of high change are
- 24:28because pathogens are changing in order
- 24:30to not be recognizable by their hosts.
- 24:33And often the host has recognition proteins
- 24:35that are changing to still recognize the pathogen,
- 24:36even the pathogen is changing.
- 24:38So these high rates of evolution
- 24:40are very strong indicators of selection
- 24:42in host pathogen situations.
- 24:45So this is one way to study a natural selection.
- 24:48It does depend, though, on having a lot of data going back
- 24:52in time because you're actually reliant on these changes
- 24:55are occurring in multiple places on multiple lineages.
- 24:58Now, a more recent level, and I'm going to go back
- 25:02to the middle in a moment.
- 25:05But a very recent time, you may have
- 25:07heard of selective sweep detection,
- 25:08a couple of methods people use are tajima's D,
- 25:11or IHS, there's a bunch of other methods that are out now.
- 25:14And the idea there is to look at polymorphism.
- 25:16And if you look at an individual, before selection,
- 25:20this is sort of just a idea diagram,
- 25:22not what you look at.
- 25:23But so if you look at an individual who has a variant,
- 25:26and what you see in a population is that
- 25:30one individual with variant, a variant that's important
- 25:33as somehow swept across the population.
- 25:35So if you see this would be before selection,
- 25:37there's a lot of variation at a particular locus
- 25:39in the genome after selection,
- 25:41that one individuals variant which contributed
- 25:44to the reproductive fitness would then imply
- 25:46that they would spread across the population.
- 25:50And if they spread across the population,
- 25:52then the genetic variants that were present
- 25:54in that original individual spread across
- 25:56the population as well along with this selected site,
- 26:00and so you can look for this kind of partial or speedy.
- 26:04And the selection is going on neither
- 26:07of the approaches that I just talked about
- 26:09or the approach that I'm doing today.
- 26:10So I just wanted to introduce those,
- 26:12so you knew those are different.
- 26:13And they're different because we're looking
- 26:15at a more intermediate timescale.
- 26:16That's like the sweet detection is purely
- 26:19dependent on polymorphism in the population,
- 26:21like what's happening in a population right now.
- 26:24The phylogenetic divergence is purely dependent
- 26:26on this ancient changes that you get from a phylogeny
- 26:28understanding how different species are related
- 26:31to each other at an intermediate level,
- 26:33our methods use that use both the polymorphism
- 26:35and the divergence.
- 26:37And the idea here in the McDonald-Kreitman approach,
- 26:40and the master approach I'm going to tell you
- 26:42about is that the polymorphism what you see generally
- 26:46in the population is sort of consistent with this.
- 26:48Sorry, if I go back to this slide.
- 26:51With this before selection, you know,
- 26:53all of these blue sites are assumed
- 26:55to not be under selection,
- 26:57and that generally what we believe in evolutionary biology,
- 26:59because of empirical data that validates it
- 27:02is that most sites that you find varying in populations
- 27:05are not under strong selection.
- 27:07If they were on stronger selection,
- 27:08they would probably fix it, everyone would have them.
- 27:11And if they were under negative selection,
- 27:13they wouldn't rise to a high frequency.
- 27:14So generally speaking sites that you actually see
- 27:17change differences between us and our genetics
- 27:18typically are not affecting anything.
- 27:20Of course, we spend in our...
- 27:23In the media, you only hear about the changes
- 27:24that actually affect things.
- 27:25And that's because those are important to us,
- 27:26the ones that don't change anything
- 27:28we don't really care about.
- 27:29So nobody talks about that much.
- 27:30But most of the changes within population or differences
- 27:33within population don't have much material effect.
- 27:35So under that hypothesis,
- 27:37then when you look at polymorphism,
- 27:39most polymorphism is just an indication
- 27:41of the underlying mutation rate,
- 27:43some mutation happened didn't have any effect.
- 27:45It's drifting up and down in the population.
- 27:47And so the advantage of that is if you know
- 27:50that polymorphism is signal is a signature
- 27:52of just random mutation, it gives us an estimate
- 27:54of the underlying mutation rate, which we can then compare
- 27:57to the divergence and using that comparison,
- 28:00we can understand how organisms are related.
- 28:02So whether organisms are under selection
- 28:05or not, if the divergence is high compared
- 28:07to the polymorphism, that indicates a lot of selection.
- 28:09That means (indistinct chatter)
- 28:12in the timescale of the analysis you're doing,
- 28:14we have a lot of change the population,
- 28:17and on the other hand, you have a lot of polymorphism
- 28:20and not that much divergence, then that indicates
- 28:22you've got a lot of change going on,
- 28:23but it's not actually being directionally
- 28:26selected because the divergence is much lower.
- 28:27So how does that test work in practice?
- 28:30Well, just to step back for one moment,
- 28:32so we're gonna apply that kind of test.
- 28:35In this talk I'm applying that test
- 28:36to the emergence of COVID-19.
- 28:39I'm actually applying it but also to SARS, which is fairly
- 28:44closely related the SARS coronavirus one
- 28:46because we have similar data and can apply
- 28:48the same test in the same way to that data set.
- 28:52And we're using in addition the SARS like
- 28:55Coronavirus in a sample that had been sequence
- 28:58basically collected from bats.
- 29:00Over the past 20 years or so,
- 29:02so what you can see here is a phylogeny,
- 29:05which includes COVID-19 epidemic ongoing now in humans,
- 29:09the SARS epidemic, which caused some 400 deaths
- 29:13or so back in the early 2000s.
- 29:18And what we're doing is analyzing both and looking at,
- 29:21in particular, the very short internode here
- 29:25were between the most closely related non human infections
- 29:31and the human infection set that we can see.
- 29:33And this internode here, also,
- 29:36between these non human infections and the human
- 29:39infections we can see here, because the changes
- 29:42that may have enabled, we don't know,
- 29:45there may be no changes that enabled it,
- 29:47maybe this virus throughout
- 29:49its entire history could have infected humans,
- 29:51but it just never managed to or never did.
- 29:53But if there are changes that are unique to this virus
- 29:56that happened during zoonosis, enabling it to infect us,
- 29:59they happened on this lineage,
- 30:00and so we're interested in seeing what those changes are.
- 30:04And so that's what we're gonna do is we're gonna run
- 30:06this polymorphism and divergence approach on this lineage.
- 30:10And what I just want to make (indistinct chatter)
- 30:13clear to you is the reason
- 30:14why the polymorphism divergence approach is important is
- 30:18the phylogenetic approach, the ancient approach
- 30:20relies on a large clade of data, which we don't have
- 30:22for that particular lineage here,
- 30:24we just have the human infection,
- 30:26which is no longer zoonotic.
- 30:26And we have this one lineage.
- 30:28And so what we can do is ancestrally reconstruct
- 30:30the ancestor of this lineage, which is right here,
- 30:33actually on the phylogeny,
- 30:34and also the ancestor right here,
- 30:37and then use mass PRF, this approach that's based
- 30:40on polymorphism in the room, so I'll explain to you
- 30:43on the divergence between that ancestor
- 30:46and the first ancestor of all the human infections.
- 30:48And we can take that as the near zoonosis time
- 30:51and figure out what mutations might
- 30:53have happened during that time.
- 30:54All right, so we're gonna do that in both
- 30:56the COVID-19 and SARS cases.
- 30:59Now, how does this work in principle?
- 31:02Well, there's an old approach,
- 31:03which is not what we're using.
- 31:05But I have to compare it to in order to
- 31:06sort of reference it in terms of the literature.
- 31:09And that is that when you assume
- 31:11that polymorphism is neutral,
- 31:13we expect a different proportion of replacement
- 31:16to synonymous divergence compared to replacement
- 31:18to synonymous polymorphism in a gene.
- 31:21So it's just a two by two table here, again,
- 31:23very simple statistics, where we look at
- 31:25the number of replacement sites that are divergent
- 31:28the number of synonymous sites replacement,
- 31:30again, is when an amino acid change
- 31:32occurs in a DNA sequence.
- 31:33DNA sequence changes can either change the amino acid
- 31:35or not depending on what the sequence of the code on
- 31:39the three base pair code on in the DNA sequences.
- 31:42So if there's a replacement, we tally it here,
- 31:44if it's a synonymous change, that doesn't change the amino
- 31:46acid, we tally it here, these ones are preserved.
- 31:48Sometimes changes are presumably neutral because
- 31:50they don't change anything about your protein.
- 31:52And then the if it's a polymorphic replacement,
- 31:56then we see it here.
- 31:57And if it's a synonymous polymorphism we see it here.
- 31:59So under the hypothesis that I mentioned,
- 32:01all three of these cells should occur, it should
- 32:04be sort of changing in exactly the same way
- 32:06because polymorphic sites, whether they're replacement
- 32:09are synonymous, we're assuming are neutral,
- 32:11synonymous sites, whether the divergent
- 32:12or polymorphic, we're assuming is neutral.
- 32:15The only one that apparently that under
- 32:17assumption is not neutral are these replacement
- 32:19changes at replacement divergence sites.
- 32:22So, if this replacement divergence, if the marginals
- 32:25add up so that this replacement divergence is sort of in
- 32:29line with all these others, then we assume nothing important
- 32:30is happening in that gene, it's probably not selected,
- 32:33it's just neutral changes that are happening there.
- 32:35If this divergence is higher, though,
- 32:38then we might conclude that it's under
- 32:39selection for changes at a rapid pace.
- 32:41So neutrality yields a DN over DS that's equal
- 32:44to the PN over PS positive selection means
- 32:46that the DN DS is greater than the PN PS and negative
- 32:50selection where changes are actually being selected against
- 32:53at a high level indicates the DN DS
- 32:56is gonna be less than PN PS.
- 32:59All right now Let's get to a little bit of the
- 33:01complexity on this thing that I mentioned that's called
- 33:04Poisson random field theory, quantitatively estimates
- 33:05gene-wide selection intensity.
- 33:09So what you can do is take that
- 33:12same two by two table, and you can say under a model of
- 33:14selection, what do we actually think is happening here.
- 33:18And that gives us the ability to estimate the selection
- 33:20coefficient, which is a basically the rate at which that
- 33:22change allows the virus to increase its reproductive ability
- 33:25or survival ability in the host.
- 33:27And that that is this gamma term right here
- 33:32in these terms, and this, these look complicated,
- 33:34but essentially, these formulas are just saying
- 33:36that the expectation for a synonymous sorry,
- 33:39the synonymous and replacement have reversed
- 33:41on this chart compared to the last,
- 33:43so don't be confused by that.
- 33:45But the expectation under synonymous
- 33:45changes is essentially the mutation rate.
- 33:48And these terms are just about the sampling properties
- 33:50of when you sequence how many of these things you get,
- 33:52I don't need to go into the detail about that here.
- 33:55Similarly, the polymorphic sequence
- 33:57is just basically dependent on the mutation rate.
- 34:00How the replacement sequences are a little bit more
- 34:02complicated in that they have to account
- 34:07for kinds of selection that may be going on.
- 34:11For reasons that I don't wanna get into
- 34:12the polymorphic selection, so both of them are depending
- 34:16on the mutation rate for replacement sites,
- 34:18and both of them depend on
- 34:20how much each variant is selected.
- 34:23Selection doesn't pack the polymorphism
- 34:25to a certain degree in the sense that if variants
- 34:27are moving through the population very fast,
- 34:30that can change how much polymorphism you see.
- 34:32But then if you use these sampling formulas, and the formula
- 34:36for the estimate of the strength of selection,
- 34:38given how many variants we see changing,
- 34:41you get these formulas for how much replacement
- 34:44divergence and polymorphism you expect to see.
- 34:47So this is a population genetics that was worked
- 34:49out by Stan Sawyer and Dan Hurley in 1992.
- 34:52The only change I'm making in this is pure F,
- 34:56instead of using a year which was how many grants
- 35:00that you see in the the McConnell Craven uses it,
- 35:04I'm taking the probabilities of replacement divergence
- 35:08and the probabilities of some polymorphism
- 35:11and putting them in here.
- 35:12And the advantage here is that what
- 35:13I can do with that is what I mentioned earlier,
- 35:15I can go back to the old mass MACML
- 35:18approach sequence clustering approach
- 35:20that I mentioned before, estimating those probabilities
- 35:25across the entire gene, I can then estimate action across
- 35:27the entire gene by using these probability single site,
- 35:30I don't have changes for single site.
- 35:32So what this allows
- 35:34us to estimate this gamma, minimizing likelihood of what
- 35:38gamma is to blame those problems exist, see.
- 35:42So this is a very complex diagram of how this all works,
- 35:46again, is a pretty elaborate method of computation.
- 35:50But again, has the nice properties that I'm not putting
- 35:53in any I'm not using assumptions
- 35:55and not putting in any parameters.
- 35:56They go in.
- 35:58I just take the polymorph at the end analyze it for
- 36:01weather sites are clustered into four different categories.
- 36:04Again, replacement polymorphism.
- 36:06That's this arc here.
- 36:07So polymorphisms anonymous divergence, placement divergence,
- 36:12we cluster within all four of those categories.
- 36:15We calculate the model average probability,
- 36:17all those clusters and merge the data together.
- 36:20I'm not going to go through the details.
- 36:22But just if you were to do essentially the KML,
- 36:25like clustering on those four categories
- 36:27for a particular gene polymorphisms
- 36:30and Ana's polymorphisms, monster and placement divergence
- 36:33if you plug those in, to the formulas I showed you before,
- 36:37you're basically plugging into these categories,
- 36:39you can estimate those formulas.
- 36:41And in the end, what you get is
- 36:42an estimate of gamma across nucleotide positions in a gene.
- 36:49I won't go into what this result here,
- 36:51it's an interesting result for reasons
- 36:54that are only of interest mostly to evolutionary
- 36:56biologist, but you can see here in this particular gene
- 36:58that there's a lot of variation in the selection
- 37:02intensity across the gene.
- 37:04Now, that is actually really
- 37:06consistent with what we'd expect.
- 37:08From a sort of basic biology standpoint.
- 37:11Different parts of a gene are gonna either
- 37:13be very strongly selected to stay the same
- 37:15or they're gonna change, you shouldn't really expect
- 37:18that all parts of gene are equally likely to change.
- 37:20And this gives a very nice diagram
- 37:22that allows you to understand how
- 37:23it's different across the gene.
- 37:25So if we compare this kind of approach
- 37:27to the McDonald kreitman tests, which again,
- 37:30are just putting in the DN DS, PN PS values
- 37:33into this two by two table,
- 37:36and I went through that, the important difference is that
- 37:39the Mk test assumes this intergenic homogeneous selection
- 37:42that in fact, a gene has the same selection
- 37:44across the entire sequence.
- 37:46The problem with that is if you have one small
- 37:48region that's under selection,
- 37:50the averaging out process across that entire gene
- 37:53can mean that you don't detect the selection there,
- 37:54even though it may be very strong for that small region.
- 37:57And so the hope is that mastery graph can
- 38:01identify those regions much better
- 38:02than MK for instance, would.
- 38:04And in fact, I went through this already.
- 38:09I'll just skip past this because I went through it already.
- 38:13And this it does do that.
- 38:18So this is an example of McDonnell Craven
- 38:21tests here applied to a Drosophila gene,
- 38:23what you see is this high evolution of a high level
- 38:27of replacement divergence, which turns out
- 38:30to indicate high selection.
- 38:33And you can see here that the DN DS ratio
- 38:35is about eight to one word as the PN PS ratio
- 38:38is almost even.
- 38:40So this is a gene that's under very strong selection
- 38:42based on the McDonald kreitman test.
- 38:45Now, interestingly, so this one works
- 38:47with a homogeneity.
- 38:49And then if you analyze the ACP 26 AA gene
- 38:55and look for the probability of all four categories.
- 38:58These are the four categories and of course,
- 39:01the replacement divergence here is the one
- 39:04that's most likely to drive selection.
- 39:06What do you get when you estimate gamma using this?
- 39:09Well, interestingly, what you see is not something
- 39:10that's under very strong selection across the entire gene,
- 39:13but something that's on moderately strong selection,
- 39:15basically in the second half of the gene,
- 39:17and then one peak of very strong
- 39:19selection right around the middle of the gene.
- 39:21And this is visible in currents because
- 39:23of a number of changes that occur
- 39:26in one particular domain of the gene here.
- 39:28Now, if you look at just the replacement divergence,
- 39:30you wouldn't be able to figure this out.
- 39:32Because you see there are other
- 39:34peaks along here.
- 39:35Those don't turn out to be so important.
- 39:36And the reason why they don't turn out to be so important
- 39:39is that the synonymous divergence synonymous by morphism
- 39:41replacement polymorphism.
- 39:42Tell us more about the underlying mutation rate
- 39:44that says those elevations are probably have
- 39:47something to do with mutation rate, and not necessarily
- 39:49to do with added divergence.
- 39:52You can sort of see this elevation
- 39:54on the right hand side over here compared
- 39:56to the small dip right here and up here
- 39:59and the way it all works out mathematically
- 40:02is we can really see that there's strong selection here.
- 40:04We can also get what I call model intervals for this.
- 40:06If you look across all the models,
- 40:08what are the estimates of selection?
- 40:11Possibly, what do we get is the 95% model interval for this?
- 40:14And that's what these very faint gray lines you
- 40:17may be able to see are those allow us to detect whether
- 40:19these are significant, least significant,
- 40:22statistically significant differences in selection.
- 40:24All right, I'm gonna skip through this
- 40:27just because I want to spend the time
- 40:29but the point is, you can do this for other genes,
- 40:29and it shows similar results that allow us
- 40:32to understand where sites are under selection in that gene.
- 40:34I'll just cover a few more examples
- 40:37of how we've used this to give you an idea
- 40:39of what it can look like in a comparison between humans
- 40:42and chimpanzees where we've run this just to understand
- 40:44how we've diverged from chimpanzees.
- 40:47We see a bunch of different examples here.
- 40:50Again, doing a little bit of comparison to
- 40:52that traditional McDonald kreitman test
- 40:54and the mass PRF test.
- 40:56Here you see a gene, which is statistically significant
- 41:00people's point of view.
- 41:01Based on the Mk tests, the four categories
- 41:04of the four tallies of which are indicated here.
- 41:07Here's the MASS -PRF profile, and it shows us again
- 41:10a particular region within this SLC AA
- 41:12one gene that is under selection.
- 41:14There are interesting stories behind all of these,
- 41:17but I'm not gonna take the time to go through them.
- 41:19Here's another example where and this is an example
- 41:22where the McDonald pregnant test
- 41:23comes out is not significant.
- 41:25There's just not that much divergence
- 41:26compared to the other categories.
- 41:28But if you do this, spatially with the MASS-PRF test,
- 41:32you actually see that a very central region there
- 41:34has very strong selection, and then the rest of the gene
- 41:37is under almost zero selection or almost no selection.
- 41:41So this is an example I talked about,
- 41:43where you could have some very small portion
- 41:45of the gene under very strongest selection.
- 41:47And McDonald-Kreitman test wouldn't detect it
- 41:49because it's averaging over the entire gene.
- 41:51Similarly, you'll get some genes.
- 41:52Oops, I didn't mean to do that.
- 41:54Some jeans, here's M gamma over here, where there's a...
- 41:58Well, let me turn to that one last.
- 41:59Actually, let me look at TPH First,
- 42:02there's no statistical selection according to the Mk tests.
- 42:06And in fact, in our MASS-PRF,
- 42:08there's no specific selection either
- 42:09the error bars are entirely overlapping zero here,
- 42:12which indicates no selection.
- 42:15Lastly, here's M gamma.
- 42:16This is the one of the very few examples
- 42:18we were able to find where McDonald test did detect
- 42:21selection where, where MASS-PRF didn't.
- 42:24As you can see, there's quite high tallies here,
- 42:26which means there's a lot of power
- 42:27to detect selection if it's there,
- 42:28but it's probably not very strong,
- 42:30because the numbers are not all that different
- 42:32from each other.
- 42:34And McDonald-Kreitman says it's statistically significant.
- 42:36Now the reason why McDonald Kreitman is telling
- 42:40it's statistic's nothing compared to mass PRF
- 42:41is that actually, I didn't explain this in detail to you.
- 42:44But McDonald- Kreitman doesn't actually assume
- 42:47that there's an elevation of rate here.
- 42:48And so the significance here is actually driven by
- 42:51the high polymorphic replacement level.
- 42:53So there's a lot of polymorphic replacements in there.
- 42:56And what that means is there's some other
- 43:00kind of selection that isn't a directional selection.
- 43:01I won't go into the details there.
- 43:02But the nice thing is that in the examples
- 43:04where we find that McDonald kreitman is statistically
- 43:07significant and MASS-PRF isn't examples
- 43:10where in fact MASS-PRF is not designed to detect
- 43:12that kind of selection and MK test is.
- 43:15In general MASS-PRF turned out to be significant
- 43:18in almost every case math MK tests were not.
- 43:21Okay, so how can we use this, apply this
- 43:24to instances like COVID-19, the point of this whole talk,
- 43:27and I'm just gonna give you one example first
- 43:30to justify why we think it's a good idea,
- 43:32because we don't have results on doing it,
- 43:34at least not many results on doing it to COVID-19
- 43:36yet, and that is that we applied this influenza before,
- 43:39which has some similarities to COVID-19, as everyone knows
- 43:43and in influenza, again, we're interested in looking across
- 43:46the gene are there sites that are under selection
- 43:48because those sites that are under selection
- 43:50are candidates where we need to be aware that
- 43:53in fact, vaccines need like for every year they design
- 43:58a new influenza vaccine, right?
- 43:58And what they're trying to do is accommodate
- 44:00the fact that these changes occur on the sites
- 44:03that are actually susceptible
- 44:04to your immune system recognizing the influenza virus.
- 44:08So we need to understand those sites that are changing
- 44:11and where they are in in order to design
- 44:13more universal vaccines that maybe could target sites
- 44:16that won't change rapidly because they can't change
- 44:19because they're structurally constrained in the virus.
- 44:22So what we did was apply this MASS-PRF approach
- 44:25to influenza similarly on a phylogeny
- 44:29to like I described for Coronavirus.
- 44:30I don't have the phylogeny in the slide set,
- 44:33but the point is just looking at the ancestral influenza
- 44:36and it's divergent sites within a particular region.
- 44:40And what we were able to do is identify a set of sites
- 44:43that are under select---ion using mass PRF
- 44:46that are beyond what people had prophesied
- 44:48as positive selection sites in the past.
- 44:50So there's a paper by Westgeest al 2012
- 44:53which is essentially the gold standard for this
- 44:55and they found a bunch of sites that are all
- 44:58these circled sites in gray MASS-PRF.
- 45:00Also found those the orange diagram here
- 45:03is the MASS-PRF for this gene.
- 45:09And it also identified other sites
- 45:10that are under selection as well.
- 45:14And we're in the process of understanding
- 45:16better how those can be validated.
- 45:17But the ultimate point is that
- 45:20these are important selected sites that may be relevant
- 45:25to the design of vaccines for influenza.
- 45:28So similarlY, we'd like to illuminate
- 45:31which sites might be changing rapidly
- 45:34and under positive selection in Coronavirus,
- 45:37not only during the human epidemic,
- 45:39but again during the zonotic zoonotic time period.
- 45:41And so now we're finally coming to the final
- 45:43part of my talk, which is what we're doing
- 45:46in terms of the model average estimation the mcos
- 45:48and natural selection in SARS coronavirus,
- 45:51one and SARS coronavirus two,
- 45:53Corona viruses during zoonosis.
- 45:53But the whole point here is really
- 45:56explain to you what I've done because the results I have
- 45:57as I said are I just have a few plots of some of the stuff
- 46:01longest selection we were able to check
- 46:03because we have to process through a lot more data
- 46:05before we get a more in depth look at the lesser
- 46:07selected sites that are on these genes.
- 46:10And so we looked at this for the for Coronavirus.
- 46:13This is just a Coronavirus, Getty image that Yale
- 46:17has used looking at Coronavirus.
- 46:21And again, as I mentioned,
- 46:23we're looking at these two sites of where COVID-19
- 46:26emergence occurred, and where SARS emergence occurred.
- 46:30And the question is, are there changes
- 46:33that happen there that are specifically
- 46:34responsible perhaps for those zoonosis and the only results
- 46:38I have are just a few results again, highlighting some of
- 46:40the strongest selection we saw.
- 46:42This is actually a diagram of the spike
- 46:44protein which if you've heard much about COVID-19
- 46:47molecular biology, you probably have heard about the spike
- 46:50protein, it's what sticks out from the virus.
- 46:52It's what grabs onto the AC receptor,
- 46:56and essentially is what most vaccines
- 46:58that one might design for the virus would target.
- 47:01And the point is that the recombination binding
- 47:04domain, which has gotten a lot of press already turns out
- 47:07to have the selected sites.
- 47:08You can see them here, here, here and here.
- 47:12These are sites that are selected,
- 47:13meaning they're changing rapidly
- 47:13during the pre zoonotic phase.
- 47:17So these are sites that are changing, not in humans,
- 47:20but in the bats in the pangolins.
- 47:22And whatever other animals that this virus
- 47:25is spreading among, or has been spreading among
- 47:27before the zoonosis to humans.
- 47:29So then the question is, are similar sites under
- 47:30selection during zoonosis?
- 47:31And during post zoonosis?
- 47:36And the answer right now is yes,
- 47:38it seems kind of similar,
- 47:39although we don't get the same sites.
- 47:40So we have to do a little bit
- 47:42more molecular, you know, staring at this and understanding
- 47:44it because these results are literally
- 47:46I got these results today, actually.
- 47:48So we have to sort of do more of this
- 47:51and we actually can actually look at more depth
- 47:54and get more sites with other approaches
- 47:55that we haven't implemented at this moment.
- 47:57But during near zoonosis what you see is again,
- 47:58the selected sites which are in bright red
- 48:06are also on the sort of the visible side
- 48:08of the recombination binding domain
- 48:13of the spike protein, which is the tip
- 48:17the outside portion of this gene.
- 48:23Lastly, if we look post-zoonosis that's in
- 48:24the evolution of humans, we again see that
- 48:26the selected sites are sites that are at this tip region.
- 48:33Again, none of this is terribly surprising.
- 48:35The interesting thing is that it kind of indicates
- 48:36that the zoonosis it kind of indicates consistency.
- 48:38Again, there's a lot more to do before
- 48:40we can conclude anything like this,
- 48:42but the idea we have right now indicates
- 48:44a good deal of consistency between the selection
- 48:46that's ongoing in humans during zoonosis and pre zoonosis.
- 48:51And what that implies is that this may
- 48:54well have been as I said, very briefly,
- 48:56during this talk an instance where there's a virus
- 49:00just circulating around in bats and penguins
- 49:01that could have caused this disease at any time,
- 49:04it's just a matter of whether or not we actually
- 49:07have exposure to, to those organisms
- 49:11that allows the transmission to happen.
- 49:14Consistent with this, I'll just mention
- 49:17a couple like verbal points,
- 49:18which is that all the evidence that we have indicates
- 49:20that this virus spread extremely quickly
- 49:23from the moment that it zoonosis into humans.
- 49:26And in fact, in most cases of zoonosis,
- 49:28we find that that's true,
- 49:31which is somewhat counterintuitive.
- 49:33Obviously, it hasn't adapted to humans,
- 49:34it has adapted to the amount of mammalian immune system.
- 49:37And so to the extent that our immune system is not
- 49:39tremendously different from that of bats or pangolins,
- 49:41it may be not surprising that it can infect us.
- 49:44But one of the things that is true is that
- 49:47if it did not spread very quickly,
- 49:48very easily from the very moment it transmitted to someone,
- 49:51it would probably lead to a dead end.
- 49:52In other words, if you don't have
- 49:55an ability to transmit and spread just from the get go,
- 49:57the first person who gets infected
- 50:00is very unlikely to transmit it to someone else.
- 50:02So it sort of has to be well pre adapted
- 50:04for a zoonotic event to actually spread in humans.
- 50:07Now there's, we need more zoonotic events,
- 50:11God forbid that it actually happens,
- 50:13to really get a better picture of that.
- 50:15But the general result and the scientific
- 50:16literature does seem to show that zoonosis happens.
- 50:18the disease's already well set to cause problems.
- 50:22And the examples that we don't have where
- 50:24it happens like that, like MERS
- 50:27or like, well, MERS is a good example.
- 50:30It's a really deadly disease,
- 50:31but it doesn't transmit well among humans.
- 50:32And so that's an example where maybe it's transmitting
- 50:35to humans, but it's not transmitting among humans.
- 50:37And it's very hard for that disease
- 50:40to catch on within the human population
- 50:43and do human transmission as opposed to zoonotic events.
- 50:45And that's because it doesn't transmit
- 50:47and it doesn't usually evolve that ability
- 50:48to transmit over the short time that
- 50:51that individuals might get infected.
- 50:53when when they get it usually from camels.
- 50:57Okay, so I've showed you those examples.
- 50:59I just wanna to mention what else we're gonna be doing.
- 51:02So I what I just showed you was actually
- 51:05the sort of SARS coronavirus to some sites
- 51:06that are under selection in search
- 51:08for Coronavirus two genes.
- 51:10This is the S gene right here.
- 51:12That's the spike gene.
- 51:13We're gonna be looking at that in SARS coronavirus,
- 51:15one and two, we're also going to be looking
- 51:18at other genes in the genomes.
- 51:22These have other functions.
- 51:23The M gene, for instance, is a membrane gene.
- 51:26So it might be relevant to and the gene
- 51:28as well might be relevant to vaccine generation.
- 51:32Like if we could generate a vaccine that targeted
- 51:35those, maybe they would be unable to change at the same
- 51:41pace that spike protein would they might be more conserved.
- 51:44And that might be one approach towards developing a vaccine.
- 51:46That would be a longer term vaccine because one thing we
- 51:49have to worry about, of course with this Coronavirus,
- 51:53is and I have other research that we're doing on
- 51:55this question, which I'd love to talk about if anyone's
- 51:57curious, but you can estimate
- 51:59what the actual waning immunity of it is,
- 52:00even though we don't have data on that by Looking
- 52:03at other related species and using the phylogeny
- 52:05to understand how the how the waning immunity
- 52:08has evolved across the species
- 52:09and what the projected or most likely
- 52:12waning immunity of SARS coronavirus is,
- 52:15and it's, it tends to be it looks like
- 52:16it's around 80 weeks or so.
- 52:18So if we get about 8 weeks of waiting a period
- 52:21of immunity from this, that's not that
- 52:22much in terms of every two years or so we're gonna have
- 52:25Coronavirus coming around and in terms of we're going to
- 52:28be susceptible again to Coronavirus.
- 52:30Not that we're going to get it every two years.
- 52:33And what that would mean is that
- 52:36it's likely to persist as a circulating virus.
- 52:38And if it remains as deadly as it is that's a serious issue.
- 52:40So we're gonna really want to buy a vaccine.
- 52:42And we're not necessarily going to wanna have another flu
- 52:44vaccine that we have to get every year.
- 52:49So what we really want to do is target
- 52:51some genes that may be under more constraint
- 52:53then the recombination binding protein gene, the spike gene.
- 52:57So anyway, so the point is looking at multiple genes for
- 53:00trying to understand where conservative regions are where
- 53:03regions that are under selection are important.
- 53:05And we'll be doing that.
- 53:07And hopefully some of those results will
- 53:11help to guide the kind of generation of vaccines,
- 53:15and also the generation of therapeutics,
- 53:16because sites that are under
- 53:19selection are functional.
- 53:20So if you actually design a therapeutic
- 53:21that interferes with the sites that are under selection
- 53:22sort of in an opposite way, from vaccines, vaccines,
- 53:25we really want to target something that just doesn't change.
- 53:26With therapeutics, we may want to target
- 53:27the changing regions, if we can design something
- 53:30that generically does, because those changing
- 53:31regions are functional.
- 53:32In other words, those sites at the end of the spike protein
- 53:33are clearly ones that do bind the ACE gene.
- 53:35It's just that they're flexible
- 53:38about what they are in order to bind it.
- 53:42So we need to include
- 53:43all of those changing sites, if we wanna dissolve develop
- 53:46a therapeutic that for instance, would somehow interfering
- 53:50with the binding of Ace to receptors from the spike genes.
- 53:53So thank you very much for listening to the ongoing work
- 53:56we're doing on COVID-19.
- 53:59I would love to entertain any questions that you have.
- 54:03Let me just take one moment to acknowledge
- 54:05some of the people that I should acknowledge in this work,
- 54:09I already showed you a picture of John John who was earlier
- 54:11the the picture and the associated with the Mac ml approach
- 54:13that we developed many years ago 10 years ago basically
- 54:15Yinfei Wu has been taking the lead on this project.
- 54:18She's a master student.
- 54:19Yano os Wang was an assistant was in visiting
- 54:22Assistant Professor Stephen Gaugham,
- 54:24is in the Evie department
- 54:26has been helping out with this analysis.
- 54:28Haley Hassler is in my lab, has been helping out
- 54:30with phylogenetics Jayveer Singh is an undergrad
- 54:32who's been doing some of the research work
- 54:35some of the actually literature research
- 54:37that has helped us to contextualize
- 54:39the work we're doing Mofeed Najib
- 54:41produced those diagrams of the spike protein
- 54:44with the sites that we have identified
- 54:46as under selection so far,
- 54:48Zheng Wang is a long term collaborator of mine who works
- 54:54on nearly all the phylogenetic projects
- 54:56that I do, who's works with me.
- 54:59And then Alex Thornburg is A long term collaborator of mine,
- 55:02now in North Carolina.
- 55:06He was while he's currently at the North Carolina
- 55:08Museum of sciences, but he works on a lot of phylogenetic
- 55:11projects with me as well.
- 55:13And by the way, all of this, fortunately
- 55:16was recently awarded one of the NSF rapid grants
- 55:19to do this research.
- 55:20So we're very pleased to have funding to
- 55:22continue to work on this as time goes on, which is good
- 55:25because it's taking quite a lot of work
- 55:27to do the sequence wrangling.
- 55:29And the analyses themselves.
- 55:30As I mentioned, they're computationally intensive.
- 55:32So Alex and I were the PI's on that particular
- 55:36grant from the NSF.
- 55:37So we're excited to continue to do that work.
- 55:41And with that, I think I would
- 55:42like to entertain any questions you might have.
- 55:45- Thank you, Jeff, this was great.
- 55:48I'm sure we have a lot of questions
- 55:49who gets first?
- 55:54Again, you can type the questions on the
- 55:59chat box or just mute.
- 56:13- I have a quick question.
- 56:14- Okay.
- 56:16- You mentioned or you touched a bit on this before,
- 56:20but how would this compare to cite wise estimates
- 56:24of omega that you would get from Pamel
- 56:28or similar program?
- 56:29- So I'm sorry, I sort of was rushing at the end,
- 56:32I didn't explain that, in fact, I'm using pamel for some,
- 56:35So I'm using Pamela
- 56:36for the pre zoonosis analysis, and for the post zoonosis
- 56:40analysis, because as I mentioned during the talk,
- 56:44if you have a large phylogeny
- 56:46with multiple branches, et cetera, et cetera,
- 56:49where you can look over that entire phylogeny then you
- 56:51can get multiple changes at individual sites,
- 56:53which is what pamel actually uses to infer selection, right?
- 56:55You have to have the site change not just once
- 56:57but twice or three times.
- 57:02And then it says all that's under selection because
- 57:07it keeps changing again and again and again.
- 57:12So, so Pamela allows you to do that
- 57:13if you have this sort of deep time
- 57:15or large amount of time and multiple lineages that you're
- 57:17looking at, the master of approach that I'm using, enables
- 57:19you to do that on just a single lineage without needing
- 57:22multiple changes, I mean, multiple changes
- 57:23on a single language you can't even detect
- 57:25because it just looks like one change
- 57:26if you have the ancestral sequence, which is what we do
- 57:28ancestral data summation, get the ancestral sequence.
- 57:31And if you have the descendant sequence, a changes
- 57:33to T, you don't know if it changed to A to G to C to T again
- 57:35or if it just changed a to T, you have no idea you can
- 57:36just say it changed once.
- 57:38And so there's no real way to run pants,
- 57:40there is a way but it's really it's statistically
- 57:41really underpowered terrible thing
- 57:42to do to try to run pamel on a single lineage
- 57:44and figure out whether something's under selection.
- 57:47The advantage of this approach is because it
- 57:49can use that polymorphism data, the data of like what's
- 57:51just circulating in within populations as a metric for how
- 57:54much mutation is occurring.
- 57:56You can essentially divide out by that
- 57:59and then again, because we're integrating over all
- 58:04these models of how these things change, we're essentially
- 58:07borrowing information from neighboring sites for what their
- 58:10rates of change are, et cetera et cetera
- 58:13to estimate what the possible amount
- 58:15of selection is on all these sites.
- 58:16So by using the polymorphism data, and by doing this model
- 58:19averaging approach, we're actually able
- 58:21to take individual lineages and estimate
- 58:23the selection on them.
- 58:25And that's what we're doing in the near zonosis analysis
- 58:29that I showed you in the middle here.
- 58:33So there are different ways of doing the analysis.
- 58:35And it's necessitated by the fact that we just have this
- 58:37one lineage and there's no way it won't be a single lineage
- 58:39in any dataset we look at because for zoonosis,
- 58:42we're going to have human sequences,
- 58:44we're gonna have some animal sequences,
- 58:45we're not going to know we're not going
- 58:48to have any information about the actual zoonosis.
- 58:50Even if we knew the first human,
- 58:52we could just take that as an estimate.
- 58:54We still probably need some data here.
- 58:56Maybe you could have the first human
- 58:58and the first animal that you got it from.
- 59:00That just doesn't exist.
- 59:01We don't have that data for any zoonosis.
- 59:04How would we would never be there at the moment.
- 59:07So we have to assume that there's a number
- 59:09of transmissions among humans
- 59:10and a number of transmissions among animals
- 59:13during that near zoonotic period.
- 59:14And it's just a single lineage.
- 59:16So we can't really run pamel on that,
- 59:19in summary, because pamel requires multiple
- 59:21changes multiple lineages to have power
- 59:23to actually infer evolutionary change.
- 59:25MASS-PRF fortunatelY, can do that,
- 59:27because you can look on single lineages.
- 59:28So you can use MK tests as well on single lineage
- 59:33is basically designed to look at single lineages.
- 59:36But the problem with MK tests, as I mentioned,
- 59:37is that they're assuming the entire
- 59:39gene is under selection, which means it doesn't give you
- 59:41the scope or understanding about recombination
- 59:44binding gene sites under selection or something like that.
- 59:46It often will just give you a result of the genes not under
- 59:47selection, which is not true.
- 59:51- Does that answer your question?
- 59:54- Yes.
- 59:55- Great.
- 60:00- Any other questions?
- 01:00:04- I have one more if no one else wants to.
- 01:00:05- Sure, go ahead.
- 01:00:07- So in B cells, we have mechanisms
- 01:00:10that have mutation that specifically
- 01:00:13bias towards replacement mutations.
- 01:00:17So in the absence of selection,
- 01:00:18the mutation mechanisms actually cause
- 01:00:21an Omega greater than one.
- 01:00:24would this have any way of correcting for that?
- 01:00:28- So the tricky part is, and I don't know how it might,
- 01:00:31the tricky part is not so much running the software,
- 01:00:33which you could certainly do on that.
- 01:00:37The tricky part would be identifying
- 01:00:39what polymorphism is, in the case of those cells.
- 01:00:43So if you could identify sets of cells that are undergoing
- 01:00:47the mutation but aren't under selection in some way, then
- 01:00:51you could use that as the proxy for the way we use it here
- 01:00:54is polymorphism within population polymorphism,
- 01:00:57and then estimate that.
- 01:00:59I just don't know whether you have a way of
- 01:01:01doing Doing that if you want to discuss
- 01:01:03it with me, we could.
- 01:01:05That's sort of always the key for detecting selection.
- 01:01:09And it's, you know, many of you may be familiar that I work
- 01:01:11on cancer and some of the work that I do.
- 01:01:13It's the same
- 01:01:18problem that I'm working on there all the time, I'm trying
- 01:01:21to understand what the baseline mutation rates of cancer
- 01:01:23in cancer and somatic evolution of cells are.
- 01:01:25Because if I understand the baseline rates
- 01:01:27, how often those things change,
- 01:01:29just the mutation alone,
- 01:01:30then I can always estimate selection.
- 01:01:32And that's the thing we almost always want to
- 01:01:34know about in the analog analysis of sequence data.
- 01:01:37So, again, it's all about figuring out if there's some piece
- 01:01:42of the data that can be used to estimate that polymorphism
- 01:01:46and an approach like this, the benefit of an approach like
- 01:01:48this would be, you know, maybe you can estimate that for
- 01:01:50some portions of the gene, but not others, you know, maybe
- 01:01:52then there's a way that you could use this sort of model
- 01:01:54averaging approach to get at the underlying rate that it's
- 01:01:56happening, even if you can't estimate
- 01:01:58for that particular site, for instance.
- 01:02:00So I think the Might be potential to do it,
- 01:02:02but it just depends, you know, about on whether
- 01:02:04there's a critical, you know, set of data in what you're
- 01:02:09looking at which I haven't spent much time
- 01:02:12looking at back in the day.
- 01:02:13So I wouldn't know whether there's some way
- 01:02:15of baseline getting that baseline polymorphism or baseline
- 01:02:19mutation rate, which essentially amounts to the same thing.
- 01:02:23It just depends on whether, you know, you're assuming the
- 01:02:26population is sort of has, you know,
- 01:02:29it's just whether you're looking at at a population level,
- 01:02:31or you have some sort of covariance matrix
- 01:02:34to better understand the mutation rates itself.
- 01:02:36- I think there is a similar population B cells,
- 01:02:38- Great, so I encourage you to look into that.
- 01:02:44- Jeff, I have a quick question.
- 01:02:47I'm not too familiar with genome sequencing.
- 01:02:50But I think the Clustering Problem,
- 01:02:53the issue and the solution you have
- 01:02:55can be applied to many types of data.
- 01:02:58So I'm kind of confused.
- 01:02:59So you start In the diagram where you describe
- 01:03:02the different steps, you said that you first pick the most
- 01:03:06likely cluster and then you essentially
- 01:03:07keep splitting the clusters, right?
- 01:03:09How do you get the first clusters? Like
- 01:03:12there is some randomness in how you split the first?
- 01:03:16- Oh, so I sorry, I apologize.
- 01:03:19I didn't explain it in enough detail.
- 01:03:22The reason why it's so computationally intensive
- 01:03:24is we look at all possible.
- 01:03:27all possible exhaustedly.
- 01:03:29Now, I actually spent a year of my life trying
- 01:03:31to find a way to develop a Bayesian approach
- 01:03:34or some approach that would allow me
- 01:03:38to not look at all possible, you know, like to
- 01:03:40make this because because if you could do that,
- 01:03:41this would be a great way for doing tons of different things
- 01:03:45on very large data sets, right, large, like,
- 01:03:47and what amazed me is, I found that
- 01:03:50it was just an impenetrable problem.
- 01:03:53If I didn't look at every possible model.
- 01:03:56I could not get it to work I couldn't prove that
- 01:04:00That's Through like, I don't have any proof, that's true.
- 01:04:04And I would encourage anyone who really wants to dive
- 01:04:05in there, go ahead.
- 01:04:06But I'll warn you that I spent a year
- 01:04:07banging my head against that problem.
- 01:04:09And when I didn't
- 01:04:10exhaustively search all the models, I could not, I always
- 01:04:12caused these biases, like there was no way to sample them.
- 01:04:16I even have ways of sampling the models
- 01:04:17according to their probability.
- 01:04:24But even that causes a bias because sometimes
- 01:04:31there's a large number.
- 01:04:31So if you look at the, if you think
- 01:04:34about the set of models, it's a very large set of models.
- 01:04:35And there isn't actually a huge amount
- 01:04:38of likelihood differences between these models.
- 01:04:42That's the thing.
- 01:04:45So when you don't exhaustively sample the models,
- 01:04:49if you just sample some of the most likely models,
- 01:04:53you actually are sampling just
- 01:04:56one corner of the space.
- 01:04:57And it's possible for a bunch of
- 01:04:59not quite so likely models, but reasonable models
- 01:05:00that are not in that corner to sort of be actually
- 01:05:03highly influential on the model average.
- 01:05:04And so the bottom line is like sampling
- 01:05:05by trying to pick in the you know, most likely space doesn't
- 01:05:06work sampling by picking randomly doesn't work.
- 01:05:07And I could go into more detail about it.
- 01:05:09But it turned out that I couldn't do it
- 01:05:10any way other than exhaustive sampling.
- 01:05:12So, I say that Sorry, I missed that mistake.
- 01:05:14I couldn't do it by any biased approach
- 01:05:16towards that exhaustive handling
- 01:05:18the approach that I'm showing you right here.
- 01:05:21Actually, there are two ways of doing it.
- 01:05:22One is to sample stochastically,
- 01:05:23according to likelihood, and the other is to sample exactly
- 01:05:27across all exhausted sampling significantly works.
- 01:05:30In fact, it's implemented in the approach that I
- 01:05:33was just showing, I'm sorry, I just sort of jumped too fast
- 01:05:35to say what I was saying.
- 01:05:37So sampling stochastically works
- 01:05:38and sampling exhaustively work sampling stochastically is
- 01:05:40still very computationally intensive.
- 01:05:42But there's no I couldn't
- 01:05:44find any way to sort of, you know, important sample or do
- 01:05:48some sort of approach that would allow me to get a smaller
- 01:05:50set of models, which would then if we could do that,
- 01:05:53that could be really important,
- 01:05:55because then you could do this
- 01:05:57on more than like 2000 site,
- 01:05:59it's somewhere around 2000 sites.
- 01:06:00So you start running into real problems with
- 01:06:04just too much computing computation time
- 01:06:06to make it worthwhile.
- 01:06:07So we could extend this to 10,000 100,000, you know,
- 01:06:11potentially really, really large numbers of sites,
- 01:06:13and really, really sparse sets of sites.
- 01:06:16If only we could find a way
- 01:06:19to bias the sampling towards models that are more likely
- 01:06:24without causing biases in the results.
- 01:06:26I couldn't find any way to do.
- 01:06:27- This seems very much related to tree based
- 01:06:30methods where essentially you've got, like split the space
- 01:06:36and then you model of geology models,
- 01:06:39like the random forest, for example,
- 01:06:41or is very much related to that right.
- 01:06:45- Yeah, I have to say I was now familiar
- 01:06:47with those approaches.
- 01:06:49But when I was completely unfamiliar with it, yeah, I sort
- 01:06:52of thought about it that way.
- 01:06:54But you're absolutely right.
- 01:06:56Yeah, I guess the difference but here
- 01:06:57you have a sequence like one sequence,
- 01:07:00tghere you have a space.
- 01:07:01So you just split in
- 01:07:02different dimensions, but it is really good.
- 01:07:05- And I can mention, just to speculate,
- 01:07:10I'm kind of interested in a number of
- 01:07:13other ways of applying this.
- 01:07:15So for instance, if the one I've been thinking about
- 01:07:18and actually worked on a little
- 01:07:20bit haven't gotten very far with, but it's like,
- 01:07:21when you're dealing with event spaces over time,
- 01:07:22like if you have days, and you have individuals like,
- 01:07:24prominent us in public health,
- 01:07:27like individuals who are undergoing events
- 01:07:29you end up with a very sparse matrix of events.
- 01:07:31And so we use these approaches like survival plots
- 01:07:38all these approaches that we use to sort of understand
- 01:07:40how these rare events are happening,
- 01:07:42and how people are changing over this,
- 01:07:44that event space is actually really sparse.
- 01:07:45But it's kind of a matrix.
- 01:07:47And you could do this in two dimensions,
- 01:07:48not just one, right?
- 01:07:49So you could model average across two dimensions,
- 01:07:52and then you could get something
- 01:07:53that the thing that really appeals to me about that is that
- 01:07:55again, it's really this approach is really,
- 01:08:00it only builds up from the this binomial event
- 01:08:04No, no event, stuff, a picture that's very continuous over
- 01:08:09over the space and involves no assumptions
- 01:08:11about distribution whatsoever.
- 01:08:12So I'm just wondering if there aren't instances
- 01:08:14where, you know, we could come up
- 01:08:17with a better understanding of what's going on
- 01:08:19with individuals in a matrix such as
- 01:08:20that by using this approach.
- 01:08:22And it's an approach that is
- 01:08:23that still works even with these sparse spaces, because
- 01:08:26you can model average over these tremendously large number
- 01:08:29of models that all have fairly likely fairly
- 01:08:33equal likelihood to get a result.
- 01:08:35So I don't know that's just a sort of a
- 01:08:37speculation that there might be some interesting approaches
- 01:08:38, ways to approach those problems using this kind of kind
- 01:08:41of model averaging technique.
- 01:08:46- Great, I think we should wrap up.
- 01:08:49Thank you, Jeff, for this great presentation was great.
- 01:08:52And thank you all for joining today.
- 01:08:57See you next next seminar
- 01:08:58is gonna be I think, July 14.
- 01:09:01So we'll send out invites.
- 01:09:05All right, thank you, Jeff.
- 01:09:07Thank you all, bye, bye.