Skip to Main Content

BIS Seminar - 6.23.2020 - Model-averaged estimation of molecular evolution and natural selection in SARS-COV-1 and SARS-CoV-2 coronaviruses during zoonosis

June 23, 2020
BIS Seminar - 6.23.2020 -   Model-averaged estimation of molecular evolution and natural selection in SARS-COV-1 and SARS-CoV-2 coronaviruses during zoonosis
  • 00:19- All right, I see more people joining
  • 00:32Jeff, how long do you how long do you have like an hour?
  • 00:36Less than that?
  • 00:36- I think I can probably finish in less than an hour.
  • 00:40- Less than hour, all right.
  • 00:58I think we should get started.
  • 01:02So hi, everyone.
  • 01:03Welcome to our seminar series on COVID-19,
  • 01:07organized by the Department of Biostatistics.
  • 01:10I'm very pleased to have here today, Jeff Thompson,
  • 01:15Professor of biostatistics, Ecology and Evolutionary Biology
  • 01:20from the Yale School of Public Health.
  • 01:23Thank you, Jeff, for being here today with us.
  • 01:27As usual, you're welcome to write questions
  • 01:30in the chat box or even unmute yourself, if you can,
  • 01:35and other people are not talking.
  • 01:38And, Jeff, why don't you take it from here?
  • 01:42- Okay, thank you very much for the introduction, Laura.
  • 01:45I'm really pleased to have an opportunity to talk
  • 01:46about the work that we've been doing.
  • 01:49I think like many speakers in this series, you know,
  • 01:52we've been doing a lot of work very hard
  • 01:54on a short period to try to get some progress on COVID-19.
  • 01:58Ironically, this is the first work
  • 01:59I think that I started In response to the COVID-19 epidemic
  • 02:03and it's turned out to be a lot of work.
  • 02:07So it's actually gotten the least far.
  • 02:11So we've done a little bit of work, for instance,
  • 02:13on epidemic modeling of COVID-19.
  • 02:15That's already, it's actually been submitted,
  • 02:18I actually have some other work on quarantine
  • 02:20and stuff that turns out to be really interesting
  • 02:24and far along in the research.
  • 02:26And then this work, which I started early on,
  • 02:28which is more evolutionary, and looking at the zoonotic
  • 02:31process has gone a little bit slower.
  • 02:32So what that means is consistent with
  • 02:35many other speakers in this series,
  • 02:35I'm gonna be talking a lot about
  • 02:38the methods that we're going to be using,
  • 02:40which are well developed, and what we're planning to do,
  • 02:43I don't have a lot of results.
  • 02:44But I think that's consistent with these talks in general.
  • 02:47So hopefully, that will be of interest to you
  • 02:49and also be illuminating in terms
  • 02:53of possible research approaches towards this kind of work.
  • 02:58So as Laura mentioned,
  • 03:00I use a lot of evolutionary approaches
  • 03:02to do my analyses of things.
  • 03:04And the title of this talk is model averaged estimation
  • 03:08of molecular evolution and natural selection
  • 03:12in SARS coronavirus, one and SARS coronavirus two
  • 03:14two Corona viruses during the zoonotic period.
  • 03:18So what was attracting my interest in this particular case
  • 03:21is that it's usually very difficult and challenging to find.
  • 03:25And I'll get to this later in the talk to figure
  • 03:27out what's going on during the zoonotic period,
  • 03:29because you don't usually get much sampling there.
  • 03:32So, what I wanted to do was apply some techniques
  • 03:35that I've developed to this problem.
  • 03:38And I will get to those techniques
  • 03:41and the application to this problem.
  • 03:43But I first just wanna give a little bit of introduction,
  • 03:46I think, maybe from a statistics point of view
  • 03:47towards some of the methodologies that we're using,
  • 03:49just so everyone can sort of see on board
  • 03:51at least how I see this as contributing
  • 03:55to interesting statistical questions.
  • 03:57So and in a broad sense, if I can get this to Move forward.
  • 04:00Here we go.
  • 04:01I think one of the most intriguing
  • 04:03and interesting and challenging areas of mathematics
  • 04:05and statistics is understanding this border
  • 04:08between the discrete and the continuous.
  • 04:09So these are just some one particular
  • 04:13example you can pick out is, if you look at discrete
  • 04:16and continuous distributions that are frequently
  • 04:19in use in statistical probabilistic analyses,
  • 04:21we have the geometric and negative binomial distributions.
  • 04:25And we have the exponential and gamma distributions.
  • 04:30These are basically essentially waiting for discrete events
  • 04:32when you have a probability over time.
  • 04:33We're waiting for the earth event if you
  • 04:35have probably over time,
  • 04:37and they correspond to the distributions on a continuous
  • 04:39time for the wait for the first event
  • 04:42or the wait for the alpha event.
  • 04:45So there's a real clear correspondence
  • 04:46between these two distributions.
  • 04:48And you can actually see in the mathematics,
  • 04:50how they're similar as well.
  • 04:53And that correspondence is kind of interesting.
  • 04:54And the reason why I say it's interesting is
  • 04:56because often many of the biggest problems I think
  • 04:59we wrestle with in statistics are when we're trying
  • 05:01to deal with data that is some intermediate
  • 05:04level between continuous and discrete,
  • 05:07and where we're trying to figure out which
  • 05:08approach is the best to use, should we use some sort
  • 05:11sort of parameterize distribution to address it?
  • 05:13Or should we use some sort of nonparametric
  • 05:17approach based on the discrete?
  • 05:18I'm not sure in any particular case.
  • 05:19But I just wanna mention
  • 05:21that I think that's a very interesting area.
  • 05:22And the technique I'm gonna tell you about
  • 05:23is definitely wrestling with exactly this kind of question.
  • 05:27So what kind of question do I mean?
  • 05:29Well, I mean, questions that deal with state spaces,
  • 05:32over time, or over any discrete or continuous axis.
  • 05:36And you can see in this diagram just give you a picture
  • 05:40of the kinds of problems that one deals with
  • 05:43between discrete and continuous measures.
  • 05:45You can have here it's depicted as time,
  • 05:48you could have a discrete state space,
  • 05:51state space you're measuring over time,
  • 05:53you could have a continuous sorry,
  • 05:56you're gonna have discrete measurements
  • 05:59over where You've got discrete time
  • 06:01in a discrete state space,
  • 06:03you could also have discrete time
  • 06:06and a continuous state space.
  • 06:08You can have continuous, continuous
  • 06:12or you can have discrete, continuous.
  • 06:13And this two on the bottom are, two on the left,
  • 06:15sorry, are the relevant ones for
  • 06:17what I wanna talk to you about.
  • 06:19In my research, which is largely focused
  • 06:22on informatik data that we can obtain from sequencing
  • 06:26or other approaches like that.
  • 06:28A lot of what we're trying to do is look at these discrete
  • 06:30linear sequences that have sites DNA sites or amino acid
  • 06:34sites and trying to understand is there some
  • 06:37pattern in those sites that allows us to understand
  • 06:40something about the biology of the organism
  • 06:41or the biology that we want to know something more about?
  • 06:45So what essentially I'm gonna be doing
  • 06:48is telling you about approach an approach
  • 06:50that takes essentially discrete items over some X axis
  • 06:54here, in which case in my case, it's always going to be
  • 06:56sequence space, like the nucleotides
  • 06:58or the amino acids of a sequence.
  • 07:01And turns it into these kinds of more discrete models.
  • 07:04And then in some, in a procedure that I'm going to tell you
  • 07:07about actually gives us more of a continuous measure
  • 07:10over that space, it's not completely continuous,
  • 07:13it actually is on every site.
  • 07:14But when you work with hundreds of sites,
  • 07:17it turns out to look very continuous
  • 07:20in terms of how it appears.
  • 07:22But it's done with a discrete model
  • 07:23that looks over multiple sites.
  • 07:24So well, I'll tell you how it works in a moment.
  • 07:26And I hope it's of interest to you guys.
  • 07:28So just to introduce that, in general,
  • 07:31the lab has worked on a lot of different kinds of data,
  • 07:34and including things like gene expression data
  • 07:36that borders this discrete continuous measurement.
  • 07:39The old micro arrays we used to use give us
  • 07:43essentially continuous measures of gene expression.
  • 07:44Now we get discrete counts
  • 07:46from our census sequencing approaches.
  • 07:49Then all the sequence data we work with
  • 07:51often ends up being essentially clusters
  • 07:53of sites and various kinds.
  • 07:56And then we also use a lot of phylogenetic inference,
  • 07:59which is another kind of just discrete modeling
  • 08:01in terms of the topology, but the borders
  • 08:03between these two because we have discrete modeling of the
  • 08:07topology, there are certain topologies
  • 08:10that the taxa that we're interested in looking at
  • 08:12that show their relationship to each other.
  • 08:13At the same time, there's also a continuous
  • 08:15measure out of that, which is these branch lengths,
  • 08:17or how diverge these different tacks
  • 08:19are from each other and constructing the phylogeny.
  • 08:22So this sort of border between discrete
  • 08:24and continuous measures, always sort of plagues
  • 08:28and intrigues me, I guess it would be the question.
  • 08:30Okay, so what am I gonna do today?
  • 08:32What I wannado today is talk about
  • 08:35maximum likelihood model averaging to profile clustering
  • 08:37of site types across discrete linear sequences.
  • 08:40So at the very base level,
  • 08:41how do we take kind of these discrete sequences
  • 08:44of amino acids or nucleotides
  • 08:46and understand whether sites are closer to each other
  • 08:50or farther apart from each other
  • 08:52this is the question are they just uniformly
  • 08:53distributed site types across a sequence?
  • 08:55Are they clustered close together or far apart?
  • 08:58Secondly, I'm gonna talk about how we can
  • 09:01then use that approach to understand whether sites
  • 09:04are under selection in a gene expressed in a sequence.
  • 09:07And what I mean by under selection is that,
  • 09:09in fact, sites are changing in a rapid
  • 09:12or at a more rapid pace than you'd expect simply
  • 09:14by mutation alone.
  • 09:16So mutation, of course, is going to introduce
  • 09:18variation into a genetic sequence.
  • 09:19But when you see changes that are happening faster
  • 09:21over time in a population,
  • 09:23then mutation alone would produce
  • 09:26that implies that every time that mutation is happening,
  • 09:29it's spreading across the population.
  • 09:30And that's why you see that uptick
  • 09:31in the rate of change of those sites.
  • 09:34So we can actually use this clustering approach
  • 09:36to identify regions of the gene that have
  • 09:38that sort of uptick and I'll explain how we do that.
  • 09:41Now lastly, I'm just going to show you a very few slides
  • 09:43on the title of the talk,
  • 09:45which is this model average estimation of the molecular
  • 09:48evolution and natural selection in SARS Coronavirus one
  • 09:51and SARS Coronavirus two during the zoonosis.
  • 09:55So by the time we refer to these,
  • 09:57I'll just let you know we're almost done with the talk.
  • 09:59AlL right, so to talk about the first one
  • 10:01maximum likelihood model averaging five clustering
  • 10:03of sites across the street linear sequences.
  • 10:09I just want to... (phone ringing)
  • 10:11Sorry, emphasize that we wanna figure out
  • 10:20whether site types are clustered within a linear sequence.
  • 10:22This sounds like a very straightforward
  • 10:24statistical question seems like something
  • 10:27that should have been addressed many, many times
  • 10:28in the statistical literature.
  • 10:29Much to my surprise,
  • 10:30it's actually not terribly well explored.
  • 10:34You have a linear sequence,
  • 10:36it's so long and you have site types of one type
  • 10:38or another are they clustered next to each other?
  • 10:39Well, if you know the bounds of the region of interest,
  • 10:42and others, if you can describe oh,
  • 10:43it's I'm interested in this domain right here,
  • 10:46and it's from site to site 90 or some other description.
  • 10:48If you know the bounds,
  • 10:49it's very simple to analyze that kind of data.
  • 10:52You can just quantify the site type proportions
  • 10:55within and outside those bounds.
  • 10:57use something like a straightforward fisher's exact
  • 10:59test for significance extremely simple problem.
  • 11:01But what if you don't actually know those bounds?
  • 11:04What if you don't know even what you're looking for exactly?
  • 11:05you just know you're interested in concentrations
  • 11:07of one site type compared to another site type
  • 11:10across some discrete linear sequence,
  • 11:12like this series of zeros and ones you see below.
  • 11:15There's one, zero, zeros, there's one, zero, ones,
  • 11:17there's periods where ones are closer to each other a series
  • 11:20of ones are closer or farther apart from each other.
  • 11:22How should we figure out whether things
  • 11:24are actually clustered in that site?
  • 11:26Or are they random?
  • 11:27So if you don't know exactly where to describe,
  • 11:31or what size you're looking for,
  • 11:33the most common solution people use
  • 11:35is some kind of sliding window,
  • 11:36they take a window over the series,
  • 11:38and they slide it across and say,
  • 11:40"How many are in this window?"
  • 11:41And then you can come up with based on the sliding window
  • 11:44a sort of diagram of the clustering.
  • 11:46And that's an approach that actually does
  • 11:49give a good metric of the clustering
  • 11:51in terms of like you see peaks where there's
  • 11:53a lot of clustering and valleys where there is none.
  • 11:56However, significance testing with that kind of approach
  • 11:59is often awkward to construct.
  • 12:00Due to a strong or autocorrelation
  • 12:02among this URL overlapping windows.
  • 12:04And of course, if you just sort of
  • 12:06take windows arbitrarily from one location to another,
  • 12:09then you're really instituting, (indistinct chatter)
  • 12:13then that causes problems.
  • 12:14Because what if the cluster is really on a border
  • 12:16between two windows, so you have to slide it over and then
  • 12:19you have the autocorrelation.
  • 12:20And it becomes actually statistically
  • 12:21quite challenging to sort of account
  • 12:24for all of those auto correlations.
  • 12:25Secondly, they need to specify that window
  • 12:27size itself presents a user with a procedural ambiguity
  • 12:31that almost inevitably leads to post hoc selection of window
  • 12:34size and can mislead inference that is just the fact that
  • 12:37you have to choose a window size.
  • 12:39And if you don't actually have a good arbitrary
  • 12:41outside reason to choose it.
  • 12:43It's very hard not to choose a window size
  • 12:44that ends up validating your hypothesis in some way.
  • 12:49So it'd be better if we could just have an approach
  • 12:51that does not require us to place in some
  • 12:53arbitrary parameter that gives us a window size.
  • 12:56So in order to address this question,
  • 12:58a postdoc of mine, John John, who you see below work
  • 13:01with me to address it.
  • 13:03Oh, I wanted to say one other thing,
  • 13:04which is that, yes, this has been addressed with some
  • 13:07nonparametric methods that people have developed,
  • 13:11including some rather famous people like Sam Carlin.
  • 13:14And these are methods that do not assume prior knowledge.
  • 13:17And they've been suggested to detect this clustering
  • 13:20and discrete linear sequences.
  • 13:21So you can do runs tests that look for
  • 13:22the longest unbroken run, or the variance of the run
  • 13:26links across the entire sequence.
  • 13:27Both of these are indicators of clustering.
  • 13:30Unfortunately, both of those are using
  • 13:32are not sufficient tests.
  • 13:34And those they don't use enough of the information
  • 13:36to say that you're actually have as much power as you'd
  • 13:39like to do the analysis.
  • 13:40And that's because if you use like
  • 13:42the longest run link, for instance, of course,
  • 13:44you're only really using a little bit
  • 13:45of information about the entire sequence.
  • 13:47And of course, you're really missing anything
  • 13:49like the cluster of ones that are have a bunch of small
  • 13:52clusters that are all next to each other interspersed
  • 13:54with a few of the other type,
  • 13:56so the longest unbroken run doesn't work well.
  • 13:59If you use the In terms of power,
  • 14:01if you use the variance of long run link
  • 14:04that gets rid of the fact that you're looking for just one.
  • 14:05But unfortunately, a variance doesn't tell you anything
  • 14:07about the relative position of site
  • 14:11that are of the same type across the sequence.
  • 14:14So the fact that this one, one, one, one here is close
  • 14:18to the one, one here, and the one another is,
  • 14:20and this the fact that these are all close to each other,
  • 14:22does not give us the power that it should
  • 14:25for understanding this region may
  • 14:27be under maybe cluster.
  • 14:30So variants of run length is also an underpowered approach.
  • 14:33The most powerful approach that's been published out there,
  • 14:36aside from the ones we've been working on,
  • 14:38is the empirical cumulative distribution functions
  • 14:41to sick that's where you sort of go across the sequence
  • 14:43and just say, "oh, okay, we're accumulating ones here,
  • 14:47we're shooting more accumulating more."
  • 14:49And there's fortunately a number
  • 14:52of highly developed statistical approaches
  • 14:53to look at the empirical distribution and figure
  • 14:55out whether you see an increase beyond
  • 15:00expected during some period during that ECDF,
  • 15:03the power is better than either the previous methods,
  • 15:05but it's still not very strong.
  • 15:07It's not clear that it includes all the
  • 15:08information that it should.
  • 15:10And it can be affected.
  • 15:12Research has shown that it can be affected
  • 15:14by the location of the cluster, which is not desirable.
  • 15:16So if you have a cluster on an end,
  • 15:18that has less the ECDF will have less power
  • 15:21or more power compared to a cluster in the middle.
  • 15:23It's also challenging to interpret in the end,
  • 15:26for reasons I'm not gonna go into right away.
  • 15:29So what did we do?
  • 15:30What we did was develop a tripartite divide
  • 15:32and conquer approach to model variant sites
  • 15:35based on iterative sub clustering.
  • 15:37And I'll describe it in detail right now.
  • 15:39I'll just tell you the plus and the minus
  • 15:40of this approach at the beginning,
  • 15:42which is it's sort of a bioinformatics approach
  • 15:45and that are bioinformatics statisticians approach
  • 15:48and that it uses intensive computation
  • 15:50to solve the problem instead of giving
  • 15:52a strict analytical result.
  • 15:55And in fact, what it does is it just says,
  • 15:58Well, if we're interested in clustering in any case,
  • 16:00clusters should be represented by increases in
  • 16:03the probability within some cluster central region
  • 16:06compared to some side regions.
  • 16:08And if we define CS and CE to be anything
  • 16:11from the very beginning to the very end of the sequence,
  • 16:14it encompasses all possible single clusters
  • 16:17within a sequence.
  • 16:19So, for instance, if the cluster were on the far left
  • 16:22we can just define CS to be at zero,
  • 16:25the left hand cluster is nothing and the right hand cluster,
  • 16:28right hand area that has depressed in variant type intensity
  • 16:35would be the other category.
  • 16:38Anyway, so, what we can do is divide any sequence
  • 16:42into three sections, just count up the number
  • 16:44of site types in each one, estimate the maximum
  • 16:46likelihood probability for the site type
  • 16:50to be of the variant type of interest,
  • 16:52say it's a glycine amino acids within a protein
  • 16:55or add mean nucleotides limited gene, whatever it is.
  • 17:00So then you can just come up with a null hypothesis,
  • 17:03which is the likelihood under the hypothesis
  • 17:06that these things are located at random
  • 17:09across the whole sequence.
  • 17:11And then an alternate hypothesis that allows
  • 17:14that is invoking a model which involves more parameters,
  • 17:18which then separate separates into a clustered
  • 17:21versus non-clustered state.
  • 17:23So that would be fine if what we really
  • 17:25expected in a sequence was one cluster,
  • 17:27compared to nothing else,
  • 17:29compared to the sort of baseline rate of clustering,
  • 17:33sort of baseline rate of variant types.
  • 17:35And but what we really want is an approach
  • 17:39that can take clustering at many, many levels.
  • 17:42So what if there's a cluster within the cluster
  • 17:43or cluster within left?
  • 17:45So what you can do is then take each
  • 17:46of these sub clusters you've identified and actually
  • 17:50do the same process on them looking for whether there's
  • 17:53a higher likelihood of the data given another cluster
  • 17:56somewhere within this sequence, et cetera, et cetera.
  • 17:59Now, if you think so this sort of dictates a procedure,
  • 18:04which is that you start, you input the sequence,
  • 18:07you start at, you know, the first at
  • 18:09the left and move all the way to the right,
  • 18:11essentially, you find the most likely cluster
  • 18:13among all the possible clusters.
  • 18:15If the cluster is statistically significant,
  • 18:17you then sub sequence each of those three parts,
  • 18:21the left hand part, the central center part
  • 18:24and the right hand part, find the most
  • 18:26likely clusters within each of them.
  • 18:27And proceed doing this until you reach a point
  • 18:30where you can no longer find any statistical evidence
  • 18:32that there is continued clustering within it.
  • 18:34And that's the point at which you stop.
  • 18:36And then what you can do.
  • 18:37And this, I think, is sort of a key because
  • 18:39at the end of that, what you get is one discrete diagram,
  • 18:42kind of like that diagram I showed you initially,
  • 18:44where it proceeds flat, goes up,
  • 18:46proceeds flat goes down, et cetera.
  • 18:47I'll show you an example of that in a moment.
  • 18:50But what you really wanna do possibly,
  • 18:53right, what I think is really appealing about
  • 18:55this approach is that then you can take
  • 18:56that as one model, the most likely model and you can look
  • 18:59at all the other possible models
  • 19:00that you could have constructed.
  • 19:02And you can use AIC weighting to actually figure
  • 19:05out how much you should believe what is the weight
  • 19:11for every possible model.
  • 19:13And then you can average across those models
  • 19:14to give you a continuous description
  • 19:17of how much clustering you see across the sequence.
  • 19:18And again, the advantage that I mentioned
  • 19:20early on about this,
  • 19:22from my standpoint is I haven't put in anything
  • 19:24about how big a window how big a cluster,
  • 19:26I put in nothing about what I'm expecting
  • 19:28to see out of the sequence.
  • 19:30I'm just asking, what's the most likely description
  • 19:32of this given the assay penalty for parameterization
  • 19:37and what the result gives me.
  • 19:39So then we have a bunch of different weights
  • 19:41for all our different models.
  • 19:44And what it gives us something like this.
  • 19:45So on the top, I've shown you the AIC model selection
  • 19:48which is the first thing I showed you
  • 19:49if I just took the most likely description
  • 19:51of this particular sequence.
  • 19:53It's not important what it is it's PRF
  • 19:55ADHD, which has been widely studied in evolutionary biology.
  • 19:59But if you take this model selection would,
  • 20:02the most likely description
  • 20:05given that sub clustering looks something like this
  • 20:07where we have a region with fairly high concentration
  • 20:10of polymorphism, in this case, a valley,
  • 20:14a region, an intermediate level,
  • 20:16a point where we have a lot of polymorphism.
  • 20:19And then it moves and changes across the sequence.
  • 20:21Now, if you then instead take not just that one model,
  • 20:25but a series of models and do the AIC model average,
  • 20:28you get a much more continuous description across
  • 20:30the sequence of what the probability
  • 20:33of sight types being different is.
  • 20:36And that enables us to ask a question
  • 20:37that's a little bit more interesting in many cases,
  • 20:41and I'll show you how it enables us to ask questions
  • 20:43about natural selection in a moment.
  • 20:45So in particular, it allows us to get an estimate,
  • 20:49you know of what the probability
  • 20:50is across the entire sequence.
  • 20:51Even though we don't have
  • 20:52observations within the central region
  • 20:54or this barren region here.
  • 20:56We can still estimate what the model average,
  • 21:00probably of a change of hearing in different places
  • 21:02have this gene are and that enables us
  • 21:05to ask questions that we otherwise could not do.
  • 21:08All right, so that's an introduction of MACML.
  • 21:11I'll just mention, and I could give you more detail on this.
  • 21:14It's like this is actually published work,
  • 21:16so you can find it.
  • 21:17But compared to the ECDF statistics,
  • 21:19that approach I just showed you has greater power
  • 21:21to detect heterogeneous clusters
  • 21:23it identifies clusters with greater accuracy and precision
  • 21:26based on the Kullback-Liebler divergence between
  • 21:28the actual distribution of the observed distribution,
  • 21:31sorry, the actual distribution
  • 21:34and the inferred distribution.
  • 21:36It has better power and accuracy across
  • 21:37different levels of clustering,
  • 21:38better power and accuracy across
  • 21:40different sequence links,
  • 21:41and better power and accuracy and finding
  • 21:43multiple clusters compared to a single cluster.
  • 21:45The disadvantage is, it's extraordinarily
  • 21:47computationally intensive, and it is prohibitively
  • 21:49so for very long sequences.
  • 21:51So for genes a very long length,
  • 21:53we can't actually run it on the full-length gene
  • 21:55and we have to do some more heuristic processes
  • 21:58to crunch those genes into smaller size.
  • 22:01Which we then can analyze and then build them up.
  • 22:03Again, I won't go into those at the moment.
  • 22:05But the point is that at certain links,
  • 22:07it gets just computationally too intensive to go
  • 22:09through all the possible models that could explain the data.
  • 22:13Now, I've talked about the maximum-likelihood averaging
  • 22:17to profile clustering of site types
  • 22:19across discrete linear sequences,
  • 22:21introduced that methodology to now I'm gonna talk about
  • 22:24how we can at apply that methodology
  • 22:26to get us a better idea of which sites are under selection
  • 22:29using a what's called a pause on random fields approach.
  • 22:32And don't worry about that terminology.
  • 22:34You might know it from statistics,
  • 22:37it has to do with a particular observation
  • 22:40in molecular evolutionary biology,
  • 22:42which is why they're using it
  • 22:44and it's not really important for this talk,
  • 22:46why it's called that.
  • 22:48So let's go on and go ahead and do that talk
  • 22:51about the model-averaged site selection
  • 22:53using Poisson random fields.
  • 22:54So first, I need to give you a little bit of background
  • 22:56in the evolutionary biology for those of you
  • 22:59who haven't had a lot of biology,
  • 23:00so you understand how this fits in with
  • 23:02what we tend to do another strategy.
  • 23:03Of course, evolutionary biologists
  • 23:05are often very interested in understanding
  • 23:06what things are under selection.
  • 23:07And in the context of this talk,
  • 23:09why is that important?
  • 23:10Well, we'd really like to know what things
  • 23:12are under selection in the COVID epidemic,
  • 23:14because we'd like to know what sites
  • 23:16are actually causing the COVID epidemic
  • 23:18to spread more or not, and what sites may have
  • 23:21been important in it prior to zoonosis,
  • 23:24MSN, perhaps, especially in the context of this talk,
  • 23:26what sites were selected during
  • 23:28that zoonotic process that made this virus perhaps able
  • 23:31to infect humans in the first place.
  • 23:33So what we're doing is,
  • 23:34so to give you an introduction,
  • 23:36I just wanna mention that they're sort of ways
  • 23:39to look at ancient times and understand
  • 23:40whether selection was happening.
  • 23:42And that's this approach that's called
  • 23:44that looks at phylogenetic divergence,
  • 23:45looking at multiple sites and saying,
  • 23:47"Oh, we have a whole bunch of phylogeny
  • 23:49of how these organisms are related."
  • 23:51And then we have a bunch of sites that are for each taxon.
  • 23:55When we see sites like this, for instance,
  • 23:57that's having A and then a couple C's and then a G
  • 24:00and another tacks on, we know that this site changed twice
  • 24:03on that phylogeny, at least right?
  • 24:05So it changed to probably change from C ancestrally
  • 24:09to an A in this lineage and to a G
  • 24:11in this lineage independently.
  • 24:13And so the fact that it changed twice means
  • 24:16that it's got an elevated rate of change.
  • 24:18And that elevated rate of change is an indication
  • 24:20that there's been positive selection for change.
  • 24:22It's especially likely in sort of pathogen hosts
  • 24:25interactions that high rates of high change are
  • 24:28because pathogens are changing in order
  • 24:30to not be recognizable by their hosts.
  • 24:33And often the host has recognition proteins
  • 24:35that are changing to still recognize the pathogen,
  • 24:36even the pathogen is changing.
  • 24:38So these high rates of evolution
  • 24:40are very strong indicators of selection
  • 24:42in host pathogen situations.
  • 24:45So this is one way to study a natural selection.
  • 24:48It does depend, though, on having a lot of data going back
  • 24:52in time because you're actually reliant on these changes
  • 24:55are occurring in multiple places on multiple lineages.
  • 24:58Now, a more recent level, and I'm going to go back
  • 25:02to the middle in a moment.
  • 25:05But a very recent time, you may have
  • 25:07heard of selective sweep detection,
  • 25:08a couple of methods people use are tajima's D,
  • 25:11or IHS, there's a bunch of other methods that are out now.
  • 25:14And the idea there is to look at polymorphism.
  • 25:16And if you look at an individual, before selection,
  • 25:20this is sort of just a idea diagram,
  • 25:22not what you look at.
  • 25:23But so if you look at an individual who has a variant,
  • 25:26and what you see in a population is that
  • 25:30one individual with variant, a variant that's important
  • 25:33as somehow swept across the population.
  • 25:35So if you see this would be before selection,
  • 25:37there's a lot of variation at a particular locus
  • 25:39in the genome after selection,
  • 25:41that one individuals variant which contributed
  • 25:44to the reproductive fitness would then imply
  • 25:46that they would spread across the population.
  • 25:50And if they spread across the population,
  • 25:52then the genetic variants that were present
  • 25:54in that original individual spread across
  • 25:56the population as well along with this selected site,
  • 26:00and so you can look for this kind of partial or speedy.
  • 26:04And the selection is going on neither
  • 26:07of the approaches that I just talked about
  • 26:09or the approach that I'm doing today.
  • 26:10So I just wanted to introduce those,
  • 26:12so you knew those are different.
  • 26:13And they're different because we're looking
  • 26:15at a more intermediate timescale.
  • 26:16That's like the sweet detection is purely
  • 26:19dependent on polymorphism in the population,
  • 26:21like what's happening in a population right now.
  • 26:24The phylogenetic divergence is purely dependent
  • 26:26on this ancient changes that you get from a phylogeny
  • 26:28understanding how different species are related
  • 26:31to each other at an intermediate level,
  • 26:33our methods use that use both the polymorphism
  • 26:35and the divergence.
  • 26:37And the idea here in the McDonald-Kreitman approach,
  • 26:40and the master approach I'm going to tell you
  • 26:42about is that the polymorphism what you see generally
  • 26:46in the population is sort of consistent with this.
  • 26:48Sorry, if I go back to this slide.
  • 26:51With this before selection, you know,
  • 26:53all of these blue sites are assumed
  • 26:55to not be under selection,
  • 26:57and that generally what we believe in evolutionary biology,
  • 26:59because of empirical data that validates it
  • 27:02is that most sites that you find varying in populations
  • 27:05are not under strong selection.
  • 27:07If they were on stronger selection,
  • 27:08they would probably fix it, everyone would have them.
  • 27:11And if they were under negative selection,
  • 27:13they wouldn't rise to a high frequency.
  • 27:14So generally speaking sites that you actually see
  • 27:17change differences between us and our genetics
  • 27:18typically are not affecting anything.
  • 27:20Of course, we spend in our...
  • 27:23In the media, you only hear about the changes
  • 27:24that actually affect things.
  • 27:25And that's because those are important to us,
  • 27:26the ones that don't change anything
  • 27:28we don't really care about.
  • 27:29So nobody talks about that much.
  • 27:30But most of the changes within population or differences
  • 27:33within population don't have much material effect.
  • 27:35So under that hypothesis,
  • 27:37then when you look at polymorphism,
  • 27:39most polymorphism is just an indication
  • 27:41of the underlying mutation rate,
  • 27:43some mutation happened didn't have any effect.
  • 27:45It's drifting up and down in the population.
  • 27:47And so the advantage of that is if you know
  • 27:50that polymorphism is signal is a signature
  • 27:52of just random mutation, it gives us an estimate
  • 27:54of the underlying mutation rate, which we can then compare
  • 27:57to the divergence and using that comparison,
  • 28:00we can understand how organisms are related.
  • 28:02So whether organisms are under selection
  • 28:05or not, if the divergence is high compared
  • 28:07to the polymorphism, that indicates a lot of selection.
  • 28:09That means (indistinct chatter)
  • 28:12in the timescale of the analysis you're doing,
  • 28:14we have a lot of change the population,
  • 28:17and on the other hand, you have a lot of polymorphism
  • 28:20and not that much divergence, then that indicates
  • 28:22you've got a lot of change going on,
  • 28:23but it's not actually being directionally
  • 28:26selected because the divergence is much lower.
  • 28:27So how does that test work in practice?
  • 28:30Well, just to step back for one moment,
  • 28:32so we're gonna apply that kind of test.
  • 28:35In this talk I'm applying that test
  • 28:36to the emergence of COVID-19.
  • 28:39I'm actually applying it but also to SARS, which is fairly
  • 28:44closely related the SARS coronavirus one
  • 28:46because we have similar data and can apply
  • 28:48the same test in the same way to that data set.
  • 28:52And we're using in addition the SARS like
  • 28:55Coronavirus in a sample that had been sequence
  • 28:58basically collected from bats.
  • 29:00Over the past 20 years or so,
  • 29:02so what you can see here is a phylogeny,
  • 29:05which includes COVID-19 epidemic ongoing now in humans,
  • 29:09the SARS epidemic, which caused some 400 deaths
  • 29:13or so back in the early 2000s.
  • 29:18And what we're doing is analyzing both and looking at,
  • 29:21in particular, the very short internode here
  • 29:25were between the most closely related non human infections
  • 29:31and the human infection set that we can see.
  • 29:33And this internode here, also,
  • 29:36between these non human infections and the human
  • 29:39infections we can see here, because the changes
  • 29:42that may have enabled, we don't know,
  • 29:45there may be no changes that enabled it,
  • 29:47maybe this virus throughout
  • 29:49its entire history could have infected humans,
  • 29:51but it just never managed to or never did.
  • 29:53But if there are changes that are unique to this virus
  • 29:56that happened during zoonosis, enabling it to infect us,
  • 29:59they happened on this lineage,
  • 30:00and so we're interested in seeing what those changes are.
  • 30:04And so that's what we're gonna do is we're gonna run
  • 30:06this polymorphism and divergence approach on this lineage.
  • 30:10And what I just want to make (indistinct chatter)
  • 30:13clear to you is the reason
  • 30:14why the polymorphism divergence approach is important is
  • 30:18the phylogenetic approach, the ancient approach
  • 30:20relies on a large clade of data, which we don't have
  • 30:22for that particular lineage here,
  • 30:24we just have the human infection,
  • 30:26which is no longer zoonotic.
  • 30:26And we have this one lineage.
  • 30:28And so what we can do is ancestrally reconstruct
  • 30:30the ancestor of this lineage, which is right here,
  • 30:33actually on the phylogeny,
  • 30:34and also the ancestor right here,
  • 30:37and then use mass PRF, this approach that's based
  • 30:40on polymorphism in the room, so I'll explain to you
  • 30:43on the divergence between that ancestor
  • 30:46and the first ancestor of all the human infections.
  • 30:48And we can take that as the near zoonosis time
  • 30:51and figure out what mutations might
  • 30:53have happened during that time.
  • 30:54All right, so we're gonna do that in both
  • 30:56the COVID-19 and SARS cases.
  • 30:59Now, how does this work in principle?
  • 31:02Well, there's an old approach,
  • 31:03which is not what we're using.
  • 31:05But I have to compare it to in order to
  • 31:06sort of reference it in terms of the literature.
  • 31:09And that is that when you assume
  • 31:11that polymorphism is neutral,
  • 31:13we expect a different proportion of replacement
  • 31:16to synonymous divergence compared to replacement
  • 31:18to synonymous polymorphism in a gene.
  • 31:21So it's just a two by two table here, again,
  • 31:23very simple statistics, where we look at
  • 31:25the number of replacement sites that are divergent
  • 31:28the number of synonymous sites replacement,
  • 31:30again, is when an amino acid change
  • 31:32occurs in a DNA sequence.
  • 31:33DNA sequence changes can either change the amino acid
  • 31:35or not depending on what the sequence of the code on
  • 31:39the three base pair code on in the DNA sequences.
  • 31:42So if there's a replacement, we tally it here,
  • 31:44if it's a synonymous change, that doesn't change the amino
  • 31:46acid, we tally it here, these ones are preserved.
  • 31:48Sometimes changes are presumably neutral because
  • 31:50they don't change anything about your protein.
  • 31:52And then the if it's a polymorphic replacement,
  • 31:56then we see it here.
  • 31:57And if it's a synonymous polymorphism we see it here.
  • 31:59So under the hypothesis that I mentioned,
  • 32:01all three of these cells should occur, it should
  • 32:04be sort of changing in exactly the same way
  • 32:06because polymorphic sites, whether they're replacement
  • 32:09are synonymous, we're assuming are neutral,
  • 32:11synonymous sites, whether the divergent
  • 32:12or polymorphic, we're assuming is neutral.
  • 32:15The only one that apparently that under
  • 32:17assumption is not neutral are these replacement
  • 32:19changes at replacement divergence sites.
  • 32:22So, if this replacement divergence, if the marginals
  • 32:25add up so that this replacement divergence is sort of in
  • 32:29line with all these others, then we assume nothing important
  • 32:30is happening in that gene, it's probably not selected,
  • 32:33it's just neutral changes that are happening there.
  • 32:35If this divergence is higher, though,
  • 32:38then we might conclude that it's under
  • 32:39selection for changes at a rapid pace.
  • 32:41So neutrality yields a DN over DS that's equal
  • 32:44to the PN over PS positive selection means
  • 32:46that the DN DS is greater than the PN PS and negative
  • 32:50selection where changes are actually being selected against
  • 32:53at a high level indicates the DN DS
  • 32:56is gonna be less than PN PS.
  • 32:59All right now Let's get to a little bit of the
  • 33:01complexity on this thing that I mentioned that's called
  • 33:04Poisson random field theory, quantitatively estimates
  • 33:05gene-wide selection intensity.
  • 33:09So what you can do is take that
  • 33:12same two by two table, and you can say under a model of
  • 33:14selection, what do we actually think is happening here.
  • 33:18And that gives us the ability to estimate the selection
  • 33:20coefficient, which is a basically the rate at which that
  • 33:22change allows the virus to increase its reproductive ability
  • 33:25or survival ability in the host.
  • 33:27And that that is this gamma term right here
  • 33:32in these terms, and this, these look complicated,
  • 33:34but essentially, these formulas are just saying
  • 33:36that the expectation for a synonymous sorry,
  • 33:39the synonymous and replacement have reversed
  • 33:41on this chart compared to the last,
  • 33:43so don't be confused by that.
  • 33:45But the expectation under synonymous
  • 33:45changes is essentially the mutation rate.
  • 33:48And these terms are just about the sampling properties
  • 33:50of when you sequence how many of these things you get,
  • 33:52I don't need to go into the detail about that here.
  • 33:55Similarly, the polymorphic sequence
  • 33:57is just basically dependent on the mutation rate.
  • 34:00How the replacement sequences are a little bit more
  • 34:02complicated in that they have to account
  • 34:07for kinds of selection that may be going on.
  • 34:11For reasons that I don't wanna get into
  • 34:12the polymorphic selection, so both of them are depending
  • 34:16on the mutation rate for replacement sites,
  • 34:18and both of them depend on
  • 34:20how much each variant is selected.
  • 34:23Selection doesn't pack the polymorphism
  • 34:25to a certain degree in the sense that if variants
  • 34:27are moving through the population very fast,
  • 34:30that can change how much polymorphism you see.
  • 34:32But then if you use these sampling formulas, and the formula
  • 34:36for the estimate of the strength of selection,
  • 34:38given how many variants we see changing,
  • 34:41you get these formulas for how much replacement
  • 34:44divergence and polymorphism you expect to see.
  • 34:47So this is a population genetics that was worked
  • 34:49out by Stan Sawyer and Dan Hurley in 1992.
  • 34:52The only change I'm making in this is pure F,
  • 34:56instead of using a year which was how many grants
  • 35:00that you see in the the McConnell Craven uses it,
  • 35:04I'm taking the probabilities of replacement divergence
  • 35:08and the probabilities of some polymorphism
  • 35:11and putting them in here.
  • 35:12And the advantage here is that what
  • 35:13I can do with that is what I mentioned earlier,
  • 35:15I can go back to the old mass MACML
  • 35:18approach sequence clustering approach
  • 35:20that I mentioned before, estimating those probabilities
  • 35:25across the entire gene, I can then estimate action across
  • 35:27the entire gene by using these probability single site,
  • 35:30I don't have changes for single site.
  • 35:32So what this allows
  • 35:34us to estimate this gamma, minimizing likelihood of what
  • 35:38gamma is to blame those problems exist, see.
  • 35:42So this is a very complex diagram of how this all works,
  • 35:46again, is a pretty elaborate method of computation.
  • 35:50But again, has the nice properties that I'm not putting
  • 35:53in any I'm not using assumptions
  • 35:55and not putting in any parameters.
  • 35:56They go in.
  • 35:58I just take the polymorph at the end analyze it for
  • 36:01weather sites are clustered into four different categories.
  • 36:04Again, replacement polymorphism.
  • 36:06That's this arc here.
  • 36:07So polymorphisms anonymous divergence, placement divergence,
  • 36:12we cluster within all four of those categories.
  • 36:15We calculate the model average probability,
  • 36:17all those clusters and merge the data together.
  • 36:20I'm not going to go through the details.
  • 36:22But just if you were to do essentially the KML,
  • 36:25like clustering on those four categories
  • 36:27for a particular gene polymorphisms
  • 36:30and Ana's polymorphisms, monster and placement divergence
  • 36:33if you plug those in, to the formulas I showed you before,
  • 36:37you're basically plugging into these categories,
  • 36:39you can estimate those formulas.
  • 36:41And in the end, what you get is
  • 36:42an estimate of gamma across nucleotide positions in a gene.
  • 36:49I won't go into what this result here,
  • 36:51it's an interesting result for reasons
  • 36:54that are only of interest mostly to evolutionary
  • 36:56biologist, but you can see here in this particular gene
  • 36:58that there's a lot of variation in the selection
  • 37:02intensity across the gene.
  • 37:04Now, that is actually really
  • 37:06consistent with what we'd expect.
  • 37:08From a sort of basic biology standpoint.
  • 37:11Different parts of a gene are gonna either
  • 37:13be very strongly selected to stay the same
  • 37:15or they're gonna change, you shouldn't really expect
  • 37:18that all parts of gene are equally likely to change.
  • 37:20And this gives a very nice diagram
  • 37:22that allows you to understand how
  • 37:23it's different across the gene.
  • 37:25So if we compare this kind of approach
  • 37:27to the McDonald kreitman tests, which again,
  • 37:30are just putting in the DN DS, PN PS values
  • 37:33into this two by two table,
  • 37:36and I went through that, the important difference is that
  • 37:39the Mk test assumes this intergenic homogeneous selection
  • 37:42that in fact, a gene has the same selection
  • 37:44across the entire sequence.
  • 37:46The problem with that is if you have one small
  • 37:48region that's under selection,
  • 37:50the averaging out process across that entire gene
  • 37:53can mean that you don't detect the selection there,
  • 37:54even though it may be very strong for that small region.
  • 37:57And so the hope is that mastery graph can
  • 38:01identify those regions much better
  • 38:02than MK for instance, would.
  • 38:04And in fact, I went through this already.
  • 38:09I'll just skip past this because I went through it already.
  • 38:13And this it does do that.
  • 38:18So this is an example of McDonnell Craven
  • 38:21tests here applied to a Drosophila gene,
  • 38:23what you see is this high evolution of a high level
  • 38:27of replacement divergence, which turns out
  • 38:30to indicate high selection.
  • 38:33And you can see here that the DN DS ratio
  • 38:35is about eight to one word as the PN PS ratio
  • 38:38is almost even.
  • 38:40So this is a gene that's under very strong selection
  • 38:42based on the McDonald kreitman test.
  • 38:45Now, interestingly, so this one works
  • 38:47with a homogeneity.
  • 38:49And then if you analyze the ACP 26 AA gene
  • 38:55and look for the probability of all four categories.
  • 38:58These are the four categories and of course,
  • 39:01the replacement divergence here is the one
  • 39:04that's most likely to drive selection.
  • 39:06What do you get when you estimate gamma using this?
  • 39:09Well, interestingly, what you see is not something
  • 39:10that's under very strong selection across the entire gene,
  • 39:13but something that's on moderately strong selection,
  • 39:15basically in the second half of the gene,
  • 39:17and then one peak of very strong
  • 39:19selection right around the middle of the gene.
  • 39:21And this is visible in currents because
  • 39:23of a number of changes that occur
  • 39:26in one particular domain of the gene here.
  • 39:28Now, if you look at just the replacement divergence,
  • 39:30you wouldn't be able to figure this out.
  • 39:32Because you see there are other
  • 39:34peaks along here.
  • 39:35Those don't turn out to be so important.
  • 39:36And the reason why they don't turn out to be so important
  • 39:39is that the synonymous divergence synonymous by morphism
  • 39:41replacement polymorphism.
  • 39:42Tell us more about the underlying mutation rate
  • 39:44that says those elevations are probably have
  • 39:47something to do with mutation rate, and not necessarily
  • 39:49to do with added divergence.
  • 39:52You can sort of see this elevation
  • 39:54on the right hand side over here compared
  • 39:56to the small dip right here and up here
  • 39:59and the way it all works out mathematically
  • 40:02is we can really see that there's strong selection here.
  • 40:04We can also get what I call model intervals for this.
  • 40:06If you look across all the models,
  • 40:08what are the estimates of selection?
  • 40:11Possibly, what do we get is the 95% model interval for this?
  • 40:14And that's what these very faint gray lines you
  • 40:17may be able to see are those allow us to detect whether
  • 40:19these are significant, least significant,
  • 40:22statistically significant differences in selection.
  • 40:24All right, I'm gonna skip through this
  • 40:27just because I want to spend the time
  • 40:29but the point is, you can do this for other genes,
  • 40:29and it shows similar results that allow us
  • 40:32to understand where sites are under selection in that gene.
  • 40:34I'll just cover a few more examples
  • 40:37of how we've used this to give you an idea
  • 40:39of what it can look like in a comparison between humans
  • 40:42and chimpanzees where we've run this just to understand
  • 40:44how we've diverged from chimpanzees.
  • 40:47We see a bunch of different examples here.
  • 40:50Again, doing a little bit of comparison to
  • 40:52that traditional McDonald kreitman test
  • 40:54and the mass PRF test.
  • 40:56Here you see a gene, which is statistically significant
  • 41:00people's point of view.
  • 41:01Based on the Mk tests, the four categories
  • 41:04of the four tallies of which are indicated here.
  • 41:07Here's the MASS -PRF profile, and it shows us again
  • 41:10a particular region within this SLC AA
  • 41:12one gene that is under selection.
  • 41:14There are interesting stories behind all of these,
  • 41:17but I'm not gonna take the time to go through them.
  • 41:19Here's another example where and this is an example
  • 41:22where the McDonald pregnant test
  • 41:23comes out is not significant.
  • 41:25There's just not that much divergence
  • 41:26compared to the other categories.
  • 41:28But if you do this, spatially with the MASS-PRF test,
  • 41:32you actually see that a very central region there
  • 41:34has very strong selection, and then the rest of the gene
  • 41:37is under almost zero selection or almost no selection.
  • 41:41So this is an example I talked about,
  • 41:43where you could have some very small portion
  • 41:45of the gene under very strongest selection.
  • 41:47And McDonald-Kreitman test wouldn't detect it
  • 41:49because it's averaging over the entire gene.
  • 41:51Similarly, you'll get some genes.
  • 41:52Oops, I didn't mean to do that.
  • 41:54Some jeans, here's M gamma over here, where there's a...
  • 41:58Well, let me turn to that one last.
  • 41:59Actually, let me look at TPH First,
  • 42:02there's no statistical selection according to the Mk tests.
  • 42:06And in fact, in our MASS-PRF,
  • 42:08there's no specific selection either
  • 42:09the error bars are entirely overlapping zero here,
  • 42:12which indicates no selection.
  • 42:15Lastly, here's M gamma.
  • 42:16This is the one of the very few examples
  • 42:18we were able to find where McDonald test did detect
  • 42:21selection where, where MASS-PRF didn't.
  • 42:24As you can see, there's quite high tallies here,
  • 42:26which means there's a lot of power
  • 42:27to detect selection if it's there,
  • 42:28but it's probably not very strong,
  • 42:30because the numbers are not all that different
  • 42:32from each other.
  • 42:34And McDonald-Kreitman says it's statistically significant.
  • 42:36Now the reason why McDonald Kreitman is telling
  • 42:40it's statistic's nothing compared to mass PRF
  • 42:41is that actually, I didn't explain this in detail to you.
  • 42:44But McDonald- Kreitman doesn't actually assume
  • 42:47that there's an elevation of rate here.
  • 42:48And so the significance here is actually driven by
  • 42:51the high polymorphic replacement level.
  • 42:53So there's a lot of polymorphic replacements in there.
  • 42:56And what that means is there's some other
  • 43:00kind of selection that isn't a directional selection.
  • 43:01I won't go into the details there.
  • 43:02But the nice thing is that in the examples
  • 43:04where we find that McDonald kreitman is statistically
  • 43:07significant and MASS-PRF isn't examples
  • 43:10where in fact MASS-PRF is not designed to detect
  • 43:12that kind of selection and MK test is.
  • 43:15In general MASS-PRF turned out to be significant
  • 43:18in almost every case math MK tests were not.
  • 43:21Okay, so how can we use this, apply this
  • 43:24to instances like COVID-19, the point of this whole talk,
  • 43:27and I'm just gonna give you one example first
  • 43:30to justify why we think it's a good idea,
  • 43:32because we don't have results on doing it,
  • 43:34at least not many results on doing it to COVID-19
  • 43:36yet, and that is that we applied this influenza before,
  • 43:39which has some similarities to COVID-19, as everyone knows
  • 43:43and in influenza, again, we're interested in looking across
  • 43:46the gene are there sites that are under selection
  • 43:48because those sites that are under selection
  • 43:50are candidates where we need to be aware that
  • 43:53in fact, vaccines need like for every year they design
  • 43:58a new influenza vaccine, right?
  • 43:58And what they're trying to do is accommodate
  • 44:00the fact that these changes occur on the sites
  • 44:03that are actually susceptible
  • 44:04to your immune system recognizing the influenza virus.
  • 44:08So we need to understand those sites that are changing
  • 44:11and where they are in in order to design
  • 44:13more universal vaccines that maybe could target sites
  • 44:16that won't change rapidly because they can't change
  • 44:19because they're structurally constrained in the virus.
  • 44:22So what we did was apply this MASS-PRF approach
  • 44:25to influenza similarly on a phylogeny
  • 44:29to like I described for Coronavirus.
  • 44:30I don't have the phylogeny in the slide set,
  • 44:33but the point is just looking at the ancestral influenza
  • 44:36and it's divergent sites within a particular region.
  • 44:40And what we were able to do is identify a set of sites
  • 44:43that are under select---ion using mass PRF
  • 44:46that are beyond what people had prophesied
  • 44:48as positive selection sites in the past.
  • 44:50So there's a paper by Westgeest al 2012
  • 44:53which is essentially the gold standard for this
  • 44:55and they found a bunch of sites that are all
  • 44:58these circled sites in gray MASS-PRF.
  • 45:00Also found those the orange diagram here
  • 45:03is the MASS-PRF for this gene.
  • 45:09And it also identified other sites
  • 45:10that are under selection as well.
  • 45:14And we're in the process of understanding
  • 45:16better how those can be validated.
  • 45:17But the ultimate point is that
  • 45:20these are important selected sites that may be relevant
  • 45:25to the design of vaccines for influenza.
  • 45:28So similarlY, we'd like to illuminate
  • 45:31which sites might be changing rapidly
  • 45:34and under positive selection in Coronavirus,
  • 45:37not only during the human epidemic,
  • 45:39but again during the zonotic zoonotic time period.
  • 45:41And so now we're finally coming to the final
  • 45:43part of my talk, which is what we're doing
  • 45:46in terms of the model average estimation the mcos
  • 45:48and natural selection in SARS coronavirus,
  • 45:51one and SARS coronavirus two,
  • 45:53Corona viruses during zoonosis.
  • 45:53But the whole point here is really
  • 45:56explain to you what I've done because the results I have
  • 45:57as I said are I just have a few plots of some of the stuff
  • 46:01longest selection we were able to check
  • 46:03because we have to process through a lot more data
  • 46:05before we get a more in depth look at the lesser
  • 46:07selected sites that are on these genes.
  • 46:10And so we looked at this for the for Coronavirus.
  • 46:13This is just a Coronavirus, Getty image that Yale
  • 46:17has used looking at Coronavirus.
  • 46:21And again, as I mentioned,
  • 46:23we're looking at these two sites of where COVID-19
  • 46:26emergence occurred, and where SARS emergence occurred.
  • 46:30And the question is, are there changes
  • 46:33that happen there that are specifically
  • 46:34responsible perhaps for those zoonosis and the only results
  • 46:38I have are just a few results again, highlighting some of
  • 46:40the strongest selection we saw.
  • 46:42This is actually a diagram of the spike
  • 46:44protein which if you've heard much about COVID-19
  • 46:47molecular biology, you probably have heard about the spike
  • 46:50protein, it's what sticks out from the virus.
  • 46:52It's what grabs onto the AC receptor,
  • 46:56and essentially is what most vaccines
  • 46:58that one might design for the virus would target.
  • 47:01And the point is that the recombination binding
  • 47:04domain, which has gotten a lot of press already turns out
  • 47:07to have the selected sites.
  • 47:08You can see them here, here, here and here.
  • 47:12These are sites that are selected,
  • 47:13meaning they're changing rapidly
  • 47:13during the pre zoonotic phase.
  • 47:17So these are sites that are changing, not in humans,
  • 47:20but in the bats in the pangolins.
  • 47:22And whatever other animals that this virus
  • 47:25is spreading among, or has been spreading among
  • 47:27before the zoonosis to humans.
  • 47:29So then the question is, are similar sites under
  • 47:30selection during zoonosis?
  • 47:31And during post zoonosis?
  • 47:36And the answer right now is yes,
  • 47:38it seems kind of similar,
  • 47:39although we don't get the same sites.
  • 47:40So we have to do a little bit
  • 47:42more molecular, you know, staring at this and understanding
  • 47:44it because these results are literally
  • 47:46I got these results today, actually.
  • 47:48So we have to sort of do more of this
  • 47:51and we actually can actually look at more depth
  • 47:54and get more sites with other approaches
  • 47:55that we haven't implemented at this moment.
  • 47:57But during near zoonosis what you see is again,
  • 47:58the selected sites which are in bright red
  • 48:06are also on the sort of the visible side
  • 48:08of the recombination binding domain
  • 48:13of the spike protein, which is the tip
  • 48:17the outside portion of this gene.
  • 48:23Lastly, if we look post-zoonosis that's in
  • 48:24the evolution of humans, we again see that
  • 48:26the selected sites are sites that are at this tip region.
  • 48:33Again, none of this is terribly surprising.
  • 48:35The interesting thing is that it kind of indicates
  • 48:36that the zoonosis it kind of indicates consistency.
  • 48:38Again, there's a lot more to do before
  • 48:40we can conclude anything like this,
  • 48:42but the idea we have right now indicates
  • 48:44a good deal of consistency between the selection
  • 48:46that's ongoing in humans during zoonosis and pre zoonosis.
  • 48:51And what that implies is that this may
  • 48:54well have been as I said, very briefly,
  • 48:56during this talk an instance where there's a virus
  • 49:00just circulating around in bats and penguins
  • 49:01that could have caused this disease at any time,
  • 49:04it's just a matter of whether or not we actually
  • 49:07have exposure to, to those organisms
  • 49:11that allows the transmission to happen.
  • 49:14Consistent with this, I'll just mention
  • 49:17a couple like verbal points,
  • 49:18which is that all the evidence that we have indicates
  • 49:20that this virus spread extremely quickly
  • 49:23from the moment that it zoonosis into humans.
  • 49:26And in fact, in most cases of zoonosis,
  • 49:28we find that that's true,
  • 49:31which is somewhat counterintuitive.
  • 49:33Obviously, it hasn't adapted to humans,
  • 49:34it has adapted to the amount of mammalian immune system.
  • 49:37And so to the extent that our immune system is not
  • 49:39tremendously different from that of bats or pangolins,
  • 49:41it may be not surprising that it can infect us.
  • 49:44But one of the things that is true is that
  • 49:47if it did not spread very quickly,
  • 49:48very easily from the very moment it transmitted to someone,
  • 49:51it would probably lead to a dead end.
  • 49:52In other words, if you don't have
  • 49:55an ability to transmit and spread just from the get go,
  • 49:57the first person who gets infected
  • 50:00is very unlikely to transmit it to someone else.
  • 50:02So it sort of has to be well pre adapted
  • 50:04for a zoonotic event to actually spread in humans.
  • 50:07Now there's, we need more zoonotic events,
  • 50:11God forbid that it actually happens,
  • 50:13to really get a better picture of that.
  • 50:15But the general result and the scientific
  • 50:16literature does seem to show that zoonosis happens.
  • 50:18the disease's already well set to cause problems.
  • 50:22And the examples that we don't have where
  • 50:24it happens like that, like MERS
  • 50:27or like, well, MERS is a good example.
  • 50:30It's a really deadly disease,
  • 50:31but it doesn't transmit well among humans.
  • 50:32And so that's an example where maybe it's transmitting
  • 50:35to humans, but it's not transmitting among humans.
  • 50:37And it's very hard for that disease
  • 50:40to catch on within the human population
  • 50:43and do human transmission as opposed to zoonotic events.
  • 50:45And that's because it doesn't transmit
  • 50:47and it doesn't usually evolve that ability
  • 50:48to transmit over the short time that
  • 50:51that individuals might get infected.
  • 50:53when when they get it usually from camels.
  • 50:57Okay, so I've showed you those examples.
  • 50:59I just wanna to mention what else we're gonna be doing.
  • 51:02So I what I just showed you was actually
  • 51:05the sort of SARS coronavirus to some sites
  • 51:06that are under selection in search
  • 51:08for Coronavirus two genes.
  • 51:10This is the S gene right here.
  • 51:12That's the spike gene.
  • 51:13We're gonna be looking at that in SARS coronavirus,
  • 51:15one and two, we're also going to be looking
  • 51:18at other genes in the genomes.
  • 51:22These have other functions.
  • 51:23The M gene, for instance, is a membrane gene.
  • 51:26So it might be relevant to and the gene
  • 51:28as well might be relevant to vaccine generation.
  • 51:32Like if we could generate a vaccine that targeted
  • 51:35those, maybe they would be unable to change at the same
  • 51:41pace that spike protein would they might be more conserved.
  • 51:44And that might be one approach towards developing a vaccine.
  • 51:46That would be a longer term vaccine because one thing we
  • 51:49have to worry about, of course with this Coronavirus,
  • 51:53is and I have other research that we're doing on
  • 51:55this question, which I'd love to talk about if anyone's
  • 51:57curious, but you can estimate
  • 51:59what the actual waning immunity of it is,
  • 52:00even though we don't have data on that by Looking
  • 52:03at other related species and using the phylogeny
  • 52:05to understand how the how the waning immunity
  • 52:08has evolved across the species
  • 52:09and what the projected or most likely
  • 52:12waning immunity of SARS coronavirus is,
  • 52:15and it's, it tends to be it looks like
  • 52:16it's around 80 weeks or so.
  • 52:18So if we get about 8 weeks of waiting a period
  • 52:21of immunity from this, that's not that
  • 52:22much in terms of every two years or so we're gonna have
  • 52:25Coronavirus coming around and in terms of we're going to
  • 52:28be susceptible again to Coronavirus.
  • 52:30Not that we're going to get it every two years.
  • 52:33And what that would mean is that
  • 52:36it's likely to persist as a circulating virus.
  • 52:38And if it remains as deadly as it is that's a serious issue.
  • 52:40So we're gonna really want to buy a vaccine.
  • 52:42And we're not necessarily going to wanna have another flu
  • 52:44vaccine that we have to get every year.
  • 52:49So what we really want to do is target
  • 52:51some genes that may be under more constraint
  • 52:53then the recombination binding protein gene, the spike gene.
  • 52:57So anyway, so the point is looking at multiple genes for
  • 53:00trying to understand where conservative regions are where
  • 53:03regions that are under selection are important.
  • 53:05And we'll be doing that.
  • 53:07And hopefully some of those results will
  • 53:11help to guide the kind of generation of vaccines,
  • 53:15and also the generation of therapeutics,
  • 53:16because sites that are under
  • 53:19selection are functional.
  • 53:20So if you actually design a therapeutic
  • 53:21that interferes with the sites that are under selection
  • 53:22sort of in an opposite way, from vaccines, vaccines,
  • 53:25we really want to target something that just doesn't change.
  • 53:26With therapeutics, we may want to target
  • 53:27the changing regions, if we can design something
  • 53:30that generically does, because those changing
  • 53:31regions are functional.
  • 53:32In other words, those sites at the end of the spike protein
  • 53:33are clearly ones that do bind the ACE gene.
  • 53:35It's just that they're flexible
  • 53:38about what they are in order to bind it.
  • 53:42So we need to include
  • 53:43all of those changing sites, if we wanna dissolve develop
  • 53:46a therapeutic that for instance, would somehow interfering
  • 53:50with the binding of Ace to receptors from the spike genes.
  • 53:53So thank you very much for listening to the ongoing work
  • 53:56we're doing on COVID-19.
  • 53:59I would love to entertain any questions that you have.
  • 54:03Let me just take one moment to acknowledge
  • 54:05some of the people that I should acknowledge in this work,
  • 54:09I already showed you a picture of John John who was earlier
  • 54:11the the picture and the associated with the Mac ml approach
  • 54:13that we developed many years ago 10 years ago basically
  • 54:15Yinfei Wu has been taking the lead on this project.
  • 54:18She's a master student.
  • 54:19Yano os Wang was an assistant was in visiting
  • 54:22Assistant Professor Stephen Gaugham,
  • 54:24is in the Evie department
  • 54:26has been helping out with this analysis.
  • 54:28Haley Hassler is in my lab, has been helping out
  • 54:30with phylogenetics Jayveer Singh is an undergrad
  • 54:32who's been doing some of the research work
  • 54:35some of the actually literature research
  • 54:37that has helped us to contextualize
  • 54:39the work we're doing Mofeed Najib
  • 54:41produced those diagrams of the spike protein
  • 54:44with the sites that we have identified
  • 54:46as under selection so far,
  • 54:48Zheng Wang is a long term collaborator of mine who works
  • 54:54on nearly all the phylogenetic projects
  • 54:56that I do, who's works with me.
  • 54:59And then Alex Thornburg is A long term collaborator of mine,
  • 55:02now in North Carolina.
  • 55:06He was while he's currently at the North Carolina
  • 55:08Museum of sciences, but he works on a lot of phylogenetic
  • 55:11projects with me as well.
  • 55:13And by the way, all of this, fortunately
  • 55:16was recently awarded one of the NSF rapid grants
  • 55:19to do this research.
  • 55:20So we're very pleased to have funding to
  • 55:22continue to work on this as time goes on, which is good
  • 55:25because it's taking quite a lot of work
  • 55:27to do the sequence wrangling.
  • 55:29And the analyses themselves.
  • 55:30As I mentioned, they're computationally intensive.
  • 55:32So Alex and I were the PI's on that particular
  • 55:36grant from the NSF.
  • 55:37So we're excited to continue to do that work.
  • 55:41And with that, I think I would
  • 55:42like to entertain any questions you might have.
  • 55:45- Thank you, Jeff, this was great.
  • 55:48I'm sure we have a lot of questions
  • 55:49who gets first?
  • 55:54Again, you can type the questions on the
  • 55:59chat box or just mute.
  • 56:13- I have a quick question.
  • 56:14- Okay.
  • 56:16- You mentioned or you touched a bit on this before,
  • 56:20but how would this compare to cite wise estimates
  • 56:24of omega that you would get from Pamel
  • 56:28or similar program?
  • 56:29- So I'm sorry, I sort of was rushing at the end,
  • 56:32I didn't explain that, in fact, I'm using pamel for some,
  • 56:35So I'm using Pamela
  • 56:36for the pre zoonosis analysis, and for the post zoonosis
  • 56:40analysis, because as I mentioned during the talk,
  • 56:44if you have a large phylogeny
  • 56:46with multiple branches, et cetera, et cetera,
  • 56:49where you can look over that entire phylogeny then you
  • 56:51can get multiple changes at individual sites,
  • 56:53which is what pamel actually uses to infer selection, right?
  • 56:55You have to have the site change not just once
  • 56:57but twice or three times.
  • 57:02And then it says all that's under selection because
  • 57:07it keeps changing again and again and again.
  • 57:12So, so Pamela allows you to do that
  • 57:13if you have this sort of deep time
  • 57:15or large amount of time and multiple lineages that you're
  • 57:17looking at, the master of approach that I'm using, enables
  • 57:19you to do that on just a single lineage without needing
  • 57:22multiple changes, I mean, multiple changes
  • 57:23on a single language you can't even detect
  • 57:25because it just looks like one change
  • 57:26if you have the ancestral sequence, which is what we do
  • 57:28ancestral data summation, get the ancestral sequence.
  • 57:31And if you have the descendant sequence, a changes
  • 57:33to T, you don't know if it changed to A to G to C to T again
  • 57:35or if it just changed a to T, you have no idea you can
  • 57:36just say it changed once.
  • 57:38And so there's no real way to run pants,
  • 57:40there is a way but it's really it's statistically
  • 57:41really underpowered terrible thing
  • 57:42to do to try to run pamel on a single lineage
  • 57:44and figure out whether something's under selection.
  • 57:47The advantage of this approach is because it
  • 57:49can use that polymorphism data, the data of like what's
  • 57:51just circulating in within populations as a metric for how
  • 57:54much mutation is occurring.
  • 57:56You can essentially divide out by that
  • 57:59and then again, because we're integrating over all
  • 58:04these models of how these things change, we're essentially
  • 58:07borrowing information from neighboring sites for what their
  • 58:10rates of change are, et cetera et cetera
  • 58:13to estimate what the possible amount
  • 58:15of selection is on all these sites.
  • 58:16So by using the polymorphism data, and by doing this model
  • 58:19averaging approach, we're actually able
  • 58:21to take individual lineages and estimate
  • 58:23the selection on them.
  • 58:25And that's what we're doing in the near zonosis analysis
  • 58:29that I showed you in the middle here.
  • 58:33So there are different ways of doing the analysis.
  • 58:35And it's necessitated by the fact that we just have this
  • 58:37one lineage and there's no way it won't be a single lineage
  • 58:39in any dataset we look at because for zoonosis,
  • 58:42we're going to have human sequences,
  • 58:44we're gonna have some animal sequences,
  • 58:45we're not going to know we're not going
  • 58:48to have any information about the actual zoonosis.
  • 58:50Even if we knew the first human,
  • 58:52we could just take that as an estimate.
  • 58:54We still probably need some data here.
  • 58:56Maybe you could have the first human
  • 58:58and the first animal that you got it from.
  • 59:00That just doesn't exist.
  • 59:01We don't have that data for any zoonosis.
  • 59:04How would we would never be there at the moment.
  • 59:07So we have to assume that there's a number
  • 59:09of transmissions among humans
  • 59:10and a number of transmissions among animals
  • 59:13during that near zoonotic period.
  • 59:14And it's just a single lineage.
  • 59:16So we can't really run pamel on that,
  • 59:19in summary, because pamel requires multiple
  • 59:21changes multiple lineages to have power
  • 59:23to actually infer evolutionary change.
  • 59:25MASS-PRF fortunatelY, can do that,
  • 59:27because you can look on single lineages.
  • 59:28So you can use MK tests as well on single lineage
  • 59:33is basically designed to look at single lineages.
  • 59:36But the problem with MK tests, as I mentioned,
  • 59:37is that they're assuming the entire
  • 59:39gene is under selection, which means it doesn't give you
  • 59:41the scope or understanding about recombination
  • 59:44binding gene sites under selection or something like that.
  • 59:46It often will just give you a result of the genes not under
  • 59:47selection, which is not true.
  • 59:51- Does that answer your question?
  • 59:54- Yes.
  • 59:55- Great.
  • 60:00- Any other questions?
  • 01:00:04- I have one more if no one else wants to.
  • 01:00:05- Sure, go ahead.
  • 01:00:07- So in B cells, we have mechanisms
  • 01:00:10that have mutation that specifically
  • 01:00:13bias towards replacement mutations.
  • 01:00:17So in the absence of selection,
  • 01:00:18the mutation mechanisms actually cause
  • 01:00:21an Omega greater than one.
  • 01:00:24would this have any way of correcting for that?
  • 01:00:28- So the tricky part is, and I don't know how it might,
  • 01:00:31the tricky part is not so much running the software,
  • 01:00:33which you could certainly do on that.
  • 01:00:37The tricky part would be identifying
  • 01:00:39what polymorphism is, in the case of those cells.
  • 01:00:43So if you could identify sets of cells that are undergoing
  • 01:00:47the mutation but aren't under selection in some way, then
  • 01:00:51you could use that as the proxy for the way we use it here
  • 01:00:54is polymorphism within population polymorphism,
  • 01:00:57and then estimate that.
  • 01:00:59I just don't know whether you have a way of
  • 01:01:01doing Doing that if you want to discuss
  • 01:01:03it with me, we could.
  • 01:01:05That's sort of always the key for detecting selection.
  • 01:01:09And it's, you know, many of you may be familiar that I work
  • 01:01:11on cancer and some of the work that I do.
  • 01:01:13It's the same
  • 01:01:18problem that I'm working on there all the time, I'm trying
  • 01:01:21to understand what the baseline mutation rates of cancer
  • 01:01:23in cancer and somatic evolution of cells are.
  • 01:01:25Because if I understand the baseline rates
  • 01:01:27, how often those things change,
  • 01:01:29just the mutation alone,
  • 01:01:30then I can always estimate selection.
  • 01:01:32And that's the thing we almost always want to
  • 01:01:34know about in the analog analysis of sequence data.
  • 01:01:37So, again, it's all about figuring out if there's some piece
  • 01:01:42of the data that can be used to estimate that polymorphism
  • 01:01:46and an approach like this, the benefit of an approach like
  • 01:01:48this would be, you know, maybe you can estimate that for
  • 01:01:50some portions of the gene, but not others, you know, maybe
  • 01:01:52then there's a way that you could use this sort of model
  • 01:01:54averaging approach to get at the underlying rate that it's
  • 01:01:56happening, even if you can't estimate
  • 01:01:58for that particular site, for instance.
  • 01:02:00So I think the Might be potential to do it,
  • 01:02:02but it just depends, you know, about on whether
  • 01:02:04there's a critical, you know, set of data in what you're
  • 01:02:09looking at which I haven't spent much time
  • 01:02:12looking at back in the day.
  • 01:02:13So I wouldn't know whether there's some way
  • 01:02:15of baseline getting that baseline polymorphism or baseline
  • 01:02:19mutation rate, which essentially amounts to the same thing.
  • 01:02:23It just depends on whether, you know, you're assuming the
  • 01:02:26population is sort of has, you know,
  • 01:02:29it's just whether you're looking at at a population level,
  • 01:02:31or you have some sort of covariance matrix
  • 01:02:34to better understand the mutation rates itself.
  • 01:02:36- I think there is a similar population B cells,
  • 01:02:38- Great, so I encourage you to look into that.
  • 01:02:44- Jeff, I have a quick question.
  • 01:02:47I'm not too familiar with genome sequencing.
  • 01:02:50But I think the Clustering Problem,
  • 01:02:53the issue and the solution you have
  • 01:02:55can be applied to many types of data.
  • 01:02:58So I'm kind of confused.
  • 01:02:59So you start In the diagram where you describe
  • 01:03:02the different steps, you said that you first pick the most
  • 01:03:06likely cluster and then you essentially
  • 01:03:07keep splitting the clusters, right?
  • 01:03:09How do you get the first clusters? Like
  • 01:03:12there is some randomness in how you split the first?
  • 01:03:16- Oh, so I sorry, I apologize.
  • 01:03:19I didn't explain it in enough detail.
  • 01:03:22The reason why it's so computationally intensive
  • 01:03:24is we look at all possible.
  • 01:03:27all possible exhaustedly.
  • 01:03:29Now, I actually spent a year of my life trying
  • 01:03:31to find a way to develop a Bayesian approach
  • 01:03:34or some approach that would allow me
  • 01:03:38to not look at all possible, you know, like to
  • 01:03:40make this because because if you could do that,
  • 01:03:41this would be a great way for doing tons of different things
  • 01:03:45on very large data sets, right, large, like,
  • 01:03:47and what amazed me is, I found that
  • 01:03:50it was just an impenetrable problem.
  • 01:03:53If I didn't look at every possible model.
  • 01:03:56I could not get it to work I couldn't prove that
  • 01:04:00That's Through like, I don't have any proof, that's true.
  • 01:04:04And I would encourage anyone who really wants to dive
  • 01:04:05in there, go ahead.
  • 01:04:06But I'll warn you that I spent a year
  • 01:04:07banging my head against that problem.
  • 01:04:09And when I didn't
  • 01:04:10exhaustively search all the models, I could not, I always
  • 01:04:12caused these biases, like there was no way to sample them.
  • 01:04:16I even have ways of sampling the models
  • 01:04:17according to their probability.
  • 01:04:24But even that causes a bias because sometimes
  • 01:04:31there's a large number.
  • 01:04:31So if you look at the, if you think
  • 01:04:34about the set of models, it's a very large set of models.
  • 01:04:35And there isn't actually a huge amount
  • 01:04:38of likelihood differences between these models.
  • 01:04:42That's the thing.
  • 01:04:45So when you don't exhaustively sample the models,
  • 01:04:49if you just sample some of the most likely models,
  • 01:04:53you actually are sampling just
  • 01:04:56one corner of the space.
  • 01:04:57And it's possible for a bunch of
  • 01:04:59not quite so likely models, but reasonable models
  • 01:05:00that are not in that corner to sort of be actually
  • 01:05:03highly influential on the model average.
  • 01:05:04And so the bottom line is like sampling
  • 01:05:05by trying to pick in the you know, most likely space doesn't
  • 01:05:06work sampling by picking randomly doesn't work.
  • 01:05:07And I could go into more detail about it.
  • 01:05:09But it turned out that I couldn't do it
  • 01:05:10any way other than exhaustive sampling.
  • 01:05:12So, I say that Sorry, I missed that mistake.
  • 01:05:14I couldn't do it by any biased approach
  • 01:05:16towards that exhaustive handling
  • 01:05:18the approach that I'm showing you right here.
  • 01:05:21Actually, there are two ways of doing it.
  • 01:05:22One is to sample stochastically,
  • 01:05:23according to likelihood, and the other is to sample exactly
  • 01:05:27across all exhausted sampling significantly works.
  • 01:05:30In fact, it's implemented in the approach that I
  • 01:05:33was just showing, I'm sorry, I just sort of jumped too fast
  • 01:05:35to say what I was saying.
  • 01:05:37So sampling stochastically works
  • 01:05:38and sampling exhaustively work sampling stochastically is
  • 01:05:40still very computationally intensive.
  • 01:05:42But there's no I couldn't
  • 01:05:44find any way to sort of, you know, important sample or do
  • 01:05:48some sort of approach that would allow me to get a smaller
  • 01:05:50set of models, which would then if we could do that,
  • 01:05:53that could be really important,
  • 01:05:55because then you could do this
  • 01:05:57on more than like 2000 site,
  • 01:05:59it's somewhere around 2000 sites.
  • 01:06:00So you start running into real problems with
  • 01:06:04just too much computing computation time
  • 01:06:06to make it worthwhile.
  • 01:06:07So we could extend this to 10,000 100,000, you know,
  • 01:06:11potentially really, really large numbers of sites,
  • 01:06:13and really, really sparse sets of sites.
  • 01:06:16If only we could find a way
  • 01:06:19to bias the sampling towards models that are more likely
  • 01:06:24without causing biases in the results.
  • 01:06:26I couldn't find any way to do.
  • 01:06:27- This seems very much related to tree based
  • 01:06:30methods where essentially you've got, like split the space
  • 01:06:36and then you model of geology models,
  • 01:06:39like the random forest, for example,
  • 01:06:41or is very much related to that right.
  • 01:06:45- Yeah, I have to say I was now familiar
  • 01:06:47with those approaches.
  • 01:06:49But when I was completely unfamiliar with it, yeah, I sort
  • 01:06:52of thought about it that way.
  • 01:06:54But you're absolutely right.
  • 01:06:56Yeah, I guess the difference but here
  • 01:06:57you have a sequence like one sequence,
  • 01:07:00tghere you have a space.
  • 01:07:01So you just split in
  • 01:07:02different dimensions, but it is really good.
  • 01:07:05- And I can mention, just to speculate,
  • 01:07:10I'm kind of interested in a number of
  • 01:07:13other ways of applying this.
  • 01:07:15So for instance, if the one I've been thinking about
  • 01:07:18and actually worked on a little
  • 01:07:20bit haven't gotten very far with, but it's like,
  • 01:07:21when you're dealing with event spaces over time,
  • 01:07:22like if you have days, and you have individuals like,
  • 01:07:24prominent us in public health,
  • 01:07:27like individuals who are undergoing events
  • 01:07:29you end up with a very sparse matrix of events.
  • 01:07:31And so we use these approaches like survival plots
  • 01:07:38all these approaches that we use to sort of understand
  • 01:07:40how these rare events are happening,
  • 01:07:42and how people are changing over this,
  • 01:07:44that event space is actually really sparse.
  • 01:07:45But it's kind of a matrix.
  • 01:07:47And you could do this in two dimensions,
  • 01:07:48not just one, right?
  • 01:07:49So you could model average across two dimensions,
  • 01:07:52and then you could get something
  • 01:07:53that the thing that really appeals to me about that is that
  • 01:07:55again, it's really this approach is really,
  • 01:08:00it only builds up from the this binomial event
  • 01:08:04No, no event, stuff, a picture that's very continuous over
  • 01:08:09over the space and involves no assumptions
  • 01:08:11about distribution whatsoever.
  • 01:08:12So I'm just wondering if there aren't instances
  • 01:08:14where, you know, we could come up
  • 01:08:17with a better understanding of what's going on
  • 01:08:19with individuals in a matrix such as
  • 01:08:20that by using this approach.
  • 01:08:22And it's an approach that is
  • 01:08:23that still works even with these sparse spaces, because
  • 01:08:26you can model average over these tremendously large number
  • 01:08:29of models that all have fairly likely fairly
  • 01:08:33equal likelihood to get a result.
  • 01:08:35So I don't know that's just a sort of a
  • 01:08:37speculation that there might be some interesting approaches
  • 01:08:38, ways to approach those problems using this kind of kind
  • 01:08:41of model averaging technique.
  • 01:08:46- Great, I think we should wrap up.
  • 01:08:49Thank you, Jeff, for this great presentation was great.
  • 01:08:52And thank you all for joining today.
  • 01:08:57See you next next seminar
  • 01:08:58is gonna be I think, July 14.
  • 01:09:01So we'll send out invites.
  • 01:09:05All right, thank you, Jeff.
  • 01:09:07Thank you all, bye, bye.