Skip to Main Content

YSPH Biostatistics Seminar: “Feature Aggregation in Causal Discovery for High-dimensional Data: Application to Targeting the “Gut-Brain-Axis” via the Microbiome Diversity"

September 21, 2023
  • 00:01<v ->All right, I'm very excited</v>
  • 00:03to introduce our speaker for today.
  • 00:04We have Dr. Meghan Short.
  • 00:06Dr. Short has completed fellowships
  • 00:08at the Glenn Biggs Institute for Alzheimer's
  • 00:10and Neurodegenerative Diseases,
  • 00:12and at Harvard's Huttenhower Lab.
  • 00:14Currently, Dr. Short is an assistant professor
  • 00:17at Tufts University.
  • 00:18Let's give a warm welcome to Dr. Short.
  • 00:31<v ->Hi, everyone, Thank you for being here.</v>
  • 00:34Can you all hear me, okay?
  • 00:35<v ->Sign in if you're registered.</v>
  • 00:38<v ->All right, so, today, I'm going to talk about a project</v>
  • 00:41that I worked on as part of my postdoc
  • 00:43down at UT Health San Antonio
  • 00:46with the Glenn Biggs Institute for Alzheimer's
  • 00:49and Neurodegenerative Diseases,
  • 00:51and I wanted to talk about this as a...
  • 00:56None of the sort of methods that I'm gonna talk about
  • 00:58in this talk are particularly new.
  • 01:01This wasn't sort of a methods development project.
  • 01:04So the sort of main network method I'll talk about
  • 01:08is about a decade old at this point, at least,
  • 01:10but what's nice about it is that
  • 01:13with increasing availability
  • 01:15of high dimensional biomedical data,
  • 01:17it's sort of seeing more use cases,
  • 01:20and it's not something that, at least, I learned about
  • 01:22in my graduate program in biostatistics,
  • 01:24but it's something that I thought
  • 01:26would be good to talk about today
  • 01:28since it's such a useful method.
  • 01:32So let's see if I advance.
  • 01:35There we go.
  • 01:36So I'll start just by giving a quick introduction.
  • 01:39I know that when I was in grad school, I always wanted,
  • 01:43I thought it was interesting
  • 01:44to hear about people's career paths
  • 01:46as I was considering my own.
  • 01:48So I started in biology as a field.
  • 01:53I studied salt marsh ecology as an undergrad,
  • 01:56and then by the end of undergrad,
  • 01:58I was interested in getting more into sort of a human,
  • 02:00more directly human-focused environment,
  • 02:02and so I considered public health.
  • 02:05I learned about statistics
  • 02:06as part of my research in undergrad
  • 02:08and wanted to continue with that so I participated in SIBS,
  • 02:11which is a program that you may be aware of,
  • 02:14and that was my first intro to biostat.
  • 02:16I was a graduate student at Boston University.
  • 02:19I had fortune of working
  • 02:21with the Framingham Heart Study,
  • 02:22which is where the data comes from
  • 02:24that I'll be talking to you about today,
  • 02:26which is a really interesting study,
  • 02:27and I'll get more details on in the few slides.
  • 02:29That was sort of my introduction
  • 02:31to working with epidemiological data.
  • 02:34After grad school, I continued on,
  • 02:36again, to UT Health San Antonio,
  • 02:38and then following that to postdoc at Harvard
  • 02:42looking at developing methods for microbiome analysis.
  • 02:47So if you have any interest in that,
  • 02:49feel free to approach me,
  • 02:51although I'm not gonna talk about that today,
  • 02:54and then as of March this year,
  • 02:57I started as an assistant professor at Tufts Medicine
  • 03:00where I'm working on a variety of projects
  • 03:02but a lot related to sort of omics data
  • 03:06and aging and longevity.
  • 03:12So I'll start today's talk with a bit of motivation
  • 03:15for why network-based analyses we're a good fit
  • 03:18for looking at sort of the proteome in Alzheimer's disease.
  • 03:24So first of all, Alzheimer's disease
  • 03:27is a very prevalent condition.
  • 03:30Many of you may be like me and know some family members
  • 03:33or people who have been affected by it.
  • 03:36It's very common and expect it to be more so
  • 03:40as populations age, and it's a leading cause of mortality,
  • 03:44disability, and poor health among seniors,
  • 03:47and one interesting feature of this disease
  • 03:49is that precursors of it can appear years to decades
  • 03:52before symptoms manifest.
  • 03:55So those precursors can include indicators
  • 03:58that are visible on brain MRIs,
  • 04:01performance on neurocognitive testing, changes in gait,
  • 04:05even changes in sense of smell,
  • 04:08and cerebral spinal fluid markers, such as tau and amyloid.
  • 04:17Because of this, there's interest in being able to find
  • 04:21plasma biomarkers for Alzheimer's disease
  • 04:24and related dementias.
  • 04:25ADRD is a acronym we'll be using sort of throughout.
  • 04:30Because since there are indicators
  • 04:33of sort of pre-disease development
  • 04:35in years to decades before being able to detect those,
  • 04:37either earlier or in a less invasive or expensive way,
  • 04:40is very useful,
  • 04:44and so when I say invasive, I mentioned CSF markers,
  • 04:50such as how an amyloid can predict dementia,
  • 04:53but that involves doing a lumbar puncture
  • 04:56versus something like a blood draw, which is easier to do.
  • 05:01Another good aspect of trying to find biomarkers
  • 05:04is that you can get a sense of biological processes
  • 05:07that are involved in disease development,
  • 05:10and that can hopefully lead to either preventative
  • 05:13or therapeutic interventions.
  • 05:19What makes this difficult?
  • 05:21So in my case, I was looking at proteins.
  • 05:24There are thousands and thousands to select from,
  • 05:27and you get sort of this inherent trade off
  • 05:30between trying to control a false positive rate
  • 05:33for all these multiple tests that you may be performing,
  • 05:36but if you effectively control the false positive rate,
  • 05:39you're going to likely end up with low statistical power.
  • 05:42There's this trade off between...
  • 05:45It's sort of a needle in a haystack.
  • 05:47Another thing that has tended to be true
  • 05:50is that there is not very good replicability across studies.
  • 05:53So one study may find 20 biomarkers
  • 05:57and maybe one or two of them
  • 05:59may replicate in a different study.
  • 06:01So there's a lot of noise that ends up coming through.
  • 06:08The approach that I took in this project
  • 06:10was to use network analysis
  • 06:13to analyze the protein data,
  • 06:17and the motivation there is to try and capture
  • 06:20subtle but consistent variation in groups of proteins.
  • 06:23I'll refer to them as modules during this talk.
  • 06:28In then just a few things, so first of all,
  • 06:30it reduces the dimensionality
  • 06:32of the statistical testing problem that you have.
  • 06:35So rather than testing each protein individually
  • 06:37and having to adjust for all of those multiple tests,
  • 06:40you can sort of reduce the space
  • 06:43to a smaller number of tests
  • 06:46where the proteins within each group being tested
  • 06:49are inter-correlated with one another,
  • 06:52and unlike other dimensionality reduction methods,
  • 06:55something like a principle components analysis
  • 06:57that you may have maybe familiar with,
  • 06:59the network method has sort of a benefit of looking
  • 07:03not just at, say, correlations
  • 07:06or relationships between pairs of proteins,
  • 07:09but, also, at sort of the correlational neighborhood
  • 07:11of what common neighbors
  • 07:13those proteins share in the network.
  • 07:18Another benefit of or sort of way
  • 07:22that we try to get around some of the pitfalls
  • 07:23of proteomic analysis is by focusing on biological pathways
  • 07:29instead of on individual proteins themselves.
  • 07:32So within groups of proteins that we find to be of interest
  • 07:36or possibly associated with dementia outcomes,
  • 07:40we use a tool called over-representation analysis,
  • 07:43which I'll talk about later,
  • 07:45but it essentially tries to pinpoint biological pathways
  • 07:48that may be overrepresented by the proteins
  • 07:51that are found to be associated with the outcome,
  • 07:54and the hope there is to find,
  • 07:56to get sort of insights that are more robust across studies
  • 08:01and, hopefully, address some of the issues
  • 08:03with replicability.
  • 08:08Okay, so that's sort of the motivation for this study,
  • 08:11and, now, I'll talk a little bit about the data.
  • 08:18The data for this study
  • 08:19comes from the Framingham Heart Study,
  • 08:22which has been going on for a very long time.
  • 08:24It started in 1948 in a town of Framingham, Massachusetts,
  • 08:29and at the time they enrolled,
  • 08:31they reached out to two-thirds of the population of the town
  • 08:34to try and enroll them in this epidemiological study.
  • 08:36It was one of the first ones of its kind,
  • 08:39and people would come in for exams every few years,
  • 08:42and they would take all of this information about them,
  • 08:45and then follow them for outcomes.
  • 08:47Cardiovascular outcomes was really
  • 08:49the sort of outcome of interest when it first started.
  • 08:53Over the years, they've then enrolled offspring
  • 08:57of the original cohort participants
  • 08:59as well as grandchildren and third generation,
  • 09:02and then as sort of the demographics
  • 09:06of Framingham have changed over the years,
  • 09:09if you're only enrolling descendants
  • 09:10of people who live there in 1948,
  • 09:12you're not gonna capture that.
  • 09:13So they also have been enrolling omni cohorts
  • 09:15to reflect sort of more diverse populations (indistinct).
  • 09:21Again, they were sort of aiming
  • 09:23towards identifying risk factors
  • 09:25and etiologies of cardiovascular disease,
  • 09:29but as those populations age,
  • 09:31brain health and cognition is also an important outcome,
  • 09:34and so they've measured sort of cognitive outcomes
  • 09:39and incidents of dementia as well, and, of course,
  • 09:41those things are also related to cardiovascular.
  • 09:48For our study in particular,
  • 09:51we were using the offspring cohort,
  • 09:53and at their examination cycle five,
  • 09:55which was in the early 90s, they collected blood samples,
  • 10:00and froze the plasma from those samples,
  • 10:03and years later, when they sort of had
  • 10:06these broader proteomic analysis assays available,
  • 10:11they measured the plasma proteome,
  • 10:14I'll talk about the methods for that on the next slide,
  • 10:18but they did this in about 1,900 participants
  • 10:21who were approximately aged 55 when the blood was drawn.
  • 10:24So this is sort of a middle-aged cohort,
  • 10:26generally, cognitively healthy
  • 10:29and a little more than half women.
  • 10:33The main outcomes that we looked at in this study
  • 10:35are MRI-based measures, so brain MRIs were taken
  • 10:41about 10 years or so, five to 10 years
  • 10:45after the initial blood draws, and those had...
  • 10:51The sort of outcomes that I looked at there are
  • 10:54total brain volume as well as the volume of the hippocampus
  • 10:57and then a measure called white matter hyperintensities,
  • 11:01which is sort of a measure of vascular injury in the brain,
  • 11:06and a reason to look at those outcomes is that
  • 11:10I mentioned there are sort of precursors of dementia
  • 11:13or risk factors for dementia that can be identified on MRI,
  • 11:16those are some of the big ones.
  • 11:19Especially since we had a middle-aged cohort,
  • 11:22you may not see a lot of incident dementia,
  • 11:24and so being able to detect proteins
  • 11:27that are associated with some of those precursors
  • 11:29is a way of getting at this issue.
  • 11:34We did also look at incident dementia.
  • 11:36So we had about 20 years of follow-up,
  • 11:37which is one of the strengths of this,
  • 11:40looking in this particular sample,
  • 11:42and we had 128 incidences of dementia
  • 11:46of which 94 of them were classified
  • 11:48as Alzheimer's type dementia.
  • 11:53We also had a replication cohort.
  • 11:55I mentioned the importance replication,
  • 11:58and so we worked with collaborators
  • 12:00at the University of Washington and their cohort study
  • 12:04called the Cardiovascular Health Study,
  • 12:06which has sites, I think, four different sites around the US
  • 12:09and has measures of the same proteomic platform
  • 12:13and same outcomes that we're looking at in the study.
  • 12:19The assay that we used to measure proteins
  • 12:23is called SOMAScan.
  • 12:24It's by this company called SomaLogic.
  • 12:27They use these single-stranded DNA aptamers
  • 12:29that are designed to specifically bind
  • 12:31to different proteins, and you can sort of tag them
  • 12:35that way and measure their concentrations.
  • 12:38In our sample, the assay had 1,300 proteins,
  • 12:42which that's even sort of becoming dated now.
  • 12:45I think the latest version
  • 12:46has something like 7,000 proteins.
  • 12:48So there's a lot that can be measured with this,
  • 12:51but there is some sort of bias towards, I think,
  • 12:57molecules that sort of have some evidence
  • 12:59of being important in cardiovascular disease.
  • 13:01So it's not an entirely sort of agnostic choice of proteins,
  • 13:06but it does get a pretty wide range.
  • 13:11Okay, so that's a description of the data,
  • 13:15and, now, I want to dig in a bit
  • 13:17to the network methods that we used.
  • 13:20So this is sort of a graphical abstract
  • 13:24from their original paper,
  • 13:28describing this weighted gene
  • 13:29correlation network analysis method.
  • 13:32So that's what WGCNA stands for.
  • 13:34I put gene in parentheses because they've started
  • 13:37dropping that from the name when it gets used elsewhere
  • 13:40because, originally, it was developed
  • 13:42for gene expression data, but it's been found to have use
  • 13:45in other high dimensional data sets as well,
  • 13:48and so in our case, we're using it to analyze proteins,
  • 13:52but the language here makes reference to gene expression.
  • 13:57So just broadly, what this method does
  • 14:01is you get a co-expression network,
  • 14:04and I'll sort of give details on the next few slides,
  • 14:08but the idea is that the network is based
  • 14:10on co-occurrence or correlation in your sample.
  • 14:14So there's not really information coming from outside.
  • 14:17You're not even considering your outcome at all.
  • 14:19It's just looking at the space of the proteins
  • 14:21and which proteins are correlated with one another.
  • 14:26Once you've identified this sort of network matrix,
  • 14:29you use a hierarchical clustering algorithm
  • 14:32to define modules.
  • 14:34It's a little small here, but I'll show a a bigger example.
  • 14:37Basically, you have a dendrogram,
  • 14:39and you see that if sort of proteins
  • 14:42are on this x-axis of this figure here.
  • 14:44I'll do the mouse for people who are online.
  • 14:48You get these sort of bands or groups of proteins
  • 14:51that are highly correlated with one another
  • 14:53and not correlated with other proteins.
  • 14:58So that is where those sort of protein groups come from.
  • 15:02Once you have those, you can use a numerical summary
  • 15:06of each protein group as sort of a feature or a predictor
  • 15:10in a regression or some sort of analysis
  • 15:13to try and relate the modules or groups
  • 15:15to external information.
  • 15:16So that's how we relate our protein groups
  • 15:20to dementia outcomes in this study.
  • 15:24There's also the possibility
  • 15:25of looking at relationships between modules.
  • 15:28So I mentioned the modules in the network
  • 15:32are highly inter-correlated
  • 15:33within the proteins within themselves,
  • 15:36but there may also be some correlation between modules,
  • 15:38and that could be important to look at as well,
  • 15:41and then within modules, you may have
  • 15:44tens or hundreds of proteins, and so trying to figure out
  • 15:47which proteins within those modules
  • 15:50are driving any associations you see
  • 15:52is sort of a final step that can be
  • 15:55useful for getting sort of biological meaning
  • 15:57out of these associations.
  • 16:02So that's a broad overview.
  • 16:03This is sort of a more graphical abstract from our study,
  • 16:08and I'll sort of go through bit by bit
  • 16:11the different pieces of the analysis.
  • 16:15So, again, this WGCNA step is sort of the first step
  • 16:18of getting from this protein expression matrix
  • 16:20where you have sort of your proteins by participants,
  • 16:24and using the sort of correlations in your sample
  • 16:27to come up with these modules of co-expressed proteins.
  • 16:33The first step in doing that
  • 16:35is to make a pairwise correlation or similarity matrix.
  • 16:39So if you have n proteins,
  • 16:40then that becomes an n by n matrix
  • 16:43where each cell is describing
  • 16:45the similarity or correlation
  • 16:47between protein i and protein j in your sample.
  • 16:52You then use this to create
  • 16:54what's called an adjacency matrix, which is,
  • 16:56I'll talk about more in the next slide,
  • 16:58but is sort of a more networky way
  • 17:02of describing the association between proteins,
  • 17:05and then a topological overlap matrix,
  • 17:08which then takes into account
  • 17:10not only the correlation between proteins
  • 17:13but their shared neighborhood, and then, again,
  • 17:16that is what is used to cluster the proteins.
  • 17:23So to get into a bit more detail
  • 17:25about sort of the network construction,
  • 17:30again, you described the network as an n by n matrix
  • 17:33with the number of nodes or genes, proteins, et cetera,
  • 17:36and, in our case, we use to describe the similarity,
  • 17:39just a simple correlation,
  • 17:42absolute value of the correlation,
  • 17:44between a given node i and j.
  • 17:48The adjacency is then a measure of whether or how strongly
  • 17:51the nodes are connected in the network.
  • 17:53So the idea being that
  • 17:56nodes that have very high correlations
  • 17:58are particularly interesting.
  • 18:00Nodes that have moderate to low correlations
  • 18:02are probably not informative
  • 18:03is sort of the the underlying idea,
  • 18:07and so if you look at sort of this figure here,
  • 18:12the correlation or similarity is on the x-axis,
  • 18:15and then the adjacency is on the y, and so if you use
  • 18:19what's called an unweighted network approach,
  • 18:22you pick a threshold value, here, it's 0.8,
  • 18:25and you say that anything with a similarity less than 0.8
  • 18:28is considered to not be a connection in the network,
  • 18:31and everything greater than 0.8
  • 18:33is considered to be a connection.
  • 18:34So it's sort of a binary yes or no.
  • 18:38What WGCNA does that was novel
  • 18:42was to introduce a weighting
  • 18:45where sort of the downside of this unweighted metric is that
  • 18:49if you have a correlation of 0.79,
  • 18:52that could be useful to know, but it counts as a zero.
  • 18:55So you're losing information,
  • 18:57and so what the weighted network does
  • 19:00is it uses a sort of power transformation
  • 19:03to get from sort of the straight correlation
  • 19:07shown in this red line,
  • 19:08and sort of depending on this power value that you use,
  • 19:12you weight more or less towards the higher correlations
  • 19:16in your network, and when you fit this model
  • 19:20or when you sort of build the network, your choice of data
  • 19:24is sort of one of the parameters that you choose going in,
  • 19:27and there's ways to sort of measure
  • 19:30which gives the best fit to the data.
  • 19:38So then once you have your sort of unweighted
  • 19:41or weighted adjacency matrix,
  • 19:45then is the part where you account for shared neighbors.
  • 19:48So this is this topological overlap matrix that is created,
  • 19:52so, basically, this measure omega of connectedness.
  • 19:58The equation, I don't find super sort of intuitive,
  • 20:01but the components are...
  • 20:03This is the sum, so u are, basically,
  • 20:05all of the nodes other than i and j
  • 20:07that you're looking at the connectedness between,
  • 20:10and so you're summing up
  • 20:11the sort of common connection strength between i and u
  • 20:15and j and u as a product.
  • 20:18So if I and J both have a strong connection
  • 20:22to this other node, then that's adding to this term l,
  • 20:27and then these k terms here
  • 20:29are just the individual connections between, no,
  • 20:32each sort of the node i of interest
  • 20:34and other nodes in the network,
  • 20:37but I find sort of the easiest or most intuitive explanation
  • 20:41from this original paper shows that for the unweighted case,
  • 20:46omega is equal to one if the node with fewer connections
  • 20:50has all of its neighbors,
  • 20:51also, has connections of the other node.
  • 20:53So the connections of node i
  • 20:55are a subset of the connections of node j,
  • 20:59and, also, i and j are directly connected.
  • 21:01So that's sort of the most interconnected
  • 21:03that those two nodes can be,
  • 21:05and then the least interconnected they can be
  • 21:08is if they are not connected to one another,
  • 21:10and they don't share any neighbors.
  • 21:11So that would be sort of the zero case.
  • 21:16So this a value can either take on
  • 21:18the unweighted or the weighted case,
  • 21:20and in our sample with WGCNA,
  • 21:23we're using those sort of weighted network connections
  • 21:26that just adds more information
  • 21:28into this topological overlap matrix.
  • 21:36Okay.
  • 21:39So, now, once you have the topological overlap matrix,
  • 21:45again, this measure of sort of interconnectedness
  • 21:48accounting for shared neighbors,
  • 21:51then you can use hierarchical clustering
  • 21:53to divide those proteins
  • 21:57into groups based on their similarity,
  • 22:00and this is the results from our analysis.
  • 22:03So sort of on the x-axis,
  • 22:06you have the different proteins, you have the dendrogram,
  • 22:09which represents the hierarchical clustering
  • 22:11of the topological overlap matrix,
  • 22:14and then you have this dynamic tree cut algorithm
  • 22:20which then defines these clusters
  • 22:22which are shown in colors on the bottom based on the tree.
  • 22:26So you see this huge branch down here.
  • 22:28That's gonna be this black cluster.
  • 22:30There's this other cluster over here in green,
  • 22:33and so there's, again, a few more parameters
  • 22:37that you can use to decide how those cuts are made,
  • 22:40and, in some cases, you can sort of merge branches
  • 22:43that have correlation with one another,
  • 22:45and my general advice
  • 22:48for when you're doing this on real data
  • 22:49is to try different values
  • 22:51and see how robust the network is
  • 22:53to choosing different values because, in our case,
  • 22:56it tended to be pretty consistent
  • 22:59where we saw four modules pretty much regardless.
  • 23:02I think if we merged,
  • 23:04if we really cranked up one of the merging parameters,
  • 23:06we would get to three,
  • 23:07but other than that it sort of stayed put.
  • 23:13Okay.
  • 23:15So the next step is trying to get
  • 23:18a numerical summary measure of the groups of proteins
  • 23:22that we've identified from our network.
  • 23:25So from these modules of co-expressed proteins,
  • 23:28we then use, basically, a principle components analysis
  • 23:33to get what we call an eigenprotein
  • 23:35or it was called an eigen gene in the original paper.
  • 23:39What it is is, essentially, a weighted sum
  • 23:43of the values of each of the proteins in the module,
  • 23:47and the weights correspond to sort of how well correlated
  • 23:50that protein is with the overall module.
  • 23:53So if a protein has a high weight in the module,
  • 23:56it means that it's sort of the most interconnected
  • 23:58in the module or sort of best represents the overall module.
  • 24:04So each person is going to have
  • 24:06an eigenprotein value for each module,
  • 24:16and when we look at the sort of weights
  • 24:18within each of the modules, so just to sort of orient us,
  • 24:22on the x-axis are each of the module eigen genes
  • 24:27or eigenproteins, and then each sort of bar
  • 24:34on the y is a different protein.
  • 24:36In this case, we're only including
  • 24:39proteins that fall into one of the four modules.
  • 24:42There were, also, if you notice on the last slide,
  • 24:45plenty of proteins that didn't fall into any module
  • 24:48and were sort of the extras, so to speak,
  • 24:51and if you were to expand this down
  • 24:54and include more rows with those,
  • 24:56that would sort of show those, but for purposes of this,
  • 25:00we're just including ones
  • 25:01that fell into at least one of the four,
  • 25:04and each of these bars represents a correlation
  • 25:09between the individual protein
  • 25:11and the overall eigenprotein.
  • 25:13So for these blocks of red,
  • 25:15it's sort of the higher weighted proteins
  • 25:17that are within in this example module one,
  • 25:21module two, three, and four, and then you can see,
  • 25:24if you look sort of laterally from these proteins,
  • 25:28it's the correlation of these proteins
  • 25:30with the other modules.
  • 25:31So the idea being we wanna see sort of blocks of red,
  • 25:36and then not a lot of correlation
  • 25:38between the blocks and other modules,
  • 25:40which is what we see.
  • 25:46All right, now that we've constructed our network,
  • 25:50and we've come up with numerical summary measures
  • 25:52for each of the protein groups that we've identified,
  • 25:56that is sort of the input or the predictor
  • 25:59for these associations with outcomes.
  • 26:02So for the MRI measures, which, again,
  • 26:04our total brain volume, hippocampal volume,
  • 26:07and white matter hyperintensities,
  • 26:09we use just a simple or, you know,
  • 26:12linear regression with covariates,
  • 26:14and then a Cox proportional hazards regression,
  • 26:17we use to predict incident dementia
  • 26:20and, specifically, Alzheimer's type dementia.
  • 26:26These are the regression equations.
  • 26:28Again, these eigenproteins are,
  • 26:30they're sort of one for each module.
  • 26:32So we'll run a separate regression analysis
  • 26:35for modules one, two, three, and four.
  • 26:37We adjust for age and age squared, sex education.
  • 26:42APOE is a gene that confers a lot of risk
  • 26:44for Alzheimer's disease.
  • 26:45So it's associated with the outcomes,
  • 26:47and we include it as a covariate,
  • 26:49and then a measure of time lag
  • 26:51between when the blood was sampled
  • 26:53and when the MRI was taken to account for any differences
  • 26:57between people or the time difference,
  • 27:01and for dementia, it's slightly simpler regression equation.
  • 27:06We only adjust for age, sex, and APOE status.
  • 27:13All right, so next, I will show
  • 27:17the results in the Framingham Heart Study.
  • 27:21So from the four modules that we tested,
  • 27:24there were two that we identified to have
  • 27:27some association with outcomes.
  • 27:29The first is module two.
  • 27:31I gave it sort of a name clearance and synaptic maintenance,
  • 27:35and I'll talk about how I arrived
  • 27:37at that name for the module in a bit.
  • 27:40It has 165 proteins in it.
  • 27:44Some of the half weighted proteins sort of give an idea
  • 27:47of which ones are sort of most highly weighted
  • 27:51or sort of most correlated with the eigen protein.
  • 27:56I'll talk about how we got to these
  • 27:59in another slide as well,
  • 28:01but, basically, this is from that
  • 28:02over-representation analysis
  • 28:04where you're trying to identify biological pathways
  • 28:06that are important or overrepresented
  • 28:09by proteins in those modules.
  • 28:12So we have the Axon guidance pathway
  • 28:14was most strongly associated with this module,
  • 28:21and then in terms of relating to outcomes,
  • 28:25total brain volume
  • 28:26was the only significant association that we saw.
  • 28:29So since this is a linear aggression,
  • 28:33effect greater than zero means a positive association.
  • 28:37So we see that for larger values
  • 28:40of the eigenprotein for module two,
  • 28:42we saw larger total brain volume.
  • 28:44So it's sort of a protective effect
  • 28:47since brain atrophy is what is the risk factor for dementia,
  • 28:53and then for incident dementia,
  • 28:55we did not see a significant effect
  • 28:56after correcting our p-values
  • 28:58using a Bonferroni correction.
  • 29:00You'll notice that the confidence interval excludes one,
  • 29:04which would be the null value,
  • 29:05and that's just because that's based
  • 29:06on the non-Bonferroni corrected value,
  • 29:10but after testing for or adjusting for the four modules
  • 29:14that we tested, we didn't see a significant association.
  • 29:18It is nice at least that the direction of effect
  • 29:22is what we would expect
  • 29:23based on our total brain volume association,
  • 29:26which is that higher values of M2
  • 29:31correspond to sort of a lower incident dementia occurrence.
  • 29:38The second module that we found to be associated
  • 29:41with total brain volume was this M4,
  • 29:44which I will call sort of an inflammation-related module.
  • 29:47It had 42 proteins in it.
  • 29:50The highlighted pathway there
  • 29:52was cytokine-cytokine receptor interactions,
  • 29:55so these sort of immune signaling molecules,
  • 29:57and in this case, the association
  • 30:00was in the opposite direction
  • 30:01where higher values of this module for eigenprotein
  • 30:05are associated with lower total brain volume.
  • 30:07So it's sort of a risk conferring module
  • 30:10and, again, similar to what we saw here, not a significant,
  • 30:14sort of an annoyingly borderline association
  • 30:17between this and dementia, but, again,
  • 30:20the direction of effect is what we would expect
  • 30:24based on our observed association with brain volume,
  • 30:29and, also, I'll just mention that I standardize
  • 30:31the eigenprotein so that the effect sizes
  • 30:34correspond to a standard deviation increase in eigenprotein.
  • 30:37So it's a little bit...
  • 30:39One sort of drawback I would say
  • 30:40of these methods is the interpretation
  • 30:43since a standard deviation increase, in this case,
  • 30:47depends entirely on the sample that you're using.
  • 30:49So it's really just sort of a direction of effect
  • 30:54more than anything.
  • 30:56So to try and get at some of, get a better understanding
  • 31:00of how these modules relate to our data
  • 31:03or sort of what may be responsible
  • 31:06for some of the associations we see,
  • 31:08this is a map of the correlations
  • 31:12between different demographic variables
  • 31:15and each of the modules, and I mentioned that we have
  • 31:18a replication cohort as well, the CHS.
  • 31:20So these two bars, sort of the two columns,
  • 31:23show the two different cohorts that were included.
  • 31:28So I put blue arrows to show the covariates
  • 31:31that were included in our regression model,
  • 31:34and you can see that there are some correlations
  • 31:35between, say, sex and the modules,
  • 31:38not really anything with APOE carrier status,
  • 31:42maybe some education associations,
  • 31:44and some associations with age.
  • 31:46So it's good that we adjusted for those in our models.
  • 31:49However, you can also see there are a lot of other factors,
  • 31:53cardiovascular risk factors,
  • 31:54such as systolic blood pressure, BMI,
  • 31:58fasting glucose that have associations with these modules.
  • 32:02So we wanted to see if any of those could perhaps explain
  • 32:05the associations that we saw.
  • 32:10So I'm repeating sort of our standard model here
  • 32:14was what I showed results from previously.
  • 32:17The expanded model that we considered
  • 32:19included a bunch of these risk factors,
  • 32:23basically, something representing BMI,
  • 32:27hypertension, sort of lipid dysregulation, and diabetes,
  • 32:33and I also included smoking as well,
  • 32:37and we also included a measure of kidney function,
  • 32:40which can also be an indicator of cardiovascular disease.
  • 32:45So for module two,
  • 32:48I'm repeating the sort of effects we saw
  • 32:50from the standard model here,
  • 32:53and when you adjust for the expanded set of covariates,
  • 32:56your effect is attenuated by half,
  • 32:58and it's no longer significantly associated.
  • 33:01So with that says, it's either you have
  • 33:04a sort of confounding issue
  • 33:08where the association you're seeing between these proteins
  • 33:12and total brain volume is really just in effect
  • 33:16of sort of poor cardiovascular health
  • 33:20or better cardiovascular health
  • 33:22or you may think that it might be
  • 33:25some sort of mediation effect
  • 33:26where perhaps the risk associated
  • 33:31between the proteins and the sort of total brain volume
  • 33:34could be mediated
  • 33:35by some poor cardiovascular health outcomes,
  • 33:41and then for module four,
  • 33:43again, this sort of inflammation module,
  • 33:45we don't see any real effect attenuation.
  • 33:48Regardless of whether you adjust
  • 33:49for cardiovascular factors or not,
  • 33:52it's still associated with total brain volume,
  • 33:54which suggests it's sort of different mechanism
  • 33:57or lack of compounding between
  • 33:59or based on cardiovascular health.
  • 34:05Okay, so I mentioned
  • 34:08in the sort of initial graphical abstract
  • 34:12that once you find protein modules
  • 34:14associated with your outcomes of interest,
  • 34:16it can be good to look within the proteins of those modules
  • 34:19to try and find sort of subsets
  • 34:21or specific proteins that may be driving the associations.
  • 34:26So for modules two and four,
  • 34:27where we found associations with brain volume,
  • 34:30we wanted to see if we removed proteins one at a time
  • 34:35based on their sort of increasing weight,
  • 34:37so remove the lowest weighted proteins in the modules first,
  • 34:42what sort of happened to the strength of the associations.
  • 34:46So these are both associations with total brain volume.
  • 34:49It's sort of the p-value on the y-axis,
  • 34:53and you can see that as you remove, say, from module two,
  • 34:57the first 20 proteins or so,
  • 34:59you're really not seeing a difference
  • 35:01in the effect of the overall module with total brain volume,
  • 35:05which suggests that those proteins
  • 35:07aren't really impacting the association,
  • 35:11whereas beyond that point, once you start removing proteins,
  • 35:15the association becomes less strong,
  • 35:17and so that's suggesting that those proteins
  • 35:20may have more of an impact on sort of the overall module,
  • 35:25and so for both of these modules, we identified the spot
  • 35:29where sort of the based on the lowest p-value,
  • 35:32which proteins were
  • 35:35sort of the most important in the module.
  • 35:37I wanna emphasize that we didn't use this to...
  • 35:41So for things like dementia, if you were to run this,
  • 35:44since we didn't see a strong association
  • 35:47or a significant association beforehand,
  • 35:50we didn't sort of use that to try and find a subset
  • 35:52that we're significantly associated
  • 35:54because I would call that cheating.
  • 36:01Okay, so the last piece that I'll talk about
  • 36:05in terms of teasing apart associations
  • 36:09or sort of understanding protein within the modules
  • 36:13is this functional enrichment
  • 36:16or over-representation analysis within the modules.
  • 36:20So based on the ones, sort of the significant modules
  • 36:24or significantly associated modules with the outcomes,
  • 36:28there is this software called STRING
  • 36:31that does a few different things, but what I used it for
  • 36:35is doing an over-representation analysis
  • 36:38of biological pathways.
  • 36:41So the idea is that there are annotation databases
  • 36:45for proteins that sort of group them
  • 36:48into biological functions
  • 36:51or pathways that they're involved in,
  • 36:53and the idea is that if you have a module
  • 36:55that has more proteins than you would expect
  • 36:58from a given pathway,
  • 36:59then that's sort of the over-representation piece,
  • 37:02and it indicates that that biological pathway
  • 37:05might be important in whatever functions
  • 37:08the module is carrying out.
  • 37:12So this is just a screen grab of one example.
  • 37:16So this is from module four.
  • 37:18So you can see the annotation database is over on the left.
  • 37:22So KEGG is one of them.
  • 37:24Gene Ontology is another,
  • 37:26and so you have these sort of observed proteins,
  • 37:30and then the background is sort of the total number
  • 37:33of proteins that are in the pathway,
  • 37:36and the idea being that if you were to grab, I don't know,
  • 37:39however many proteins out of the background,
  • 37:41like how many would you expect to be in this module
  • 37:45due to chance, and do we have sort of over-representation
  • 37:49compared to what we would expect?
  • 37:51And so for module four,
  • 37:52the cytokine-cytokine receptor interaction
  • 37:55was the strongest overrepresented pathway,
  • 37:59and then you can sort of look at these others that
  • 38:03have some sort of false discovery rate greater than 0.05,
  • 38:08and so I found the KEGG pathways, personally,
  • 38:11to be the most informative.
  • 38:12Gene Ontology tends to be a lot more specific,
  • 38:15which may be more useful for targeting
  • 38:18certain sort of therapeutic processes
  • 38:21or something like that,
  • 38:22but so depending on the scale that is important to you,
  • 38:25you can sort of use different annotations.
  • 38:31Okay, so the last thing I wanted to talk about,
  • 38:33with the Framingham data in particular,
  • 38:38was sort of getting back to our motivation
  • 38:40for doing a network analysis in the first place.
  • 38:43So the sort of contrast or comparator would be to do
  • 38:47individual protein analyses where you're running
  • 38:49a regression model for each protein that you're analyzing,
  • 38:53and so we did that as a point of comparison.
  • 38:55So for total brain volume, there were like a dozen proteins
  • 38:59that were associated with total brain volume.
  • 39:02One was associated with hippocampal volume,
  • 39:04and two were associated with Alzheimer's disease
  • 39:07at an FDR value of less than 0.1.
  • 39:11So what was interesting,
  • 39:14especially with the brain volume results,
  • 39:16and, again, that was where we had seen
  • 39:17associations with these modules,
  • 39:19some of the proteins that were significantly associated
  • 39:23were from module two and module four and others weren't.
  • 39:29So what I get from that is a few things.
  • 39:32One is that some proteins
  • 39:34that are associated with the outcome
  • 39:36are sort of individually associated
  • 39:39but not sort of detectable
  • 39:41within sort of a larger network of proteins
  • 39:44that are associated with that outcome,
  • 39:46and then the other is that
  • 39:48for those that are within the modules,
  • 39:51we would only be getting information
  • 39:53about sort of a few of the proteins in the modules,
  • 39:56whereas, as we see here,
  • 40:00the associations tend or continue to get stronger
  • 40:03with sort of looking at the broader network
  • 40:06around sort of the most highly weighted proteins.
  • 40:09So you're getting a bit more information
  • 40:10about proteins that may be associated
  • 40:13with total brain volume
  • 40:14and maybe at some of the biological processes
  • 40:17compared to if you're looking at things individually,
  • 40:20but, again, because you're seeing associations
  • 40:22that you don't catch with the modules,
  • 40:23it's sort of important to look at both,
  • 40:25and you get sort of complimentary information
  • 40:28from the two approaches.
  • 40:33So a caveat,
  • 40:36I mentioned issues with lack
  • 40:37with sort of difficulties in replication.
  • 40:40We replicated this analysis
  • 40:42in the Cardiovascular Health Study,
  • 40:44and we did so by taking the same module,
  • 40:47so module two and module four,
  • 40:50taking the same weights from those proteins
  • 40:52and applying them to the protein concentrations
  • 40:56in the Cardiovascular Health Study.
  • 40:59So we didn't do a network reconstruction or anything
  • 41:02in the different study.
  • 41:03We were just seeing if these modules replicated
  • 41:07in their associations with outcomes in a different cohort.
  • 41:10So in this case, it's really not seeing much
  • 41:14in terms of association with both total brain volume
  • 41:18and we also looked at dementia out of interest
  • 41:22since things were sort of close in our cohort,
  • 41:26but, really, we're not seeing much in terms of associations.
  • 41:31Part of the reason for that,
  • 41:33so there are not that many cohorts
  • 41:36that are available that have a large proteomic panel
  • 41:39with the same proteins that we were looking at
  • 41:41as well as MRI and incident dementia outcomes,
  • 41:45and, in this case, the demographics of the cohort
  • 41:48are fairly different from (indistinct) Framingham.
  • 41:51So about 20 years older on average.
  • 41:56I'm just including the sort of first few rows
  • 41:58of our table one, but you can see differences in education,
  • 42:01systolic blood pressure, and the same is true
  • 42:03of a lot of the other cardiovascular risk factors.
  • 42:06So it's a very different cohort,
  • 42:08and digging a bit into the literature
  • 42:10about sort of proteins over the life course,
  • 42:13it's not too surprising that we don't see
  • 42:16the same associations, but it it does sort of,
  • 42:19it's a good cautionary message
  • 42:20about drawing conclusions too far
  • 42:23based on sort of one set of data
  • 42:25or one set of demographics.
  • 42:30Just to put these results in context,
  • 42:32so our module four included
  • 42:36a lot of immune-related signaling molecules
  • 42:38like interleukins, TNF receptor proteins,
  • 42:41which are both types of cytokines, and have been associated
  • 42:44with Alzheimer's disease previously,
  • 42:47in particular, interleukin-1 beta was in our module four,
  • 42:52and it had been found to be elevated
  • 42:53in 80 cases in a meta-analysis.
  • 42:56However, other biomarkers that have been sort of validated
  • 43:00in other cohorts were not identified in our module.
  • 43:08In module two, we saw Axon guidance pathway proteins
  • 43:11including ephrins, netrins, and semaphorins,
  • 43:13which have been associated with AD in previous work,
  • 43:17and complement cascades are also have been associated
  • 43:20with AD probably for the reason
  • 43:22of inducing these immune cells called microglia
  • 43:27in the brain to, basically, eat up
  • 43:31cells in response to amyloid deposition.
  • 43:35So there's some biologically plausible mechanisms
  • 43:37that could be associated with these modules
  • 43:42in Alzheimer's disease,
  • 43:46and the last thing I'll say is talking about some sort
  • 43:49of other ways of approaching this problem,
  • 43:51so as I mentioned, the CHS cohort
  • 43:54has different underlying characteristics,
  • 43:56and so it may well have a different network structure.
  • 43:59So one thing that could be good to do
  • 44:02is to look at sort of consensus modules across the cohorts
  • 44:07where you construct networks in each cohort,
  • 44:09and then look at where the overlaps are,
  • 44:12and you can get sort of a more,
  • 44:14hopefully, more robust network across cohorts,
  • 44:18and then there are other network-based approaches
  • 44:20that can incorporate external information.
  • 44:22So, again, our network approach
  • 44:24was just based on correlation in our dataset,
  • 44:27whereas other methods use sort of those annotation databases
  • 44:33and that sort of thing to construct the networks
  • 44:35and sort of decide how strong the similarities between nodes
  • 44:39or the strength of connections will be.
  • 44:41So that's another approach,
  • 44:43and then the last thing I'll say is that
  • 44:45I'm sort of still using this kind of method
  • 44:48now in work with longevity and aging
  • 44:51and trying to apply it to metabolomics,
  • 44:54so metabolites data in cohorts related to those outcomes.
  • 45:02So thank you all for being here.
  • 45:04Thank you, my collaborators.
  • 45:06This is the folks down at UT.
  • 45:09I'll say that (indistinct).
  • 45:11Thank you.
  • 45:16<v ->Thank you for wonderful presentation.</v>
  • 45:18We're open for questions.
  • 45:20So let's start with people in the room.
  • 45:21Any questions?
  • 45:23<v ->Got one over here.</v> <v ->Perfect, thank you.</v>
  • 45:25<v Audience>Yeah, so my research interest</v>
  • 45:26is about the cancer, and, also,
  • 45:28we're interested in your study.
  • 45:30So I've got some technical issues about this project.
  • 45:35So the first issue that,
  • 45:36how do you do the normalization in your process?
  • 45:41<v ->Yeah, great question.</v>
  • 45:42So yeah, I totally glossed over
  • 45:44all the pre-processing stuff.
  • 45:47So before doing the network construction,
  • 45:51I log transformed the protein concentrations
  • 45:54to reduce stiffness.
  • 45:56There was a standardization within,
  • 45:58there were sort of two phases of runs of protein modules,
  • 46:02so I sort of standardized within those batches,
  • 46:06and then after that, I did a rank normalized
  • 46:11or inverse normal rank transformation to sort of-
  • 46:16(audience speaks indistinctly) <v ->What's that?</v>
  • 46:17<v ->(indistinct) normalization?</v> <v ->Basically.</v>
  • 46:19Yeah, yeah, yeah.
  • 46:20So that was sort of the data pre-processing.
  • 46:23So I think I, you know,
  • 46:26I've thought about sort of the pros and cons
  • 46:28of those things as well and I think my biggest qualm
  • 46:31with the way that I did it is sort of interpretability,
  • 46:34because, yeah, sort of what does it mean
  • 46:37to be at one quantile versus another
  • 46:39where you have this huge dynamic range
  • 46:40of protein concentrations?
  • 46:42<v Audience>So another question is that</v>
  • 46:44I know that in your project,
  • 46:46the modules identification is very important.
  • 46:48So I wonder,
  • 46:53you have talked a little bit
  • 46:54about how to answer the modules,
  • 46:57but so can you explain a little bit more
  • 47:00about how you gonna bring modules from the data?
  • 47:08<v ->I'm not sure, can you say a little bit more?</v>
  • 47:11<v Audience>Yeah, so in your previous pages,</v>
  • 47:13I think you talked a little bit about the clustering
  • 47:17of the modules so that we know
  • 47:18that there are four main modules.
  • 47:22<v ->Yes.</v> <v ->In the whole dataset.</v>
  • 47:24So what is the name of that algorithm
  • 47:28and how it basically work?
  • 47:31<v ->Yeah, so the clustering itself was done</v>
  • 47:36using algorithm called H+.
  • 47:41To be honest, I'm not too sure
  • 47:43about sort of the details of it.
  • 47:45It can use any dissimilarity measure,
  • 47:48which, in our case, comes from the TOM matrix, but-
  • 47:52<v Audience>So this is the algorithm that we separate</v>
  • 47:55the whole proteins into four different modules
  • 47:58so that we can analyze it one by one.
  • 48:00<v ->Yeah, yeah, yeah, yeah.</v> <v ->Yeah,</v>
  • 48:01so I also noticed that
  • 48:07in the weighted protein expression network analysis,
  • 48:13you talk about the beta values.
  • 48:16<v ->Yes.</v> <v ->That you use that</v>
  • 48:20like the soft threshold. <v ->Yeah.</v>
  • 48:23<v Audience>To make the genes to be more important</v>
  • 48:28if that is the thing that you wanna analyze.
  • 48:31So in this process, I want to know how you would make sure
  • 48:35the value of the data in this process.
  • 48:39<v ->So sorry, we have to end 'cause it's 12:15.</v>
  • 48:42I know others have classes and everything.
  • 48:44Maybe you guys can discuss a little bit.
  • 48:46<v ->Yeah, (indistinct), yeah.</v> <v ->Maybe if you have time.</v>
  • 48:48Please, if you're registered,
  • 48:49make sure you signed in on a sign in sheet.
  • 48:51There's three of 'em.
  • 48:52You only have to sign on one of them,
  • 48:54and then one-fourth page reflections will be due
  • 48:57before the next speaker's time to speak.
  • 48:59(indistinct talking)