YSPH Biostatistics Seminar: “Enhancing Biostatistics and Health Informatics Research Through Collaborative Cloud-Based Data Science Tools"

October 12, 2023

Information

Stephen Larson, CEO and Co-founder, MetaCellAdria Haimann, Business Development Executive, MetaCell

October 10, 2023

ID10841

To CiteDCA Citation Guide

00:02<v ->All right.</v>
00:04In the interest of time, let's go ahead and get started.
00:08Hey everybody,
00:09thank you so much for coming today and this week seminar.
00:14It's my pleasure to introduce Stephen Larsson
00:16and Adria Haimann from Metacell.
00:20This is a few words of context here.
00:24We've talked about, we've had people,
00:26we started this semester with somebody from the hospital.
00:28We've had people from academia,
00:30we've had people from pharmaceutical companies.
00:33And so very excited to present something different.
00:37So Metacell is a company that works
00:40in sort of the research space.
00:43Near and dear to my heart.
00:44They've been, from their beginning, I think,
00:46very active in the computational neuroscience community.
00:52We both contributed to a project called NetPyNE
00:56for building models of computational neurons.
01:01But more broadly, they work in the greater
01:04health informatics space.
01:06And they're going to tell us a little bit
01:10about how we can enhance biostatistics
01:12and health informatics research
01:13through collaborative cloud-based data science tools.
01:16So let's welcome them.
01:20<v ->Thank you very much. Good afternoon everyone.</v>
01:22I can see some of the back of your heads,
01:24so I can imagine that I'm also, you know,
01:26virtually looking at your faces.
01:28Thanks so much for having us.
01:30I'm Adria Haimann and I work alongside Stephen at MetaCell.
01:33And as already mentioned, today we're gonna share with you
01:35some insights into how academics are using cloud-based
01:39collaboration tools to enhance their research.
01:42But before I kind of begin with this,
01:43I wanna provide you with some context.
01:45So, 10 years ago I was in your position,
01:48I was studying health economics
01:50at the London School of Economics,
01:52and I had joined a research team
01:54at the European Observatory for Health.
01:56And I was relatively new to this field
01:57and kind of found myself in a Catch 22
02:00that maybe you can relate to.
02:02So I wanted to know how can someone or a student or postdoc
02:05or researcher discover the best way to collaborate
02:08on their research and use new tools
02:10if you have fairly minimal experience,
02:12neither academia or in industry.
02:14So that's essentially what we want to show you today
02:17and what we'd love to share with you,
02:19if you could go to the next slide,
02:21which is kind of a collection of key topics
02:24of how researchers are doing just that,
02:27while also getting the most out of their data.
02:29So during this seminar,
02:31we're gonna cover different methods that you can share
02:33data analysis and introduce you to a specific cloud-based
02:36collaboration platform
02:38that we've created called Cloud Workspaces.
02:41And then we'll run you through some examples
02:43of how researchers are using this platform,
02:45as well as how we've formed an industry partnership.
02:48And then lastly, we wanna show you kind of other ways
02:50that this tool can be used in academic settings.
02:53And then of course, we'll open it up to you guys
02:55and encourage you to ask us questions
02:57on any of these topics.
02:59So I'll hand over to Stephen now.
03:02<v ->Thanks Adria for that great introduction.</v>
03:04And hello to all of you.
03:07I currently see you as tiny, tiny pixels on my screen
03:12because of the way this is viewed.
03:13So as much as I'd love to be there in person
03:16and looking into the whites of your eyes,
03:17I'm not gonna get that chance.
03:18But, I think we have a really good robust discussion
03:23for you guys that I hope you'll find very interesting.
03:27And thank you very much again to Robert for the invitation.
03:30So similar backstory on myself,
03:35I went through undergraduate training at MIT
03:40in computer science, did a master's in AI
03:43before it was cool again,
03:45and then shipped off to UCSD for a PhD
03:51in neuroscience with a computational specialization.
03:54So very much familiar with the academic experience
03:59and I'm really excited to share with you
04:06some of the things that I've learned since leaving academia.
04:09And one of those things
04:10has been to start this company, MetaCell,
04:14which I basically started as I was wrapping up my PhD
04:16and I kind of realized that I wanted to serve science
04:22in a different way than was gonna be possible
04:27just within the confines of academia
04:29because I realized that I was a builder
04:31and to build software that could,
04:36software tools that could be useful to, you know,
04:40tools that I would wanted to have had as myself,
04:43a graduate student.
04:44I would need to kind of put a professional team of folks
04:48together that, you know, really came outta industry
04:51and that are kind of high hard to higher end academia.
04:54So the story of this slide is, since then,
04:58all the different great groups
04:59that we've had a chance to work with,
05:01and you'll see a really kind of motley crew of logos
05:05that are present here from, you know,
05:08really, really big pharma companies
05:12like Yale, you guys are on here,
05:14other universities that we've had the chance to work with,
05:18and then biotech companies,
05:21med device companies that we work with some,
05:25some of the US lots internationally.
05:29And realizing that, you know,
05:31the core thing that unifies all the work
05:34that we've been doing over time is the way
05:36that sort of math and computation can help us
05:40understand the life sciences.
05:41So hence I come to you today in a biostatistics seminar
05:46to talk about, you know,
05:47some of the other pieces of the puzzle
05:50that go into advancing the life sciences in that way.
05:56So, let's start with a really simple, simple example, right?
06:04So let's say you're doing some kind of analysis
06:08on some kind of bio data, okay?
06:13Perhaps in the statistics context, you're using SaaS.
06:17In a computational neuroscience context,
06:20you may be using Python and the Python suite of tools.
06:26Some in the statistics field are using R open source,
06:29you know, statistics packages.
06:30Whatever it is, you've got some data, you know,
06:33maybe you're analyzing it on behalf of yourself,
06:35maybe you're analyzing on behalf of your lab,
06:37the group that you're working with.
06:38Maybe you're analyzing it in terms of a company.
06:41Whatever it is,
06:42you wanna share that data analysis with somebody else.
06:44You're probably gonna have to gather
06:47some history of those commands together.
06:50Maybe it's packaged up as a script, maybe not.
06:53You're gonna send that file
06:54to somebody else very often.
06:57And then you're also gonna wanna somehow
06:59collect the outputs of that, right?
07:01The figures, the diagrams, the summary statistics,
07:05the result of T-tests, you know,
07:08things like this, right?
07:09And send that output somewhere, right?
07:12So, you know, that is a problem time immemorial.
07:16And you know, as long as I've been, you know,
07:20working in this space still, you know,
07:23it's very common to just do this
07:25and it's maybe send this over email, right?
07:29It's still a practice that I'm sure you know, happens.
07:32And so, and that's probably just fine, you know,
07:35in many small circumstances.
07:37But as that scales up, there's problems of reproducibility,
07:42there's problems of, you know,
07:44keeping track of who sent what.
07:46Email is not a great file management system.
07:48So we've been thinking a lot over the course of our company,
07:55which is, we've been around now,
07:56this is our 13th year about how, you know,
08:00the cloud and the internet basically can come into that
08:02in any better way than sending email along.
08:05And so we've thought a lot about, you know,
08:08what starts to happen when there's a computer that lives
08:11in the cloud that multiple people can jump into and join.
08:15And what is, you know, how does that work in general?
08:18It's something that we're not only just us doing, right?
08:22This is an idea that's been there for a while.
08:24Anybody familiar with like, say Python Notebooks, right,
08:27are aware of this idea.
08:29There's tools like Google Colab,
08:31and then we've even been talking to major universities,
08:34like we've been having a conversation
08:35with Harvard Medical School,
08:37where they've been working collaboration with Amazon
08:39to kind of work together with them to set up computers
08:43that are in the cloud.
08:44Similarly, of course, there's gonna be what happens with,
08:49at like, at your local university
08:50with your local computing infrastructure.
08:52Typically that's based around supercomputers that are there
08:56for doing like really powerful computations or calculations.
08:59Things that are very data intensive.
09:01A workspace in the cloud is sort of in between.
09:02So it's kind of like, you know,
09:05just a laptop that isn't your physical laptop,
09:09but it's like a laptop that's somewhere else in the cloud
09:11that you can log into and do some analysis with.
09:14And it basically lives as long as you wanna do that analysis
09:16and then it goes away
09:18if you don't need that analysis anymore
09:20or it can stay there as long as your lab is around, right?
09:22And then go away if you don't need it anymore.
09:25So the idea is then in this story,
09:27instead of just gathering the history of commands,
09:29sending the file and sending the output of the file,
09:31what if, right you could do all that in the context
09:34of a computer that multiple people
09:37can join and look at, right?
09:39Work in that same environment.
09:40When you log out,
09:41it's exactly where you left it, right?
09:43Like if you know your computer gets misplaced
09:47or you drop it, you know, off a bridge into a river,
09:50like, doesn't matter 'cause
09:51all this stuff is preserved, right?
09:54So, how does that idea start to change the basic practice
09:57of interacting with data and doing analysis like this
10:02if you were to change that one variable okay?
10:05So that's sort of the starting premise for our chat today.
10:09So, you know, what that might look like is, you know,
10:13a session one-on-one or two-on-one with multiple people
10:16where you get, you know, perhaps one of you in the future.
10:22In the case that we've been doing in our company,
10:24one of our staff members, who has experience
10:28in doing a different kind of data analysis.
10:32In our case, we work on a variety of problems,
10:36but one of the major ones we worked on
10:37is like the imaging of calcium signals
10:42in neural tissue okay?
10:45But you know, you might be on a call like this one and just
10:49the same way that you might meet with your lab members on a
10:50Zoom call, you might meet with someone
10:54with experience in data analysis or biostatistics
10:56that is not in your lab or not in your even organization.
11:01It might be somewhere remote,
11:02maybe at another university or in a company like ours.
11:06But what they might get as the experience of that is
11:13jointly logging into this workspace that lives in the cloud.
11:17And if SaaS is the thing you wanna use,
11:20you might find a whole SaaS instance there
11:22in a desktop that you can log into.
11:25But the point being that multiple people now can type on it
11:27as opposed to like physically handing your laptop around
11:30in the lab or even just screen sharing it
11:33in some kind of a lab meeting, right?
11:35It's actually allowing for people to jump into the same
11:38application and literally like trade off
11:40on like typing commands into it.
11:43Kind of like what you get with a Google Document
11:46or a Google Spreadsheet, right?
11:48That real-time collaboration,
11:49but now for any kind of application.
11:52So that's one experience you might have.
11:54Not just SaaS, right?
11:56So a Jupyter Notebook, as I mentioned before,
11:58is another thing that you can use.
11:59And those of you who might be using,
12:01again, the more open source technologies,
12:03if you might be using R Statistics or using Python
12:05or whatnot, you'd be familiar with, you know,
12:08a Jupyter Notebook.
12:11So it's based around, you know,
12:13this idea of putting a computer in the cloud,
12:16multiple folks logging into it,
12:18and then being able to sort of transport
12:21your expertise around the world.
12:25Because in addition to the knowledge of doing analysis
12:31being shipped around,
12:32data can also come into this workspace
12:34as an intermediate space that's private to a given lab,
12:39but allows for a different kind of model on sharing data
12:43where it sort of stays under the control of the lab,
12:47you know, whoever puts it there can take it back,
12:49that kind of thing.
12:51Okay so we've been exploring this model
12:54and we've also been talking to other organizations
12:57and universities about this model and how to use it,
13:00how to implement it, right?
13:02As I mentioned, we've been talking to folks like
13:05at Harvard Medical School that partner with Amazon
13:08to bring these sorts of instances into their
13:11labs and what can be done with it.
13:13So I'm gonna wanna talk a little bit
13:14about like some of those details,
13:16and I'm saying it here in the context of our product,
13:19but I'm not trying to sell you anything.
13:20I'm really trying to talk about it
13:21more in the context of what can be done.
13:24So thinking about it, like,
13:28so I mentioned SaaS as an example.
13:29I mentioned Jupyter Notebooks as an example,
13:31but there might be other kinds of software
13:34that are more particular to a use case,
13:36like MATLAB's another one that could be installed.
13:38But there might be even more specific software
13:40that might need to be set up or run.
13:44Sometimes, for example, survey software
13:47where you might collect data from a very particular kind of
13:52survey system and you need something to work with it.
13:54So imagine that,
13:55like for the use case that you might have, right,
13:58you could have a workspace that is set up
14:02so that all that software comes pre-built
14:03once you set it up.
14:05Much like, you know, having laptops
14:07that have come pre-configured with a certain set of tools,
14:10but instead of handing out physical laptops,
14:12it's on the cloud.
14:14The virtual collaboration,
14:15I think I've gone through a lot, the multiple workspace,
14:18I think I mentioned also.
14:20Data security I kinda mentioned, you know,
14:23anybody who's doing data analysis
14:26with anybody who has, you know,
14:29talking to somebody that they weren't the ones
14:30to collect it, I'm sure has run into challenges
14:32where folks are reticent to, you know, share data.
14:37So that's why in this context,
14:38it's really important to note that like, you know,
14:41we can lock that environment down
14:42and make sure that only the people that can log into it
14:44have access to it, that's a really important point.
14:47So it's not really like the data
14:49are going out of somebody's control.
14:51Again, they're kept in a place
14:52where anybody who wants to can remove
14:53that data again and delete it.
14:57And then if there were to be very computationally aggressive
15:01things to do, it's very easy to scale it up.
15:05And that's something that folks also like.
15:10So how, you know, how are ways that this kind of workspace
15:14can support biostatistics research
15:17and data analysis in general.
15:18So I mentioned data science as a service
15:20a little bit in this example.
15:22So this would be the case where any organization
15:26who say doesn't have biostatistics
15:29or data science expertise local to them
15:32might be interested in sort of renting time
15:36or having some part-time person come in to help with that.
15:40And that's a model that we've seen work well
15:42both for labs and for companies.
15:44One way in which labs really like it is new PIs
15:49with a startup package that just, you know,
15:51first few weeks into their appointment
15:54with an R one, right, no staff yet.
15:57Nobody, but they're coming in with data from their previous,
16:03you know, from their postdoc basically.
16:06And what do they do, right?
16:07They need to write grants, they need to like hire staff,
16:10they need to do all these things.
16:12So we've actually found labs are very happy
16:15in that circumstance just to get going, you know,
16:19to be like, "Hey, I have this data,
16:20I haven't analyzed it yet.
16:21I really wanna put in my grant proposals.
16:23I just need somebody to kind of sit with me virtually
16:27and run through this data,
16:30so that I can get these figures
16:33made and get my grant out, right?"
16:34And I just don't have time
16:36to bring on a full person to do that.
16:37So data sciences service can be very useful for that.
16:40Data standardization and sharing as a service.
16:42So, you know, I'm not sure how much it's affecting folks
16:46in the room, but the NIIH over time
16:48has gotten increasingly serious about making data sharing
16:55happen for real for real,
16:56and not for fake for real, right?
16:58And so this year in particular,
17:01a new policy from NIIH has come out, DMS policy,
17:05where they're really, really asking for even, you know,
17:09grant proposals to have a whole data management
17:11strategy figured out upon submission.
17:15And even, you know, saying you need to set aside
17:19some budget for that
17:20'cause it turns out data sharing doesn't happen for free,
17:22doesn't happen for free, you know,
17:24for PIs for their time, right?
17:26So that's also something where, okay,
17:29I don't have the expertise to figure out
17:30which of the billion databases I might share my data in.
17:34Could somebody come in and help do that?
17:36Well how do you do that?
17:37You know, when I did work in the neuroinformatics
17:41space as a graduate student
17:43and I was trying to help figure out for neuroscientists
17:47how to get data that they had, you know, collected
17:50in a very laborious process of experimental collection,
17:55was trying to help them share their data
17:57'cause they wanted to comply with these policies
17:59even back then, you know, very frequently I would
18:04get the challenge of like,
18:05"Yeah, it's in a hard drive under my desk, right?
18:08Physical hard drive sitting under my desk, right?"
18:10Like, okay, so you can go pick it up and like take it away
18:14and do something with it.
18:15But you know, they don't have the expertise, you know,
18:19locally to even know, okay, now we're gonna plug it in
18:22and we gotta look through it
18:23and like, oh, the PhD student is left three years ago.
18:27And like, how do I do that?
18:27So the idea of, okay, if all we can do is like take that
18:31hard drive from under the desk
18:33and like plug it in the cloud, share it on Dropbox,
18:37okay, something like this or you know,
18:39have a conduit to get it to the cloud,
18:41share that folder in a workspace online
18:43and then have somebody else that does this all the time
18:47like go through all that and do their best to start,
18:49you know, documenting what they find,
18:51maybe raising questions that they might find, you know,
18:54to present to the PI,
18:55"Hey, I know your PhD student left three years ago,
18:58but you know, can you tell me a little bit
18:59about this experimental methodology?"
19:01There's now at least a hope that you can start,
19:03you know, standardizing that data,
19:05sharing it in a better way,
19:06making the NIIH not come kick down your door
19:09with the data sharing police force
19:11that I'm sure they're setting up now.
19:14Okay probably not.
19:16Okay a third way is through workshops.
19:21And I'll have some specific examples
19:23a little bit later about this one.
19:25But if you think about, you know,
19:27the experience of either physically traveling
19:30or doing what we're doing here
19:31and then being exposed to software, right?
19:36It's one thing to have slides show
19:37you pretty pictures of what software looks like.
19:39And it's another thing to say basically like,
19:43"Hey, log into, like go right now on your laptops
19:47and go hit this address"
19:50and like, here's your login and like while I'm explaining it
19:53to you, check it out, play with it, right?
19:57So we've actually found that also to be a really valuable
20:00way to do an extra level of education and demonstration,
20:05especially for tools built in academia,
20:09which generally have a pretty small audience, right?
20:11Not a lot of people use them maybe necessarily,
20:14or it's like a very niche community.
20:16So the total number of humans is not great.
20:18So to have the ability right now in a live session
20:21to be like, let me show you this software you log in right
20:24now, play with it can move the needle a lot on getting folks
20:27to use stuff that that there will really be tools
20:31that they will actually help them a lot.
20:33And then lastly, you know,
20:35collaborations between labs, right?
20:38Hey, we just set up a consortia,
20:40it's a five lab consortia
20:41and we're all studying this thing, right?
20:44It's a collaboration between the folks that are generating
20:46the data and the folks are gonna analyze the data.
20:48Okay, great, we got this really smart set of mathematicians
20:50who are gonna do all these great statistics, awesome.
20:53How do you get the data from point A to point B?
20:55Well email, right?
20:58So what if you can improve that, right?
21:01Or you know, the context of, you know,
21:04we also find companies wanna collaborate with each other's
21:06and then universities and companies wanna collaborate
21:08with each other also, right?
21:10So in ways that I haven't already listed,
21:13but just collaborations of whatever variety.
21:17So when it comes down to those things, right,
21:19it's one step better than just sharing on Dropbox
21:22and being like, here are the data, go check it out
21:24'cause you're keeping the analysis all together, right?
21:29It adds a layer of reproducibility
21:31to those kinds of collaborations,
21:32which are hard to match in addition to all the other things,
21:36all the great best practices for reproducibility.
21:40Okay so that's four ways to use cloud workspaces
21:43support biostatistics research.
21:47So let's, you know, I think I've kind of walked through this
21:51example already verbally,
21:52but I did have a slide specifically for it.
21:54So like this happens in research all the time.
21:57There's a lab that needs a particular analysis completed
22:00and they don't have the expertise in lab.
22:01What can be done?
22:02So typically the alternatives are, you know,
22:04bring in some student or a postdoc or collaborate
22:07with a lab that has some mathematical expertise
22:09to perform analysis.
22:11But that can be quite time consuming, you know,
22:13that might not deliver the results you're looking for.
22:16Secondly, right for folks who might, you know,
22:20be in a position, like I mentioned
22:21with early lab set up, right?
22:25Engaging some part-time data scientists from industry
22:27could help work on particular problems as needed.
22:31And that's interesting both perhaps
22:33from the perspective of me as a company,
22:35but also maybe interesting for yourselves
22:38thinking about a path through industry
22:41where you might be able to do biostatistics
22:45for multiple organizations at once, not just one at a time.
22:50And then it's also interesting,
22:51as I mentioned from the perspective folks
22:53that have the problem that need to get the analysis done.
22:57Okay so some case studies, does this happen?
23:03I sort of mentioned abstractly, it does,
23:05but these are five cases that we've worked on in our company
23:10and they are, many of them have a,
23:14well they all have the theme
23:15of being calcium imaging data, okay?
23:18So here, you know, swap out biostatistics
23:20for looking at data that comes from a microscope.
23:23But at the end of the day,
23:25that data from a microscope is basically a video stream,
23:31generally black and white images
23:33that then have to be post-processed.
23:36And from that video stream there's a spatial component
23:39of looking at a field of neurons under a microscope
23:44and a time component.
23:46Like how did those, you know,
23:49neurons activity change over time.
23:51But there's a lot of like statistical challenges
23:54that have to go into that.
23:55You need to separate the neurons out from each other, okay?
23:58They kind of overlapped on each other.
24:00So looking at a video stream, you're not always sure, right?
24:04If I'm looking at one neuron or two neurons.
24:06So you have to do some spatial analysis
24:08to separate those out.
24:09And then you wanna do some sort of peak finding over time.
24:13What you kind of wanna extract out is a time series
24:15of however many neurons you've detected
24:17in your field of view
24:19and then start to do some additional analysis.
24:21And that additional analysis will be based on
24:24the specifics of the experimental setup
24:26and like, you know, what part of brain were you looking at?
24:30What was your protocol that you applied
24:33and what kind of expectations
24:37do you have about the time series that you extracted?
24:41So these organizations that we work with, I guess, you know,
24:45four out of five are universities.
24:48So DGIST is Institute of Science and Technology
24:51in South Korea, McGill University in Canada,
24:58University of Penn, UPenn and University of Alabama.
25:04And then Maze, which is a small pharma company
25:09in San Francisco and they're all doing calcium imaging work.
25:14And I think we served all of these organizations
25:18within the same span of about six months.
25:22Each one of them had brought different data to the table.
25:27They're all generally in this form of video data
25:29with the calcium imaging to extract.
25:33All five of them were served
25:34by the same data scientist on our side,
25:38gentleman whose picture you saw earlier
25:41but they had very different scientific protocols, right?
25:44So it wasn't necessary that one person full-time
25:47over six months worked on each of these projects, right?
25:50Instead we have one individual,
25:52who's able to jump from project to project
25:54and check back in with multiple PIs/business leaders,
26:01managers to check in on the results of that, right?
26:05And that person never left their home, right?
26:08So our company is also fully remote, which is nice.
26:13And so I think that's a really powerful demonstration
26:17of what's possible for this kind of analysis,
26:19whereby, you know, essentially organizations
26:25in multiple different countries
26:27and different continent in one case, right,
26:29can all be served by the same person doing roughly
26:33having roughly the same skillset of data analysis
26:36but working on data that addresses very different scientific
26:40questions all at the same time.
26:43Okay, so that's a thing.
26:47And, in each one of these, I should say
26:49been done in this collaboration model that I mentioned
26:51where there's one workspace per organization, right?
26:57So each organization has their own workspace,
26:59they log into it, they can see the results
27:01of the data science work that happens.
27:04They have all in one way or the other,
27:06put data into the workspace, right?
27:09And, they've all sort of been able to pull figures back out
27:13again and direct the flow of analysis in the direction
27:19that they wanted through Zoom calls,
27:22like the one that I mentioned
27:23generally on like a weekly basis
27:25or every couple weeks check in.
27:28So yeah, a little bit more about the team behind that
27:34in terms of thinking about like what it takes
27:35to make that happen.
27:37While there is a little bit of like finding those labs
27:39and figuring out that they have that problem,
27:42which are not taken care of
27:45by the individuals on this screen.
27:46But I mentioned, I mentioned Phil, the PhD;
27:50another PhD, who's worked with us
27:52as data scientist is Marcus.
27:55And then kind of orchestrating behind the scenes,
27:57the standing up of these workspaces
27:59is a software architect, Zoran.
28:04Phil in the New York area, New York City area.
28:07Marcus is in China and Zoran is in the Netherlands.
28:13So again, interesting to think about the different
28:16geographies where folks come from being able to serve people
28:19in different geographies,
28:21but all of them when it comes to a project,
28:23like the center organizing node is a workspace.
28:27That is the thing that helps
28:28coordinate a lot of this together.
28:31There are a few other technologies that help.
28:34Those of you familiar with like a Kanban board
28:37or just really any kind of task driven software,
28:39you know, you can bring that to bear as well.
28:42So one of the ways you can organize work a little bit better
28:44than just sending emails back and forth
28:46is to encapsulate each task,
28:50break each task down into a card on a Kanban board.
28:53We like the tool called Trello,
28:56but there's lots of them out there
28:58that can be used for such things.
29:00And then, you know, one card per task
29:02is a nice way to organize things.
29:04And then using a practice from software engineering,
29:07you can actually sort of estimate
29:09in roughly how many hours, you know,
29:12the data scientists might think it would take
29:15to do a given task
29:16and then use that as a way to figure out
29:18like how long it's gonna take
29:20to do a certain kind of analysis.
29:21This is a practice that we actually use
29:23across my company for all sorts of tasks,
29:25not just data science,
29:26really organizing kind of everything that we do
29:28on the basis of making cards like this
29:31and moving things across.
29:32And I'm still surprised
29:33how many organizations don't use this.
29:36I have lots of friends in academia
29:38that do this just for their labs.
29:39You guys might do this in your labs, I don't know.
29:40But for organizing oneself,
29:44even if you do meet in person,
29:46having this sort of set up in the cloud
29:48can be very helpful for organizing work.
29:52Not sure how new or not new this is
29:54to those of you in the room, but something we use.
29:57And then of course there's Slack,
29:58which I think has pretty good adoption amongst academia.
30:03We do find almost every lab that we talk to
30:06pretty much is on Slack or some version of it.
30:10Companies are using Microsoft Teams,
30:12which I personally like less,
30:13but you know, but we use that too.
30:17But basically, you know,
30:20one thing that we do that maybe others don't do
30:23is to connect a Kanban board like
30:26the one that you saw to spit out notifications
30:28in a Slack channel at the same time,
30:31which can be really nice if you are a Slack based person
30:35to just like be able to see how tasks are changing
30:37and evolving in the feed,
30:40which then doesn't require an extra conversation, right?
30:42Like "Hey, so we agreed on Monday that you were gonna,
30:45you know, do that t-test on this survey data,
30:50how's that going right?"
30:52Well if they've moved that card,
30:55which was like T-test on survey data from the to-do column
30:58to the doing column,
30:59a little notification's gonna pop up in Slack.
31:02And then when they write a comment like, "Yep, you know,
31:04I ran the test and wasn't statistically significant,"
31:07then that's gonna pop up also.
31:09That comment will then be relayed into Slack.
31:11So then when you go back to check in,
31:13you don't have to ask that question.
31:13It's like, "Yep, I saw that it happened
31:15and by the way I saw that it happened on Tuesday,
31:18you know, now it's Wednesday, you know.
31:20I forgot to check back in with you about it."
31:23So like that idea of asynchronous work can happen
31:25in this cloud-based context also, which again,
31:29like we use also in all other parts
31:31of our company can be really helpful
31:33for moving projects along in lots of ways.
31:37So yeah I've told you a lot
31:42about a particular example then of doing work.
31:44I wanna call Adria back in here
31:47to extend a little bit more in a partnership example
31:52that we've had some experience with.
31:53So back to you Adria.
31:55<v ->Thanks, so one thing that Stephen mentioned was, you know,</v>
31:58another challenge we might face is,
32:00okay, where do we go find people who have data that
32:03they might need help with?
32:04And we were thinking about where does data come from, right?
32:08And so one area that data's generated
32:12from is through devices and manufacturers
32:15make devices that are sitting in labs.
32:17So we thought of the idea of let's have discussions
32:20with these manufacturers
32:21and see if we could form some sort of partnership.
32:24Now when you're forming a partnership in industry,
32:27you need to think about why that would benefit both sides
32:29in order to kind of engage your perspective partner
32:33as to why they should talk to you right?
32:34So one thing that we identified was that
32:37a key aim of manufacturers
32:39is to provide additional support
32:41to their customers or make sure,
32:43hey, I have a customer or a lab that has data
32:45and then what if there's an aspect of their data
32:48they don't know how to do something
32:51or they don't know what to do,
32:52maybe they'll stop using my device down the line
32:54because the data's just not useful to them at this point
32:57'cause they're lacking a skillset.
32:59So we thought of an idea whereby
33:01we could approach device manufacturers
33:03and kind of explain what Stephen explained
33:05about our data science as a service offering and say,
33:09"Hey look, we could form a partnership with you,
33:11whereby as an offering, in addition to extending a warranty
33:15on your device, you could offer custom analysis support
33:19or data science support to any interested customers,
33:22whereby they could use cloud workspaces
33:24to put their data that they're collecting
33:26and then they could work with someone like Phil
33:28to solve a challenge that they might have."
33:31And so we actually successfully
33:33did form such a partnership quite recently.
33:36And if you go to the next slide,
33:38you'll see, so we are now working
33:40with a company called Neurophotometrics.
33:43They produce a device that does the imaging
33:46that Stephen previously described.
33:48And what our partnership involves is we essentially offer
33:53cloud workspaces as a solution to their customers,
33:56whereby when they collect their data,
33:59they can then work on our cloud workspaces alongside Phil
34:02or ourselves and we can work with them
34:03to solve any challenges they might need.
34:06Now who are these customers of Neurophotometrics?
34:08They are a bunch of different labs kind of
34:11all over the world as well.
34:12Mostly academics, some in industry as well.
34:14And so it's that way for us as an organization
34:17to kind of find potential labs
34:20we didn't even know had the challenge.
34:22And then it's also solving the problem
34:25for NeuroPhotometrics of how do you keep your
34:26customers happy if you don't really offer a service
34:29they're already kind of asking of you
34:31as a follow-on for providing this device.
34:33So, so far the partnership is fairly new.
34:37It seems to be working quite well so far
34:40and we're meeting new people
34:41and already getting kind of more projects
34:43like Stephen described for Phil to work on.
34:45So we'll see how it goes.
34:46But this is just one way to show you
34:47that it's not just about kind
34:49of solving a problem for a customer,
34:51it's about where do you find your customers
34:53and that could be through an industry partnership.
34:57<v ->Awesome, thanks for that.</v>
35:02So I mentioned one other model earlier, which is workshops.
35:08I think I talked about that example for a bit.
35:11And we have done a few of them actually as well
35:17in the computational neuroscience space.
35:18So now the space near and dear
35:21to our work with Robert.
35:25So one of those projects was a collaboration
35:28actually Brown University on something
35:31called the Human Neocortical Neurosolver.
35:34We have kind of a neuroscience bias in the company.
35:38We like doing those sorts of things.
35:39So we did a workshop also.
35:44We helped facilitate a workshop
35:46that allowed a software tool
35:49that came out of this particular collaboration to be shown.
35:56And, let me show you a little bit more.
36:00So in this case, I'm actually gonna switch
36:04away from the Human Neocortical Neurosolver
36:05and also show you an example with NetPyNE,
36:07which is the thing that Robert mentioned earlier
36:09that we work with as well.
36:11It's similar to HNN.
36:13In both cases there's a computational model
36:15of a neuron, okay?
36:16Just think of like, you know,
36:18a spatial model of a neuron that has a cell body
36:22and has an axon and dendrite, that kind of thing.
36:25And you wanna simulate something about it.
36:28And so you have a specialized piece of software
36:34that knows how to look at the model of a neuron,
36:38the way that it's shaped
36:40and how to get signals out of it basically, right?
36:44So in collaboration with NetPyNE also a software platform
36:50called Open Source Brain at UCL
36:52that we've been partnering with for a while.
36:54You might have something that looks like this.
36:58So what you can do in a workshop context
37:03with something like a workspace that's really exciting,
37:05as I mentioned to you before is have people
37:07put hands on with the software itself.
37:09And this is one of those pictures
37:11from one of those workshop that we did,
37:14I think this one was specifically NetPyNE
37:16where you can kind of see what everybody's looking at.
37:18So everybody brought laptops in, right?
37:20And they're able to launch in this case
37:23they're literally, you can see several of 'em,
37:25like this one up in front and this one over here,
37:27they literally have exactly the same screen up
37:29that is being shown, you know, in the screen share,
37:33not because they're logged into a Zoom,
37:34but 'cause they're actually logged into essentially
37:37a workspace environment where they can also like, you know,
37:40change parameters around.
37:41So you can get this hands-on tutorial effect
37:43in a workshop, in this context.
37:46That is kind of hard to do any other way
37:50if you don't have that.
37:53If it's deployed as web-based software,
37:55that makes it a little bit easier.
37:56But if it's not, you know,
37:57if it's something that's traditionally supposed
37:59to be on a desktop,
37:59then this is kind of the only way to do something like that.
38:03And this was at a academic conference,
38:06I think CNS that gets held.
38:09So yeah, from all that today then
38:15kind of wrapping up the part where I just,
38:17we just talk at you and I hope those questions
38:20that you guys have, what do we sort of talk about today?
38:23Like how can some cloud-based data science tools
38:26help enhance the ability to do biostatistics
38:29health informatics research?
38:31I've been, you know, leaning on some examples
38:32that are heavily neuroscience based,
38:34but we kind of think that that's not the thing
38:36that's particular to this, right?
38:37It's still, you know, as I started at the beginning,
38:40you know, doing some analysis, you know,
38:42sharing the results of the commands
38:45that we're using in the analysis
38:47and then sharing the output of that analysis, right?
38:48Like that's where we began.
38:50I think that's common to every technique.
38:51We're bringing some kind of science and math
38:53to bear on some data, right?
38:55So what we're finding is that, you know,
38:57by using cloud-based platforms
38:59really can help us facilitate collaborative research,
39:02allowing colleagues to share data and work together.
39:05You can help labs efficiently gain access
39:08to additional data science support if that's desirable.
39:10That they, you know, otherwise might struggle to get
39:14or is just kind of unaffordable.
39:15Doesn't make sense 'cause there's too much of a person.
39:19And then finally in the last example, right,
39:21you can facilitate, you know,
39:23distance workshops that allow much more immediate
39:26hands-on experience with certain software.
39:29So with all that, I will thank you all for listening
39:36to us for a full 40 minutes
39:38and happy to take any questions that you have on this
39:41or any other thing I can help directly.
39:44Thank you very much.
39:46<v ->Thank you so much.</v>
39:50Does anybody have any questions for our presenters?
39:57I'll start if there's no questions.
40:01So data science is a service growth industry.
40:07People want jobs.
40:10What's your take on the industry on that?
40:13<v ->We are about 18 months into our exploration of the market.</v>
40:22We have seen growth so far.
40:25We think there's more to go.
40:28I showed you those five labs,
40:30I think in total maybe served certainly more than a dozen,
40:35I wanna say maybe like 15 and like labs plus companies or so
40:3815, 16, in those 18 months.
40:43We had to figure out lots of other stuff along the way.
40:45But we think there's a need, you know, like I mentioned
40:52and folks that have the skillset to, you know,
40:56provide that data science service
40:58that are continually in demand.
41:00So I'm gonna say yes, it's growing.
41:04We're always wondering in industry how fast, you know,
41:08that's always the question,
41:10but it's definitely not shrinking.
41:13<v Robert>Alright, that's an exciting option.</v>
41:18<v Participant>Yeah just really quick,</v>
41:20what happens with authorship?
41:22If you work with the lab very closely on a project,
41:26they come out with a really good publication.
41:31How do you deal with that in this industry?
41:36<v ->Yeah, great question. Thank you.</v>
41:40So as a company,
41:44we don't require to have our data scientists listed
41:51as co-authors on papers.
41:55I think from an ethical perspective
42:02in the case where the contribution that the data scientist
42:05has made are very significant
42:09you know, sometimes PIs have asked the question to us,
42:13you know, what sort of acknowledgement
42:15would you like of the data scientist?
42:18And if the PI feels that, say, you know,
42:21someone who has a PhD who works with us
42:23has done enough work that it merits authorship,
42:27they're free to add that person.
42:28We don't require that.
42:30Otherwise, you know, an acknowledgements nice always right?
42:33But also not required.
42:37I think, you know, sometimes the nature
42:40of the contribution really matters.
42:42So, you know, as a company it's a little bit
42:47like how much do you acknowledge
42:49the vendor of your microscope, right?
42:53You might say, okay, I did this on a Nikon microscope
42:56or you know, but you might write that more
42:58as a method section.
42:59And then if like a technician came out
43:00and like helped you calibrate it,
43:02you're probably not gonna give
43:03that person an authorship either.
43:05But you might acknowledge them if they did extensive help
43:07that like led to some novel process.
43:10So on the whole, it's a case by case conversation
43:15that scales based on the level of the contribution,
43:17but it's not the first thing that we think of.
43:19It's not like, "Hey, because we did anything for you,
43:21please put us on a paper."
43:23Definitely don't do it that way.
43:24It's more the opposite, which is like, you know,
43:27we're gonna do a thing for you.
43:28Probably, you don't need to cite us.
43:30But if it gets up to a certain point
43:33and we kind of mutually agree that that's appropriate,
43:35then we're happy to discuss that.
43:41<v ->Thank you for sharing Stephen.</v>
43:42So I have a quick question too.
43:44So if you're running on data sets,
43:47one cell may take really long time to run,
43:50then how do you solve the concurrency issue?
43:53Let's say there's multiple people collaborating online
43:56that when the cell is running,
44:00what if some other, another party just clicked stop
44:04or doing something random?
44:06How do you solve the issue that people are on the same page
44:08when something takes really long time to run?
44:13<v ->Yeah, great question.</v>
44:14So a few ways,
44:18one nice thing about a cloud workspace is that
44:22we can expand the number of processors
44:25and the amount of memory kind of
44:28behind the scenes transparently.
44:31So basically you can like log out of the workspace
44:35and in five minutes log back into the workspace
44:38and we've like doubled the processing speed
44:40and like doubled the memory.
44:42So we tend to keep our default instance
44:45at like a reasonable like laptop,
44:47like probably not a high end.
44:49And then when we discover cases like what you're talking
44:52about where like, yeah, no, that cell requires a lot
44:56and we kind of know a little bit in advance,
44:57like we're gonna wanna run that a lot, right?
44:59We might do this, which was we might
45:01like just beef it up, right?
45:03And that's cool that we can do that.
45:07And then the question becomes like,
45:10does that need to run, you know, 24/7,
45:12does it need to run every day,
45:13every week, every month right?
45:15We think a little bit about that
45:16because then there's some additional costs on our side.
45:18If you're gonna do it for like an afternoon,
45:20it's like really not, it's not worth making any additional,
45:24you know, requests of somebody.
45:27But there's another part of your question I wanna get at
45:28too, which is like maybe overriding each other, right?
45:33So that can happen.
45:34And that's a little bit like software specific.
45:38So like in a Jupyter Notebook, you could,
45:43if you don't coordinate a little bit with your lab member,
45:45like overwrite something in one cell at one time, right?
45:49The other person didn't notice.
45:50So for that, we have some best practices, you know.
45:54By far the most common, you know, example that we see is,
45:59is like two or fewer people collaborating,
46:01but if it were three or four,
46:03we'd probably recommend that they do a best practice
46:05of like, you know, while you're doing work that's separate
46:08and you're not like talking to each other,
46:10do work on separate copies of the thing, right?
46:13And then come together in a meeting
46:15and like put it back together, right?
46:17Usually is the better practice if you're say,
46:20working on a Jupyter Notebook,
46:22and you know, communicate, you know,
46:25using some other method like a meeting like this.
46:28So yeah so those are the two aspects.
46:30On the one side, if it's computation intensive,
46:32we can make it bigger.
46:33If it's actually about people writing each other,
46:35we recommend some best practices
46:37for communicating outside of the workspace.
46:42<v ->Other questions?</v>
46:47All right, I have one more question.
46:50So like in the old days,
46:53people would buy a nice computer for their lab or maybe a
46:57couple of nice computers and like then everybody
47:00would log in at that and it was a one-time cost, right?
47:05And so how have you found, I don't know,
47:09I mean, so it's a very different model for
47:14both academia industry, wherever that's trying
47:18to transition from this one time cost
47:21where now, you know, you might still be using this computer
47:2410 years later for good and ill
47:29versus sort of this continuous cloud-based thing.
47:34I don't know,
47:35do you have any words of wisdom on this transition?
47:39Because it seems like, you know, you pay
47:42for a cloud computer and if it's on constantly,
47:46it eats up a lot of money.
47:48<v ->Yeah, yeah.</v>
47:49So really good question.
47:53So I think and-
47:54<v ->Lose control of your data also, which to some extent,</v>
47:58like somebody else has your data.
48:00<v ->In theory, yes.</v>
48:02But you know, I think some of this is just like a journey
48:06and a transition that, you know, scientists are making.
48:09Those of us, like yourself,
48:11we're more software engineer minded,
48:13have been comfortable with the idea of say, you know,
48:16like all of our company's data, for example,
48:18is kind of in Google's clouds,
48:21Google's workspace technically.
48:22None of it is sitting under my desk, right?
48:25But we've gotten a level of comfort about data ownership
48:28based on essentially trust and agreements
48:32and our understanding of how certain sections
48:34of disk are like cordoned off, you know, for ourselves
48:38and lying on some of those best practices.
48:40But to get to the heart of your question,
48:44I think the best metaphor is like
48:45buying a house versus renting an apartment, right?
48:48So, you know, going down to Apple
48:51and picking up a laptop or Dell or whatever you wanna use,
48:55right, is that's the buy model.
48:56And we're super comfortable with that.
48:58The cloud model is more the like renting the apartment.
49:01And certainly people make the choice,
49:03you know, not to rent sometimes
49:05because it's like, doesn't work out economically, right?
49:07It's like, "Hey, I'm throwing money away."
49:09Sometimes people throw, right?
49:11But what is the advantage of renting, right?
49:13The advantage of renting is, you know,
49:16if a thing breaks in your rented apartment,
49:17it's not on you to go pay extra money to go fix it.
49:20That's on the person who owns it.
49:21Similarly, if something breaks with your cloud workspace,
49:24you know, you call us and you're like,
49:26"Hey, this thing didn't work,
49:27please fix it, right?"
49:29And then there's this scaling thing, right?
49:31Which is like, if you go back to Apple and you're like,
49:32"Actually can you add like double the CPU
49:37and double the memory?"
49:39They'll be like, yes, you can pay us for that,
49:41but it's gonna take a while, right?
49:43And it's not gonna happen flexibly and scalably.
49:44So I think it fits into a different space, right?
49:48Obviously these two come together,
49:50I'm talking to you on a physical laptop that I own, right?
49:52But I'm also using cloud instances to do things.
49:56So I think it's like, it fits into this niche where like,
50:00actually the most useful computer for this purpose,
50:03this collaborative purpose
50:05is a rented one, right rather than an owned one.
50:08And you know, maybe that means when I'm not using it,
50:11I'm not paying for it at all, basically, right?
50:13Like, if I'm like paused on this collaboration,
50:15then I'm like actually not paying for it at all,
50:17but then I can bring 'em back and six months and start
50:18paying for it again.
50:20So this is what I hope that folks take away is like,
50:22it opens up a lot of new possibilities.
50:24And the ones that we've gotten
50:26are certainly not the only ones.
50:27There's just like lots more
50:28that you can imagine or envision.
50:32But, but yeah, it's a mindset change
50:35and it's one that I think, you know,
50:37requires some adapting, yeah.
50:42<v ->All right. Thank you so much.</v>
50:44<v ->I have a question for you guys</v>
50:45if there's not another question for me.
50:48<v ->There's a question on the screen.</v>
50:51<v ->Sorry, I have a question.</v>
50:54I think piggy-backing off of that question-
50:58<v ->Hi hello. Hi Noelle.</v>
51:00<v ->Actually Hi.</v>
51:02I used to like physical like pieces of data
51:08and like having physical hard drives.
51:10So like what is the security for data that's on the cloud?
51:16<v ->Yeah, so folks like,</v>
51:24we ourselves build these cloud instances
51:30on the back of three major providers,
51:32whose names you'll recognize,
51:33Amazon, Google, and Microsoft okay?
51:37Those are the big three cloud providers
51:40and they make a guarantee to us
51:43and then we make a guarantee to our customers
51:46about the data protection.
51:47So it's kind of like a layer cake.
51:49And the foundation of it begins with, do you trust Amazon?
51:52Do you trust Google? Do you trust Microsoft?
51:53Some people say yes, some people say no,
51:56but fundamentally they are the ones that, you know,
51:59build data centers, right where the physical aspect
52:04of these computers actually live.
52:05So, you know, this virtual computer,
52:07maybe if you go and like,
52:09"Hey, show me the hard drive where this lives."
52:12You're gonna go out to like, I don't know,
52:14Washington State near some power plant basically,
52:18where it's very economical to set this up, right?
52:21So they then guarantee like,
52:25how do you know that that's safe, right?
52:27Well they guarantee that they're following industry
52:30standards to secure those facilities, to lock them down,
52:35to like continually maintain and manage the networks
52:41that are there to patch the servers
52:44that they're using to keep ahead of any security faults.
52:47So there's one layer of this
52:49where we rely on these big providers to do their jobs.
52:52And despite the last 15, 20 years of like hacks
52:57that you've heard about whatnot that happened in industry,
53:00these three providers so far have managed to avoid
53:03being hacked in any major way.
53:05Like you've not heard of like Amazon getting hacked,
53:08Google getting hacked, Microsoft getting hacked.
53:10If tomorrow Amazon gets hacked, then yeah,
53:13we're all worried okay?
53:14And then we probably would need to shift around.
53:16But so there's a fundamental guarantee
53:19that like all cloud kind of relies on
53:21and it's like good to talk about it
53:23because like we all have to kind of trust these,
53:27you know, these large providers.
53:29But they also invest,
53:31I'd say millions or hundreds of millions of dollars
53:34in computer security.
53:35Like if you're in the field of computer security,
53:38like, you know these guys because they are sort
53:41of world leaders in this sort of thing.
53:44Microsoft, you know, notably was involved in doing some
53:48forensic analysis on like Russian hacking back in 2016.
53:52Like they were some of the first people to notice
53:55that a state actor like Russia was on the scene
53:58doing the various things, taking over computers.
54:00So generally the community of software engineers
54:05that do cloud work know these things
54:07and kind of rely on Google, Amazon, and Microsoft
54:11to like make these investments in computer security.
54:14And notably like, I don't go like set up my own data center
54:18because I know that I would have to invest millions
54:21of dollars in having an equivalently good computer security
54:25team to like watch out for Russia,
54:27who by the way also invests hundreds of millions of dollars
54:30to try to hack these things.
54:31So, the world of computer security is a problem.
54:35So there's that level of trust, okay?
54:37And then on top of that, you have to trust one more level,
54:39which is the group that like sets up the workspace.
54:41So you kinda have to trust, like if it's from us,
54:43you have to kind of trust us that we're not screwing
54:45something up on top of all of those protections
54:48'cause it is possible to do that at the level of like,
54:51you know, Jupyter Notebook that our logins are well used.
54:55So we also invest in using industry standard
54:59like login protocols, so that only the people that we say
55:02can log in can log in, right?
55:04There's a layer of software security there that, you know,
55:07we have to be on top of patching at one level also.
55:11So these are all the things that make that secure.
55:13And the last thing would be like,
55:15do you or don't you trust us to like not to,
55:18to not go in and do something nefarious with your data
55:21even though we're the only ones that can control it.
55:23So you trust that nobody else can get into it,
55:25but do you trust us?
55:26And then that becomes,
55:27yeah a question of like, you know,
55:29going back and checking your references, you know,
55:32talking to other PIs, making sure that something nefarious
55:35hasn't happened, you know, there.
55:37And you probably wanna gain some confidence on that.
55:39But what we've found is that organizations
55:42are getting more and more comfortable with that.
55:43Dropbox is a publicly traded company,
55:46lots of people put stuff on Dropbox.
55:48When you put something on Dropbox,
55:49you're essentially trusting Dropbox.
55:51Dropbox is also built on one of these
55:53three providers same way, right?
55:55So it's that kind of idea
55:57that takes some getting used to but you know,
56:01becomes increasingly useful to do this kind of work on.
56:05And we see large banks and large pharma companies
56:07having taken their time to also adopt cloud
56:10large financial institutions.
56:13But over time there's been increasing comfort
56:15as some of these security questions
56:17have been, you know, asked and answered.
56:20So bit of a long answer,
56:22but thank you for the question 'cause it's important.
56:27<v ->Alright, thanks so much.</v>
56:28In the interest of time,
56:29I think we're gonna have to stop it here, thanks again.
56:32Really appreciate. (audio garbles)
56:37<v ->Thank you guys. Thank you all for your time.</v>
56:40<v ->Have a great day.</v>