Michael Schatz is an assistant professor of quantitative biology at Cold Spring Harbor Laboratory, where he heads the Schatz Lab, and an adjunct professor of Computer Science at Stony Brook University.
His research focuses on the development of scalable algorithms and systems to analyze DNA sequences, concentrating on the assembly and alignment of next generation sequencing reads, and related downstream analyses. These systems have been used to reconstruct the genomes of previously unsequenced organisms, probe sequence variations, and to explore a host of biological features across the tree of life. He is particularly interested in capitalizing on the latest advances in distributed and parallel computing to advance the state of the art in bioinformatics and genomics.
Michael Schatz: My interest in Cloud computing relates to kind of this data analysis, data discovery problem of being able to scan through very large volumes of DNA sequences. A lot of the technologies that were developed for Cloud computing were actually entirely invented in other disciplines. So in particular, large-scale Internet companies like Google and Facebook and Twitter had developed these technologies out of necessity. So one of the key technologies that I utilize, that I look to, is a technology called MapReduce. It was invented at Google and for a long time this was their secret sauce, if you will, for being able to do these very large studies of many trillions of web pages. Scanning through trillions of web pages is not so different than scanning through trillions of DNA sequences. A lot of the approaches that you would use for those studies are exactly the same. So I borrowed heavily from kind of that sort of community, the text mining database community, and then any sort of discipline where there tends to be large volumes of data, these technologies are rapidly gaining traction just because they are so powerful.
The first main technical challenge is if we have many thousands of genomes we want to study, how can we load all that information into the Cloud, right? The way you would want to do that is, you know, through your web browser or through your computer, but the Internet capacity is only so big and if you have to ship, you know, this conceptual pile of two miles of DVDs, to should bring that around on the Internet it takes too long. There are some ways to overcome this. It’s a little bit funny to think about it, but in some ways the most practical way to ship very large data sets is to use FedEx or UPS or some sort of physical shipment of hard drives through the mail. It’s not, you know, it’s not the sexy application that you would want for an Internet company, but that’s the practical way to do it.
So that’s the main technical barrier. And then storing data in the Cloud opens up a lot of other challenges. In particular, there’s a lot of privacy concerns about making sure that that data is really well guarded. Your genome has a lot of information about who you are, what sort of diseases you’re susceptible to. It could say a lot about your family, about your children, about your ancestors. You know, it’s precious information that we definitely don’t want to expose without giving it some consideration. So the concern is, if all this genetic information is in the Cloud and you’re not careful about how that data is protected, it could leak out, it could, you know, it could accidentally be exposed to other people. And then also, if big archives are made that has collected many thousands of people, this could suddenly become an attractive target for attackers.
So today we’re a little bit guarded in the sense that this genetic information is decentralized in many different labs so that if there’s a breach in one lab it’s relatively localized. If everything gets aggregated together it becomes a little bit more risky because it becomes a little bit more attractive as a target. I think these challenges can be overcome, the encrypting technologies, the authentication technologies. They exist; they certainly exist. And there are companies that run with the highest level of security at Amazon Cloud or another Cloud resource. It is certainly possible to do so, but we’ve just got to be so certain that we get it right on the first try, right? We don’t want to create this big database that has all this genetic information and then accidentally leave it vulnerable. So we just have to be really careful about how that’s engineered.
Directed / Produced by
Jonathan Fowler & Elizabeth Rodd
Michael Schatz: So in autism, I collaborate with some folks at Cold Spring Harbor Lab where we’re participating in a project that’s been sponsored by the Simons Foundation. And the idea there is over the last...