DNA In the Cloud

What’s the Big Idea?

In 2001, the human genome was published online for all to see: at once a Nobel-worthy scientific breakthrough, a testament to international collaboration, and a fundamental shift in the way biological research would be conducted for the foreseeable future. The price tag? A billion dollars.

Ten years later, the time and cost of sequencing DNA has dropped exponentially to millions and then hundreds of thousands of dollars, leading some theorists to wonder whether the age of the over the counter DNA test is upon us.

It’s true that we’re now privvy to some of the most arcane secrets of the human body. But it will take at least another few decades to interpret all that data, says Michael Schatz, of Cold Spring Harbor Laboratory. Schatz works in the field of bioinformatics, at the intersection of computer science and biology. Watch the video:

In a recent paper, he explained, “Sequencing throughput has recently been improving at a rate of about 5-fold per year 1, while computer performance generally follows ‘Moore’s Law,’ doubling only every 18 or 24 months.” The speed at which scientists are making new discoveries in biotechnology is actually outpacing the speed of our computers, causing an information bottleneck.

What’s the Significance?

Schatz believes the solution lies in cloud computing. He hopes to use Google’s algorithms to sort through the genomic data deluge. “Our genome is a molecule about three billion bases long, but today there is no technology that can just read off all of these individual nucleotides,” he told Big Think. “Instead, the technology sequences little tiny fragments from here and here and here and here and here. How can we interpret what the entire genome is from all these little snippets?”

What we need, he argues, is better technology. If researchers could rapidly scan large volumes of DNA sequences in the same way Google scans the Internet, they could make meaningful comparisons.

For instance, Google’s MapReduce allows the company to conduct studies of trillions of web pages at a time — a massive undertaking that is involves large data sets, just like the study of the human genome. “Large-scale Internet companies like Google and Facebook and Twitter developed these technologies out of necessity,” says Schatz. “They’re rapidly gaining traction just because they are so powerful. A lot of the approaches that you would use for those studies are exactly the same.”

Of course, there are the privacy concerns. (The risk of theft can be mitigated by storing different bits of data in different locations, according to Schatz.) But cloud computing has been used by the US federal government, pharmaceutical and Internet companies, scientific labs, and bioinformatics services to transfer and store sensitive information. Why not by geneticists?

The first printout of the human genome to be presented as a series of books, displayed at the Wellcome Collection.