Reconstructing the Genome

So one of my other major interests is this computational problem called genome assembly. So the genome again is this large molecule, but the way we can sequence it are through these little tiny fragments. So the analogy is something like take the dictionary, or take some very big book—actually, take many copies of that same book, shred it up into little tiny fragments like fortune cookie size fragments. And then the computational problem is given this large collection of shorter fragments of DNA sequences, how can we reconstruct them, how can we put them together into forming the whole genome. This is this problem called genome assembly.

This is one of the bedrock problems of genomics because without assembly there would be no way to study larger sequences. And there’s been a lot of theory developed, a lot of methods developed, a lot of improvements to these ideas on how to go about assembling genomes. But it is very much a rapidly changing, rapidly maturing discipline as new sequencing technologies are brought on board, as new computational methods are applied, as new ideas are brought in.

So two years ago, for the first time, there was this big kind of international competition called “The Assemblathon” where it really was a competition to see given this set of data—everybody got the same set of data—what’s the best way to put this together back into reconstructing the genome, what’s the best way to do so, and how does that best reconstruction compare to the actual truth.

In this international competition, there were about 20 different labs around the world that participated, contributing about 70 different assemblies of the same genome. So in this case, in “The Assemblathon,” it was a synthetic genome that was made by a computer program and that gave us more power to be able to really accurately measure how everyone did. And one kind of surprising outcome was there was this—well first, none of the assemblers were perfect. None of the assemblers were able to take all this data and perfectly reconstruct the genomes. And also, there was quite a lot of variation in how well these different teams, how successful they were able to be, to put these genomes back together.

This was a little bit—depending on your outlook, a little bit disconcerting or a little bit of an opportunity. It’s disconcerting in the sense that these genome reconstructions form the foundation for many, many studies in comparative genomics, form the basis for evolutionary studies, form the basis for, you know, many billions of dollars in research. But none of the software for assembling genomes got it quite right. They all had problems in one way or another. But it’s also an opportunity, you know, putting on my kind of computer scientist side, it’s an opportunity for me in the sense that it means that work remains to be done to be able to create better assemblers, to be able to create better software and computational systems to put all this information together.