In a lab on the ground floor of deCODE Genetics’ building in Reykjavik, Iceland, robots quietly go about chip-typing or “SNP-typing” the latest of 155,000 or so Icelanders. SNPs are single-nucleotide polymorphisms, or small variations in genetic code that can represent the basis for disease or health, presence or absence of a condition of some type or other. A floor up, Illumina HiSeq X machines, worth millions of dollars each, take a few days to come up with a complete human genome. They’ve done so for about 3,600 Icelanders at latest count.
All that sequencing means an enormous amount of data. The As, Ts, Cs, and Gs add up, and of course, the idea here is to actually make use of those letters. Across the building from the sequencing labs, I spoke with Hakon Gudbjartsson, deCODE’s VP for informatics, on the challenges and methods for dealing with mountains of data. Each individual person sequenced accounts for around 100 gigabytes of data, and it’s data, he says, “that requires a lot of organization.”
The primary means to organize the genetic information is a database the company developed that is known as GOR, or genomically ordered relations. Traditional databases, like Oracle or MySQL, organize data in tables that don’t quite make sense for genetic information; Gudbjartsson says using such methods creates bottlenecks when trying to retrieve the data. The GOR database organizes genetic code according to a “reference build,” essentially placing the data in sequential order.
“It’s a database that organizes the downstream data according to the position in the genome,” Gudbjartsson says. All the specific variations observed also fit in based on their physical place. “Whether its a SNP or… a copy number variation, anything. All the tables are basically ordered according to the genome.”
That ordering allows the design of algorithms to query the information in a much more efficient fashion. Researchers can even “stream” the genomic information, instead of calling up one specific spot.
The goal at deCODE (now owned by Amgen) is to take this impressive collection of genetic data and match it up with rich clinical and genealogical data as well. Iceland is home to only 320,000 people or so, and all Icelanders can trace their lineages to 1650, and often further back. Along with good recent medical records, that means the ability to take a genotype and match it up with a phenotype; they have produced some impressive results, perhaps most famously the discovery of a gene in 2012 that confers almost complete protection against development of Alzheimers disease.
This rich data set has created some controversy. Some critics have expressed discomfort with aggressive DNA collection methods (all participants do sign consent forms) and the apparent ability to make “data inferences” based on available data and those rich genealogical and clinical records. (Essentially, even if an individual doesn’t give consent for sequencing, but enough others do, the close genetic connections in Iceland could allow the researchers to fill in the gaps.) However, Gudbjartsson points out that everyone at deCODE signs an agreement to never actually use such inferences.
Gudbjartsson says that GOR can run on several hundred computers simultaneously already. “GOR should be elastic,” he adds, noting that, “we foresee growth.” It is already a challenge to efficiently transpose the full sequencing data from the 3600 or so completed genomes onto the chip data for the 155,000 Icelanders in the database (a way to find common variations, and match with phenotypes). But sequencing will inevitably get even faster and cheaper. The full genome for all Icelanders, or other populations around the world, will present an even greater challenge for data manipulation.