Sequencing genomes is like ripping up a piece of paper into a million pieces, blowing them into the wind, then finding and assembling the paper by identifying patterns. Though this job sounds tedious—impossible even—completing it would be a big breakthrough when applied to the field of bioinformatics.
Genomic assets are present in huge stacks of data that are difficult to organize and analyze. With the Snowflake cloud data platform, data from multiple sources can be collated, queried, analyzed, and presented to support researchers, scientists, and geneticists with predictions.
This article will take you through a series of steps involved in genome sequencing and will demonstrate how Snowflake Cloud Genomics can simplify the process for researchers, data scientists, and geneticists. We cover the following ideas:
History of the Human Genome Sequencing
In 1961, Marshall Nirenberg and his friends cracked the “code for life” eight years after James Watson discovered the helix structure of DNA. They then followed a series of explorations that landed in the Human Genome Project in 1990 when bio scientists decided to sequence 3 billion letters of the human genome in 15 years. A series of sequences were discovered and by 2007. A Map of Human Genetic Variation report was published using Solexa’s next-gen DNA sequencing technology. However, the cost of sequencing was still high until 2009 when they finally dropped. More discoveries followed, like 20,687 protein-coding in 2012; 100,000 cancer genomes sequenced in 2018; and the SARS-CoV-2 virus sequenced in 2021.
Applications of Genome Sequencing
The Human Genome Project has reached many milestones since its inception in 1990, but the real breakthrough was achieved only recently. The final missing pieces of human DNA were published to complete the whole sequence of a human gene with 3 million nucleotides on March 31, 2022. This genomic fingerprint can now help understand and fight many genetic diseases that were incurable earlier such as Huntington’s disease, progeria, hemophilia, and sickle cell anemia.
This was only possible thanks to a sequencing technology that could read thousands of nucleotides, cut them into smaller fragments, and then assemble the repetitive regions. Genome sequencing has already found its way into healthcare applications like pediatrics, drug trials, and cancer treatments. With the new discovery, it can now be used for cell repairs and curing genetic diseases that humans suffer from.
Previously, point mutations that caused major disorders were not only impossible to cure but also could not be prevented from being passed down to future generations. However, CRISPR-Cas9 (clustered regularly interspaced short palindromic repeats) technology that was created after understanding the natural defense mechanism of bacteria against viruses can now be used to add, replace, delete, or edit specific regions to treat these diseases.
The Steps in Genome Sequencing
The sequencing journey begins with bioscientists taking bacterial cells to extract DNA from a sample tissue. Then it is cut into desired-sized fragments (typically 100-5000 bp range) with molecular scissors. The resulting fragments are converted to double-stranded DNA sequences and oligonucleotide adapters then attached to the ends of these fragments for recognition of cDNA molecules.
Next, scientists make many copies of these fragments using the polymerase chain reaction (PCR) method to create a DNA library. This library is then loaded to the DNA sequencer that reads the sequence of nucleotides (A, T, C & G). Millions of such sequences are created and later put together to complete the whole genome sequence, which contains millions of these nucleotides.
A number of vendors provide kits for making sequencing libraries such as Illumina, which was awarded by Snowflake as one of the winning Data Drivers in 2021 for its unified product analytics. Snowflake enables rapid data collection from Illumina sequencing instruments to allow real-time processing.
Genomics on Snowflake Data Cloud
In the past, genome sequencing was very time consuming and costly because of the complexities and the large volumes of data. However, Next Generation Sequencing (NGS) technology can reduce time and cost to make genome sequencing more feasible for healthcare needs.
Genomic data is produced in massive quantities and thus requires large storage and computing capacities. Snowflake data cloud allows the extraction of sequencing outcomes from sequencers like Illumina and further support analysis of the genomic reads. Snowflake platform can store both structured and semi-structured data using compressed columnar storage. It can also help co-locate structured phenotype data and semi-structured annotation data to provide new possibilities. Genomic assets buried in a stack of VCF (Variant Call Format) files can be extracted for interactive data exploration and analysis that is critical for research and assessment applications.
The Snowflake data cloud can power genome analytics by supporting genome variants and gene expression data. Once the data is extracted from different sources and the genomic sequence is fed into the Snowflake cloud, Snowflake offers different approaches to make it easy to interpret this data for human agents through genome analytics. Your Snowflake cloud contains both your genome sequence data and other metadata from clinical databases extracted through FHIR APIs* (Fast Health Interoperability Resources – Application Programming Interfaces). This allows you to find the markers in the sequence and match it with the patient data to find anomalies and understand the connection between genome and dysfunctionalities in patients.
Let us understand how Snowflake data cloud can engage with a genomic firm to enable sequencing of large genomic datasets.
* The HL7® FHIR® (Fast Healthcare Interoperability Resources 1 ) standard defines how healthcare information can be exchanged between different computer systems regardless of how it is stored in those systems.
A genomic organization works on genome sequencing to come up with insights that help us learn how the body works. There are multiple steps involved that generate data that can be executed with a management platform. At every stage, a cost is also incurred that this platform can assess and reveal. Cost is incurred in sampling, lab preparation, processing, sequencing chip, sequencing device infrastructure, analysis, and reporting. Indirect costs of infrastructure, personnel, security, and different types of analysis are also added. Monitoring sequencing steps including litigation, amplification, sequencing, and analysis on Snowflake cloud data platform can be useful in tracking both processes and costs that can then be easily reported.
The Snowflake platform stores and co-locates structured phenotype data and semi-structured annotation data of the genome sequence. With this, bioscientists can query specific genetic markers both horizontally (one patient across positions) and vertically (one position across patients). The sequence that is generated is an extremely long set of DNA codes that makes no sense unless it is analyzed. The analysis identifies markers of physical characteristics of a human of misplaced nucleotides that could be the cause of a genetic disease in a person. Such reporting would require deep analytics of the huge volume of data. Healthcare analytics solution of Snowflake helps genomic organizations turn read sequences into actionable insights that can be used in healthcare.
With Snowflake platform, this is possible. Snowflake not only provides a vast data warehouse, but also integrations with partners like Tableau that deliver interactive analytics, Alteryx that offers advanced predictive analysis of large data sets, and Squark that uses machine learning (ML) and augmented intelligence to uncover human genomic patterns and predict human characteristics.
While Snowflake data cloud helps with data extraction, storage, and predictive analysis of genome sequencing, it goes a step further to provide integrations with the healthcare system. Besides the sequencer, Snowflake platform can consolidate data from multiple sources like EHR systems, insurance claims, and other lab results.
The Future of Genome Sequencing
The Human Genome Project started in 1995 with an aim to extract the gene sequence of a human. It took more than two decades for bioscientists to complete the sequence of 3 billion nucleotides to create the sequence. Along the way, scientists made many intriguing discoveries like CRISPR technology that gave way for base editing and work on point mutation cases to eliminate incurable diseases like progeria and sickle cell anemia. Rare diseases caused by disorders like mitochondrial inheritance and chromosome abnormalities such as fibrosis, hemochromatosis, turner syndrome, and myoclonic epilepsy can now be understood.
Moreover, scientists also discovered more accurate internal structure and the molecular clockwork inside our DNA. Scientists have long imagined creating tiny disease-fighting robots that could be sent inside the human body to repair and fight diseases like cancer. The thought of tiny micro-bots working across our DNA strands gives us hope for a better life ahead.
To learn more about Snowflake, check out the Apexon – Snowflake partnership or get in touch using the form below.
This blog was cowritten by Dr. Jagadisan Suresh and Prashant Kumar