This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.
The Genome Aggregation Database (gnomAD) has made its catalog of genetic data available on Azure Open Datasets. Led by researchers at the Broad Institute of MIT and Harvard, gnomAD is the world’s largest public collection of human genetic variation and brings together data from numerous large-scale sequencing projects, including population and disease-specific genetic studies. A near-ubiquitous resource for human genetics research and clinical variant interpretation, it is used in clinical genetic diagnostic pipelines worldwide, with over 20 million page views of the consortium's web browser to date.
The gnomAD dataset is now accessible on Azure at no cost as part of the Genomics Data Lake within the Azure Open Datasets Program. Azure users will no longer need to pay transfer fees or long-term storage costs to access or to maintain a personal copy of gnomAD. Azure Open Datasets containers can be accessed directly from services like Azure Synapse and Azure Machine Learning, as part of a workflow in Cromwell on Azure, or from deployed HPC infrastructure on Azure.
By democratizing access to gnomAD data through this collaboration, the gnomAD Consortium hopes to accelerate breakthrough genomic discoveries that enhance the scientific community’s understanding of human genetics and result in solutions that improve the lives of people all over the world.
As the industry anticipates further exponential growth of human genomic datasets over the next few years, the gnomAD Consortium and Microsoft Azure believe that the computational genomics community can benefit from free access to shared datasets. By reducing unnecessary duplication of terabyte- and petabyte-scale genomic datasets, we as a community save scarce environmental, capital, and human resources that would otherwise be spent maintaining many copies across separate institutions. With this collaboration, gnomAD and the Broad Institute hope to provide an avenue for more individuals and organizations to participate in creative research in human genomics, with potential downstream benefits to us all.
- All official gnomAD release data, comprising summary statistics and annotations for over 241 million unique short human genetic variants and 335,000 structural variants observed in over 141,000 healthy adult individuals across a diverse range of genetic ancestry groups
- Standard “truth” sets used to assess and validate variant calls
- Interval lists and other resources used in the creation of gnomAD releases
- Data from the gnomAD Consortium's latest collection of papers in Nature
How to access the data
The data is stored in the Azure Open Data Storage Account, within the "gnomad" container: https://azureopendatastorage.blob.core.windows.net/gnomad
azcopy ls https://azureopendatastorage.blob.core.windows.net/gnomad/
This post is contributed by Grace Tiao, Associate Director of Computational Genomics at the Broad Institute, and Jer-Ming Chia, Principal PM in Azure Compute