Genomics tertiary analysis and data lakes using AWS Glue and Amazon Athena create a scalable environment in AWS to prepare genomic data for large-scale analysis, or both for genomic data lakes. You can query tropism. It helps IT infrastructure architects, administrators, data scientists, software engineers, and DevOps professionals to build, package, and deploy libraries used for genomics data transformation. It also provisions data ingestion pipelines for genomics data preparation and cataloging, and runs interactive queries against genomics data lakes.
Data output from secondary analysis can be large and complex. For example, Variant Call Files (VCFs) need to be converted to big data optimized file formats (such as Parquet) and incorporated into existing genomics datasets. The data catalog needs to be updated with the appropriate schema and version so that users can find the data they need and work with it within a defined data model that is semantically consistent. Annotation datasets and phenotypic data must be processed, cataloged, and ingested into existing data lakes to build cohorts, aggregate data, and enrich result sets with data from annotation sources. With data governance and granular data access controls, you can protect your data while providing sufficient data access to the research and informatics communities. Genomics tertiary analysis and data lakes using AWS Glue and Amazon Athena simplify this process.
This guidance provides a genomics data lake, sets up a genomics and annotation ingestion pipeline using AWS Glue ETL and crawlers, and sets up a genomics data lake on Amazon Simple Storage Service (Amazon S3) . It demonstrates how to use Amazon Athena to perform data analysis and interpretation on a genomics data lake and create drug response reports from within a Jupyter notebook.