Genomic annotations augment variant data by providing context for each change in the genome. For example, annotations help answer questions like does this mutation cause a change in the protein-coding sequence of a gene? If so, how does it change the protein? Or is the mutation in a low information part of the genome, also known as “junk DNA”? And everything in between. Recently (since release 0.4), Glow, an open-source toolkit for large-scale genomic analysis, introduced the capability to ingest genomic annotation data from the GFF3 (Generic Feature Format Version 3) flat-file format. GFF3 was proposed by the Sequence Ontology Project in 2013 and has become the de-facto format for genome annotation. This format is widely used by genome browsers and databases such as NCBI RefSeq and GenBank. GFF3 is a 9-column tab-separated text format that typically carries the majority of the annotation data in the ninth column. This column is called attributes
and stores the annotations as a semi-colon separated list of <tag>=<value>
entries. As a result, although GFF3 files can be read as Apache Spark™ DataFrames using Spark’s standard csv
data source, the resulting DataFrame is unwieldy for query and data manipulation of annotation data, because the whole list of attribute tag-value pairs for each sequence will appear as a single semicolon-separated string in the attributes
column of the DataFrame.
Glow’s new and flexible gff
Spark data source addresses this challenge. While reading the GFF3 file, the gff
data source parses the attributes
column of the file to create an appropriately typed column for each tag. In each row, this column will contain the value corresponding to that tag in that row (or null
if the tag does not appear in the row). Consequently, all tags in the GFF3 attributes
column will have their own corresponding column in the Spark DataFrame, making annotation data query and manipulation much easier.
Ingesting GFF3 Annotation Data
Like any other Spark data source, reading GFF3 files using Glow’s gff
data source can be done in a single line of code. Figure 1 shows how we can ingest the annotations of the Homo Sapiens genome assembly GRCh38.p13 from a GFF3 file (obtained from RefSeq) as shown below. Here, we have also filtered the annotations to chromosome 22 in order to use the resulting annotations_df
DataFrame in continuation of our example. The annotations_df
alias is for the same purpose as well.
import glow glow.register(spark) gff_path = '/databricks-datasets/genomics/gffs/GCF_000001405.39_GRCh38.p13_genomic.gff.bgz' annotations_df = spark.read.format('gff').load(gff_path) \ .filter("seqid = 'NC_000022.11'") \ .alias('annotations_df')
In addition to reading uncompressed .gff
files, the gff data source supports all compression formats supported by Spark’s csv
data source, including .gz
and .bgz
. It is strongly recommended to use splittable compression formats like .bgz
instead of .gz
to enable parallelization of the read process.
Schema
Let us have a closer look at the schema of the resulting DataFrame, which was automatically inferred by Glow’s gff
data source:
annotations_df.printSchema() root |-- seqId: string (nullable = true) |-- source: string (nullable = true) |-- type: string (nullable = true) |-- start: long (nullable = true) |-- end: long (nullable = true) |-- score: double (nullable = true) |-- strand: string (nullable = true) |-- phase: integer (nullable = true) |-- ID: string (nullable = true) |-- Name: string (nullable = true) |-- Parent: array (nullable = true) | |-- element: string (containsNull = true) |-- Target: string (nullable = true) |-- Gap: string (nullable = true) |-- Note: array (nullable = true) | |-- element: string (containsNull = true) |-- Dbxref: array (nullable = true) | |-- element: string (containsNull = true) |-- Is_circular: boolean (nullable = true) |-- align_id: string (nullable = true) |-- allele: string (nullable = true) . . . |-- transl_table: string (nullable = true) |-- weighted_identity: string (nullable = true)
This schema has 100 fields (not all shown here). The first eight fields (seqId
, source
, type
, start
, end
, score
, strand
, and phase
), here referred to as the “base” fields, correspond to the first eight columns of the GFF3 format cast in the proper data types. The rest of the fields in the inferred schema are the result of parsing the attributes
column of the GFF3 file. Fields corresponding to any “official” tag (those referred to as “tags with a pre-defined meaning” in the GFF3 format description), if present in the GFF3 file, are automatically assigned the appropriate data types. The official fields are then followed by the “unofficial” fields (fields corresponding to any other tag) in alphabetical order. In the example above, ID
, Name
, Parent
, Target
, Gap
, Note
, Dbxref
, and Is_circular
are the official fields, and the rest are the unofficial fields. The gff data source discards the comments, directives, and FASTA lines that may be in the GFF3 file.
As it is not uncommon for the official tags to be spelled differently in terms of letter case and underscore usage within and/or across different GFF3 files the gff
data source is designed to be insensitive to letter case and underscore when extracting official tags from the attributes
field. For example, the official tag Dbxref
will be correctly extracted as an official field even if it appears as dbxref or dbx_ref
in the GFF3 file. Please see Glow documentation for more details.
Like other Spark data sources, Glow’s gff
data source is also able to accept a user-specified schema through the .schema
command. The data source behavior in this case is also designed to be quite flexible. More specifically, the fields (and their types) in the user-specified schema are treated as the list of fields, whether base, official, or unofficial, to be extracted from the GFF3 file (and cast to the specified types). Please see the Glow documentation for more details on how user-specified schemas can be used.
Example: Gene Transcripts and Transcript Exons
With the annotation tags extracted as individual DataFrame columns using Glow’s gff
data source, query and data preparation over genetic annotations becomes as easy as writing common Spark commands in the user’s API of choice. As an example, here we demonstrate how simple queries can be used to extract data regarding hierarchical grouping of genomic features from the annotations_df
created earlier.
One of the main advantages of the GFF3 format compared to older versions of GFF is the improved presentation of feature hierarchies (see the GFF3 format description for more details). Two examples of such hierarchies are:
- Transcripts of a gene (here, gene is the “parent” feature and its transcripts are the “children” features).
- Exons of a transcript (here, the transcript is the parent and its exons are the children).
In the GFF3 format, the parents of the feature in each row are identified by the value of the parent
tag in the attributes
column, which includes the ID(s) of the parent(s) of the row. Glow’s gff
data source extracts this information as an array of parent ID(s) in a column of the resulting DataFrame called parent
.
Assume we would like to create a DataFrame, called gene_transcript_df
, which, for each gene on chromosome 22, provides some basic information about the gene and all its transcripts. As each row in the annotations_df
of our example has at most a single parent, the parent_child_df
DataFrame created by the following query will help us in achieving our goal. This query joins annotations_df
with a subset of its own columns on the parent
column as the key. Figure 2 shows a small section of the parent_child_df
dataframe.
from pyspark.sql.functions import * parent_child_df = annotations_df \ .join( annotations_df.select('id', 'type', 'name', 'start', 'end').alias('parent_df'), col('annotations_df.parent')[0] == col('parent_df.id') # each row in annotation_df has at most one parent ) \ .orderBy('annotations_df.start', 'annotations_df.end') \ .select( 'annotations_df.seqid', 'annotations_df.type', 'annotations_df.start', 'annotations_df.end', 'annotations_df.id', 'annotations_df.name', col('annotations_df.parent')[0].alias('parent_id'), col('parent_df.Name').alias('parent_name'), col('parent_df.type').alias('parent_type'), col('parent_df.start').alias('parent_start'), col('parent_df.end').alias('parent_end') ) \ .alias('parent_child_df')
Having the parent_child_df
DataFrame, we can now write the following simple function, called parent_child_summary
, which, given this DataFrame, the parent type, and the child type, generates a DataFrame containing basic information on each parent of the given type and all its children of the given type.
from pyspark.sql.dataframe import * def parent_child_summary(parent_child_df: DataFrame, parent_type: str, child_type: str) -> DataFrame: return parent_child_df \ .select( 'seqid', col('parent_id').alias(f'{parent_type}_id'), col('parent_name').alias(f'{parent_type}_name'), col('parent_start').alias(f'{parent_type}_start'), col('parent_end').alias(f'{parent_type}_end'), col('id').alias(f'{child_type}_id'), col('start').alias(f'{child_type}_start'), col('end').alias(f'{child_type}_end'), ) \ .where(f"type == '{child_type}' and parent_type == '{parent_type}'") \ .groupBy( 'seqid', f'{parent_type}_id', f'{parent_type}_name', f'{parent_type}_start', f'{parent_type}_end' ) \ .agg( collect_list( struct( f'{child_type}_id', f'{child_type}_start', f'{child_type}_end' ) ).alias(f'{child_type}s') ) \ .orderBy( f'{parent_type}_start', f'{parent_type}_end' ) \ .alias(f'{parent_type}_{child_type}_df')
Now we can generate our intended gene_transcript_df
DataFrame, shown in Figure 3, with a single call to this function:
gene_transcript_df = parent_child_summary(parent_child_df, 'gene', 'transcript')
In each row of this DataFrame, the transcripts
column contains the ID, start and end of all transcripts of the gene in that row as an array of structs.
The same function can now be used to generate any parent-child feature summary. For example, we can generate the information of all exons of each transcript on chromosome 22 with another call to the parent_child_summary
function as shown below. Figure 4 shows the generated transcript_exon_df
DataFrame.
transcript_exon_df = parent_child_summary(parent_child_df, 'transcript', 'exon')
Example Continued: Integration with Variant Data
Glow has data sources to ingest variant data from common flat file formats such as VCF, BGEN, and PLINK. Combining Glow’s variant data sources with the new gff
data source, users can seamlessly annotate their variant DataFrames by joining them with annotation DataFrames.
As an example, let us load the chromosome 22 variants of the 1000 Genome Project (on GRCh38 genome assembly) from a VCF file (obtained from the project’s ftp site). Figure 5 shows the resulting variants_df
.
vcf_path = "/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz"" variants_df = spark.read \ .format("vcf") \ .load(vcf_path) \ .alias('variants_df')
Figure 6 shows a DataFrame which, for each variant on a gene on chromosome 22, provides the information of the variant as well as the exon, transcript, and gene on which the variant resides. This is computed using the following query with two joins. Note that the first two exploded DataFrames can also be constructed directly from parent_child_df
. Here, since we had already defined gene_transcript_df
and transcript_exon_df
, we generated these exploded DataFrames simply by applying the explode
function followed by Glow’s expand_struct function on them.
from glow.functions import * gene_transcript_exploded_df = gene_transcript_df \ .withColumn('transcripts', explode('transcripts')) \ .withColumn('transcripts', expand_struct('transcripts')) \ .alias('gene_transcript_exploded_df') transcript_exon_exploded_df = transcript_exon_df \ .withColumn('exons', explode('exons')) \ .withColumn('exons', expand_struct('exons')) \ .alias('transcript_exon_exploded_df') variant_exon_transcript_gene_df = variants_df \ .join( transcript_exon_exploded_df, (variants_df.start < transcript_exon_exploded_df.exon_end) & (transcript_exon_exploded_df.exon_start < variants_df.end) ) \ .join( gene_transcript_exploded_df, transcript_exon_exploded_df.transcript_id == gene_transcript_exploded_df.transcript_id ) \ .select( col('variants_df.contigName').alias('variant_contig'), col('variants_df.start').alias('variant_start'), col('variants_df.end').alias('variant_end'), col('variants_df.referenceAllele'), col('variants_df.alternateAlleles'), 'transcript_exon_exploded_df.exon_id', 'transcript_exon_exploded_df.exon_start', 'transcript_exon_exploded_df.exon_end', 'transcript_exon_exploded_df.transcript_id', 'transcript_exon_exploded_df.transcript_name', 'transcript_exon_exploded_df.transcript_start', 'transcript_exon_exploded_df.transcript_end', 'gene_transcript_exploded_df.gene_id', 'gene_transcript_exploded_df.gene_name', 'gene_transcript_exploded_df.gene_start', 'gene_transcript_exploded_df.gene_end' ) \ .orderBy( 'variant_contig', 'variant_start', 'variant_end' )
Try Glow!
Glow is installed in the Databricks Genomics Runtime (Azure | AWS) and is optimized for improved performance when using cloud computing to analyze large genomics datasets. Learn more about our genomics solutions and how we’re helping to further human and agricultural genome and other genetic research and enable advances like population-scale next-generation sequencing in the Databricks Unified Analytics Platform for Genomics and try out a preview today.
--
Try Databricks for free. Get started today.
The post Integrating large-scale Genomic Variation and Annotation Data with Glow appeared first on Databricks.