Variant typing/recall with variation graphs

This past year I've been working with the vgteam on the variation graph toolkit, or vg. vg has grown nicely into a very functional software package for graph construction, read mapping, SNP/MNP calling, and visualization.

I've been working on methods for detecting structural variants using the graph, both novel SVs in the sample being analyzed and shared SVs known to exist in a population. I'm certainly not the only one, and at least two other variant callers exist in vg already. Currently I'm testing a method for genotyping known indels of any size, but I'd like to extend this to other, more complex forms of structural variation as well as novel variant detection.

SV callers tend to struggle to call insertions, mostly because their sequence is often not in the reference. We can get around this today using long reads, but we might still want to call insertions/deletions against an existing reference and we may not want to sequence every sample on a long read platform.

As a proof of concept I implemented a simple variant caller for variants present in the graph which avoids some of the difficulties inherent in calling variants on a graph1. I call the process variant recall here, though typing or regenotyping would also be an accurate name; the interface is accessed through the vg genotype CLI.

Here's an example of how to run the pipeline, some quick results on simulated data, and a description of how it works on real yeast data. Right now it's restricted to insertions and deletions (i.e. ignoring inversions and copy-number changes), but I'm working on extending the caller to include these as time goes on. I'll be presenting this work at AACR in early April.

If you want to run the following, you'll need to set up vg, bcftools, samtools, and sambamba, have a modern version of python with PySam install to run lumpyexpress. You'll also want to install delly (or download the binary) and lumpy. You'll want to download my ggsv repo to run several scripts I refer to, and you'll want to make sure all the paths match between my scripts and your setup. Once that's done, we can begin.

Constructing, indexing, mapping and recalling

We'll first construct a graph using a FASTA file containing our reference, another containing the sequences of any insertions we have, and a corresponding VCF containing the variants we want to type:

vg construct -S -f -a -I my_insertions.fa -r my_reference.fa -v my_variants.vcf > graph.vg

The -S flag tells vg to include structural variant when building our graph. -I tells vg which fasta file contains our insertions sequences. -f and -a tell vg to use flat alternate alleles (i.e. they are pre-aligned to the reference) and to store the alleles of each variant as paths in the graph, which allows us to access them later.

We can then index our graph, map our reads, and recall our variants.

vg index -x graph.xg -g graph.gcsa -k 11 graph.vg
vg map -t 16 -f reads1.fq -f reads2.fq -x graph.xg -g graph.gcsa > reads.gam
vg genotype -F my_reference.fa -I my_insertions.fa -V my_variants.vcf -G reads.gam graph.vg x > recall.vcf

The x is actually an empty placeholder for a GAM index, which we skip creating since it's expensive (though it does speed up the calling process for large GAMs). Other than that we've created all of our inputs as we went along.

Performance on simulated data

What's the performance like? To test it out I simulated a ~1.0 megabase genome and 1000 non-overlapping insertion/deletion/inversion variants between 10 and 1500 basepairs using the scripts in ggsv. You can run this full analysis by just running the mega_sim.sh script as ./mega_sim.sh, or you should be able to use the commands below if you so choose.

python gen_big_descrip.py 1000 > big_descrip.descrip.txt
## You'll need to do a "tail -n 1 big_descrip.descrip.txt" to 
## determine the necessary size of the simulated reference genome (third column; add 2000 to avoid the coverage fluctuations at the end of );
## here we'll assume it's a nice 1000000 (1 megabase)
python sim_random_data -g 1000000 -d big_descrip.descrip.txt > mod.fa
## Convert our descrip to a vcf
python descrip_to_vcf.py big_descrip.descrip.txt > big.vcf
## we also get a file, "Random.original.fa," that contains our unmodified genome.
## Grab just our modified genome (without insertion seqs, for read simulation)
head -n 2 mod.fa > coregenome.mod.fa
## construct our graph, then index it
vg construct -r Random.original.fa -v big.vcf -S -a -f -I mod.fa > graph.vg
vg index -x graph.xg -g graph.gcsa -k 11 graph.vg

I then simulated reads with ART:

art_illumina -ss HS25 -p -na -i coregenome.mod.fa -f 20 -l 125 -m 700 -s 50 -o alt

Now we have a bunch of paired-end reads from our homozygous ALT simulated genome, labeled alt1.fq and alt2.fq. We'll use these to compare to Delly and lumpy-sv + svtyper.

## Map reads
vg map -t 4 -x graph.xg -g graph.gcsa -f alt1.fq -f alt2.fq > mapped.gam
## Map reads to linear ref with BWA
bwa index Random.original.fa
bwa mem -t 4 -R"@RG\tID:rg\tSM:rg" Random.original.fa alt1.fq alt2.fq | samtools view -bSh - > alt.bam
## Generates alt.sorted.bam and a BAI index.
sambamba sort alt.bam

Now we'll run our various calling pipelines:

## vg
vg genotype -F Random.original.fa -I mod.fa -V big.vcf -G mapped.gam > recall.vcf

## Next up is Delly deletions, insertions, and regenotyping, in that order.
delly call -t DEL -g Random.original.fa -o delly.del.bcf
delly call -t INS -g Random.original.fa -o delly.ins.bcf
delly call -t INV -g Random.original.fa -o delly.inv.bcf
delly call -v big.vcf -g Random.original.fa -o delly.regeno.vcf

## We need to do a little more prep for lumpy
## Use the encapsulated extract script from ggsv
extract.sh alt.sorted.bam
sambamba view -h --num-filter /1294 alt.sorted.bam | samtools sort -@ 2 -o alt.discords.bam -
lumpyexpress -B alt.sorted.bam -D alt.discords.sorted.bam -S alt.splitters.sorted.bam -o alt.lumpy.vcf

## I couldn't get my VCF to play nice with svtyper, so I just used Lumpy's, meaning this isn't a very fair comparison.
svtyper -i alt.lumpy.vcf -o alt.svtyper.vcf

I've wrapped the above commands into the mega_sim.sh script in ggsv; to run just do "./mega_sim.sh," assuming paths are set up correctly. It will occasionally fail because a variant falls off the end of the genome; just run it again if this occurs.

Let's look at the number of calls and timings for each of the above commands. I've taken times from mega_sim.sh on a 4-core (2.7ghz) desktop with 32 gigs of RAM. Remember, our calls should all be "1/1" since our reads are from a simulated homozygous ALT.

## total time for vg genotype: ~1.5 seconds 
## BWA mapping: ~7 seconds
## vg mapping: ~50 seconds - 1 min
## Time to run delly:
## DEL: 13s INS: 11s regenotype: 2s
## Time to run lumpy + svtyper: 2.4s + 4.3s

We could review our calls using the following commands:

## How many insertions should we have?
grep -c "INS" big.vcf 
grep "INS" recall.vcf | grep -c "1/1" 

## Deletions:
grep -c "DEL" big.vcf 
grep "DEL" recall.vcf | grep -c "1/1" 

##Inversions:
grep -c "INV" big.vcf 
grep "INV" | grep -c "1/1"

But I've gone ahead and wrapped that in the assess_calls.sh script, so we'll use that. How does the performance of lumpy / delly compare? NB: This is a rough comparison. We'll just count the number of homozygous alt calls for DEL/INV/INS as above, though it's probable that some of our variants are called with just breakends or might be in the wrong positions.

./assess_calls.sh
OG vcf has 1000 variants
333 INS
339 DEL
328 INV

VG recall will spit back all variants, but how well does it type them?
Of 333 insertions, 333 are labeled hom alt.
Of 339 deletions, 339 are labeled hom alt.
Of 328 inversions, 0 are labeled hom alt.

DELLY INS calls 11 variants
11 INS, 11 homozygous alt.

DELLY DEL calls 329 variants
329 DELs, 328 homozygous alt.

[E::hts_open] fail to open file 'alt.inv.sv.delly.bcf'
Failed to open alt.inv.sv.delly.bcf: No such file or directory
DELLY INV calls 0 variants
[E::hts_open] fail to open file 'alt.inv.sv.delly.bcf'
Failed to open alt.inv.sv.delly.bcf: No such file or directory
[E::hts_open] fail to open file 'alt.inv.sv.delly.bcf'
Failed to open alt.inv.sv.delly.bcf: No such file or directory
0 INV, 0 homozygous alt.

Delly's regenotype calls 339 deletions, 0 are hom alt.
Delly's regenotype calls 0 insertions, 0 are hom alt.

Lumpy calls 455 variants, and SVTYPE spits out 455
This many insertions: 0, and 0 hom alts
This many deletions: 328, and 318 hom alts
This many inversions: 17, and 6 hom alts

In short, it seems that lumpy and delly are still very admirable variant callers for deletions and inversions. VG's pipeline is a bit quicker on this test set but any advantage is lost in mapping. We are way more sensitive to insertions than either lumpy or delly based on the parameters used, but it will required further tuning to see if this holds.

"Correctly" called genotypes (1/1) from our simulated homozygous alt genome for each caller. Calls that are not 1/1 are in low alpha, while correct calls are in high alpha. We see that no caller is very good at calling inversions, vg has a significant advantage at calling insertions, and all callers perform similarly for deletions. NB: these calls aren't checked for accurate location, just number and genotype, so it's possible lumpy/delly may have false positives in this set.

One cool thing to note about vg: if the insertion sequences are unique, we can use them as evidence during calling. Check out this plot of insertion size vs. the number of supporting reads:

Using the insertion sequence as evidence provides many more reads than if we base our calls solely on discordant pairs / split reads.

Runtimes (single-threaded) for the various pipelines and their stages on our small test set (1Mbp, 20X coverage, 1000 variants). We see that vg's mapper takes 5-8X longer than BWA MEM but that calling is much faster. SVTYPER provides the shortest overall runtime by almost a factor of two, however it does not call insertions and miscalls several deletions. Delly must be run in multiple passes, one for each variant type (INS, DEL, INV, etc.). If we were only interested in one variant type its numbers would be more comparable to SVTYPER.

Using recall to validate SVs called using yeast PacBio assemblies

Yue et al published a nice paper on comparative yeast genomics in 2016 in which they provide PacBio-based assemblies as well as Illumina reads for 12 yeast strains. They call a bunch of SVs using Assemblytics, a pipeline from Maria Nattestad in the Schatz lab. I reran their Assemblytics pipeline (which I wrapped to output pseudo-VCF files, rather than BEDs) and collected a bunch of insertions/deletions relative to the SGD2010 reference. I filtered these to keep only variants longer than 20bp, sorted them, and created a variant graph from this list of deletions/insertions and the SGD2010 reference:

vg construct -r SGD2010.fa -v yue.vcf -S -a -f -I yue_insertion_sequences.fa > yue.vg
vg index -x yue.xg -g yue.gcsa -k 11 yue.vg

I'm just fudging commands here, but I'm happy to send my scripts if anyone wants. The data is publicly available here.

Then, I mapped the Illumina reads for each strain against this pangenome graph and ran the recall pipeline on each GAM:

for i in $yue_folders
    do
        vg map -t 4 -x yue.xg -g yue.gcsa -f $i/reads1.fq -f $i/reads2.fq > $i/mapped.gam
        vg genotype -F SGD2010.fa -I yue_insertion_sequences.fa -V yue.vcf -G $i/mapped.gam > $i/recall.vcf
    done

We find 1086 SVs above 20bp in the Yue PacBio assemblies using Assemblytics, with the above size distribution. 300bp is about the size we would expect for mobile element insertions.

We get the same order of het / hom alt calls in our yeast data. We see the majority of sites are called hom ref, with a much lower number of non-genotyped calls. From this it looks to me like most of our variants have evidence in the reads (i.e. we don't see any massive deletions, for example, that might disrupt our ability to map reads to specific variants).

Most strains have only a fraction of the ALT variants, and the REF strain assembly has relatively few variants relative to the reference compared to the other strains.

We see between 25 and 100 hom. alt. variants per yeast strain, with numbers split roughly evenly between insertions and deletions:

The "REF" strain is SRR4255, which is the same strain as the reference.

We find support for the alternate allele (a 0/1 or 1/1 genotype call) in our Illumina reads for about 60% of our PacBio-derived deletions and 30% of our PacBio-derived insertions.

It seems longer variants are more likely to be supported than those that are shorter. This could be due to read mismapping, or it could be that the Assemblytics pipeline is calling relatively small false positive variants.

Current work

I'm currently trying to extend our codebase (and our comparisons to other tools) to include inversions. We need to do some major reengineering to handle duplications the right way, but that's also in the works. Also, we have a tons of caveats to running this pipeline (insertions must differ significantly in their sequence, variant alleles must be flattened before construction, no flanking context is used, variants must be biallelic...) that I'm working through still. Finally, I need to validate my results for yeast SV calling against those from Yue et al; I've been in touch with the authors but we haven't had a chance to compare things just yet.

Conclusions:

We can accurately genotype simulated insertions and deletions using the current vg recall pipeline. The pipeline also looks like it is useful on real data. Calling is ~5-10X faster than Delly or Lumpy, but mapping is 5-10X slower than BWA mem and constitutes the majority of runtime currently.

Footnotes:

  1. Graph coordinates don't always translate well to and from VCF, so we might get back variants that are slightly different (e.g. off by one) from those we put in. Our current pipeline for detecting novel variants requires inserting all paths in the reads into the graph, which is computationally expensive..
  2. Currently, the recall pipeline can't handle overlapping variants, which are awkward in VCF anyway. It doesn't do repeats (though will one day). The genotyping function also fails when too many reads support a given allele, as the likelihood function calculates a factorial which overflows integer/double bounds somewhere above 100!. Also, the model assumes diploidy and biallelic variants, but we plan to extend this at some point as well. Lastly, we don't use the GAM index as I got tired of dealing with them - this means our runtime is linear in the size of the GAM, rather than being a large constant (based on the RocksDB query performance).
  3. Citations due: I used the genotype likelihood function from SpeedSeq, another awesome piece of software from Chiang et al in Gabor Marth's lab. vg is the work of Erik Garrison, Jouni Siren, Adam Novak, Glenn Hickey, Jordan Eizinga, myself, Will Jones, Orion Buske, Mike Lin, Benedict Paten, and Richard Durbin; it's in prep for submission right now. Lumpy is from Ryan Layer et al, published in Genome Biology 2014. Delly is from Tobias Rausch et al, published in Bioinformatics in 2012. Richard Durbin and Stephen Chanock advised me on this project, and Heng Li contributed a lot from a single meeting up at the Broad in November. Simon Tavare had valuable input during my first-year viva in October.