Variant typing/recall with variation graphs

This past year I've been working with the vgteam on the variation graph toolkit, or vg. vg has grown nicely into a very functional software package for graph construction, read mapping, SNP/MNP calling, and visualization.

I've been working on methods for detecting structural variants using the graph, both novel SVs in the sample being analyzed and shared SVs known to exist in a population. I'm certainly not the only one, and at least two other variant callers exist in vg already. Currently I'm testing a method for genotyping known indels of any size, but I'd like to extend this to other, more complex forms of structural variation as well as novel variant detection.

SV callers tend to struggle to call insertions, mostly because their sequence is often not in the reference. We can get around this today using long reads, but we might still want to call insertions/deletions against an existing reference and we may not want to sequence every sample on a long read platform.

As a proof of concept I implemented a simple variant caller for variants present in the graph which avoids some of the difficulties inherent in calling variants on a graph1. I call the process variant recall here, though typing or regenotyping would also be an accurate name; the interface is accessed through the vg genotype CLI.

Here's an example of how to run the pipeline, some quick results on simulated data, and a description of how it works on real yeast data. Right now it's restricted to insertions and deletions (i.e. ignoring inversions and copy-number changes), but I'm working on extending the caller to include these as time goes on. I'll be presenting this work at AACR in early April.

If you want to run the following, you'll need to set up vg, bcftools, samtools, and sambamba, have a modern version of python with PySam install to run lumpyexpress. You'll also want to install delly (or download the binary) and lumpy. You'll want to download my ggsv repo to run several scripts I refer to, and you'll want to make sure all the paths match between my scripts and your setup. Once that's done, we can begin.

Constructing, indexing, mapping and recalling

We'll first construct a graph using a FASTA file containing our reference, another containing the sequences of any insertions we have, and a corresponding VCF containing the variants we want to type:

vg construct -S -f -a -I my_insertions.fa -r my_reference.fa -v my_variants.vcf > graph.vg

The -S flag tells vg to include structural variant when building our graph. -I tells vg which fasta file contains our insertions sequences. -f and -a tell vg to use flat alternate alleles (i.e. they are pre-aligned to the reference) and to store the alleles of each variant as paths in the graph, which allows us to access them later.

We can then index our graph, map our reads, and recall our variants.

vg index -x graph.xg -g graph.gcsa -k 11 graph.vg
vg map -t 16 -f reads1.fq -f reads2.fq -x graph.xg -g graph.gcsa > reads.gam
vg genotype -F my_reference.fa -I my_insertions.fa -V my_variants.vcf -G reads.gam graph.vg x > recall.vcf

The x is actually an empty placeholder for a GAM index, which we skip creating since it's expensive (though it does speed up the calling process for large GAMs). Other than that we've created all of our inputs as we went along.

Performance on simulated data

What's the performance like? To test it out I simulated a ~1.0 megabase genome and 1000 non-overlapping insertion/deletion/inversion variants between 10 and 1500 basepairs using the scripts in ggsv. You can run this full analysis by just running the mega_sim.sh script as ./mega_sim.sh, or you should be able to use the commands below if you so choose.

python gen_big_descrip.py 1000 > big_descrip.descrip.txt
## You'll need to do a "tail -n 1 big_descrip.descrip.txt" to 
## determine the necessary size of the simulated reference genome (third column; add 2000 to avoid the coverage fluctuations at the end of );
## here we'll assume it's a nice 1000000 (1 megabase)
python sim_random_data -g 1000000 -d big_descrip.descrip.txt > mod.fa
## Convert our descrip to a vcf
python descrip_to_vcf.py big_descrip.descrip.txt > big.vcf
## we also get a file, "Random.original.fa," that contains our unmodified genome.
## Grab just our modified genome (without insertion seqs, for read simulation)
head -n 2 mod.fa > coregenome.mod.fa
## construct our graph, then index it
vg construct -r Random.original.fa -v big.vcf -S -a -f -I mod.fa > graph.vg
vg index -x graph.xg -g graph.gcsa -k 11 graph.vg

I then simulated reads with ART:

art_illumina -ss HS25 -p -na -i coregenome.mod.fa -f 20 -l 125 -m 700 -s 50 -o alt

Now we have a bunch of paired-end reads from our homozygous ALT simulated genome, labeled alt1.fq and alt2.fq. We'll use these to compare to Delly and lumpy-sv + svtyper.

## Map reads
vg map -t 4 -x graph.xg -g graph.gcsa -f alt1.fq -f alt2.fq > mapped.gam
## Map reads to linear ref with BWA
bwa index Random.original.fa
bwa mem -t 4 -R"@RG\tID:rg\tSM:rg" Random.original.fa alt1.fq alt2.fq | samtools view -bSh - > alt.bam
## Generates alt.sorted.bam and a BAI index.
sambamba sort alt.bam

Now we'll run our various calling pipelines:

## vg
vg genotype -F Random.original.fa -I mod.fa -V big.vcf -G mapped.gam > recall.vcf

## Next up is Delly deletions, insertions, and regenotyping, in that order.
delly call -t DEL -g Random.original.fa -o delly.del.bcf
delly call -t INS -g Random.original.fa -o delly.ins.bcf
delly call -t INV -g Random.original.fa -o delly.inv.bcf
delly call -v big.vcf -g Random.original.fa -o delly.regeno.vcf

## We need to do a little more prep for lumpy
## Use the encapsulated extract script from ggsv
extract.sh alt.sorted.bam
sambamba view -h --num-filter /1294 alt.sorted.bam | samtools sort -@ 2 -o alt.discords.bam -
lumpyexpress -B alt.sorted.bam -D alt.discords.sorted.bam -S alt.splitters.sorted.bam -o alt.lumpy.vcf

## I couldn't get my VCF to play nice with svtyper, so I just used Lumpy's, meaning this isn't a very fair comparison.
svtyper -i alt.lumpy.vcf -o alt.svtyper.vcf

I've wrapped the above commands into the mega_sim.sh script in ggsv; to run just do "./mega_sim.sh," assuming paths are set up correctly. It will occasionally fail because a variant falls off the end of the genome; just run it again if this occurs.

Let's look at the number of calls and timings for each of the above commands. I've taken times from mega_sim.sh on a 4-core (2.7ghz) desktop with 32 gigs of RAM. Remember, our calls should all be "1/1" since our reads are from a simulated homozygous ALT.

## total time for vg genotype: ~1.5 seconds 
## BWA mapping: ~7 seconds
## vg mapping: ~50 seconds - 1 min
## Time to run delly:
## DEL: 13s INS: 11s regenotype: 2s
## Time to run lumpy + svtyper: 2.4s + 4.3s

We could review our calls using the following commands:

## How many insertions should we have?
grep -c "INS" big.vcf 
grep "INS" recall.vcf | grep -c "1/1" 

## Deletions:
grep -c "DEL" big.vcf 
grep "DEL" recall.vcf | grep -c "1/1" 

##Inversions:
grep -c "INV" big.vcf 
grep "INV" | grep -c "1/1"

But I've gone ahead and wrapped that in the assess_calls.sh script, so we'll use that. How does the performance of lumpy / delly compare? NB: This is a rough comparison. We'll just count the number of homozygous alt calls for DEL/INV/INS as above, though it's probable that some of our variants are called with just breakends or might be in the wrong positions.

./assess_calls.sh
OG vcf has 1000 variants
333 INS
339 DEL
328 INV

VG recall will spit back all variants, but how well does it type them?
Of 333 insertions, 333 are labeled hom alt.
Of 339 deletions, 339 are labeled hom alt.
Of 328 inversions, 0 are labeled hom alt.

DELLY INS calls 11 variants
11 INS, 11 homozygous alt.

DELLY DEL calls 329 variants
329 DELs, 328 homozygous alt.

[E::hts_open] fail to open file 'alt.inv.sv.delly.bcf'
Failed to open alt.inv.sv.delly.bcf: No such file or directory
DELLY INV calls 0 variants
[E::hts_open] fail to open file 'alt.inv.sv.delly.bcf'
Failed to open alt.inv.sv.delly.bcf: No such file or directory
[E::hts_open] fail to open file 'alt.inv.sv.delly.bcf'
Failed to open alt.inv.sv.delly.bcf: No such file or directory
0 INV, 0 homozygous alt.

Delly's regenotype calls 339 deletions, 0 are hom alt.
Delly's regenotype calls 0 insertions, 0 are hom alt.

Lumpy calls 455 variants, and SVTYPE spits out 455
This many insertions: 0, and 0 hom alts
This many deletions: 328, and 318 hom alts
This many inversions: 17, and 6 hom alts

In short, it seems that lumpy and delly are still very admirable variant callers for deletions and inversions. VG's pipeline is a bit quicker on this test set but any advantage is lost in mapping. We are way more sensitive to insertions than either lumpy or delly based on the parameters used, but it will required further tuning to see if this holds.

"Correctly" called genotypes (1/1) from our simulated homozygous alt genome for each caller. Calls that are not 1/1 are in low alpha, while correct calls are in high alpha. We see that no caller is very good at calling inversions, vg has a significant advantage at calling insertions, and all callers perform similarly for deletions. NB: these calls aren't checked for accurate location, just number and genotype, so it's possible lumpy/delly may have false positives in this set.

One cool thing to note about vg: if the insertion sequences are unique, we can use them as evidence during calling. Check out this plot of insertion size vs. the number of supporting reads:

Using the insertion sequence as evidence provides many more reads than if we base our calls solely on discordant pairs / split reads.

Runtimes (single-threaded) for the various pipelines and their stages on our small test set (1Mbp, 20X coverage, 1000 variants). We see that vg's mapper takes 5-8X longer than BWA MEM but that calling is much faster. SVTYPER provides the shortest overall runtime by almost a factor of two, however it does not call insertions and miscalls several deletions. Delly must be run in multiple passes, one for each variant type (INS, DEL, INV, etc.). If we were only interested in one variant type its numbers would be more comparable to SVTYPER.

Using recall to validate SVs called using yeast PacBio assemblies

Yue et al published a nice paper on comparative yeast genomics in 2016 in which they provide PacBio-based assemblies as well as Illumina reads for 12 yeast strains. They call a bunch of SVs using Assemblytics, a pipeline from Maria Nattestad in the Schatz lab. I reran their Assemblytics pipeline (which I wrapped to output pseudo-VCF files, rather than BEDs) and collected a bunch of insertions/deletions relative to the SGD2010 reference. I filtered these to keep only variants longer than 20bp, sorted them, and created a variant graph from this list of deletions/insertions and the SGD2010 reference:

vg construct -r SGD2010.fa -v yue.vcf -S -a -f -I yue_insertion_sequences.fa > yue.vg
vg index -x yue.xg -g yue.gcsa -k 11 yue.vg

I'm just fudging commands here, but I'm happy to send my scripts if anyone wants. The data is publicly available here.

Then, I mapped the Illumina reads for each strain against this pangenome graph and ran the recall pipeline on each GAM:

for i in $yue_folders
    do
        vg map -t 4 -x yue.xg -g yue.gcsa -f $i/reads1.fq -f $i/reads2.fq > $i/mapped.gam
        vg genotype -F SGD2010.fa -I yue_insertion_sequences.fa -V yue.vcf -G $i/mapped.gam > $i/recall.vcf
    done

We find 1086 SVs above 20bp in the Yue PacBio assemblies using Assemblytics, with the above size distribution. 300bp is about the size we would expect for mobile element insertions.

We get the same order of het / hom alt calls in our yeast data. We see the majority of sites are called hom ref, with a much lower number of non-genotyped calls. From this it looks to me like most of our variants have evidence in the reads (i.e. we don't see any massive deletions, for example, that might disrupt our ability to map reads to specific variants).

Most strains have only a fraction of the ALT variants, and the REF strain assembly has relatively few variants relative to the reference compared to the other strains.

We see between 25 and 100 hom. alt. variants per yeast strain, with numbers split roughly evenly between insertions and deletions:

The "REF" strain is SRR4255, which is the same strain as the reference.

We find support for the alternate allele (a 0/1 or 1/1 genotype call) in our Illumina reads for about 60% of our PacBio-derived deletions and 30% of our PacBio-derived insertions.

It seems longer variants are more likely to be supported than those that are shorter. This could be due to read mismapping, or it could be that the Assemblytics pipeline is calling relatively small false positive variants.

Current work

I'm currently trying to extend our codebase (and our comparisons to other tools) to include inversions. We need to do some major reengineering to handle duplications the right way, but that's also in the works. Also, we have a tons of caveats to running this pipeline (insertions must differ significantly in their sequence, variant alleles must be flattened before construction, no flanking context is used, variants must be biallelic...) that I'm working through still. Finally, I need to validate my results for yeast SV calling against those from Yue et al; I've been in touch with the authors but we haven't had a chance to compare things just yet.

Conclusions:

We can accurately genotype simulated insertions and deletions using the current vg recall pipeline. The pipeline also looks like it is useful on real data. Calling is ~5-10X faster than Delly or Lumpy, but mapping is 5-10X slower than BWA mem and constitutes the majority of runtime currently.

Footnotes:

  1. Graph coordinates don't always translate well to and from VCF, so we might get back variants that are slightly different (e.g. off by one) from those we put in. Our current pipeline for detecting novel variants requires inserting all paths in the reads into the graph, which is computationally expensive..
  2. Currently, the recall pipeline can't handle overlapping variants, which are awkward in VCF anyway. It doesn't do repeats (though will one day). The genotyping function also fails when too many reads support a given allele, as the likelihood function calculates a factorial which overflows integer/double bounds somewhere above 100!. Also, the model assumes diploidy and biallelic variants, but we plan to extend this at some point as well. Lastly, we don't use the GAM index as I got tired of dealing with them - this means our runtime is linear in the size of the GAM, rather than being a large constant (based on the RocksDB query performance).
  3. Citations due: I used the genotype likelihood function from SpeedSeq, another awesome piece of software from Chiang et al in Gabor Marth's lab. vg is the work of Erik Garrison, Jouni Siren, Adam Novak, Glenn Hickey, Jordan Eizinga, myself, Will Jones, Orion Buske, Mike Lin, Benedict Paten, and Richard Durbin; it's in prep for submission right now. Lumpy is from Ryan Layer et al, published in Genome Biology 2014. Delly is from Tobias Rausch et al, published in Bioinformatics in 2012. Richard Durbin and Stephen Chanock advised me on this project, and Heng Li contributed a lot from a single meeting up at the Broad in November. Simon Tavare had valuable input during my first-year viva in October.

Installing TensorFlow

This is another quick post on installation difficulties and how to alleviate them. We're looking at TensorFlow as an ML solution for many of the things we are exploring with vg. It's awesome that it's free and open-source, and the community is growing by the day. However, installation isn't always a breeze.

I first tried to install tensorflow using pip a la Google's instructions (I already had python-dev and pip on my system):

sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl

This fails with an error saying that the wheel isn't supported on my platform. There's a simple workaround for this on StackOverflow, but it still wouldn't work for me. After updating pip, I tried the local install method reference in the TensorFlow docs:

wget https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl
sudo pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl

This seemed to work, but then when I cracked open python and tried import tensorflow as tf, I got another error, even though I'm on Ubuntu 14.04 and not Mac OS.

The solution was to update my protobuf to the bleeding edge. git clone --recursive https://github.com/google/protobuf.git
cd protobuf/
./autogen.sh
./configure --prefix=/usr
make -j 4
make check ## All tests passed here
sudo make install
sudo ldconfig ## Places new libs on LD_LIBRARY_PATH

At this point, I had installed the C++ version of protobuf and could compile things with protoc, but I still needed the python bindings.

## Still in protobuf dir
cd python/
python setup.py build
python setup.py test ## Fails ~ 2% of all tests
python setup.py install

And only then I could test TensorFlow and run the examples. If you're installing locally, all the instructions should be about the same but you'll need to use ./configure --prefix=/your/install/dir and ensure that you add the relevant directories to the LD_LIBRARY_PATH and LD_INCLUDE_PATH. Hopefully the next post is on doing something neat with TensorFlow now that I've got it installed!

Parallel Make Tips

I spent a lot of time fixing makefiles these past weeks. It seems there isn't much about debugging makefiles on the internet, so I'll place this here as a way to collate a bunch of StackOverflow posts.

VG has quite a few dependencies and lots of individual code modules, and a serial make build takes about 20 minutes. Travis CI builds are even worse, taking over 30 minutes at times (maybe something to do with virtualization performance?). Early on we had parallel builds working, but when I introduced vg deconstruct I inadvertently (and unknowingly) broke them. Our parallel builds would work for a while and fail out, forcing us to finish each one with a serial run.

Debugging

All of our issues came down to missing Make dependencies for various targets. To debug this, I went through each file and made sure that the #include lines matched the dependencies in the Makefile. I also had some ghost targets/dependencies, where I had mispelled a dependency and Make had never complained. Once I'd made sure all the includes were set as dependencies, I would kick off a parallel build and wait to see the dreaded #Error on the command line.

There has got to be a better way to do this...

But I haven't found it yet. Sometimes running make -n (dry run) would help, as I could see what was happening without all the debug messages from packages being built. I could probably also write a little BASH/Python to find the include/dependency discrepancies, but I've been distracted with other things.

Telling Make what to make of Makefile lines

I kept getting this ambiguous error that my make lines weren't make processes, so they were being executed in serial. I just added a + to each rule of the vg source code to fix this. Thanks again, StackOverflow!

Ensure Make target is a file

I had originally used a dummy target, but this prevented Make from ever thinking that the build was complete. I think I'll avoid things like make all and stick to real targets from now on. I even use hidden files for pre-build dependencies such as setting up folders (e.g. touching a file named .pre_build).

Build executable off of the library, not a crap ton of object files

I had originally patched up vg to build the executable on a ton of object files that were also bundled up into a library for other to use. This was pretty silly on my part. By making the executable depend on the library and the library dependent on the object files, I made the build even quicker and ensured that the binary and library contained identical code. I should have done this in the first place but didn't yet know any better.

Results

vg used to take 20 minutes to build in serial and up to ten to build in parallel. I'm consistently getting builds under four minutes with make -j 4, both on my virtual machine on a Macbook Pro and my quad-core desktop. Incremental builds are fixed again, and everyone is much happier.

Build Sanity in Bioinformatics Projects

package_name
    README.md
    Makefile AND/OR configure AND/OR install.sh
    |___bin
    |___lib
    |___include
    |___src
    |___doc
    |___test

I've been wrangling a new-to-me C++ project over the past few weeks, and I've realized that it is by no means the first bioinformatics package to use a non-canonical build format. I don't mean to pick on the field. Science will always come first, but there are good reasons for following solid design practices even if it's just a small piece of code for in-house use. You never know if a piece of code will become important enough to publish at a later date.

Let me back up and first explain what is a canonical build format. Most software packages use the same layout in their basic design, with a folder hierarchy to separate code, libraries, binaries, etc. This usually looks like the diagram this post opens with. bin contains any binaries onces they are built; lib contains third-party libraries as well as any generated during the build; include contains all header files in the package; and src contains source code. That's it. The docs folder, while nice, is not strictly necessary. You may also want a test folder and additional top-level files if you're using a test harness. While this isn't a flat structure, it's about as easy as a hierarchical system could be. A good README contains instructions for building as well as a list of external dependencies to be installed, so that there should be no guesswork in building your code. Hiltmon has a great post on the topic of project structure that I've drawn from extensively in creating this post.

This is the organization of choice for most C++ projects and almost all modern Java projects. In fact many Java build systems, such as Maven, strongly enforce such structure. In Maven you have to modify the configuration of the build system to even create a different type of structure. For emphasis: you can't modify your directory structure without modifying the build system itself, usually a more difficult task than simply obliging Maven's demand for order. While that's a pretty intense requirement, after many years of dealing with both styles I'm beginning to understand why.

The vast majority of users will do the same ./configure; make; make install sequence the moment they get your code. If they see a configure script, they'll know to run that first. I'm guilty of regularly typing ./configure at the command line before realizing there isn't even a configure script inside the folder. In Java, there's a similar gravitation towards mvn build and other basic commands. Since users are accustomed to this, the setup is easy on the user. It behooves the developers who want users to utilize their code to employ such a user-friendly structure. More users means more citations and (hopefully) good press.

Additionally, the directory structure above mimics other packages. It even mirrors Unix, with it's /usr/local/* hierarchy for system-wide header files and libraries. This format works, and it provides a sort of common interface between packages. It makes linking against packages easy, and if you add a make install target you can even install to the system-level directories with just a handful of copy commands in your makefile. Plus, your package becomes nearly plug-and-play compatible with build systems like Gradle and Maven if it's in Java. For C/C++, you may have to toy with a NAR plugin, but it should still be possible to move to an advanced build system if your package is relatively simple and organized like so.

There are I guess some downsides to this structure of course. Perhaps it's conceptually hard to grasp for certain packages. There's also the pain of migration from an old organization scheme to a new one. Another major argument against this structure is that is isn't as easy to include header files in your code, but this is easily remedied using a mixture of environment variables (LD_LIBRARY_PATH, LD_INCLUDE_PATH, CPATH, C_INCLUDE_PATH, and LIBRARY_PATH), good old GNU Make, and the -L, -I, and -l flags to GCC/ICC. You can even set these in a script that gets sourced from your Makefile to make sure they're set when the build kicks off. In all, I think the benefits of canon outway any negatives. Stick to a simple design like this, and your users (and fellow developers) will thank you.


The Beast of November: Conquering the NSF GRFP Application

The NSF GRFP is one of the highest-profile fellowships in the natural sciences. Tens of thousands of students apply each year and nearly 2,000 will be awarded. As I’ve had a lot of questions from others asking how to improve their proposals, I’d like to give a quick rundown of the NSF format and what to focus on in each section. This post in no way guarantees you’ll win an NSF. It is simply a collection of anecdotes and tidbits of advice I received prior to submitting my proposal last year. Nearly every piece of advice came from my PI at the time, who produces on average one NSF winner a year from his lab. Nonetheless, the selection process is very stochastic. Your mileage may vary, no implied warranty, etc.

The NSF has four essential parts: the research proposal, the personal statement, the recommendation letters, and your curriculum vitae. By this point in the application process your fate with regards to your letters and CV is sealed, so I’ll start with the parts you can still change.

The Research Proposal:

 The research proposal should be above all else feasible. If you cannot demonstrate that you have sufficient background knowledge or that you can obtain it in your first year as a graduate student, your project is too ambitious. If your project cannot be completed in three to four years on an honest timeline and a modest budget then it simply won’t get funded. Also, make sure your objectives are largely independent - if your first one fails, you should be able to still complete two of them.

Formatting matters more than you would think on the NSF. If anything, it ensures you don’t leave anything essential out. Here is a basic outline for a proposal:

Introduction: this will set up why you think this work is necessary.

Goals and background: what your long-term goals are, why you want to do this project, why it might be important, what your overarching hypothesis is, and why your selected graduated institution is a good place to do this research. Use a sentence that clearly states your goals - “My long term goal is to… My objective for this project is to…”

Research Approach: List your objectives, one by one as individual paragraphs. Use this format: Objective 1: One Sentence Objective. Hypothesis 1: One Sentence Hypothesis. And then describe in detail what you’re going to do.

Expected Outcomes and Intellectual Merit: Why are you doing this project and what is your goal or final product? How can it be extended? Why is it novel or superior compared to existing approaches?

References: cite your sources in a proper format, if necessary.

 

Many students get hung up on the idea that "hot” areas of science perhaps do better than others. There is really no way to say with confidence if a real bias exists, but my opinion is that it shouldn't matter. You should not write your proposal on a subject just because it is in vogue. You should write about what you are passionate about and what you understand (or want to understand). If you do this, and your proposal is well-written, you will do much better than writing a lesser proposal on a sexy topic.

    Your proposal should have clear hypotheses and build experiments to test them.  You should state these hypotheses explicitly after stating your aims so that the committee knows why you’re performing each aim. Spend a lot of time making sure the aims and the hypothesis naturally align - if your aim cannot answer your hypothesis, then you should make sure you’re answering the right question and using the right experimental approach.

    You should tailor your proposal to the NSF’s goals. The NSF exists to explore basic principles of science. Proposals that are too applied to a specific field may fall outside the purview of the NSF. For example, a project that attempted to find new drug candidates is a great proposal for the NIH; however, the NSF would probably decline to fund it and instead tell you to apply to the relevant organization. This means you cannot use a proposal for another application word-for-word. Tweak your scientific approach to answer questions of first principles.

The Personal Statement:

 For some reason it seems that students really struggle with the personal statement. This is unfortunate for many, as I think the personal statement is where the NSF is won or lost (see Broader Impacts). Creating a good proposal is largely objective and can be done with the help of your PI, but the personal statement is the product of the student alone. However, there are a few tenets that helped me guide my personal statement.

 First, use a simple format. I broke mine into two sections: “Personal Statement, Relevant Background and Future Goals” and “Broader Impacts.” The first section tells a coherent story (in roughly chronological order) of how I got interested in my subject, what my research experience so far is, and how I will use my previous experience to guide my future research and realize my long-term goals. The second details how I will help the scientific community. I think this is among the best ways to write the statement: broader impacts should be separate from the background to highlight it, and any more than two sections wastes space and looks clunky.

Second, avoid cliches. Thousands of students will pen the words, “I have been interested in [some scientific field] ever since I can remember.” It doesn’t matter if it’s true or not: you simply can’t stand out if you begin that way. Try something different. I am no expert here - I used a similar cliche/motif in my proposal, but I think it would have been stronger without it.

Third - avoid humblebrag. Make sure you set a professional tone that, while explaining your accomplishments, does not overly embellish them. The sentence “At TACC, I worked for two years in the life science group on projects related to plant genomics; during that time I contributed to three publications,” is fine; the sentence “As I was the only undergraduate at TACC, it was quite an accomplishment to be on three papers,” is humblebrag. It can be easy to accidentally use such prose, but doing so can significantly degrade the quality of your statement.

Fourth, try to focus on your progression as a scientist and how you arrived at where you are. You may leave out a significant portion of your story in doing so, and that’s okay. It’s especially important that you put your experience in context in this section - don’t talk about anything that doesn’t directly impact your development as a scientist. The NSF committee doesn’t need to know about your non-science related volunteer experiences in this section because you’ll connect them to your broader impacts and they’ll appear in your CV.

The broader impacts is likely the section that wins the NSF. In this section you will describe how you plan to improve the scientific community. This is NOT where you talk about the scientific merit of your work - the NSF clarifies this every year and yet students keep misinterpreting what the committee desires. The NSF has a mandate to encourage participation in STEM fields, and they want to know how you will help to continue this mission. This is the section where you can talk about volunteer experience, travel, etc. Suggest a feasible community initiative and describe why you’re capable of getting it implemented. Describe how it will benefit the scientific community. Working with younger students is often a good starting place. I suggested that I would start a supercomputing team at UCSD. As I had already competed with one at UT for two years, I knew first-hand what made for a successful team. I also suggested that if the initiative were successful, I would attempt to start a similar initiative for high-school students. I also had a paragraph about helping coordinate software carpentry workshops that would help students from experimental backgrounds bridge into the computational side of biology, much as I had done. I included lots of specifics, which I think demonstrated that I thought my proposals were feasible.

The Letters of Recommendation:

Letters of recommendation should be stellar; simply put, a plain letter is a negative one. Make sure that the writers know you well. They should address your abilties, and you should make sure that each recommender has a draft of your proposal (or at least a short description) so that they can comment on it. If they show enthusiasm in their letter it will often carry over into the review process. A draft also gives the writer something to comment on if they do not know you well enough to comment on your abilities.

 

The CV:

    Your CV should be formatted simply, just like everything else. A single colored line or a highlighted name is about all the embellishment that you should have. The focus should be on describing your activities and accomplishments in a way that those unfamiliar with them can understand quickly.

    Order matters with your CV. A good order of sections is:

  1. Name and Contact Info

  2. Education, including current classes and GPA

  3. Honors and Awards

  4. Research Experience

  5. Publications

  6. Skills, Relevant volunteer activities, etc.

It might be okay to flip research experience and publications, but otherwise this is a safe order. You want to highlight the things that make you unique - if you put research experience first, the committee will have to dig to locate why you’re different from the other tens of thousands of applicants. If you have awards first, they can quickly see that others have recognized your potential already.

I will keep editing this page periodically as I think of relevant information. It was written rather quickly, so please excuse any errors (but please point them out in the comments). Best of luck and remember, no award defines you as a scientist - we are defined by the work we do and the contributions we make to knowledge and other people. If you aren't successful this year, it doesn't mean you won't succeed - to even be applying puts you among an elite group. Keep your chin up and your head in the books; you'll be just fine.