Alignment Compare webapp

Variant Compare webapp

For more information contact us at

Publication [Open access]

If you use VarSim in your work, please cite the following:
John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark B. Gerstein, Alexej Abyzov, Wing H. Wong, and Hugo Y.K. Lam
VarSim: A high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications
Bioinformatics first published online December 17, 2014 doi:10.1093/bioinformatics/btu828

Data Access

Pre-generated reads and variants are avaliable on Amazon S3. It is in a requester-pays bucket so the requester must pay for the data transmission costs outside of S3. Currently the cost Amazon lists is 0.09 $/GB. You will also need an AWS account to access the data.

Bucket URI: s3://varsim

Instructions on how to access the data

  1. Create an account at
  2. Download an AWS S3 client, this example will use
  3. It must be the beta version of s3cmd (1.53+), if you use pip, use the command pip -vvv install s3cmd --pre
  4. Configure your s3cmd with s3cmd configure, get authentication details from then "access keys"
  5. List files with the command s3cmd ls --add-header=x-amz-request-payer:requester s3://varsim/*
  6. Download files with the command s3cmd get --add-header=x-amz-request-payer:requester s3://varsim/[file that you want]

There is a readme.txt in the root directory with details of the data.

Download VarSim

Latest pre-built version:

For source code, see "releases".

VarSim on PrecisionFDA

VarSim is also available as an app on the PrecisionFDA portal. If you have an account, please check out this note

System Requirements

For variant simulation and read generation/simulation:

For alignment and variant calling validation:

Building VarSim

VarSim uses maven to build some of the tools. Please make sure maven is installed on your system.

To build, execute mvn package

Running VarSim

Type ./ -h for help. is a helper script for the somatic workflow. This is in its early stages. Please contact us if you have trouble with it

Quick Start Guide for Germline Simulation

This quick start guide will provide steps for generating a random genome with pre-specified and random variants. Then generate reads from this genome with ART. Finally, results of analysis on the output of secondary analysis is plotted.

Step 1: Create a directory varsim_run

Download varsim to varsim_run
tar xfz varsim-0.7.1.tar.gz

Step 2: Download the following files to varsim_run:

  1. Reference genome in FASTA format [hs37d5.fa.gz]
  2. Concatented insert sequences as [insert_seq.txt]
  3. DGV annotations [GRCh37_hg19_supportingvariants_2013-07-23.txt]
  4. dbSNP 141 file [All.vcf.gz]
  5. (optional) Extra variants to include in simulation variants.vcf

by running the following commands

gunzip hs37d5.fa.gz

Step 3: Download Samtools to index the reference genome

mkdir samtools
cd samtools
tar xfj samtools-bcftools-htslib-1.0_x64-linux.tar.bz2
cd ..
samtools/samtools-bcftools-htslib-1.0_x64-linux/bin/samtools faidx hs37d5.fa

Step 4: Dowload ART to varsim_run under an the directory ART by running

mkdir ART
cd ART
tar xfz artbinvanillaicecream031114linux64tgz.tgz
cd ..

The structure of varsim_run should look like

- All.vcf.gz  
- \ART  
- - \art_bin_VanillaIceCream  
- - - ART files ...
- - artbinvanillaicecream031114linux64tgz.tgz
- GRCh37_hg19_supportingvariants_2013-07-23.txt	
- hs37d5.fa  
- hs37d5.fa.fai  
- insert_seq.txt  
- \samtools  
- - samtools files...
- VarSim.jar
- varsim-0.7.1.tar.gz

Step 5: Run the following command to generate the simulated genome and reads to 30x depth. Replace the values in square brackets with the appropriate values. This will take a few hours to run. The last --vcfs option is optional and only required if you want to add additional variants to the simluation. It is recommended to run GATK LeftAlignAndTrimVariants on any VCF file that is used as input.

./ --vc_in_vcf All.vcf.gz --sv_insert_seq insert_seq.txt \
--sv_dgv GRCh37_hg19_supportingvariants_2013-07-23.txt \
--reference hs37d5.fa --id simu --read_length 100 --vc_num_snp 3000000 --vc_num_ins 100000 \
--vc_num_del 100000 --vc_num_mnp 50000 --vc_num_complex 50000 --sv_num_ins 2000 \
--sv_num_del 2000 --sv_num_dup 200 --sv_num_inv 1000 --sv_percent_novel 0.01 \
--vc_percent_novel 0.01 --mean_fragment_size 350 --sd_fragment_size 50 \
--vc_min_length_lim 0 --vc_max_length_lim 49 --sv_min_length_lim 50 \
--sv_max_length_lim 1000000 --nlanes 5 --total_coverage 1 \
--simulator_executable ART/art_bin_VanillaIceCream/art_illumina --out_dir out --log_dir log --work_dir work \
--simulator art  \
--vcfs [Optional VCF file to include, variants.vcf]

The reads will be generated in the out directory. A script is also provided which runs the above steps and generates reads for 1X coverage. Run the secondary analysis tools (alignment and variant calling) on those. The ground truth VCF file is also in the out directory called simu.truth.vcf

Step 6: After running the alignment and variant calling we can evaluate the results. In order to validate the variants run the following command:

java -jar VarSim.jar vcfcompare -true_vcf simu.truth.vcf -prefix simu [VCF from result of secondary analysis] 

This will output a JSON file that can be used as input to the VCF Compare webapp []

Step 7: In order to validate the alignments run the following command:

java -jar VarSim.jar samcompare -prefix simu [BAM files from result of secondary analysis]

This will output a JSON file that can be used as input to the Alignment Compare webapp []

Quick Start Guide for Somatic Simulation [in writing]

Step 1: Make sure you have already gone through the Quick Start Guide for germline simulation

Step 2: Currently VarSim can only simulate random sets of variants from the COSMIC database. This limitation will be addressed in a future version. Hence, this step will be to acquire the COSMIC database VCF.

Download two two cosmic VCF files from . CosmicCodingMuts.vcf.gz and CosmicNonCodingVariants.vcf.gz

After downloading the 2 cosmic VCFs to the directory varsim_run run the following command:

        cat <(gzip -dc CosmicCodingMuts.vcf.gz) <(gzip -dc CosmicNonCodingVariants.vcf.gz) | gzip -c > cosmic.vcf.gz

This will make a concatentated VCF. Although this is not a valid VCF file, VarSim is ok with it. You can also provide your own COSMIC VCF file too. Just make sure it is a single file. VarSim uses the "COS" in the cosmic SNP annotations to separate them out. This will change in future versions.

Step 3: Simulate tumor reads and variants by running the following command: --reference hs37d5.fa --id cosmic --som_num_snp 10000 \
        --som_num_ins 2000 --som_num_del 2000 \
        --som_num_mnp 200 \
        --som_num_complex 200 \
        --cosmic_vcf cosmic.vcf.gz \
        --normal_vcf out/simu.truth.vcf \
        --nlanes 5 --total_coverage 1 \
        --simulator art \
        --simulator_executable ART/art_bin_VanillaIceCream/art_illumina \
        --out_dir som_out --log_dir som_log --work_dir som_work &> somatic.log

Step 4 [optional]: Mix in some normal reads into the tumor. In order to roughly simulate normal contamination you can take advantage of the multiple lanes easily mix the normal and tumor reads.

For example, in order to simulate reads with 50x coverage of a normal sample and 50x coverage of a tumor sample with tumor allele frequency of 0.3. One possible way to achieve this is to simulate 7 lanes of normal sample at 70x coverage and 3 lanes of tumor sample at 30x coverage. Take the first 5 lanes of the simulated normal sample as the pure normal. Then take the remaining two lanes of normal with all the simulated tumor samples lanes as the contaminated tumor.

Step 5: Run somatic analysis pipeline with the simulated reads

Step 6: Validate alignments and variants called in the same way as before for the germline case. The truth VCF is in the file som_out/cosmic_somatic.vcf.

Command Line Options

	VarSim: A high-fidelity simulation validation framework

optional arguments:
  -h, --help            show this help message and exit
  --out_dir DIR         Output directory for the simulated genome, reads and
                        variants (default: out)
  --work_dir DIR        Work directory, currently not used (default: work)
  --log_dir DIR         Log files of all steps are kept here (default: log)
  --reference FASTA     Reference genome that variants will be inserted into
                        (default: None)
  --seed seed           Random number seed for reproducibility (default: 0)
  --sex Sex             Sex of the person (MALE/FEMALE) (default: MALE)
  --id ID               Sample ID to be put in output VCF file (default: None)
  --simulator SIMULATOR
                        Read simulator to use (default: art)
  --varsim_jar PATH     Path to VarSim.jar (default: /net/kodiak/volumes/river
  --read_length LENGTH  Length of read to simulate (default: 100)
  --nlanes INTEGER      Number of lanes to generate, coverage will be divided
                        evenly over the lanes. Simulation is parallized over
                        lanes. Each lane will have its own pair of files
                        (default: 1)
  --total_coverage FLOAT
                        Total coverage to simulate (default: 1.0)
  --mean_fragment_size FLOAT
                        Mean fragment size to simulate (default: 350)
  --sd_fragment_size FLOAT
                        Standard deviation of fragment size to simulate
                        (default: 50)
  --vcfs VCF [VCF ...]  Addtional list of VCFs to insert into genome, priority
                        is lowest ... highest (default: [])
                        Force output bases to be only ACTGN (default: False)
  --filter              Only use PASS variants for simulation (default: False)
  --keep_temp           Keep temporary files after simulation (default: False)

Pipeline control options. Disable parts of the pipeline.:
  --disable_rand_vcf    Disable sampling from the provided small variant VCF
                        (default: False)
  --disable_rand_dgv    Disable sampline from the provided DGV file (default:
                        Disable diploid genome simulation (default: False)
  --disable_sim         Disable read simulation (default: False)

Small variant simulation options:
  --vc_num_snp INTEGER  Number of SNPs to sample from small variant VCF
                        (default: 0)
  --vc_num_ins INTEGER  Number of insertions to sample from small variant VCF
                        (default: 0)
  --vc_num_del INTEGER  Number of deletions to sample from small variant VCF
                        (default: 0)
  --vc_num_mnp INTEGER  Number of MNPs to sample from small variant VCF
                        (default: 0)
  --vc_num_complex INTEGER
                        Number of complex variants to sample from small
                        variant VCF (default: 0)
  --vc_percent_novel FLOAT
                        Percent variants sampled from small variant VCF that
                        will be moved to novel positions (default: 0)
  --vc_min_length_lim INTEGER
                        Min length of small variant to accept [inclusive]
                        (default: 0)
  --vc_max_length_lim INTEGER
                        Max length of small variant to accept [inclusive]
                        (default: 99)
  --vc_in_vcf VCF       Input small variant VCF, usually dbSNP (default: None)
  --vc_prop_het FLOAT   Proportion of heterozygous small variants (default:

Structural variant simulation options:
  --sv_num_ins INTEGER  Number of insertions to sample from DGV (default: 20)
  --sv_num_del INTEGER  Number of deletions to sample from DGV (default: 20)
  --sv_num_dup INTEGER  Number of duplications to sample from DGV (default:
  --sv_num_inv INTEGER  Number of inversions to sample from DGV (default: 20)
  --sv_percent_novel FLOAT
                        Percent variants sampled from DGV that will be moved
                        to novel positions (default: 0)
  --sv_min_length_lim min_length_lim
                        Min length of structural variant to accept [inclusive]
                        (default: 100)
  --sv_max_length_lim max_length_lim
                        Max length of structural variant to accept [inclusive]
                        (default: 1000000)
  --sv_insert_seq FILE  Path to file containing concatenation of real
                        insertion sequences (default: None)
  --sv_dgv DGV_FILE     DGV file containing structural variants (default:

DWGSIM options:
  --dwgsim_start_e first_base_error_rate
                        Error rate on the first base (default: 0.0001)
  --dwgsim_end_e last_base_error_rate
                        Error rate on the last base (default: 0.0015)
  --dwgsim_options DWGSIM_OPTIONS
                        DWGSIM command-line options (default: )

ART options:
  --profile_1 profile_file1
                        ART error profile for first end (default: None)
  --profile_2 profile_file2
                        ART error profile for second end (default: None)
  --art_options ART_OPTIONS
                        ART command-line options (default: )