VarSim

For more information contact us at bina.rd@roche.com

Publication [Open access]

            If you use VarSim in your work, please cite the following:

            John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark B. Gerstein, Alexej Abyzov, Wing H. Wong,
            and Hugo Y.K. Lam

            VarSim: A high-fidelity simulation and validation framework for high-throughput genome sequencing with
                cancer applications

            Bioinformatics first published online December 17, 2014 doi:10.1093/bioinformatics/btu828
        

Data Access

Pre-generated reads and variants are avaliable on Amazon S3. It is in a requester-pays bucket so the requester must pay for the data transmission costs outside of S3. Currently the cost Amazon lists is 0.09 $/GB. You will also need an AWS account to access the data.

        Bucket URI: s3://varsim
    

Instructions on how to access the data

Create an account at aws.amazon.com
Download an AWS S3 client, this example will use http://s3tools.org/s3cmd
It must be the beta version of s3cmd (1.53+), if you use pip, use the command pip -vvv install s3cmd --pre
Configure your s3cmd with s3cmd configure, get authentication details from https://console.aws.amazon.com/iam/home?#security_credential then "access keys"
List files with the command s3cmd ls --add-header=x-amz-request-payer:requester s3://varsim/*
Download files with the command s3cmd get --add-header=x-amz-request-payer:requester s3://varsim/[file that you want]

There is a readme.txt in the root directory with details of the data.

Download and Build VarSim

Requires git to download VarSim now

git clone https://github.com/bioinform/varsim.git
cd varsim
./build.sh

Docker

Available at varsim.Alternatively, try out docker/test.sh in VarSim repo.

VarSim on PrecisionFDA

VarSim is also available as an app on the PrecisionFDA portal. If you have an account, please check out this note

System Requirements

For variant simulation and read generation/simulation:

32GB free RAM
Enough free disk space to store twice the number of reads generated
ART or dwgsim installed
Currently only supports human genome simulation

For alignment and variant calling validation:

8GB free RAM

Running VarSim

Type ./varsim.py -h for help.

varsim_somatic.py is a helper script for the somatic workflow. This is in its early stages. Please contact us if you have trouble with it

Quick Start Guide for Germline Simulation

This quick start guide will provide steps for generating a random genome with pre-specified and random variants. Then generate reads from this genome with ART. Finally, results of analysis on the output of secondary analysis is plotted.

Step 1: Follow Download and Build VarSim section to download and install VarSim and auxillary programs.

Step 2: Run the following command to generate the simulated genome and reads. Replace the values in square brackets with the appropriate values. This will take a few minutes to run.

cd tests/quickstart_test 
./quickstart.sh

The reads as well as the ground truth VCF will be generated in the out directory, i.e. out/lane*.fq.gz and out/simu.truth.vcf. By default, quickstart.sh generates very low coverage data, to increase coverage, adjust --total_coverage option. Afterwards, alignment and variant caller can be run to generate variant calls.

Step 3: After running the alignment and variant calling we can evaluate the results. In order to validate the variants run the following command:

#assume you are in quickstart_test folder
../../opt/miniconda2/bin/python ../../compare_vcf.py --true_vcf out/simu.truth.vcf --out_dir validation --reference hs37d5.fa --vcfs [VCF from result of secondary analysis]

This will output a JSON file (validation/augmented_report.json) that can be used as input to the VCF Compare webapp [http://bioinform.github.io/varsim/webapp/variant_compare.html] True positive calls are in validation/augmented_tp.vcf.gz; false negative calls in validation/augmented_fn.vcf.gz; false positive calls in validation/augmented_fp.vcf.gz.

Step 4: In order to validate the alignments run the following command:

java -jar VarSim.jar samcompare -prefix simu [BAM files from result of secondary analysis]

This will output a JSON file that can be used as input to the Alignment Compare webapp [http://bioinform.github.io/varsim/webapp/alignment_compare.html]

Quick Start Guide for Somatic Simulation [under construction]

Step 1: Make sure you have already gone through the Quick Start Guide for germline simulation

Step 2: Currently VarSim can only simulate random sets of variants from the COSMIC database. This limitation will be addressed in a future version. Hence, this step will be to acquire the COSMIC database VCF.

Download two two cosmic VCF files from http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/download . CosmicCodingMuts.vcf.gz and CosmicNonCodingVariants.vcf.gz

After downloading the 2 cosmic VCFs to the directory varsim_run run the following command:

        cat <(gzip -dc CosmicCodingMuts.vcf.gz) <(gzip -dc CosmicNonCodingVariants.vcf.gz) | gzip -c > cosmic.vcf.gz

This will make a concatentated VCF. Although this is not a valid VCF file, VarSim is ok with it. You can also provide your own COSMIC VCF file too. Just make sure it is a single file. VarSim uses the "COS" in the cosmic SNP annotations to separate them out. This will change in future versions.

Step 3: Simulate tumor reads and variants by running the following command:

        varsim_somatic.py --reference hs37d5.fa --id cosmic --som_num_snp 10000 \
        --som_num_ins 2000 --som_num_del 2000 \
        --som_num_mnp 200 \
        --som_num_complex 200 \
        --cosmic_vcf cosmic.vcf.gz \
        --normal_vcf out/simu.truth.vcf \
        --nlanes 5 --total_coverage 1 \
        --simulator art \
        --simulator_executable ART/art_bin_VanillaIceCream/art_illumina \
        --out_dir som_out --log_dir som_log --work_dir som_work &> somatic.log

Step 4 [optional]: Mix in some normal reads into the tumor. In order to roughly simulate normal contamination you can take advantage of the multiple lanes easily mix the normal and tumor reads.

For example, in order to simulate reads with 50x coverage of a normal sample and 50x coverage of a tumor sample with tumor allele frequency of 0.3. One possible way to achieve this is to simulate 7 lanes of normal sample at 70x coverage and 3 lanes of tumor sample at 30x coverage. Take the first 5 lanes of the simulated normal sample as the pure normal. Then take the remaining two lanes of normal with all the simulated tumor samples lanes as the contaminated tumor.

Step 5: Run somatic analysis pipeline with the simulated reads

Step 6: Validate alignments and variants called in the same way as before for the germline case. The truth VCF is in the file som_out/cosmic_somatic.vcf.

Command Line Options

	VarSim: A high-fidelity simulation validation framework

optional arguments:
  -h, --help            show this help message and exit
  --out_dir DIR         Output directory for the simulated genome, reads and
                        variants (default: out)
  --work_dir DIR        Work directory, currently not used (default: work)
  --log_dir DIR         Log files of all steps are kept here (default: log)
  --reference FASTA     Reference genome that variants will be inserted into
                        (default: None)
  --seed seed           Random number seed for reproducibility (default: 0)
  --sex Sex             Sex of the person (MALE/FEMALE) (default: MALE)
  --id ID               Sample ID to be put in output VCF file (default: None)
  --simulator SIMULATOR
                        Read simulator to use (default: art)
  --varsim_jar PATH     Path to VarSim.jar (default: /net/kodiak/volumes/river
                        /shared/users/johnmu/github/varsim/VarSim.jar)
  --read_length LENGTH  Length of read to simulate (default: 100)
  --nlanes INTEGER      Number of lanes to generate, coverage will be divided
                        evenly over the lanes. Simulation is parallized over
                        lanes. Each lane will have its own pair of files
                        (default: 1)
  --total_coverage FLOAT
                        Total coverage to simulate (default: 1.0)
  --mean_fragment_size FLOAT
                        Mean fragment size to simulate (default: 350)
  --sd_fragment_size FLOAT
                        Standard deviation of fragment size to simulate
                        (default: 50)
  --vcfs VCF [VCF ...]  Addtional list of VCFs to insert into genome, priority
                        is lowest ... highest (default: [])
  --force_five_base_encoding
                        Force output bases to be only ACTGN (default: False)
  --filter              Only use PASS variants for simulation (default: False)
  --keep_temp           Keep temporary files after simulation (default: False)

Pipeline control options. Disable parts of the pipeline.:
  --disable_rand_vcf    Disable sampling from the provided small variant VCF
                        (default: False)
  --disable_rand_dgv    Disable sampline from the provided DGV file (default:
                        False)
  --disable_vcf2diploid
                        Disable diploid genome simulation (default: False)
  --disable_sim         Disable read simulation (default: False)

Small variant simulation options:
  --vc_num_snp INTEGER  Number of SNPs to sample from small variant VCF
                        (default: 0)
  --vc_num_ins INTEGER  Number of insertions to sample from small variant VCF
                        (default: 0)
  --vc_num_del INTEGER  Number of deletions to sample from small variant VCF
                        (default: 0)
  --vc_num_mnp INTEGER  Number of MNPs to sample from small variant VCF
                        (default: 0)
  --vc_num_complex INTEGER
                        Number of complex variants to sample from small
                        variant VCF (default: 0)
  --vc_percent_novel FLOAT
                        Percent variants sampled from small variant VCF that
                        will be moved to novel positions (default: 0)
  --vc_min_length_lim INTEGER
                        Min length of small variant to accept [inclusive]
                        (default: 0)
  --vc_max_length_lim INTEGER
                        Max length of small variant to accept [inclusive]
                        (default: 99)
  --vc_in_vcf VCF       Input small variant VCF, usually dbSNP (default: None)
  --vc_prop_het FLOAT   Proportion of heterozygous small variants (default:
                        0.6)

Structural variant simulation options:
  --sv_num_ins INTEGER  Number of insertions to sample from DGV (default: 20)
  --sv_num_del INTEGER  Number of deletions to sample from DGV (default: 20)
  --sv_num_dup INTEGER  Number of duplications to sample from DGV (default:
                        20)
  --sv_num_inv INTEGER  Number of inversions to sample from DGV (default: 20)
  --sv_percent_novel FLOAT
                        Percent variants sampled from DGV that will be moved
                        to novel positions (default: 0)
  --sv_min_length_lim min_length_lim
                        Min length of structural variant to accept [inclusive]
                        (default: 100)
  --sv_max_length_lim max_length_lim
                        Max length of structural variant to accept [inclusive]
                        (default: 1000000)
  --sv_insert_seq FILE  Path to file containing concatenation of real
                        insertion sequences (default: None)
  --sv_dgv DGV_FILE     DGV file containing structural variants (default:
                        None)

DWGSIM options:
  --dwgsim_start_e first_base_error_rate
                        Error rate on the first base (default: 0.0001)
  --dwgsim_end_e last_base_error_rate
                        Error rate on the last base (default: 0.0015)
  --dwgsim_options DWGSIM_OPTIONS
                        DWGSIM command-line options (default: )

ART options:
  --profile_1 profile_file1
                        ART error profile for first end (default: None)
  --profile_2 profile_file2
                        ART error profile for second end (default: None)
  --art_options ART_OPTIONS
                        ART command-line options (default: )

References/Tools

vcf2diploid: Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol Syst Biol. 2011 Aug 2;7:522. doi: 10.1038/msb.2011.54. Download link
DWGSIM: Nils Homer. Github repository
ART: Huang W1, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012 Feb 15;28(4):593-4. doi: 10.1093/bioinformatics/btr708. Epub 2011 Dec 23. Download link