Alignment Compare webapp
Variant Compare webapp
For more information contact us at bina.rd@roche.com
Pre-generated reads and variants are avaliable on Amazon S3. It is in a requester-pays bucket so the requester must pay for the data transmission costs outside of S3. Currently the cost Amazon lists is 0.09 $/GB. You will also need an AWS account to access the data.
http://s3tools.org/s3cmd
s3cmd
(1.53+), if you use pip
, use the command
pip -vvv install s3cmd --pre
s3cmd configure
, get authentication details from https://console.aws.amazon.com/iam/home?#security_credential
then "access keys"
s3cmd ls --add-header=x-amz-request-payer:requester s3://varsim/*
s3cmd get --add-header=x-amz-request-payer:requester s3://varsim/[file
that you want]
There is a readme.txt
in the root directory with details of the data.
Requires git to download VarSim now
git clone https://github.com/bioinform/varsim.git cd varsim ./build.sh
Available at varsim.Alternatively, try out docker/test.sh
in VarSim repo.
VarSim is also available as an app on the PrecisionFDA portal. If you have an account, please check out this note
For variant simulation and read generation/simulation:
For alignment and variant calling validation:
Type ./varsim.py -h
for help.
varsim_somatic.py
is a helper script for the somatic workflow.
This is in its early stages. Please contact us if you have trouble with it
This quick start guide will provide steps for generating a random genome with pre-specified and random variants. Then generate reads from this genome with ART. Finally, results of analysis on the output of secondary analysis is plotted.
Step 1: Follow Download and Build VarSim section to download and install VarSim and auxillary programs.
Step 2: Run the following command to generate the simulated genome and reads. Replace the values in square brackets with the appropriate values. This will take a few minutes to run.
cd tests/quickstart_test
./quickstart.sh
The reads as well as the ground truth VCF will be generated in the out
directory, i.e. out/lane*.fq.gz
and out/simu.truth.vcf
.
By default, quickstart.sh
generates very low coverage data, to increase coverage, adjust --total_coverage
option. Afterwards, alignment and variant caller can be run to generate variant calls.
Step 3: After running the alignment and variant calling we can evaluate the results. In order to validate the variants run the following command:
#assume you are in quickstart_test folder ../../opt/miniconda2/bin/python ../../compare_vcf.py --true_vcf out/simu.truth.vcf --out_dir validation --reference hs37d5.fa --vcfs [VCF from result of secondary analysis]
This will output a JSON file (validation/augmented_report.json
) that can be used as input to the VCF Compare webapp [http://bioinform.github.io/varsim/webapp/variant_compare.html]
True positive calls are in validation/augmented_tp.vcf.gz
; false negative calls in validation/augmented_fn.vcf.gz
; false positive calls in validation/augmented_fp.vcf.gz
.
Step 4: In order to validate the alignments run the following command:
java -jar VarSim.jar samcompare -prefix simu [BAM files from result of secondary analysis]
This will output a JSON file that can be used as input to the Alignment Compare webapp [http://bioinform.github.io/varsim/webapp/alignment_compare.html]
Step 1: Make sure you have already gone through the Quick Start Guide for germline simulation
Step 2: Currently VarSim can only simulate random sets of variants from the COSMIC database. This limitation will be addressed in a future version. Hence, this step will be to acquire the COSMIC database VCF.
Download two two cosmic VCF files from http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/download .
CosmicCodingMuts.vcf.gz
and CosmicNonCodingVariants.vcf.gz
After downloading the 2 cosmic VCFs to the directory varsim_run
run the following command:
cat <(gzip -dc CosmicCodingMuts.vcf.gz) <(gzip -dc CosmicNonCodingVariants.vcf.gz) | gzip -c > cosmic.vcf.gz
This will make a concatentated VCF. Although this is not a valid VCF file, VarSim is ok with it. You can also provide your own COSMIC VCF file too. Just make sure it is a single file. VarSim uses the "COS" in the cosmic SNP annotations to separate them out. This will change in future versions.
Step 3: Simulate tumor reads and variants by running the following command:
varsim_somatic.py --reference hs37d5.fa --id cosmic --som_num_snp 10000 \ --som_num_ins 2000 --som_num_del 2000 \ --som_num_mnp 200 \ --som_num_complex 200 \ --cosmic_vcf cosmic.vcf.gz \ --normal_vcf out/simu.truth.vcf \ --nlanes 5 --total_coverage 1 \ --simulator art \ --simulator_executable ART/art_bin_VanillaIceCream/art_illumina \ --out_dir som_out --log_dir som_log --work_dir som_work &> somatic.log
Step 4 [optional]: Mix in some normal reads into the tumor. In order to roughly simulate normal contamination you can take advantage of the multiple lanes easily mix the normal and tumor reads.
For example, in order to simulate reads with 50x coverage of a normal sample and 50x coverage of a tumor sample with tumor allele frequency of 0.3. One possible way to achieve this is to simulate 7 lanes of normal sample at 70x coverage and 3 lanes of tumor sample at 30x coverage. Take the first 5 lanes of the simulated normal sample as the pure normal. Then take the remaining two lanes of normal with all the simulated tumor samples lanes as the contaminated tumor.
Step 5: Run somatic analysis pipeline with the simulated reads
Step 6: Validate alignments and variants called in the same way as before for the germline case.
The truth VCF is in the file som_out/cosmic_somatic.vcf
.
VarSim: A high-fidelity simulation validation framework optional arguments: -h, --help show this help message and exit --out_dir DIR Output directory for the simulated genome, reads and variants (default: out) --work_dir DIR Work directory, currently not used (default: work) --log_dir DIR Log files of all steps are kept here (default: log) --reference FASTA Reference genome that variants will be inserted into (default: None) --seed seed Random number seed for reproducibility (default: 0) --sex Sex Sex of the person (MALE/FEMALE) (default: MALE) --id ID Sample ID to be put in output VCF file (default: None) --simulator SIMULATOR Read simulator to use (default: art) --varsim_jar PATH Path to VarSim.jar (default: /net/kodiak/volumes/river /shared/users/johnmu/github/varsim/VarSim.jar) --read_length LENGTH Length of read to simulate (default: 100) --nlanes INTEGER Number of lanes to generate, coverage will be divided evenly over the lanes. Simulation is parallized over lanes. Each lane will have its own pair of files (default: 1) --total_coverage FLOAT Total coverage to simulate (default: 1.0) --mean_fragment_size FLOAT Mean fragment size to simulate (default: 350) --sd_fragment_size FLOAT Standard deviation of fragment size to simulate (default: 50) --vcfs VCF [VCF ...] Addtional list of VCFs to insert into genome, priority is lowest ... highest (default: []) --force_five_base_encoding Force output bases to be only ACTGN (default: False) --filter Only use PASS variants for simulation (default: False) --keep_temp Keep temporary files after simulation (default: False) Pipeline control options. Disable parts of the pipeline.: --disable_rand_vcf Disable sampling from the provided small variant VCF (default: False) --disable_rand_dgv Disable sampline from the provided DGV file (default: False) --disable_vcf2diploid Disable diploid genome simulation (default: False) --disable_sim Disable read simulation (default: False) Small variant simulation options: --vc_num_snp INTEGER Number of SNPs to sample from small variant VCF (default: 0) --vc_num_ins INTEGER Number of insertions to sample from small variant VCF (default: 0) --vc_num_del INTEGER Number of deletions to sample from small variant VCF (default: 0) --vc_num_mnp INTEGER Number of MNPs to sample from small variant VCF (default: 0) --vc_num_complex INTEGER Number of complex variants to sample from small variant VCF (default: 0) --vc_percent_novel FLOAT Percent variants sampled from small variant VCF that will be moved to novel positions (default: 0) --vc_min_length_lim INTEGER Min length of small variant to accept [inclusive] (default: 0) --vc_max_length_lim INTEGER Max length of small variant to accept [inclusive] (default: 99) --vc_in_vcf VCF Input small variant VCF, usually dbSNP (default: None) --vc_prop_het FLOAT Proportion of heterozygous small variants (default: 0.6) Structural variant simulation options: --sv_num_ins INTEGER Number of insertions to sample from DGV (default: 20) --sv_num_del INTEGER Number of deletions to sample from DGV (default: 20) --sv_num_dup INTEGER Number of duplications to sample from DGV (default: 20) --sv_num_inv INTEGER Number of inversions to sample from DGV (default: 20) --sv_percent_novel FLOAT Percent variants sampled from DGV that will be moved to novel positions (default: 0) --sv_min_length_lim min_length_lim Min length of structural variant to accept [inclusive] (default: 100) --sv_max_length_lim max_length_lim Max length of structural variant to accept [inclusive] (default: 1000000) --sv_insert_seq FILE Path to file containing concatenation of real insertion sequences (default: None) --sv_dgv DGV_FILE DGV file containing structural variants (default: None) DWGSIM options: --dwgsim_start_e first_base_error_rate Error rate on the first base (default: 0.0001) --dwgsim_end_e last_base_error_rate Error rate on the last base (default: 0.0015) --dwgsim_options DWGSIM_OPTIONS DWGSIM command-line options (default: ) ART options: --profile_1 profile_file1 ART error profile for first end (default: None) --profile_2 profile_file2 ART error profile for second end (default: None) --art_options ART_OPTIONS ART command-line options (default: )