If you use SomaticSeq in your work, please cite the following:
The SEQC2/MAQC-IV Consortium has published numerous whole-genome and whole-exome sequencing replicates from multiple sequencing centers for a pair of tumor-normal reference samples, along with the high-confidence somatic mutation reference call set. These resources can be used to train machine learning classifers (e.g., Sahraeian SME et al. Genome Biol 2022) or evaluate algorithms and pipelines (e.g., Xiao W et al. Nat Biotechnol 2021). This work is published as:
SomaticSeq was the tool that had Bina Technologies, Inc. ranked #1 and #2 in INDEL and SNV in the Stage 5 of the ICGC-TCGA DREAM Somatic Mutation Calling Challenge.
Documentation can be directly downloaded here.
A quick 8-minute video explaining SomaticSeq Fang LT, et al. Genome Biol (2015) |
SEQC2 somatic mutation reference data and call sets
1st place award 2021 MCBIOS/MAQC Annual Meeting Fang LT, et al. Nat Biotechnol (2021) / PubMed / SharedIt |
How to run SomaticSeq v3.6.3 on precisionFDA
Run in train or prediction mode |
somaticseq_parallel.py
--output-directory $OUTPUT_DIR
--genome-reference GRCh38.fa
--inclusion-region genome.bed
--exclusion-region blacklist.bed
paired
--tumor-bam-file tumor.bam
--normal-bam-file matched_normal.bam
--mutect2-vcf MuTect2/variants.vcf
--varscan-snv VarScan2/variants.snp.vcf
--varscan-indel VarScan2/variants.indel.vcf
--jsm-vcf JointSNVMix2/variants.snp.vcf
--somaticsniper-vcf SomaticSniper/variants.snp.vcf
--vardict-vcf VarDict/variants.vcf
--muse-vcf MuSE/variants.snp.vcf
--lofreq-snv LoFreq/variants.snp.vcf
--lofreq-indel LoFreq/variants.indel.vcf
--scalpel-vcf Scalpel/variants.indel.vcf
--strelka-snv Strelka/variants.snv.vcf
--strelka-indel Strelka/variants.indel.vcf
somaticseq_parallel.py
--output-directory $OUTPUT_DIR
--genome-reference GRCh38.fa
--inclusion-region genome.bed
--exclusion-region blacklist.bed
single
--bam-file tumor.bam
--mutect2-vcf MuTect2/variants.vcf
--varscan-vcf VarScan2/variants.vcf
--vardict-vcf VarDict/variants.vcf
--lofreq-vcf LoFreq/variants.vcf
--scalpel-vcf Scalpel/variants.indel.vcf
--strelka-vcf Strelka/variants.vcf
--somaticseq-train
: FLAG to invoke training mode with no argument, which also requires the following inputs, R and ada package in R--truth-snv
: if you have ground truth VCF file for SNV--truth-indel
: if you have a ground truth VCF file for INDEL--classifier-snv
: classifier (.RData file) previously built for SNV--classifier-indel
: classifier (.RData file) previously built for INDEL--inclusion-region
and/or --exclusion-region
will require BEDTools in your path.--threads X
before the paired option to indicate X threads. It simply creates multiple BED file (each consisting of 1/X of total base pairs) for SomaticSeq to run on each of those sub-BED files in parallel. It then merges the results. This requires bedtools in your path.