If you use SomaticSeq in your work, please cite the following:
                
            The SEQC2/MAQC-IV Consortium has published numerous whole-genome and whole-exome sequencing replicates from multiple sequencing centers for a pair of tumor-normal reference samples, along with the high-confidence somatic mutation reference call set. These resources can be used to train machine learning classifers (e.g., Sahraeian SME et al. Genome Biol 2022) or evaluate algorithms and pipelines (e.g., Xiao W et al. Nat Biotechnol 2021). This work is published as:
                
SomaticSeq was the tool that had Bina Technologies, Inc. ranked #1 and #2 in INDEL and SNV in the Stage 5 of the ICGC-TCGA DREAM Somatic Mutation Calling Challenge.
Documentation can be directly downloaded here.
| 
                    A quick 8-minute video explaining SomaticSeq Fang LT, et al. Genome Biol (2015)  | 
                
                    SEQC2 somatic mutation reference data and call sets
                     1st place award 2021 MCBIOS/MAQC Annual Meeting Fang LT, et al. Nat Biotechnol (2021) / PubMed / SharedIt  | 
                
                    How to run SomaticSeq v3.6.3 on precisionFDA
                     Run in train or prediction mode  | 
            
            
                        somaticseq_parallel.py
                        --output-directory  $OUTPUT_DIR
                        --genome-reference  GRCh38.fa
                        --inclusion-region  genome.bed
                        --exclusion-region  blacklist.bed
                        paired
                        --tumor-bam-file    tumor.bam
                        --normal-bam-file   matched_normal.bam
                        --mutect2-vcf       MuTect2/variants.vcf
                        --varscan-snv       VarScan2/variants.snp.vcf
                        --varscan-indel     VarScan2/variants.indel.vcf
                        --jsm-vcf           JointSNVMix2/variants.snp.vcf
                        --somaticsniper-vcf SomaticSniper/variants.snp.vcf
                        --vardict-vcf       VarDict/variants.vcf
                        --muse-vcf          MuSE/variants.snp.vcf
                        --lofreq-snv        LoFreq/variants.snp.vcf
                        --lofreq-indel      LoFreq/variants.indel.vcf
                        --scalpel-vcf       Scalpel/variants.indel.vcf
                        --strelka-snv       Strelka/variants.snv.vcf
                        --strelka-indel     Strelka/variants.indel.vcf
                    
                
                        somaticseq_parallel.py
                        --output-directory  $OUTPUT_DIR
                        --genome-reference  GRCh38.fa
                        --inclusion-region  genome.bed
                        --exclusion-region  blacklist.bed
                        single
                        --bam-file          tumor.bam
                        --mutect2-vcf       MuTect2/variants.vcf
                        --varscan-vcf       VarScan2/variants.vcf
                        --vardict-vcf       VarDict/variants.vcf
                        --lofreq-vcf        LoFreq/variants.vcf
                        --scalpel-vcf       Scalpel/variants.indel.vcf
                        --strelka-vcf       Strelka/variants.vcf
                    
                --somaticseq-train: FLAG to invoke training mode with no argument, which also requires the following inputs, R and ada package in R--truth-snv: if you have ground truth VCF file for SNV--truth-indel: if you have a ground truth VCF file for INDEL--classifier-snv: classifier (.RData file) previously built for SNV--classifier-indel: classifier (.RData file) previously built for INDEL--inclusion-region and/or --exclusion-region will require BEDTools in your path.--threads X before the paired option to indicate X threads. It simply creates multiple BED file (each consisting of 1/X of total base pairs) for SomaticSeq to run on each of those sub-BED files in parallel. It then merges the results. This requires bedtools in your path.