Publication [Open access]
If you use LongISLND in your work, please cite the following:
Bayo Lau, Marghoob Mohiyuddin, John C. Mu, Li Tai Fang, Narges Bani Asadi, Carolina Dallett, and Hugo Y.K. Lam
LongISLND: In silico Sequencing of Lengthy and Noisy Datatypes
Bioinformatics first published online September 25, 2016
doi:10.1093/bioinformatics/btw602
Introduction
LongISLND is a read simulator which profiles the characteristics of third generation, single-molecule sequencing technologies and simulates accordingly. The general software architecture is easily extendable, as demonstrated by the emulation of Pacific Biosciences (PacBio) multi-pass sequencing with P5 and P6 chemistries, producing data in FASTQ, H5, and the latest PacBio BAM format. Please read on to see application examples to PacBio and oxford nanopre (ONT) data.
Download LongISLND
Github repository: https://github.com/bioinform/longislnd
System Requirements
The following must be installed:
- Java 1.8
- Maven (to build from source)
- Python 2.7 (for convinient scripts and usage examples)
- samtools(for convinient scripts and usage examples)
Installing LongISLND
From Source
- change current working directory into the source code directory
- execute
linux_build.sh
, which assumes wget
, Java 1.8 (java
) and Maven (mvn
) are available.
- accept HDF Java Products' license and accept default installation location to proceed.
From Binary Download
- download a binary package from https://github.com/bioinform/longislnd/releases
- execute
tar xzf ${TAR_GZ}
to unpack the tar.gz package
- change current working directory into the binary directory
- execute
linux_build.sh
, which assumes wget
and curl
are available.
- accept HDF Java Products' license and accept default installation location to proceed.
Running LongISLND
Usage of the Java JAR can be carried out/demonstrated by the following convinient Python scripts.
simulate.py
- execute with
-h
for a list of options
- simulate sequences given a model and a genome
sample.py
- execute with
-h
for a list of options
- please refer to the Example below for input preparation.
- build model used by
simulate.py.
- for PacBio's read, the script will look for files whose names are:
- *.fofn, each storing a list of bax.h5 read files
- *.fofn.cmp.h5, each storing the alignment of the corresponding *.fofn
Troubleshooting
- Consider setting maximum memory for java to match that of your computer. For example,
java -Xmx20g
would allocate 20gb.
- The memory setting can be specified for
sample.py
and simulate.py
with --jvm_opt " -Xmx20g "
Usage Examples
These are demonstrations of LongISLND, from downloading data, to alignment, to learning, and to simulation.
These are tested only on Linux platform due to various depedencies.
Installing Aligners
- PacBio
This sets up PacBio's SMRTAnalysis2.3, if it's not already available to you. All download_and_align.sh
scripts in PacBio examples assume an installation location of sampling_example/smrtanalysis
, please change those scripts if you want to use your own version of SMARTAnalysis.
- Change working directory to
sampling_example
.
- execute
setup_smrt23.sh
- download and build PacBio's SMRTAnalysis2.3
- (to avoid complications) select NONE in answering the question "What job management system will you be using?"
- ONT
This sets up GraphMap (Sovic, I. et al (2016). Fast and sensitive mapping of nanopore sequencing reads with graphmap. Nat Commun, 7.) and Samtools (H. Li, et al (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics. 25(16)), if it's not already available to you. sampling_example/ont_ecoli/download_and_align.sh
assumes installation locations of sampling_example/graphmap
and sampling_example/samtools-1.2
, please change the script if you want to use your own version of aligners.
- Change working directory to
sampling_example
.
- execute
setup_ont.sh
- download and build GraphMap
- download and build Samtools
PacBio P6 E. coli
- Change working directory to
sampling_example/ecoli
.
- execute
download_and_align.sh
- download ecoli data and assembly
- align to corresponding assembly
- generate *.fofn and *.fofn.cmp5 used by
sample.py
- execute
learn_and_simulate.sh
- invoke
sample.py
to learn model from aligned data
- invoke
simulate.py
to simulate reads and output in PacBio's BAM format, stored in the default out
directory.
ONT R7.3 E. coli
- Change working directory to
sampling_example/ont_ecoli
.
- execute
download_and_align.sh
- download ecoli data (Loman, N.J. et al (2015). A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Meth, 12(8), 733–735)
- extract sequences
- align to reference
- execute
learn_and_simulate.sh
- invoke
sample.py
to learn model from aligned data
- invoke
simulate.py
to simulate reads and output in the FASTQ format, stored in the default out
directory.
PacBio P6 CHM1
This is storage/io/compute-intensive due to the scale of the human-scale sequencing data.
- Change working directory to
sampling_example/p6_chm1
.
- execute
download_and_align.sh
- download a subset of P6 CHM1 data
- download the MHAP assembly
- align to corresponding assembly
- generate *.fofn and *.fofn.cmp5 used by
sample.py
- execute
learn.sh
- learn model from aligned data
- NUM_THREADS can be changed to parallelized learning. At least 3GB per thread is recommended.
- please use
simulate.py
as in sampling_example/ecoli/learn_and_simulate.sh
to simulate