LongISLND

For more information contact us at bina.rd@roche.com

Publication [Open access]

            If you use LongISLND in your work, please cite the following:

            Bayo Lau, Marghoob Mohiyuddin, John C. Mu, Li Tai Fang, Narges Bani Asadi, Carolina Dallett, and Hugo Y.K. Lam

            LongISLND: In silico Sequencing of Lengthy and Noisy Datatypes

            Bioinformatics first published online September 25, 2016
            doi:10.1093/bioinformatics/btw602

Introduction

LongISLND is a read simulator which profiles the characteristics of third generation, single-molecule sequencing technologies and simulates accordingly. The general software architecture is easily extendable, as demonstrated by the emulation of Pacific Biosciences (PacBio) multi-pass sequencing with P5 and P6 chemistries, producing data in FASTQ, H5, and the latest PacBio BAM format. Please read on to see application examples to PacBio and oxford nanopre (ONT) data.

Download LongISLND

Github repository: https://github.com/bioinform/longislnd

System Requirements

The following must be installed:

Java 1.8
Maven (to build from source)
Python 2.7 (for convinient scripts and usage examples)
samtools(for convinient scripts and usage examples)

Installing LongISLND

From Source

change current working directory into the source code directory
execute linux_build.sh, which assumes wget, Java 1.8 (java) and Maven (mvn) are available.
accept HDF Java Products' license and accept default installation location to proceed.

From Binary Download

download a binary package from https://github.com/bioinform/longislnd/releases
execute tar xzf ${TAR_GZ} to unpack the tar.gz package
change current working directory into the binary directory
execute linux_build.sh, which assumes wget and curl are available.
accept HDF Java Products' license and accept default installation location to proceed.

Running LongISLND

Usage of the Java JAR can be carried out/demonstrated by the following convinient Python scripts.

simulate.py

execute with -h for a list of options
simulate sequences given a model and a genome

sample.py

execute with -h for a list of options
please refer to the Example below for input preparation.
build model used by simulate.py.
for PacBio's read, the script will look for files whose names are:

*.fofn, each storing a list of bax.h5 read files
*.fofn.cmp.h5, each storing the alignment of the corresponding *.fofn

Troubleshooting

Consider setting maximum memory for java to match that of your computer. For example, java -Xmx20g would allocate 20gb.
The memory setting can be specified for sample.py and simulate.py with --jvm_opt " -Xmx20g "

Usage Examples

These are demonstrations of LongISLND, from downloading data, to alignment, to learning, and to simulation.

These are tested only on Linux platform due to various depedencies.

Installing Aligners

PacBio

This sets up PacBio's SMRTAnalysis2.3, if it's not already available to you. All download_and_align.sh scripts in PacBio examples assume an installation location of sampling_example/smrtanalysis, please change those scripts if you want to use your own version of SMARTAnalysis.

Change working directory to sampling_example.
execute setup_smrt23.sh

download and build PacBio's SMRTAnalysis2.3
(to avoid complications) select NONE in answering the question "What job management system will you be using?"

ONT

This sets up GraphMap (Sovic, I. et al (2016). Fast and sensitive mapping of nanopore sequencing reads with graphmap. Nat Commun, 7.) and Samtools (H. Li, et al (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics. 25(16)), if it's not already available to you. sampling_example/ont_ecoli/download_and_align.sh assumes installation locations of sampling_example/graphmap and sampling_example/samtools-1.2, please change the script if you want to use your own version of aligners.

Change working directory to sampling_example.
execute setup_ont.sh

download and build GraphMap
download and build Samtools

PacBio P6 E. coli

Change working directory to sampling_example/ecoli.
execute download_and_align.sh

download ecoli data and assembly
align to corresponding assembly
generate *.fofn and *.fofn.cmp5 used by sample.py

execute learn_and_simulate.sh

invoke sample.py to learn model from aligned data
invoke simulate.py to simulate reads and output in PacBio's BAM format, stored in the default out directory.

ONT R7.3 E. coli

Change working directory to sampling_example/ont_ecoli.
execute download_and_align.sh

download ecoli data (Loman, N.J. et al (2015). A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Meth, 12(8), 733–735)
extract sequences
align to reference

execute learn_and_simulate.sh

invoke sample.py to learn model from aligned data
invoke simulate.py to simulate reads and output in the FASTQ format, stored in the default out directory.

PacBio P6 CHM1

This is storage/io/compute-intensive due to the scale of the human-scale sequencing data.

Change working directory to sampling_example/p6_chm1.
execute download_and_align.sh

download a subset of P6 CHM1 data
download the MHAP assembly
align to corresponding assembly
generate *.fofn and *.fofn.cmp5 used by sample.py

execute learn.sh

learn model from aligned data
NUM_THREADS can be changed to parallelized learning. At least 3GB per thread is recommended.

please use simulate.py as in sampling_example/ecoli/learn_and_simulate.sh to simulate