First Steps with ddRAGE

Installation

The easiest way to install ddRAGE is by using conda and the bioconda channel:

me@machine:~$ conda create -c bioconda --name ddrage ddrage
me@machine:~$ source activate ddrage
(ddrage) me@machine:~$ ddrage

This will install ddRAGE and all of its dependencies into a conda environment named ddrage. After activating the environment, ddrage can be executed from the command line. Another option is to use pip and install ddRAGE from PyPI (using the name ddrage):

me@machine:~$ pip install ddrage

You can also download the source code and compile it yourself:

me@machine:~$ git clone https://bitbucket.org/genomeinformatics/rage ddrage
me@machine:~/rage$ cd ddrage
me@machine:~/rage$ pip install .

Regardless of which installation method you chose, an executable for ddRAGE will be available in your path.

Simulate Reads

To start ddRAGE with the default parameters, use the installed executable:

me@machine:~$ ddrage

This creates a dataset and prints a list of created files. The generated files are stored in the a subfolder of the current directory whose name contains the current date and time and the string _ddRAGEdataset. Dataset parameters, like number of individuals, number of loci, coverage, and read length, can be changed by providing command line parameters:

me@machine:~$ ddrage --loci 30 --nr-individuals 5 --coverage 30 --read-length 150

Since the number of individuals that can be simulated is restricted by the used barcode file, in order to simulate datatsets with many individuals you might need to use another set of barcodes. The default barcode set supports up to 24 individuals in normal mode and up to 96 individuals when using all four p7 barcodes:

me@machine:~$ ddrage -l 1000 --nr-individuals 80 --combine-p7-bcs

For everything bigger than this the barcode set -b huge can be used, which supports up to 1461 individuals.

A full list of parameters can be found here. In the cookbook, example parameter sets to generate common patterns of ddRAD datasets can be found.

Randomize Read Order

The FASTQ files generated by ddRAGE are written to file in order of their simulation. Since this can create very easy instances for some analysis tools, for a realistic assessment the FASTQ files need to be randomized. This can be done using the randomize_fastq script that is installed alongside ddRAGE:

me@machine:~$ ls
ddRAGEdataset_ATCACG_1.fastq   ddRAGEdataset_ATCACG_2.fastq
me@machine:~$ randomize_fastq ddRAGEdataset_ATCACG_1.fastq ddRAGEdataset_ATCACG_2.fastq ddRAGEdataset_ATCACG_randomized_1.fastq.gz ddRAGEdataset_ATCACG_randomized_2.fastq.gz

Tools

ddRAGE also installs some useful auxiliary tools:

  • visualize_bbd shows the effect different parameter values have on the beta-binomial distribution used to simulate coverage profiles. This option requires to additionally install the BBD-visualization extra packages when installing via pip (pip install ddrage[BBD-visualization] or pip install .[BBD-visualization] for the local installation).

  • learn_qmodel allows to extract a profile of quality values from a set of FASTQ files so it can be used by ddRAGE.

  • remove_annotation removes information added to the FASTQ headers by ddRAGE, since not all analysis tools can work with these.