Assumptions and Notation

Error Model

Because most ddRADSeq pipelines use Illumina Sequencers, the error model used for the sequencing error simulation is illumina specific as well. In short every base has an error probability (default 0.01) to show a base flip. This includes p7 index barcodes sequences.

Sequence Position Notation

All positions reported by ddRAGE are 0-based, meaning that the first base in a sequence is found at index 0. This is consistent with the results generated by Stacks. On top of that ddRAGE distinguished between read positions and sequence positions:

Read Position

A position relative to the read start, including all auxiliary sequences like barcodes, etc. This means that the first sequence byte will be at read position len(auxiliary sequences) - 1. The read position is equivalent to the position of a base in the FASTQ read line.

Sequence Position

A position relative to the beginning of the genomic sequence information without auxiliary sequences. Read positions can be created by adding the offset value (sum of p5_barcode, spacer, binding_site) to the sequence Position. This is the position in the genomic sequence after all auxiliary sequences have been removed.

Which is used where?

Both notations have up- and downsides depending on the place they are used in. Sequencing errors can arise in all parts of the reads and hence, must be given as read positions. However, after the auxiliary sequences have been removed, read positions become useless. To deal with this problem ddRAGE denotes read and sequence positions for all mutations. If two positions are given, like p5@20(4), the position in braces is the sequence position and the one without braces is the read position.

Example: A SNP would show up in a FASTQ file like this:

@nameline ... mutations:'p5@20(4):A>G'
/aux_seq/ACGTGCGT
+
/aux_qvs/IIIIIIII

assuming the length of the auxiliary sequences to be 16. Hence, the read position of the SNP is reported as 20, while the sequence position, without the auxiliary sequences, is 4.

A sequencing error is reported as:

@nameline ... p5_seq_errors:'3'
ACGTA
+
III"I

The base at index 3 (the fourth base in the sequence), in this case the ‘T’, is the sequencing error.

Quality Values

The simulated quality values use the Sanger / Illumina 1.8 format. Quality values are sampled from a position specific distribution learned from real ddRAD data sets. To define a custom distribution, please take a look at the input files chapter. To learn a distribution from a given data set, you can use the learn_qmodel script as described in the tools chapter.

Ploidity

All genomes created by ddRAGE are diploid. Hence, each individual can only posses either one or two different alleles for each mutation at the locus.