Output Format¶

Rage writes three types of output files: Simulated reads that can be analyzed, supplementary files that contain additional information about the reads as well as statistics and ground truth files, that contain expected results of an analysis.

Reads¶

The reads are written to disk as FASTQ files. Because paired-end reads are simulated two files are created. The first file with the suffix _1.fastq contains the p5 (or forward) reads while the file with the suffix _2.fastq contains the p7 (reverse) reads respectively.

The reads are sorted by locus ID and individual. While this makes it easy to validate results manually, it might create unreasonably easy problem instances for analysis tools. In order to assure a realistic instance of the problem please consider shuffling the FASTQ files, as described here.

Annotation¶

Each effect simulated for a specific read is annotated in the name line of that read in the FASTQ file. As usual for current Illumina sequencers, the name line starts with a CASAVA style name line like

@instrument:29:42:72:23:0:0 1:N:0:ACTTGA

which contains information provided by the sequencer. The instrument name always is instrument, and the following six entries (run id, flowcell id, lane, tile, x-coordinate, y-coordinate) are populated with random integer values. Run, flowcell id and lane are uniformly chosen from [0, 10000) and tile, x-coordinate and y-coordinate from [0, 1000000). These entries result in a high probability of accidental collisions.

In the second group, the pair entry is set to 1 for p5 reads and to 2 for p7 reads. The filtered entry is N (no) for all reads and none of the control bits is set. The most interesting part here is the p7 index barcode.

After the CASAVA line, separated by a space, follows a space separated list of key-value-pairs. Each of these pairs is formatted as key:’value’. The key is guaranteed to contain no spaces, the value can contains spaces. The following keys can occur:

Key	Value	Format/ Type	Example
read_from	Name of the individual from which this read originates.	Individual name as string	read_from:’Individual 20’
at_locus	For valid reads: Name/ Number of the locus from which this read originates. For singletons/ HRL reads: Locus type and Name/ Number of origin locus.	Integer String String	at_locus:’42’ at_locus:’singleton_35’ at_locus:’hrl_23’
p5_bc	Sequence of the p5 barcode.	String	p5_bc:’ACTTGA’
p7_bc	Sequence of the p7 barcode.	String	p7_bc:’GGCTAC’
rID	Pseudounique read ID.	String	rID:’fjibo’
type	Individual event type that created this read.	String	type:’common’ type:’PCR copy’
p5_seq_errors	Sequencing errors in the p5 read.	Semicolon separated positions	p5_seq_errors:’17;23’
p7_seq_errors	Sequencing errors in the p7 read.	Semicolon separated positions	p7_seq_errors:’17;23’
p7_barcode_seq_errors	Sequencing errors in the p7 index barcode.	Semicolon separated positions	p7_barcode_seq_errors:’5’
mutations	Types of mutations in this read. Can be empty for het. mutations including the common allele (Allele 0) For a full list see notation of mutations.	String	mutations:’p7@36(20):A>T’
genotype	Which allele has been applied to this read along the zygosity.	String	genotype:’het. Allele 1’ genotype:’hom. Allele 42’

Note that some ddRAd analysis tools can not handle this addition to the FASTQ name line. For these, the annotation has to be removed using the remove_annotation script, as described in the the tools chapter.

Supplementary Files¶

The supplementary files contain additional information about the data set and the simulated events. They consist of:

An annotation file containing a list of the used parameters, sequences etc.

A statistics file in pdf format. This is a graphical aggregation of important parameters of the simulation. It includes plots for the distribution of locus types, number of SNPs etc.

A barcode file containing barcodes and spacers used for individuals in the data set.

Ground Truth¶

The ground truth generated by ddRAGE is saved in YAML format. It contains three separated documents:

Individual information: Containing the auxiliary sequences associated with the individuals in the sample.
Locus entries: One entry, named Locus i where i is the locus number, starting with 0 for each valid locus. Each locus entry contains the locus sequence, coverage and genotype information for the individuals.
HRL entries: One entry per HRL locus, named HRL Locus i. Contains the coverage for each individual at the HRL locus.

These three segments are saved as disjoint YAML documents in the same file (separated by lines containing only --). The can be unpacked using the load_all function supplied by most YAML readers.

Specification¶

Key	Content	Content Format
Individual Information	Auxiliary sequences for all individuals in the sample.	One dictionary per individual containing: ‘dbr’, ‘p5 bc’, ‘p5 spacer’, ‘p5 overhang’, ‘p7 bc’, ‘p7 spacer’, ‘p7 overhang’.

Locus `i` (one entry per locus)	Simulated events for the locus.	allele coverages (dict: allele name -> int), allele frequency (dict: allele name -> float), total locus coverage (int), nr of id reads (int), individual genotypes (dict: individual name -> allele; allele is a dict: allele name -> (cov (int), mutations (list of string representation of all mutations of the allele))

HRL locus `j` (one entry per HRL)	Coverages for the individuals at the HRL locus.	Dictionary mapping individual names (str) to coverage values (int).

Notation of Mutations¶

The four kinds of mutations simulated by ddRAGE, namely: SNPs, insertions, deletions, and null alleles, are notated as follows:

Mutation Type	Representation	Example	Translation
SNP	[p5,p7]@[seq. pos]([read pos)]:[base from]>[base to]	p5@33(54):A>T	An A>T polymorphism in the p5 read. At genomic position 33 (without auxiliary sequences) and read position 54 (including auxiliary sequences).
Insertion	[p5,p7]@[seq. pos]([read pos)]:+[insert bases]	p5@33(54):+ACG	An insertion of the sequence ACG in the p5 read after genomic position 33 (read position 54).
Deletion	[p5,p7]@[seq. pos]([read pos)]:-[deleted bases]	p5@33(54):-T	A deletion of the sequence T in the p5 read after genomic position 33 (read position 54).
Null Alleles with alternative sequences	p7:NA_alternative, p5:NA_alternative	p7:NA_alternative	Null alleles changing the whole p5 or p7 seqeunce of a read.
Null Alleles with dropout	p7:NA_dropout, p5:NA_dropout	p4:NA_dropout	Null alleles preventing reads from being generated (can only be seen in _gt file)

Regular expression for SNPs, Insertions, and Deletions:

import re

mutation_string = "p5@33(54):+ACG"
reg_exp = re.compile("(p\d)@(\d+)\((\d+)\):(.*)")
read_direction, read_position, genomic_position, diff = reg_exp.search(mutation_string).groups()

Genomic Position vs. Read Position¶

If two positions for a mutation are listed, the position in braces is a read position, while the other is the sequence position.

The read position describes the position measured from the beginning of the read, including all auxiliary sequences. This is equivalent to the position of the mutation in the reads in the FASTQ files.

The sequence position, on the other hand, denotes the position in the genomic sequence of the reads. This is helpful when only the genomic sequence is present and all auxiliary sequences have been removed during analysis.

Barcode file¶

The barcode file contains a header and one line for each individual in the dataset. Each line contains, in this order and separated by tabs, the following information:

individual name
p5 barcode
p7 barcode
p5 spacer seqeunce
p7 spacer seqeunce
individual annotation