Parameters¶
Detailed list of parameters:
Naming and Output Parameters¶
These option allow to modify the general handling of the data set, where it is stored, how it is named and similar aspects.
--name NAME
Name for the data set. If none is given, the name ddRAGEdataset will be used. Each data set will be written to an own folder in the output path to avoid naming conflicts. This name will appear in the final file name identifying the data set like:
$ ddrage ddRAGEdataset_TAGCTT_1.fastq $ ddrage --name foobar foobar_TAGCTT_1.fastq
-o OUTPUT_PATH_PREFIX --output OUTPUT_PATH_PREFIX
Prefix of the output path, i.e. the folder in which the output files created by ddRAGE are stored. As default, the current folder is used.
$ ddrage -o /tmp/rage_data $ cd /tmp/rage_data $ ls ddRAGEdataset_GGCTAC_1.fastq ddRAGEdataset_GGCTAC_2.fastq ddRAGEdataset_GGCTAC_barcodes.txt ddRAGEdataset_GGCTAC_gt.yaml logs/
Dataset Parameters¶
These option specify the content of the data set. This includes the number of loci and individuals, the simulated coverage quality as well as coverage boundaries.
-n NR_INDIVIDUALS, --nr-individuals NR_INDIVIDUALS
Number of individuals in the sample (Default: 3). The specific individuals will be extracted from the barcodes file, which poses an upper bound on the number of individuals. In the default barcode set (
full.txt
) a maximum of 24 individuals can be picked per p7 barcode. To increase the number the barcodes file has to be modified or a different barcodes file needs to be chosen. The filehuge.txt
can support up to 1462 individuals, but uses barcodes of length 10 to keep individuals distinguishable.-l LOCI, --loci LOCI
Number of loci for which reads will be simulated for the selected individuals (Default: 3) or path to FASTA file. If a FASTA file is given, the sequences contained will be used to create the locus sequences. In this case the number of loci simulated is the number of sequences in the file.
This parameter greatly influences the size of the resulting data set.
-r READ_LENGTH, --read-length READ_LENGTH
Total sequence length of the reads (including overhang, barcodes, etc., Default: 100).
This refers to the length of reads as they are returned by the sequencer and has to be distinguished from sequence length by which we refer to the sequence information from the individual. Consider the following p5 read:
ACGTGA G TAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN | | | | | | | genomic sequence (40 bp) | | enzyme overhang (3bp) | spacer (1bp) barcode (6bp) -> read length = 6 + 1 + 3 + 40 bp = 50bp
Mind that the length of the spacer sequence depends on the individual. Hence, p5 sequences in the same data set with the same read length will yield different length sequences. To trim all genomic sequences to the same length use the
--truncate-at
parameter. As the generated data sets are simulated to already be demultiplexed using the p7 index (barcode) there is no variation in the p7 sequence length, as all individuals use the same p7 spacer sequence, which are linked to the barcode sequence (unless the--combine-p7-bcs
parameter is used).The officially supported and well tested range of read lengths is 50 - 500bp. However, shorter or longer lengths are possible and will most likely also produce reasonable results.
-c COV, --coverage COV
Expected (target) coverage that will be created by normal duplication and snps (Default: 30).
This is the targeted sequencing depth dₛ. In actual ddRAD data sets this value is seldom reached precisely, hence a random process (coverage model) is used in order to simulated more realistic detests. The expected value of both available models (Poisson and Beta-binomial) is dₛ, but deviations to both sides, more reads or less reads than expected, can be reached. The model can be specified with the
--coverage-model
parameter.--hrl-number
Number of HRLs that will be added, given as fraction of total locus size (Default: 0.05). Example:
-l 100 --hrl-number 0.1
for 10 HRLs.--no-singletons
Disable generation of singleton reads. If set, no singletons will be simulated.
--diversity DIVERSITY
Handles the number of alleles created per locus. Default: 1.0, increase for more alleles / genotypes per locus.
This will be used as the λ parameter for a Zero Truncated Poisson Distribution, which is used to pick a number of different alleles for the locus. A higher λ means more alleles, thereby increasing the possible genotypes.
Example: Three out of six individuals receive a heterozygous SNP event for a specific locus. For each individual two alleles are chosen from the model. Using λ = 1.0 the expected number of different alleles will be 3: the true sequence (also called root allele, R), and 2 mutation alleles (the expected value of ZTPD(1.0) is 1.58: A, B). So only three different homozygous genotypes can be created: RA, RB, AB. Using λ = 5.0 the expected number of alleles is 6 and the number of different combinations rises to 10.
--gc-content
GC content of the generated sequences. This is used to create a skewed distribution of bases that satisfies the desired GC content. The bases within a class (GC and AT respectively) are chosen uniformly. The default value is 0.5
Example 1: The default value is 0.5, meaning that 50% of the bases will be GC, the rest will be AT. Within these classes the probability is chosen uniformly, yielding probabilities of 0.25 for each base.
Example 2:
--gc-content 0.1
the probabilities for A and T are both 0.45 ((1 - 0.9) / 2) while the probabilities of C and G are 0.05 = 0.1 / 2.-q QUALITY_MODEL, --quality-model QUALITY_MODEL
Model from which quality values will be sampled. The model is given as a .qmodel file.
--single-end, --se
Write a single-end dataset. Only writes a p5 FASTQ file. Default: False
--overlap, --ol
Overlap factor (between 0 and 1.0) of randomly generated reads. This value describes how much the ends of the p5 and the p7 read will overlap. Default 0
--multiple-p7-barcodes, --combine-p7-bcs
Combine individuals with multiple p7 barcodes in one output file. This simulates the bahaviour of an unsplit read file. Files created like this can be split up using the split_by_p7_barcode tool. Default: False
Coverage Model Parameters¶
--coverage-model {PD, BBD}
Mathematical model used to sample coverage values (Default: betabinomial). Currently two models are supported. The BBD model (default) uses a Beta-binomial distribution which creates coverage values with high variance and can be modeled to different coverage profiles using the
--BBD-alpha
and--BBD-beta
parameters. The PD model uses a Poisson distribution to sample coverage values. This creates a less variant set of distributions and creates easier instances for analysis.--BBD-alpha
Alpha parameter of the Beta-binomial distribution (Default: 6). Higher values increase the left tailing of the coverage distribution, if the BBD model is used.
--BBD-beta
Beta parameter of the Beta-binomial distribution (Default: 2). Higher values increase the right tailing of the coverage distribution, if the BBD model is used.
--max-pcr-copies MAX_PCR_COPY_NR
Maximum number of PCR copies that can be created for each finalized (potentially mutated and multiplied) read (Default: 3).
--hrl-max-cov MAX_COVERAGE, --hrl-max-coverage MAX_COVERAGE
Maximum coverage for Highly Repetitive Loci (HRLs) (Default: 2000). The minimum coverage is determined as μ + 2σ of the main coverage generating function.
Sequence Parameters¶
These options allow to change the auxiliary sequences used to construct the reads. Please note, that both the overhang and the restriction site of the enzymes used have to be specified. A list of enzymes and their associated parameters can be found here.
-d DBR, --dbr DBR
DBR sequence, used as unique molecular identifier (UMI), in IUPAC ambiguity code. Default: ‘NNNNNNMMGGACG’
--p5-overhang P5_OVERHANG
Sequence of the p5 overhang. Default: ‘TGCAT’ (NsiI)
--p7-overhang P7_OVERHANG
Sequence of the p7 overhang. Default: ‘TAC’ (Csp6I)
--p5-rec-site P5_REC_SITE
Sequence of the p5 recognition site. Default: ‘ATGCAT’ (Csp6I)
--p7-rec-site P7_REC_SITE
Sequence of the p7 recognition site. Default: ‘GTAC’ (NsiI)
-b BARCODE_SET, --barcodes BARCODE_SET
Path to barcodes file or predefined barcode set like ‘barcodes’, ‘small’ or ‘full’. Default: ‘barcodes’, a generic population. See input formats for more information.
Event Probabilities¶
The following parameters influence the probabilities of different event occurring. This includes individual event types, the probabilities of different mutation types, zygosity, and several other aspects:
--event-probabilities
Probability profile for the distribution of event types (common, dropout, mutation; in this order). Each entry can be given as a float or a string of python code (see example above) which is helpful for small probability values.
Example:
$ ddrage --event-probabilities 0.9 0.05 0.05 -> common 90%, dropout 5%, mutation 5% (Default)
--mutation-type-probabilities
Probability profile for the distribution of mutation types (snp, insertion, deletion, p5 na alternative, p7 na alternative, p5 na dropout, p7 na dropout; in this order). Each entry can be given as a float or a string of python code (see example above) which is helpful for small probability values.
Example:
$ ddrage --mutation-type-probabilities 0.8999 0.05 0.05 '0.0001*0.001' '0.0001*0.05' '0.0001*0.899' '0.0001*0.05' -> snp 89.99%, insertion 5%, deletion 5%, p5 na alternative 0.00001% , p7 na alternative 0.0005%, p5 na dropout 0.00899%, p7 na dropout 0.0005% (Default)
--prob-heterozygous PROB_HETEROZYGOCITY
Probability of mutations being heterozygous. Default: 0.5
--prob-incomplete-digestion PROB_INCOMPLETE_DIGESTION
Probability of incomplete digestion for common and mutation type individuals. Default: 0.1
--rate-incomplete-digestion PROB_INCOMPLETE_DIGESTION
Expected fraction of reads that are being lost in the event of Incomplete Digestion. Default: 0.2
--prob-pcr-copy PROB_PCR_COPY
Probability that a (potentially mutated and multiplied) read will receive pcr copies. This influences the simulated pcr copy rate. Default: 0.2
--hrl-pcr-copies HRL_PCR_COPIES
Probability of PCR copies for HRL reads in relation to normal reads. Default: 0.9, i.e. the probability for a PCR copy of a HRL read is
prob_pcr_copy * hrl_pcr copies = 0.2 * 0.9 = 0.18
--singleton-pcr-copies SINGLETON_PCR_COPIES
Probability of PCR copies for singleton reads in relation to normal reads. Default: 1/3, i.e. the probability for a PCR copy of a singleton read is
prob_pcr_copy * singleton_pcr_copies = 0.2 * (1/3) = 0.0666...
-e PROB_SEQ_ERROR, --prob-seq-error PROB_SEQ_ERROR
Probability of sequencing substitution errors. Default: 0.01
Debugging and User Output¶
-v, --verbose
Increase verbosity of output:
-v: Show progress of simulation, including current simulation phase and a percentage of loci finished. -vv: Print used parameters after simulation. This is similar to the content of the annotation output file. -vvv: Show details for each simulated locus, including the simulated types.
-z, --zip
Write reads as .fastq.gz files. The .gz suffix is automatically added.
--DEBUG
Set debug-friendly values for Locus distribution (Probabilities for common, dropout, mutation are all 1/3).
--version
Print the version number.