Parameters¶
Detailed list of parameters:
Naming and Output Parameters¶
These option allow to modify the general handling of the data set, where it is stored, how it is named and similar aspects.
--name NAMEName for the data set. If none is given, the name ddRAGEdataset will be used. Each data set will be written to an own folder in the output path to avoid naming conflicts. This name will appear in the final file name identifying the data set like:
$ ddrage ddRAGEdataset_TAGCTT_1.fastq $ ddrage --name foobar foobar_TAGCTT_1.fastq
-o OUTPUT_PATH_PREFIX --output OUTPUT_PATH_PREFIXPrefix of the output path, i.e. the folder in which the output files created by ddRAGE are stored. As default, the current folder is used.
$ ddrage -o /tmp/rage_data $ cd /tmp/rage_data $ ls ddRAGEdataset_GGCTAC_1.fastq ddRAGEdataset_GGCTAC_2.fastq ddRAGEdataset_GGCTAC_barcodes.txt ddRAGEdataset_GGCTAC_gt.yaml logs/
Dataset Parameters¶
These option specify the content of the data set. This includes the number of loci and individuals, the simulated coverage quality as well as coverage boundaries.
-n NR_INDIVIDUALS, --nr-individuals NR_INDIVIDUALSNumber of individuals in the sample (Default: 3). The specific individuals will be extracted from the barcodes file, which poses an upper bound on the number of individuals. In the default barcode set (
full.txt) a maximum of 24 individuals can be picked per p7 barcode. To increase the number the barcodes file has to be modified or a different barcodes file needs to be chosen. The filehuge.txtcan support up to 1462 individuals, but uses barcodes of length 10 to keep individuals distinguishable.-l LOCI, --loci LOCINumber of loci for which reads will be simulated for the selected individuals (Default: 3) or path to FASTA file. If a FASTA file is given, the sequences contained will be used to create the locus sequences. In this case the number of loci simulated is the number of sequences in the file.
This parameter greatly influences the size of the resulting data set.
-r READ_LENGTH, --read-length READ_LENGTHTotal sequence length of the reads (including overhang, barcodes, etc., Default: 100).
This refers to the length of reads as they are returned by the sequencer and has to be distinguished from sequence length by which we refer to the sequence information from the individual. Consider the following p5 read:
ACGTGA G TAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN | | | | | | | genomic sequence (40 bp) | | enzyme overhang (3bp) | spacer (1bp) barcode (6bp) -> read length = 6 + 1 + 3 + 40 bp = 50bp
Mind that the length of the spacer sequence depends on the individual. Hence, p5 sequences in the same data set with the same read length will yield different length sequences. To trim all genomic sequences to the same length use the
--truncate-atparameter. As the generated data sets are simulated to already be demultiplexed using the p7 index (barcode) there is no variation in the p7 sequence length, as all individuals use the same p7 spacer sequence, which are linked to the barcode sequence (unless the--combine-p7-bcsparameter is used).The officially supported and well tested range of read lengths is 50 - 500bp. However, shorter or longer lengths are possible and will most likely also produce reasonable results.
-c COV, --coverage COVExpected (target) coverage that will be created by normal duplication and snps (Default: 30).
This is the targeted sequencing depth dₛ. In actual ddRAD data sets this value is seldom reached precisely, hence a random process (coverage model) is used in order to simulated more realistic detests. The expected value of both available models (Poisson and Beta-binomial) is dₛ, but deviations to both sides, more reads or less reads than expected, can be reached. The model can be specified with the
--coverage-modelparameter.Coverage profiles for valid (1, 2) and invalid (3,4) locus types. Valid loci sample coverage values from a coverage model, while singletons always have a coverage of 1 (before adding PCR copies) and HRL receive a very high coverage.¶
--hrl-numberNumber of HRLs that will be added, given as fraction of total locus size (Default: 0.05). Example:
-l 100 --hrl-number 0.1for 10 HRLs.--no-singletonsDisable generation of singleton reads. If set, no singletons will be simulated.
--diversity DIVERSITYHandles the number of alleles created per locus. Default: 1.0, increase for more alleles / genotypes per locus.
This will be used as the λ parameter for a Zero Truncated Poisson Distribution, which is used to pick a number of different alleles for the locus. A higher λ means more alleles, thereby increasing the possible genotypes.
Probability density for three values for λ, along with the expected value (E) for the respective parameter. This is equivalent to the expected (average) number of alleles at all loci.¶
Example: Three out of six individuals receive a heterozygous SNP event for a specific locus. For each individual two alleles are chosen from the model. Using λ = 1.0 the expected number of different alleles will be 3: the true sequence (also called root allele, R), and 2 mutation alleles (the expected value of ZTPD(1.0) is 1.58: A, B). So only three different homozygous genotypes can be created: RA, RB, AB. Using λ = 5.0 the expected number of alleles is 6 and the number of different combinations rises to 10.
--gc-contentGC content of the generated sequences. This is used to create a skewed distribution of bases that satisfies the desired GC content. The bases within a class (GC and AT respectively) are chosen uniformly. The default value is 0.5
Example 1: The default value is 0.5, meaning that 50% of the bases will be GC, the rest will be AT. Within these classes the probability is chosen uniformly, yielding probabilities of 0.25 for each base.
Example 2:
--gc-content 0.1the probabilities for A and T are both 0.45 ((1 - 0.9) / 2) while the probabilities of C and G are 0.05 = 0.1 / 2.-q QUALITY_MODEL, --quality-model QUALITY_MODELModel from which quality values will be sampled. The model is given as a .qmodel file.
--single-end, --seWrite a single-end dataset. Only writes a p5 FASTQ file. Default: False
--overlap, --olOverlap factor (between 0 and 1.0) of randomly generated reads. This value describes how much the ends of the p5 and the p7 read will overlap. Default 0
--multiple-p7-barcodes, --combine-p7-bcsCombine individuals with multiple p7 barcodes in one output file. This simulates the bahaviour of an unsplit read file. Files created like this can be split up using the split_by_p7_barcode tool. Default: False
Coverage Model Parameters¶
--coverage-model {PD, BBD}Mathematical model used to sample coverage values (Default: betabinomial). Currently two models are supported. The BBD model (default) uses a Beta-binomial distribution which creates coverage values with high variance and can be modeled to different coverage profiles using the
--BBD-alphaand--BBD-betaparameters. The PD model uses a Poisson distribution to sample coverage values. This creates a less variant set of distributions and creates easier instances for analysis.--BBD-alphaAlpha parameter of the Beta-binomial distribution (Default: 6). Higher values increase the left tailing of the coverage distribution, if the BBD model is used.
--BBD-betaBeta parameter of the Beta-binomial distribution (Default: 2). Higher values increase the right tailing of the coverage distribution, if the BBD model is used.
--max-pcr-copies MAX_PCR_COPY_NRMaximum number of PCR copies that can be created for each finalized (potentially mutated and multiplied) read (Default: 3).
--hrl-max-cov MAX_COVERAGE, --hrl-max-coverage MAX_COVERAGEMaximum coverage for Highly Repetitive Loci (HRLs) (Default: 2000). The minimum coverage is determined as μ + 2σ of the main coverage generating function.
Sequence Parameters¶
These options allow to change the auxiliary sequences used to construct the reads. Please note, that both the overhang and the restriction site of the enzymes used have to be specified. A list of enzymes and their associated parameters can be found here.
-d DBR, --dbr DBRDBR sequence, used as unique molecular identifier (UMI), in IUPAC ambiguity code. Default: ‘NNNNNNMMGGACG’
--p5-overhang P5_OVERHANGSequence of the p5 overhang. Default: ‘TGCAT’ (NsiI)
--p7-overhang P7_OVERHANGSequence of the p7 overhang. Default: ‘TAC’ (Csp6I)
--p5-rec-site P5_REC_SITESequence of the p5 recognition site. Default: ‘ATGCAT’ (Csp6I)
--p7-rec-site P7_REC_SITESequence of the p7 recognition site. Default: ‘GTAC’ (NsiI)
-b BARCODE_SET, --barcodes BARCODE_SETPath to barcodes file or predefined barcode set like ‘barcodes’, ‘small’ or ‘full’. Default: ‘barcodes’, a generic population. See input formats for more information.
Event Probabilities¶
The following parameters influence the probabilities of different event occurring. This includes individual event types, the probabilities of different mutation types, zygosity, and several other aspects:
--event-probabilitiesProbability profile for the distribution of event types (common, dropout, mutation; in this order). Each entry can be given as a float or a string of python code (see example above) which is helpful for small probability values.
Example:
$ ddrage --event-probabilities 0.9 0.05 0.05 -> common 90%, dropout 5%, mutation 5% (Default)
--mutation-type-probabilitiesProbability profile for the distribution of mutation types (snp, insertion, deletion, p5 na alternative, p7 na alternative, p5 na dropout, p7 na dropout; in this order). Each entry can be given as a float or a string of python code (see example above) which is helpful for small probability values.
Example:
$ ddrage --mutation-type-probabilities 0.8999 0.05 0.05 '0.0001*0.001' '0.0001*0.05' '0.0001*0.899' '0.0001*0.05' -> snp 89.99%, insertion 5%, deletion 5%, p5 na alternative 0.00001% , p7 na alternative 0.0005%, p5 na dropout 0.00899%, p7 na dropout 0.0005% (Default)
--prob-heterozygous PROB_HETEROZYGOCITYProbability of mutations being heterozygous. Default: 0.5
--prob-incomplete-digestion PROB_INCOMPLETE_DIGESTIONProbability of incomplete digestion for common and mutation type individuals. Default: 0.1
--rate-incomplete-digestion PROB_INCOMPLETE_DIGESTIONExpected fraction of reads that are being lost in the event of Incomplete Digestion. Default: 0.2
--prob-pcr-copy PROB_PCR_COPYProbability that a (potentially mutated and multiplied) read will receive pcr copies. This influences the simulated pcr copy rate. Default: 0.2
--hrl-pcr-copies HRL_PCR_COPIESProbability of PCR copies for HRL reads in relation to normal reads. Default: 0.9, i.e. the probability for a PCR copy of a HRL read is
prob_pcr_copy * hrl_pcr copies = 0.2 * 0.9 = 0.18--singleton-pcr-copies SINGLETON_PCR_COPIESProbability of PCR copies for singleton reads in relation to normal reads. Default: 1/3, i.e. the probability for a PCR copy of a singleton read is
prob_pcr_copy * singleton_pcr_copies = 0.2 * (1/3) = 0.0666...-e PROB_SEQ_ERROR, --prob-seq-error PROB_SEQ_ERRORProbability of sequencing substitution errors. Default: 0.01
Debugging and User Output¶
-v, --verboseIncrease verbosity of output:
-v: Show progress of simulation, including current simulation phase and a percentage of loci finished. -vv: Print used parameters after simulation. This is similar to the content of the annotation output file. -vvv: Show details for each simulated locus, including the simulated types.
-z, --zipWrite reads as .fastq.gz files. The .gz suffix is automatically added.
--DEBUGSet debug-friendly values for Locus distribution (Probabilities for common, dropout, mutation are all 1/3).
--versionPrint the version number.