Changelog¶

Version 1.8.1¶

Updated BBD visualization code to work with bokeh >= 3.1.1

Version 1.8.0¶

Added support up to Python 3.12. Please note that the bioconda installation is currently only supported for Python versions up to 3.10. For newer Python versions, ddrage can be installed from PyPI via pip.
Removed support for Python <=3.7
Updated project structure to use pyproject.toml
Added compatibility for bokeh >=2.5.0.
Minor documentation updates.

Version 1.7.1¶

Bugs fixed¶

Fixed newer version of scipy to prevent errors with the import of comb.

Version 1.7.0¶

Bugs fixed¶

In low coverage scenarios, it is possible for incomplete digestion (ID) to affect all reads of an individual at a locus. This update fixes a possible crash due to an empty list when this conincides with a homozygous mutation event.

Documentation¶

Fixed some typos.

Other changes¶

Added support for Python 3.8.

Version 1.6.3¶

Other changes¶

Refactored the content of the annotation output file to be more informative. Added a visualization for the distribution of simulated read types (valid, PCR duplicate, HRL, singleton, etc.) and clarified the names of the read types.

Version 1.6.2¶

Bugs fixed¶

Singletons from individuals not in the data set¶

Singletons with p5-p7 barcode combinations that were not in the barcode file could be created. More precisely, additional barcode combinations of specified p5 and p7 barcodes were created.

This is no longer possible.

Version 1.6.0¶

New features¶

Added the --no-singletons parameter to disable singleton generation.

Documentation¶

Fixed description of -o parameter to reflect actual behaviour.
Fixed some typos.

Version 1.5.2¶

Other changes¶

Control bits in the CASAVA header are now set to 0.
Added the --version flag to show the installed version of ddRAGE.
Fixed broken link in file format documentation.

Version 1.5.1¶

Added compatibility with Python 3.7, restored compatibility with Python 3.5.

Version 1.5.0¶

New features¶

Splitting files by p7 barcode¶

After creating a multi-p7 barcode set using the --multiple-p7-barcodes parameter, the split_by_p7_barcode tool can be used to splits the generated FASTQ files up by their p7 barcode.

Example:

$ rage --multiple p7 barcodes
Simulating reads from 3 individuals at 3 loci with a coverage of 30.

Created output files:
    p5 reads                  data_folder/ddRAGEdataset_2_p7_barcodes_1.fastq
    p7 reads                  data_folder/ddRAGEdataset_2_p7_barcodes_2.fastq
    ground truth              data_folder/ddRAGEdataset_2_p7_barcodes_gt.yaml
    barcode file              data_folder/ddRAGEdataset_2_p7_barcodes_barcodes.txt
    annotation file           data_folder/logs/ddRAGEdataset_2_p7_barcodes_annotation.txt
    statistics file           data_folder/logs/ddRAGEdataset_2_p7_barcodes_statstics.pdf

$ cat data_folder/logs/ddRAGEdataset_2_p7_barcodes_annotation.txt
#  Ind.      p5 bc   p7 bc   p5 spc  p7 spc  Annotation
Individual 05        ACAGTG  ATCACG  AC              Annotation 1
Individual 12        CTTGTA  ATCACG  GAC             Annotation 1
Individual 54        GCCAAT  TAGCTT          AT      Annotation 3

The files contain reads with two different p7 barcodes (ATCACG and TAGCTT). To split them up, call split_by_p7_barcode file_1.fq file_2.fq and pass the two FASTQ files as parameters:

$ split_by_p7_barcode data_folder/ddRAGEdataset_2_p7_barcodes_1.fastq data_folder/ddRAGEdataset_2_p7_barcodes_2.fastq

Found new barcode: TAGCTT
Writing to:
  -> reads_TAGCTT_1.fastq
  -> reads_TAGCTT_2.fastq

Found new barcode: GGCTAC
Writing to:
  -> reads_GGCTAC_1.fastq
  -> reads_GGCTAC_2.fastq

This leaves you with two FASTQ files for each barcode, that are placed in the current working folder. The tool preserves the file ending, hence if you pass two .fq.gz files, the output will also be in gzipped FASTQ format.

If these target files are already present, you need to pass the --force parameter to overwrite them.

Bugs fixed¶

Index error when placing SNPs in the multiple p7 barcodes case¶

When simulating reads with multiple p7 barcodes, the length variability of the p7 reads was not taken into account. This resulted in SNPs being placed in a region that was not present in some reads, causing ddRAGE to crash with:

IndexError: bytearray index out of range

This does no longer occur.

Documentation¶

Fixed example barcodes files, which contained an invalid combination of indexes.

Other changes¶

Pseudounique CASAVA headers¶

Some analysis tools have problems with reads with duplicate names. Until now this was quite likely to happen, since only two entries of the (simulated) CASAVA header were random. Now, the run, flowcell_id, and lane fields are filled with a random integer between 0 and 10000. The lane, tile, xpos, and ypos fields contain a random integer between 0 and 1000000000. This should avoid collisions for most data sets.

Version 1.4.0¶

New features¶

p5 ID reads¶

ID reads are now simulated for both the p5 and the p7 side of the read. Before only p7 ID reads were simulated. To account for the lower probability of p5 ID reads (the p5 cutter is a rare cutter so incompletely digested fragments are unlikely to pass size selection in the ddRAD pipeline) 1% of the ID events are on the p5 side of the read.

PCR rates for HRLs and singletons¶

The PCR copy rate relative to valid reads can now be changed using the --hrl-pcr-copies and --singleton-pcr-copies parameters respectively. Both take a fraction and are used to modify the basic --prob-pcr-copy parameter. For example, with --prob-pcr-copy 0.1 and --hrl-pcr-copies 1 --singleton-pcr-copies 0.2, PCR duplicates for HRL reads are as likely as for valid reads, while PCR duplicates for singletons only occur with a chance of 0.1 * 0.2 = 0.02 per read.

Version 1.3.1¶

Bugfixes¶

Fixed bug in remove_annotation script that caused it to crash.

Version 1.3.0¶

New features¶

Barcode files¶

A barcodes file, containing a list of individuals in the sample and their associated barcodes, is automatically written as output.

Two larger standard barcode files have been added as default barcode sets. The big barcode file contains 91 p5 barcodes of length 6 and one p7 barcode of the same length. The huge barcode file contains 1461 p5 barcodes of length 10 and one p7 barcode of the same length. These two barcode sets can be accessed with the -b parameter, like: -b huge.

Added the --get-barcodes parameter, which copies the default barcode files to a local folder named barcode_files. No existing files are overwritten by this. This can be used to extract the barcode files if ddRAGE has been installed via conda or pip.

Zipped output¶

FASTQ files can be written as gzipped files, by passing the -z parameter to ddRAGE. Note that the randomize_fastq script is unable to read gzipped files. However, it can write gzipped files, by passing a file name ending with ".gz" as output file.

me@machine:~/$ randomize_fastq ddRAGEds_ATCACG_1.fastq ddRAGEds_ATCACG_2.fastq ddRAGEds_ATCACG_randomized_1.fastq.gz ddRAGEds_ATCACG_randomized_2.fastq.gz

The ability to read zipped input has been added to the remove_annotation script.

Paired-end quality models¶

The learn_qmodel script now supports different models for p5 and p7 reads. This change replaces the old plain-text .qmodel files with binary .qmodel.npz files. Additionally the script can now show the progress of the analysis (-v, opens a constantly updating plot), can write a plot of the learned distribution (-p a pdf file with the same name prefix as the output file), and plot the distribution for a given quality model file (-s custom.qmodel.npz). New quality models have also been added.

Single-end mode¶

Single-end datasets can now be simulated using the --single-end parameter. Only a p5 read file will be written and no mutations or sequencing errors are written for the p7 read.

Fragment mode¶

A FASTA file can now be passed to the -l, --loci (former --nr-loci) parameter to create reads from the contained sequences. This allows to simulate reads from a reference genome. The number of simulated loci is the number of sequences in the file.

Overlap¶

The overlap of simulate reads can now be influenced with the --overlap parameter. The default value (0.0) means that reads do not overlap, and the maximum value (1.0) makes reads overlap significantly (the exact value depends on the adaptor setup of the reads). In fragment mode, the overlap is determined from the length of the sequences in the FASTA file and this parameter has no effect.

New Mutation Types¶

In addition to p7 null allele mutation that alter the p7 sequence, three additional mutation types have been added:

p5 na mutations that alter the p5 seqeunce

p5 dropout mutations that make one allele drop out

p7 dropout mutations that make one allele drop out

The --mutation-type-probability parameter has been apadted to now use 7 probabilities:

PROB_SNP PROB_INSERTION PROB_DELETION PROB_P5_NA_MUTATION PROB_P7_NA_MUTATION PROB_P5_NA_DROPOUT PROB_P7_NA_DROPOUT

In Order to make entering small probabilities easier, each of these values can now be written as a small equation in python syntax. To do this put the equation in single or double quotes:

python ddrage.py --mutation-type-probabilities 0.8999 0.05 0.05 '0.0001*(1/24)' '0.0001*(7/24)' '0.0001/3' '0.0001/3'

Other changes¶

Name change¶

We fully renamed the program to ddRAGE including all file paths, file names, etc.

File names¶

Removed colons from the default ISO timestamp folder names. These caused escaping issues and have been replaced with dots.

old: 2017-09-18T11:14:09_ddRAGEdataset
new: 2017-09-18T11.14.09_ddRAGEdataset

Consensus sequences in YAML file¶

The consensus sequence reported in the YAML file are now the longest read sequence found in the dataset. Individuals with long barcodes will have less of this sequence present in the generated reads, since read lengths are truncated to a fixed length determined by the -r parameter.

Performance Improvements¶

Several improvements, drastically reducing the memory footprint while also reducing runtime.

Other¶

Fixed several typos in documentation, plots, and source code.

Bugfixes¶

Fixed randomize_fastq not working when writing to stdout when using only one input file.

Version 1.2.0¶

New features¶

Visualization of BBD parameters¶

Added bokeh visualization of BBD parameter choice which is available using the visualize_bbd script:

$ visualize_bbd

This opens a browser window displaying an interactive plot of the BBD that can be used to select alpha- and beta-parameters.

Removal of FASTQ annotations¶

Added remove_annotation script to remove annotations written in the FASTQ name lines of RAGE files, since some analysis tools can not handle the extended name lines:

$ remove_annotation RAGEdataset_ATCACG_1.fastq RAGEdataset_ATCACG_2.fastq

The simulated files will remain unchanged and two new files without annotation are written. The extracted annotations are written to a new file with the _annotation.txt suffix. This file contains one line per read in the FASTQ file.

Other changes¶

Also mad some minor fixes in the documentation and added a list of restriction enzymes to the docs.

Version 1.1.0¶

Added learn_qmodel script, which allows generating a .qmodel file from a set of FASTQ files.

Version 1.0.0¶

Initial release.