Koverage.scripts

combineCoverage.py

Combine the coverage statistics from all samples

This script will take the coverage information from each samples' coverage file and output the population-wide counts for each contig.

collect_coverage_stats - Read and add the counts from all samples
print_sample_coverage - Print the combined coverage statistics for all contigs

`collect_coverage_stats(input_file)`

Combine the mapped coverage stats for all samples.

Parameters:	`input_file` (`str`) – Text TSV file (Sample Contig Count RPM RPKM RPK TPM Hitrate Variance)

Returns:	`all_coverage`( `dict` ) – key (str): contig ID value (dict): count (int): number of reads rpm (float): reads per million rpkm (float): reads per kilobase million tpm (float): transcripts per million

`print_sample_coverage(output_file, all_coverage)`

Print the combined coverage statistics from collect_coverage_stats().

Parameters:	`output_file` (`str`) – Text TSV filepath for writing `all_coverage` (`dict`) – key (str): contig ID value (dict): count (int): number of reads rpm (float): reads per million rpkm (float): reads per kilobase million tpm (float): transcripts per million

combineKmerCoverage.py

Combine the kmer-based coverage statistics from all samples

This script will take the kmer-based coverage information from each samples' coverage file and output the population-wide counts for each contig.

collect_kmer_coverage_stats - Read and add the kmer counts from all samples
print_kmer_coverage - Print the combined kmer coverage statistics for all contigs

`collect_kmer_coverage_stats(input_file)`

Combine the kmer coverage stats for all samples.

Parameters:	`input_file` (`str`) – Text TSV file (Sample Contig Sum Mean Median Hitrate Variance)

Returns:	`allCoverage`( `dict` ) – key (str): contig ID value (dict): sum (int): sum of kmer hits mean (float): mean kmer depth median (float): median kmer depth

`print_kmer_coverage(allCoverage, output_file, lines_per_batch=1000)`

Print the combined kmer coverage statistics from collect_kmer_coverage_stats().

Parameters:	`output_file` (`str`) – Gzipped Text TSV filepath for writing `allCoverage` (`dict`) – key (str): contig ID value (dict): sum (int): sum of kmer hits mean (float): mean kmer depth median (float): median kmer depth `lines_per_batch` (`int`, default: `1000` ) – Number of lines to compress and write at a time

kmerScreen.py

Screen the sample for reference sampled kmers

This script will parse the reference sampled kmers and query them from the sample jellyfish db

trimmed_variance - Calculate variance from a list of integers
output_print_worker - Take output lines from a queue and print to zstandard-compressed TSV
process_counts - Process the kmer depths of the ref sampled kmers and the sample jellyfish database

`output_print_worker(out_queue=None, out_file=None)`

Worker to take the output lines for printing, compress with gzip, and print to the output file.

Parameters:	`out_queue` (`Queue`, default: `None` ) – Queue with lines for printing to output file `out_file` (`str`, default: `None` ) – Output file for writing gzipped output

`process_counts(kmer_counts, sample_name, contig_name)`

Process the kmer depths of the ref sampled kmers and the sample jellyfish database.

Parameters:	`kmer_counts` (`list`) – list of kmer depths `sample_name` (`str`) – name of the sample `contig_name` (`str`) – contig ID

Returns:	`out_line`( `str` ) – output line for printing to output file, or None

`ref_kmer_parser_worker(ref_kmers=None, jellyfish_db=None, out_queue=None, sample_name=None, cmd=None)`

Parse the processed reference kmer file (zstd-compressed) and query kmers from the Jellyfish database.

Parameters:

ref_kmers (str, default: None ) –

Filepath of the sampled kmers from ref fasta (zstd-compressed; "contigID kmer kmer kmer...")
jellyfish_db (str, default: None ) –

Filepath of the jellyfish database for the sample
out_queue (Queue, default: None ) –

Queue of the lines of output for compression and writing to the output file.
sample_name (str, default: None ) –

Name of the sample
cmd (list, default: None ) –

jellyfish command. The command is passed here to allow for unit testing without invoking jellyfish.

`trimmed_variance(data, trim_frac=0.05)`

Calculate the variance, minus the top x percent of outliers

Parameters:	`data` (`list`) – list of int kmer depths for calculating variance `trim_frac` (`float`, default: `0.05` ) – fraction of top depths to trim (outlier handling)

Returns:	`variance`( `float` ) – variance of the trimmed kmer depths

minimapWrapper.py

Run minimap2, parse its output, calculate counts on the fly

This script will run minimap2 of a sample's reads against the reference FASTA. We use a wrapper instead of a snakemake rule to avoid additional read/writes for every sample. PAF files of alignments can optionally be saved.

worker_mm_to_count_paf_queues - read minimap2 output and pass to queues for processing and saving PAF
worker_mm_to_count_queues - read minimap2 output and pass to queue for processing only
worker_paf_writer - read minimap2 output from queue and write to zstandard-zipped file
worker_count_and_print - read minimap2 output from queue, calculate counts, print to output files
build_mm2cmd - return the minimap2 command based on presence of R2 file
start_workers - start queues and worker threads

`build_mm2cmd(**kwargs)`

Return the minimap2 command

Parameters:	`kwargs`** (`dict`, default: `{}` ) – threads (int): Number of worker threads to use minimap_mode (str): Mapping preset for minimap2 ref_idx (str): Reference indexed file r1_file (str): Forward reads file r2_file (str): Reverse reads file (or "" for SE reads/longreads)

Returns:	`mm2cmd`( `list` ) – minimap2 command for opening with subprocess

`contig_lens_from_fai(file_path)`

Collect the sequence IDs from the reference fasta file

Parameters:	`file_path` (`str`) – File path of reference fasta fai index file

Returns:	`ctg_lens`( `list` ) – 0: Sequence ID (str) 1: contig length (int)

`start_workers(queue_counts, queue_paf, pipe_minimap, **kwargs)`

Start workers for reading the minimap output and parsing to queue(s) for processing

Parameters:	`queue_counts` (`Queue`) – queue to use for putting minimap2 output for collecting counts `pipe_minimap` (`pipe`) – subprocess pipe for minimap2 for reading `kwargs`** (`dict`, default: `{}` ) – paf_file (str): PAF file for writing save_pafs (bool): flag for if PAF files should be saved

`worker_count_and_print(count_queue, contig_lengths, **kwargs)`

Collect the counts from minimap2 queue and calc counts on the fly

Parameters:	`count_queue` (`Queue`) – queue of minimap2 output for reading `contig_lengths` (`list`) – 0: Sequence ID (str) 1: contig length (int) `kwargs`** (`dict`, default: `{}` ) – bin_width (int): Width of bins for hitrate and variance estimation output_counts (str): filepath for writing output count stats

`worker_mm_to_count_paf_queues(pipe, count_queue, paf_queue)`

Read minimap2 output and slot into queues for collecting coverage counts, and saving the paf file.

Parameters:	`pipe` (`pipe`) – minimap2 pipe for reading `count_queue` (`Queue`) – queue for putting for counts `paf_queue` (`Queue`) – queue for putting for saving paf

`worker_mm_to_count_queues(pipe, count_queue)`

Read minimap2 output and slot into queues for collecting coverage counts

Args: pipe (pipe): minimap2 pipe for reading count_queue (Queue): queue for putting for counts

`worker_paf_writer(paf_queue, paf_dir, sample, chunk_size=100)`

Read minimap2 output from queue and write to zstd-zipped file

Parameters:	`paf_queue` (`Queue`) – queue of minimap2 output for reading `paf_dir` (`str`) – dir for saving paf files

refSampleKmer.py

Sample kmers from the reference FASTA file

This script will parse the reference FASTA and sample kmers from each contig.

parse_fasta - Read the fasta(.gz) file
contigs_to_queue - Parse the fasta file and populate the processing queue with the id and sequence
string_to_kmers - Sample kmers from a sequence
process_contigs - Parse the processing queue, get sampled kmers, push output lines to writing queue
output_printer - Print the output lines to zstandard-zipped file

`contigs_to_queue(file, queue_put, available_threads, queue_hold=1000)`

Parse the fasta file and populate the processing queue with the id and sequence

Parameters:	`file` (`str`) – Reference FASTA file, optionally gzipped `queue_put` (`Queue`) – Queue of IDs and sequences for processing; {'id': id, 'seq': seq} `available_threads` (`int`) – Number of worker threads `queue_hold` (`int`, default: `1000` ) – Number of entries in the processing before slowing down (for memory management)

`output_printer(queue, outfile, chunk_size=1000)`

Print the output lines to zstandard-zipped file

Parameters:	`queue` (`Queue`) – Queue of output lines for writing `outfile` (`str`) – Output file for writing `chunk_size` (`int`, default: `1000` ) – number of lines to compress and write at a time

`parse_fasta(file)`

Read the fasta(.gz) file

Parameters:	`file` (`str`) – Reference FASTA file, optionally gzipped

Returns:	`generator` – A generator object that yields stripped lines from the FASTA file.

`process_contigs(in_queue, out_queue, **kwargs)`

Parse the processing queue, get sampled kmers, push output lines to writing queue

Parameters:	`in_queue` (`Queue`) – Contig IDs and sequences to process `out_queue` (`Queue`) – Output lines for writing `kwargs`** (`dict`, default: `{}` ) – kspace (int): Sample every n-th kmer ksize (int): Length of kmers kmin (int): Min number of kmers to sample kmax (int): Max number of kmers to sample

`string_to_kmers(seq, **kwargs)`

Sample kmers from a sequence

Parameters:	`seq` (`str`) – Contig's sequence `kwargs`** (`dict`, default: `{}` ) – kspace (int): Sample every n-th kmer ksize (int): Length of kmers kmin (int): Min number of kmers to sample kmax (int): Max number of kmers to sample

Returns:	`kmers`( `list` ) – list of kmers sampled from input contig sequence

sampleCoverage.py

Calculate the coverage stats for a sample for each contig

This script will parse the raw count summary for a sample and calculate the output coverage stats for each contig.

calculate_coverage_stats_from_counts - Read in the library size and the counts from minimapWrapper.py, calculate rpm, rpkm, and rpk, write the counts

`calculate_coverage_stats_from_counts(**kwargs)`

Read in the library size and the counts from minimapWrapper.py, calculate rpm, rpkm, and rpk, write counts for sample

Kwargs

count_file (str): filepath to pickle file of contig lengths and np.array count objects bin_width (int): bin width size output_file (str): filepath of ouptut file for writing sample (str): sample name