Koverage.scripts

combineCoverage.py

Combine the coverage statistics from all samples

This script will take the coverage information from each samples' coverage file and output the population-wide counts for each contig.

  • collect_coverage_stats - Read and add the counts from all samples
  • print_sample_coverage - Print the combined coverage statistics for all contigs

collect_coverage_stats(input_file)

Combine the mapped coverage stats for all samples.

Parameters:
  • input_file (str) –

    Text TSV file (Sample Contig Count RPM RPKM RPK TPM Hitrate Variance)

Returns:
  • all_coverage( dict ) –
    • key (str): contig ID
    • value (dict):
      • count (int): number of reads
      • rpm (float): reads per million
      • rpkm (float): reads per kilobase million
      • tpm (float): transcripts per million

print_sample_coverage(output_file, all_coverage)

Print the combined coverage statistics from collect_coverage_stats().

Parameters:
  • output_file (str) –

    Text TSV filepath for writing

  • all_coverage (dict) –
    • key (str): contig ID
    • value (dict):
      • count (int): number of reads
      • rpm (float): reads per million
      • rpkm (float): reads per kilobase million
      • tpm (float): transcripts per million

combineKmerCoverage.py

Combine the kmer-based coverage statistics from all samples

This script will take the kmer-based coverage information from each samples' coverage file and output the population-wide counts for each contig.

  • collect_kmer_coverage_stats - Read and add the kmer counts from all samples
  • print_kmer_coverage - Print the combined kmer coverage statistics for all contigs

collect_kmer_coverage_stats(input_file)

Combine the kmer coverage stats for all samples.

Parameters:
  • input_file (str) –

    Text TSV file (Sample Contig Sum Mean Median Hitrate Variance)

Returns:
  • allCoverage( dict ) –
    • key (str): contig ID
    • value (dict):
      • sum (int): sum of kmer hits
      • mean (float): mean kmer depth
      • median (float): median kmer depth

print_kmer_coverage(allCoverage, output_file, lines_per_batch=1000)

Print the combined kmer coverage statistics from collect_kmer_coverage_stats().

Parameters:
  • output_file (str) –

    Gzipped Text TSV filepath for writing

  • allCoverage (dict) –
    • key (str): contig ID
    • value (dict):
      • sum (int): sum of kmer hits
      • mean (float): mean kmer depth
      • median (float): median kmer depth
  • lines_per_batch (int, default: 1000 ) –

    Number of lines to compress and write at a time

kmerScreen.py

Screen the sample for reference sampled kmers

This script will parse the reference sampled kmers and query them from the sample jellyfish db

  • trimmed_variance - Calculate variance from a list of integers
  • output_print_worker - Take output lines from a queue and print to zstandard-compressed TSV
  • process_counts - Process the kmer depths of the ref sampled kmers and the sample jellyfish database

output_print_worker(out_queue=None, out_file=None)

Worker to take the output lines for printing, compress with gzip, and print to the output file.

Parameters:
  • out_queue (Queue, default: None ) –

    Queue with lines for printing to output file

  • out_file (str, default: None ) –

    Output file for writing gzipped output

process_counts(kmer_counts, sample_name, contig_name)

Process the kmer depths of the ref sampled kmers and the sample jellyfish database.

Parameters:
  • kmer_counts (list) –

    list of kmer depths

  • sample_name (str) –

    name of the sample

  • contig_name (str) –

    contig ID

Returns:
  • out_line( str ) –

    output line for printing to output file, or None

ref_kmer_parser_worker(ref_kmers=None, jellyfish_db=None, out_queue=None, sample_name=None, cmd=None)

Parse the processed reference kmer file (zstd-compressed) and query kmers from the Jellyfish database.

Parameters:
  • ref_kmers (str, default: None ) –

    Filepath of the sampled kmers from ref fasta (zstd-compressed; "contigID kmer kmer kmer...")

  • jellyfish_db (str, default: None ) –

    Filepath of the jellyfish database for the sample

  • out_queue (Queue, default: None ) –

    Queue of the lines of output for compression and writing to the output file.

  • sample_name (str, default: None ) –

    Name of the sample

  • cmd (list, default: None ) –

    jellyfish command. The command is passed here to allow for unit testing without invoking jellyfish.

trimmed_variance(data, trim_frac=0.05)

Calculate the variance, minus the top x percent of outliers

Parameters:
  • data (list) –

    list of int kmer depths for calculating variance

  • trim_frac (float, default: 0.05 ) –

    fraction of top depths to trim (outlier handling)

Returns:
  • variance( float ) –

    variance of the trimmed kmer depths

minimapWrapper.py

Run minimap2, parse its output, calculate counts on the fly

This script will run minimap2 of a sample's reads against the reference FASTA. We use a wrapper instead of a snakemake rule to avoid additional read/writes for every sample. PAF files of alignments can optionally be saved.

  • worker_mm_to_count_paf_queues - read minimap2 output and pass to queues for processing and saving PAF
  • worker_mm_to_count_queues - read minimap2 output and pass to queue for processing only
  • worker_paf_writer - read minimap2 output from queue and write to zstandard-zipped file
  • worker_count_and_print - read minimap2 output from queue, calculate counts, print to output files
  • build_mm2cmd - return the minimap2 command based on presence of R2 file
  • start_workers - start queues and worker threads

build_mm2cmd(**kwargs)

Return the minimap2 command

Parameters:
  • **kwargs (dict, default: {} ) –
    • threads (int): Number of worker threads to use
    • minimap_mode (str): Mapping preset for minimap2
    • ref_idx (str): Reference indexed file
    • r1_file (str): Forward reads file
    • r2_file (str): Reverse reads file (or "" for SE reads/longreads)
Returns:
  • mm2cmd( list ) –

    minimap2 command for opening with subprocess

contig_lens_from_fai(file_path)

Collect the sequence IDs from the reference fasta file

Parameters:
  • file_path (str) –

    File path of reference fasta fai index file

Returns:
  • ctg_lens( list ) –

    0: Sequence ID (str) 1: contig length (int)

start_workers(queue_counts, queue_paf, pipe_minimap, **kwargs)

Start workers for reading the minimap output and parsing to queue(s) for processing

Parameters:
  • queue_counts (Queue) –

    queue to use for putting minimap2 output for collecting counts

  • pipe_minimap (pipe) –

    subprocess pipe for minimap2 for reading

  • **kwargs (dict, default: {} ) –
    • paf_file (str): PAF file for writing
    • save_pafs (bool): flag for if PAF files should be saved

worker_count_and_print(count_queue, contig_lengths, **kwargs)

Collect the counts from minimap2 queue and calc counts on the fly

Parameters:
  • count_queue (Queue) –

    queue of minimap2 output for reading

  • contig_lengths (list) –

    0: Sequence ID (str) 1: contig length (int)

  • **kwargs (dict, default: {} ) –
    • bin_width (int): Width of bins for hitrate and variance estimation
    • output_counts (str): filepath for writing output count stats

worker_mm_to_count_paf_queues(pipe, count_queue, paf_queue)

Read minimap2 output and slot into queues for collecting coverage counts, and saving the paf file.

Parameters:
  • pipe (pipe) –

    minimap2 pipe for reading

  • count_queue (Queue) –

    queue for putting for counts

  • paf_queue (Queue) –

    queue for putting for saving paf

worker_mm_to_count_queues(pipe, count_queue)

Read minimap2 output and slot into queues for collecting coverage counts

Args: pipe (pipe): minimap2 pipe for reading count_queue (Queue): queue for putting for counts

worker_paf_writer(paf_queue, paf_dir, sample, chunk_size=100)

Read minimap2 output from queue and write to zstd-zipped file

Parameters:
  • paf_queue (Queue) –

    queue of minimap2 output for reading

  • paf_dir (str) –

    dir for saving paf files

refSampleKmer.py

Sample kmers from the reference FASTA file

This script will parse the reference FASTA and sample kmers from each contig.

  • parse_fasta - Read the fasta(.gz) file
  • contigs_to_queue - Parse the fasta file and populate the processing queue with the id and sequence
  • string_to_kmers - Sample kmers from a sequence
  • process_contigs - Parse the processing queue, get sampled kmers, push output lines to writing queue
  • output_printer - Print the output lines to zstandard-zipped file

contigs_to_queue(file, queue_put, available_threads, queue_hold=1000)

Parse the fasta file and populate the processing queue with the id and sequence

Parameters:
  • file (str) –

    Reference FASTA file, optionally gzipped

  • queue_put (Queue) –

    Queue of IDs and sequences for processing; {'id': id, 'seq': seq}

  • available_threads (int) –

    Number of worker threads

  • queue_hold (int, default: 1000 ) –

    Number of entries in the processing before slowing down (for memory management)

output_printer(queue, outfile, chunk_size=1000)

Print the output lines to zstandard-zipped file

Parameters:
  • queue (Queue) –

    Queue of output lines for writing

  • outfile (str) –

    Output file for writing

  • chunk_size (int, default: 1000 ) –

    number of lines to compress and write at a time

parse_fasta(file)

Read the fasta(.gz) file

Parameters:
  • file (str) –

    Reference FASTA file, optionally gzipped

Returns:
  • generator

    A generator object that yields stripped lines from the FASTA file.

process_contigs(in_queue, out_queue, **kwargs)

Parse the processing queue, get sampled kmers, push output lines to writing queue

Parameters:
  • in_queue (Queue) –

    Contig IDs and sequences to process

  • out_queue (Queue) –

    Output lines for writing

  • **kwargs (dict, default: {} ) –
    • kspace (int): Sample every n-th kmer
    • ksize (int): Length of kmers
    • kmin (int): Min number of kmers to sample
    • kmax (int): Max number of kmers to sample

string_to_kmers(seq, **kwargs)

Sample kmers from a sequence

Parameters:
  • seq (str) –

    Contig's sequence

  • **kwargs (dict, default: {} ) –
    • kspace (int): Sample every n-th kmer
    • ksize (int): Length of kmers
    • kmin (int): Min number of kmers to sample
    • kmax (int): Max number of kmers to sample
Returns:
  • kmers( list ) –

    list of kmers sampled from input contig sequence

sampleCoverage.py

Calculate the coverage stats for a sample for each contig

This script will parse the raw count summary for a sample and calculate the output coverage stats for each contig.

  • calculate_coverage_stats_from_counts - Read in the library size and the counts from minimapWrapper.py, calculate rpm, rpkm, and rpk, write the counts

calculate_coverage_stats_from_counts(**kwargs)

Read in the library size and the counts from minimapWrapper.py, calculate rpm, rpkm, and rpk, write counts for sample

Kwargs

count_file (str): filepath to pickle file of contig lengths and np.array count objects bin_width (int): bin width size output_file (str): filepath of ouptut file for writing sample (str): sample name