Skip to content
Felix Thalén edited this page Apr 23, 2024 · 8 revisions

About

Patchwork is an alignment-based program for retrieving and concatenating phylogenetic markers from whole-genome sequencing (WGS) data. The program searches the provided DNA query contigs against one or more amino acid reference sequences. Multiple, overlapping hits are merged to derive a single, continuous sequence for each provided reference sequence.

Synopsis

usage: Patchwork.jl [--contigs PATH [PATH...]]
                    --reference PATH [PATH...] [--search-results PATH]
                    [--database PATH] [--matrix NAME]
                    [--custom-matrix PATH]
                    [--species-delimiter CHARACTER]
                    [--fasta-extension STRING] [--wrap-column NUMBER]
                    [--no-plots] [--output-dir PATH] [--overwrite]
                    [--query-gencode NUMBER] [--strand STRING]
                    [--min-orf NUMBER] [--fast] [--mid-sensitive]
                    [--sensitive] [--more-sensitive]
                    [--very-sensitive] [--ultra-sensitive]
                    [--iterate [MODE...]] [--frameshift NUMBER]
                    [--evalue NUMBER] [--min-score NUMBER]
                    [--max-target-seqs NUMBER] [--top NUMBER]
                    [--max-hsps NUMBER] [--id PERCENTAGE]
                    [--query-cover PERCENTAGE]
                    [--subject-cover PERCENTAGE] [--masking MODE]
                    [--len NUMBER] [--gapopen NUMBER]
                    [--gapextend NUMBER] [--retain-stops]
                    [--retain-ambiguous] [--no-trimming]
                    [--window-size NUMBER]
                    [--required-distance NUMBER] [--threads NUMBER]
                    [--block-size NUMBER] [--version] [-h]

Alignment-based retrieval and concatenation of phylogenetic markers
from whole-genome sequencing data

optional arguments:
  --version             show version information and exit
  -h, --help            show this help message and exit

input/output:
  --contigs PATH [PATH...]
                        PATH to 1+ nucleotide sequence files in FASTA
                        or FASTQ format. Can be GZip compressed.
  --reference PATH [PATH...]
                        PATH to 1+ amino acid sequence files in the
                        FASTA format.
  --search-results PATH
                        PATH to a tabular DIAMOND output file, with
                        one header line in format: 6 qseqid sseqid
                        pident length mismatch gapopen qstart qend
                        sstart send evalue bitscore qframe sseq seq.
  --database PATH       Path to a subject DIAMOND or BLAST database to
                        search against.
  --matrix NAME         Specifies the NAME of the scoring matrix
  --custom-matrix PATH  PATH to a custom scoring matrix
  --species-delimiter CHARACTER
                        Set the CHARACTER used to separate the OTU
                        from the rest in sequence IDs (type: Char,
                        default: '@')
  --fasta-extension STRING
                        Filetype extension used for output FASTA files
                        (default: ".fas")
  --wrap-column NUMBER  Wrap output sequences at column NUMBER. 0 = no
                        wrap (type: Int64, default: 0)
  --no-plots            Do not include plots
  --output-dir PATH     Write output files to this directory PATH
                        (default: "patchwork_output")
  --overwrite           Overwrite old content in the output directory

DIAMOND BLASTX:
  --query-gencode NUMBER
                        Genetic code used for translation of query
                        sequences. A list of possible values can be
                        found on the NCBI website. Standard Code is
                        used by default (type: Int64)
  --strand STRING       Specifies the strand of the query. Possible
                        values are: 'both', 'plus', and 'minus'. Both
                        strands are searched by default
  --min-orf NUMBER      DIAMOND ignores translated sequences with
                        smaller open reading frames. Default is:
                        disabled for sequences smaller than 30, 20 fro
                        sequences smaller than 100, and 40 otherwise.
                        Set to 1 to disable (type: Int64)
  --fast                Set DIAMOND sensitivity mode to 'fast'.
  --mid-sensitive       Set DIAMOND sensitivity mode to
                        'mid-sensitive'.
  --sensitive           Set DIAMOND sensitivity mode to 'sensitive'.
  --more-sensitive      Set DIAMOND sensitivity mode to
                        'more-sensitive'.
  --very-sensitive      Set DIAMOND sensitivity mode to
                        'very-sensitive'.
  --ultra-sensitive     Set DIAMOND sensitivity mode to
                        'ultra-sensitive'.
  --iterate [MODE...]   Set DIAMOND option --iterate. In version
                        2.0.12 or higher, you can optionally specify a
                        space-separated list of sensitivity modes to
                        iterate over. Allowed values are 'fast',
                        'mid-sensitive', 'sensitive',
                        'more-sensitive', 'very-sensitive',
                        'ultra-sensitive', 'default' and none
                        (default: ["PATCHWORK_OFF"])
  --frameshift NUMBER   Allow frameshift in DIAMOND and set frameshift
                        penalty. Without this option, frameshift is
                        disabled entirely (type: Int64)
  --evalue NUMBER       Only report DIAMOND hits with lower e-values
                        than the given value (type: Float64)
  --min-score NUMBER    Only report DIAMOND hits with bitscores >= the
                        given value. Overrides            the --evalue
                        option (type: Float64)
  --max-target-seqs NUMBER
                        The maximum number of subject sequences that
                        DIAMOND may report per query. Default is 25;
                        setting it to 0 will report all hits (type:
                        Int64)
  --top NUMBER          Discard DIAMOND hits outside the given
                        percentage range of the top alignment score.
                        This option overrides --max-target-seqs (type:
                        Int64)
  --max-hsps NUMBER     Maximum number of HSPs DIAMOND may report per
                        target sequence for each query. Default is
                        reporting only the highest-scoring HSP.
                        Setting this option to 0 will report all
                        alternative HSPs (type: Int64)
  --id PERCENTAGE       Discard DIAMOND hits with less sequence
                        identity than the given percentage (type:
                        Float64)
  --query-cover PERCENTAGE
                        Discard DIAMOND hits with less query cover
                        than the given percentage (type: Float64)
  --subject-cover PERCENTAGE
                        Discard DIAMOND hits with less subject cover
                        than the given percentage (type: Float64)
  --masking MODE        Set the DIAMOND mode for repeat masking. Note
                        that, contrary to DIAMOND default (tantan
                        masking enabled), Patchwork disables masking
                        by default (default: 0)! Set to 1 to enable
                        tantan masking, or to 2 to enable default
                        BLASTP SEG masking. Note that the latter
                        requires a DIAMOND version >= 2.0.12. (type:
                        Int64, default: 0)

alignment:
  --len NUMBER          Discard DIAMOND hits shorter than the provided
                        NUMBER (type: Int64)
  --gapopen NUMBER      Set the gap open penalty to this positive
                        NUMBER (type: Int64)
  --gapextend NUMBER    Set the gap extension penalty to this positive
                        NUMBER (type: Int64)
  --retain-stops        Do not remove stop codons (`*`) in the output
                        sequences
  --retain-ambiguous    Do not remove ambiguous characters from the
                        output sequences

sliding window:
  --no-trimming         Skip sliding window-based trimming of
                        alignments
  --window-size NUMBER  Specifices the NUMBER of positions to average
                        across (type: Int64, default: 4)
  --required-distance NUMBER
                        Specifies the average distance required (type:
                        Float64, default: -7.0)

resources:
  --threads NUMBER      Number of threads to utilize (type: Int64,
                        default: 12)
  --block-size NUMBER   Billions of sequence letters to be processed
                        at a time. A larger block size
                        leads to increased performance at the expense
                        of disk and memory usage. Values
                        >20 are not recommended. (type: Float64,
                        default: 2.0)
Clone this wiki locally