============================================================ HLRMDB METHODS & COMMAND EXCERPTS Last updated: 2025-08-12 Intended use: This page summarises methods and provides brief command excerpts. ============================================================ [A] PREPROCESSING — QC & HOST REMOVAL ------------------------------------------------------------ Objective Clean raw reads, summarise quality, and remove host-derived sequences. Resources fastp; filtlong; NanoPlot; minimap2; Bowtie2; SeqKit; samtools; GRCh38. Command excerpts $ fastp -i short_R1.fq.gz -I short_R2.fq.gz -o clean_R1.fq.gz -O clean_R2.fq.gz $ filtlong [-1 clean_R1.fq.gz -2 clean_R2.fq.gz] long.fq.gz > long.filtered.fq $ NanoPlot --fastq long.filtered.fq -o nanoplot_out/ $ minimap2 -ax map-ont GRCh38.mmi long.filtered.fq | samtools view -b -f 4 | \ samtools fastq - | pigz -p > long.nonhost.fq.gz $ bowtie2 -x GRCh38 -1 clean_R1.fq.gz -2 clean_R2.fq.gz | samtools view -b -f 12 -F 256 | \ samtools fastq \ -1 >(pigz -p > short.nonhost_R1.fq.gz) \ -2 >(pigz -p > short.nonhost_R2.fq.gz) [B] ASSEMBLY-FREE PROFILING ------------------------------------------------------------ Objective Read-level taxonomic profiling and AMR/virulence screening without assembly. Resources Kraken2; Abricate (NCBI, VFDB); SeqKit. Command excerpts $ kraken2 --db --threads --report sample.k2.report reads.fq.gz $ seqkit fq2fa long.nonhost.fq.gz > long.nonhost.fa $ abricate --db ncbi long.nonhost.fa > amr.tsv $ abricate --db vfdb long.nonhost.fa > vfdb.tsv [C] ASSEMBLY & POLISHING ------------------------------------------------------------ Objective De novo assembly from long-read or hybrid data, followed by polishing and QC. Resources metaSPAdes (hybrid) or metaFlye (long-read only); NextPolish; MetaQUAST. Command excerpts $ metaspades.py -1 short.nonhost_R1.fq.gz -2 short.nonhost_R2.fq.gz \ [--nanopore|--pacbio] long.nonhost.fq.gz -o asm_spades/ $ flye --meta [--nano-raw|--pacbio-raw|--pacbio-hifi] long.nonhost.fq.gz --out-dir asm_flye/ $ nextPolish run.cfg $ metaquast polished_assembly.fasta -o quast_out/ [D] BINNING & REFINEMENT ------------------------------------------------------------ Objective Recover MAGs using complementary binners and refine consensus bins. Resources MetaWRAP (MetaBAT2/MaxBin2/CONCOCT); SemiBin2; DAS Tool. Command excerpts $ metawrap binning -a polished_assembly.fasta -o bins/ short.nonhost_R1.fq short.nonhost_R2.fq $ minimap2 -ax [map-ont|map-pb] polished_assembly.fasta long.nonhost.fq.gz | samtools sort -o long.bam $ SemiBin2 single_easy_bin -i polished_assembly.fasta -b long.bam -o semibin2_out/ $ DAS_Tool -i concoct.tsv,maxbin2.tsv,metabat2.tsv,semibin2.tsv -c polished_assembly.fasta -o das_out/ [E] QUALITY, TAXONOMY & FUNCTIONAL ANNOTATION ------------------------------------------------------------ Objective Assess bin quality, assign taxonomy, and perform per-MAG structural/functional annotation. Scope (data level) Bin-level (per MAG). Typical inputs/outputs Inputs: MAG FASTA (*.fa) from refined bins. Outputs: QC metrics (completeness/contamination), GTDB taxonomy tables, Prokka GFF/FAA/FFN, eggNOG annotations. Resources CheckM2; GTDB-Tk; Prokka; eggNOG-mapper. Command excerpts $ checkm2 predict -x fa --input das_bins/ --output checkm2_out/ $ gtdbtk classify_wf --genome_dir das_bins/ --out_dir gtdbtk_out/ $ prokka bin.fa --outdir prokka/bin --prefix bin $ emapper.py -i prokka/bin/bin.faa --output_dir emap/bin [F] GENE- & PATHWAY-LEVEL (ASSEMBLY-BASED) ------------------------------------------------------------ Objective Predict ORFs, build a non-redundant gene catalogue, quantify genes (hybrid), and annotate functions, taxonomy and AMR. Scope (data level) Gene-catalog level (ORF-based). Typical inputs/outputs Inputs: non-redundant ORFs (genes.nr.*), optionally short-read pairs for quantification. Outputs: TPM/count matrices (Salmon; hybrid only), functional/KO/COG/GO/CAZy roll-ups, AMR annotations (RGI), species-level gene annotations (BASTA). Resources Prodigal; CD-HIT; SeqKit; Salmon (hybrid); eggNOG-mapper; DIAMOND (NR or CAZy/dbCAN); BASTA; RGI. Command excerpts # 1) ORF prediction → non-redundant gene set $ prodigal -p meta -i polished_assembly.fasta -d genes.nucl.fa -a genes.prot.fa $ cd-hit-est -i genes.nucl.fa -o genes.nr.nucl.fa -c 0.95 -aS 0.9 $ seqkit translate genes.nr.nucl.fa > genes.nr.prot.fa # 2) Gene-level quantification (for weighted functional summaries; hybrid only) $ salmon index -t genes.nr.nucl.fa -i idx $ salmon quant -i idx -l A -1 short.nonhost_R1.fq.gz -2 short.nonhost_R2.fq.gz -o quant_out/ # Export TPM/counts as gene.TPM / gene.count as needed # 3) Functional & taxonomic annotation of the gene catalogue # 3a) eggNOG (functions, KOs, COGs, GOs) $ emapper.py -i genes.nr.prot.fa -o eggnog_out/ # 3b) NR homology + BASTA taxonomy (species assignment for genes) $ diamond blastp -d nr.dmnd -q genes.nr.prot.fa -o nr.m8 $ basta sequence -q QUIET --alen 100 --identity 80 --evalue 1e-5 --minimum 3 --best_hit 1 --maj_perc 60 \ -d nr.m8 species_annotation.txt prot # 3c) CAZy/dbCAN (carbohydrate-active enzymes; optional HMM route not shown here) $ diamond makedb --in dbCAN_protein.faa -d dbCAN $ diamond blastp -d dbCAN -q genes.nr.prot.fa -o cazy.tsv # 3d) Resistome (AMR) annotation $ rgi main -i genes.nr.prot.fa -t protein -a DIAMOND --include_loose -o rgi/genes