Supplementary inforamtion, scripts and data files presented in the manuscript Hobbs et al.
View the Project on GitHub HobnobMancer/SI_Hobbs_et_al_2023_cazyme_classifier_evaluation
This repository contains supplementary information for analyses reported in Hobbs et al. (2023), independently and comprehensively evaluating the CAZyme classifiers dBCAN2, dbCAN3, dbCAN4, eCAMI and CUPP.
All files (includng all tables, CSV files and figures) generated by this report are available in the archive report-dbcan-CUPP.zip
file.
All files (includng all tables, CSV files and figures) generated by this report are available in the archive report-all-classifiers.zip
file.
The output from pyrewton
is available in the results.zip
archive.
A citation for this work will be added once available. At the present please cite this repository as the source, the DOI:10.5281/zenodo.8123588, and the authors (in order): Emma E. M. Hobbs1,2,3, Tracey, M. Gloster1, Leighton Prichard2.
- School of Biology and Biomedical Sciences Research Complex, University of St Andrews, North Haugh, St Andrews, Fife, KY16 9ST, UK
- Strathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Glasgow G4 ORE, UK
- Cell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
@misc{Hobbs2023,
author = "Emma E. M. Hobbs and Tracey M. Gloster and Leighton Pritchard",
title = "Independent and Comprehensive
Benchmarking of CAZyme classifiers: dbCAN,
Hotpep, CUPP and eCAMI",
howpublished = "\url{https://hobnobmancer.github.io/SI_Hobbs_et_al_2023_cazyme_classifier_evaluation/}",
year = 2023,
note = "Version 1. DOI:10.5281/zenodo.8123588"
}
To reproduce the analyses, run all commands provided in the walkthrough from the root of this directory.
CAZyme classifiers analyse query protein sequence and predict CAZyme domains and associated CAZy family annotations. This enables exploratory analysis of CAZyme complements not presently catalogued in the CAZy database (www.cazy.org). Each CAZyme classifier implements a different method to predict CAZy family annotations.
We previously published an evaluation of dbCAN2, eCAMI and CUPP (Hobbs et al., 2021). Since then, two standalone versions of dbCAN have been released (dbCAN3 and dbCAN4). Additionally, the previous analysis was limited to 70 genomes, over weighted towards bacterial genomes. To address these points, we present here an independent and comprehensive evaluation of the CAZyme classifiers:
Evaluating performance of:
The CAZyme classifier dbCAN is available as a webserver and a standalone tool. In this evaluation the standalone tool was used, and is referred to as dbCAN, references to the webserver are defined as the dbCAN webserver. The version numbers of the standalone tool and the webserver are independent of one another:
Each version of dbCAN implements multiple sequence alignment and modelling tools:
All references to implementing HMMER and DIAMOND refer to the implementation of these tools by dbCAN. For this evaluation, specifically the implementation of HMMER and DIAMOND by dbCAN4.
Han Zhang and others, dbCAN2: a meta server for automated carbohydrate-active enzyme annotation, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W95–W101
dbCAN4 and dbCAN-sub (dbCAN)
Zheng J, Ge Q, Yan Y, Zhang X, Huang L, Yin Y. dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Res. 2023 Jul 5;51(W1):W115-W121eCAMI
Xu J, Zhang H, Zheng J, Dovoedo P, Yin Y. eCAMI: simultaneous classification and motif identification for enzyme annotation. Bioinformatics. 2020 Apr 1;36(7):2068-2075CUPP
Barrett, K., Lange, L. Peptide-based functional annotation of carbohydrate-active enzymes by conserved unique peptide patterns (CUPP). Biotechnol Biofuels 12, 102 (2019).HMMER
Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755-63.DIAMOND
Buchfink, B., Xie, C. & Huson, D. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12, 59–60 (2015). https://doi.org/10.1038/nmeth.3176 \
All raw figure files presented in the complete report in the manuscript can be found in the results/
directory.
Owing to the size of the data sets used, the figures are consequently compressed in the final manuscript. This remote repository contains the original full size, high resolution figures.
Additionally, some analyses are only briefly mentioned in the manuscript. The full method and results of these analyses are stored in this repository.
You can use this repository like a website, to browse and see how we performed the analysis, or you can download it to inspect, verify, reproduce and build on our analysis.
You can use git
to clone this repository to your local hard drive:
git clone git@github.com:HobnobMancer/SI_Hobbs_et_al_2023_cazyme_classifier_evaluation.git
Or you can download it as a compressed .zip
archive from this link.
Please raise an issue at the corresponding GitHub
page:
The structure of this repository:
???
You can use this archive to browse, validate, reproduce, or build on the phylogenomics analysis for the Hobbs et al. (2023) manuscript.
We recommend creating a conda environment specific for this activity, for example using the commands:
conda create -n pyrewton python=3.9 cazomevolve mmseqs2 -y
conda activate pyrewton
conda install --file requirements.txt -y -c bioconda -c conda-forge -c predector
pip3 install pyrewton
The installation instructions for dbCAN
v==2.0.11 can be found here and were followed to install dbCAN for the analysis presented in the manuscript.
v==3.0.7
v==4.0.0
ADD INSTALLATION INSRUCTIONS
tools
dbcan_2
dbcan_3
dbcan_4
A separate virual environment was created for each version of dbCAN
Several of the data files required to repeat the analyses presented in the manuscript are stored (available for use) in the repo. These files are stored in the data/
directory:
Configure using cazy_webscraper
(Hobbs _et al., 2022) to download all data from the CAZy database, and compile the data into a local CAZyme database.
cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets Emma E. M. Hobbs, Tracey M. Gloster, Leighton Pritchard bioRxiv 2022.12.02.518825; doi: https://doi.org/10.1101/2022.12.02.518825
The CAZy database was downloaded on 2023-07-26 using the bash script download_cazy.sh
.
scripts/cazy/download_cazy.sh <email>
This generated the local CAZyme database data/cazy/cazy-db_2023-07-26.db
.
Retrieved the test sets from the previous evaluation (presented in the following poster Hobbs et al., 2021), which comprised 40 bacterial and 30 eukaryotic genomes.
The bash script get_cazy_sampling.sh
was used to coordinate querying the local CAZyme database (data/cazy/cazy-db_2023-07-26.db
) to extract the number of unique GenBank protein IDs associated with each species listed in CAZy. These data were written to a CSV file by the bash script (data/test_sets/cazy_species_sizes.csv
).
The CSV file was mannually explore to identify an addition 10 eukaryotic genomes, with the assembly status of ‘complete’ in the NCBI Assembly database and containing at least 100 proteins listed in the local CAZyme database , to be added to the test sets from the previous evaluation.
Output:
data/test_sets/test_sets-summary.csv
.The Python script csv_to_yml.py
was used to generate a Yaml file summarising the test sets, and to write out all the NCBI Assembly version accessions to a plain text file (data/test_sets/test_sets.txt
).
scripts/test_sets/csv_to_yml.py
Output:
data/test_sets/test_sets-genomes.txt
data/test_sets/test_sets-taxs.yaml
GenBank assemblies were used for this analysis not reference (RefSeq genomes) because CAZy primarily annotates protein sequences from the NCBI GenBank database, and thus contains very few proteins from the NCBI RefSeq database.
The plain text file of NCBI Assembly version accessions (data/test_sets/test_sets-genomes.txt
) was used as input to ncbi-genome-download
tool to download the protein FASTA files for each assembly. This was coordinated using the bash script download_genomes.sh
:
scripts/test_sets/download_genomes.sh
The protein FASTA files were written to the data/test_sets/proteomes
directory.
pyrewton
was used to generate a data set per downloaded genome .faa file, consisting of 100 randomly selected proteins that are listed in the local CAZy database (from here on referred to as ‘known CAZymes’), and the 100 protein sequences with the highest BLAST Score Ratio against the 100 selected known CAZymes (from here on referred to as ‘known non-CAZymes’).
This was coordinated using the bash script create_test_sets.sh
scripts/test_sets/create_test_sets.sh
Output:
.faa
). These are stored in data/test_sets/test_sets/test_sets/
directory.data/test_sets/test_sets/alignment_scores
by pyrewton
.data/test_sets/test_sets/cazome_coverage_2023_07_26-10_38_27.txt
)data/test_sets/test_sets/test_set_composition_2023_07_26-10_38_27.txt
)Each CAZyme classifier was used to parse all test sets FASTA files. A separate bash script was used to coordinate running each CAZyme classifier:
dbCAN_2:
scripts/tools/run_dbcan_2.sh
dbCAN_3:
scripts/tools/run_dbcan_3.sh
dbCAN_4:
scripts/tools/run_dbcan_4.sh
CUPP:
scripts/tools/run_cupp.py
The output for all tools was written to the data/tool_outputs
directory.
Each tool is given its own subdirectory within data/tool_outputs/
.
Each test set is given it own subdirectory within each tool subdirectory:
└── data
└── tools_output
├── cupp
│ ├── GCA_900635675.1
│ └── GCA_900638075.1
│ ├── cupp.log
│ ├── cupp_output.fasta
│ └── cupp_output.fasta.log
├── dbcan_2
│ ├── GCA_900635675.1
│ └── GCA_900638075.1
│ ├── dbcan.log
│ ├── diamond.out
│ ├── hmmer.out
│ ├── Hotpep.out
│ ├── overview.txt
│ └── uniInput
├── dbcan_3
│ ├── GCA_900635675.1
│ └── GCA_900638075.1
│ ├── dbcan.log
│ ├── diamond.out
│ ├── eCAMI.out
│ ├── hmmer.out
│ ├── overview.txt
│ └── uniInput
└── dbcan_4
├── GCA_900635675.1
└── GCA_900638075.1
├── dbcan.log
├── dbsub.out
├── diamond.out
├── dtemp.out
├── hmmer.out
├── overview.txt
└── uniInput
Use pyrewton
to calculate performance statistics for each level of classification.
Use the YAML file data/test_sets/test_sets-kingdoms.yaml
and the --tax_groups
flag to evaluate the performance across all test sets and per taxonomic kingdom.
scripts/calculate_stats.sh
The levels of classification that are evaluated:
The performance statistics calculated for each level of CAZyme classification:
Statistics were calculated across the entire data set, and per kingdom (bacteria and eukaryotes).
Statistics were calculated for each classifier and per tool witihn each classifier:
The output (a series of dataframes) were written to the results/
directory. Note, owing to the number and sizes of the files, the results/
directory is provided as a ZIP archive file in this repository.
The RMarkdown notebook notebooks/2023-cazyme-classifier-eval-dbcan-CUPP.Rmd
was used to interrogate and visulise the performance statistics calculated by pyrewton
.
A HTML version of notebook can be viewed here.
All output from the notebook is available in the report/
directory. Note, owing to the number and sizes of the files, the report/
directory is provided as a ZIP archive file in this repository. Due to its size, this directory is provided as a ZIP file (report-dbcan-cupp.zip
).
After running the RMarkdown notebook, the Python script summarise_family_populations.py
can be run to generate a CSV file listing the number of unique NCBI protein version accessions listed in the local CAZyme database for each CAzy family, the total number of proteins included across all test sets per CAZy family, and the percentage of the CAZy family population represented in the test sets. This file is written to data/test_sets/family_representation.csv
.
The RMarkdown notebook notebooks/2023-cazyme-classifier-eval.Rmd
was used to interrogate and visulise the performance statistics calculated by pyrewton
. This evaluated the performance across all 13 prediction tools.
A HTML version of notebook can be viewed here.
All output from the notebook is available in the report/
directory. Note, owing to the number and sizes of the files, the report/
directory is provided as a ZIP archive file in this repository. Due to its size, this directory is provided as a ZIP file (report-all-classifiers.zip
).
After running the RMarkdown notebook, the Python script summarise_family_populations.py
can be run to generate a CSV file listing the number of unique NCBI protein version accessions listed in the local CAZyme database for each CAzy family, the total number of proteins included across all test sets per CAZy family, and the percentage of the CAZy family population represented in the test sets. This file is written to data/test_sets/family_representation.csv
.
To explore if the consistently poor performance of the CAZyme classifiers for CAZy families GT25 and GT31 was the result of high sequence diversity within these families, an all-versus-all DIAMOND pairwise alignment analysis was performed for each family.
# downloads protein sequences from NCBI GenBank for GT25 and GT31
# Download protein sequences are stored in the local CAZyme database
scripts/seq_diversity/download_seqs.sh
scripts/seq_diversity/extract_seqs.sh
FASTA files written to:
data/sequences/gt25.fasta
data/sequences/gt31.fasta
scripts/seq_diversity/run_diamond.sh
Output written to:
data/sequences/sequences/gt25.diamond.out
data/sequences/sequences/gt31.diamond.out
The Jupyter notebook notebooks/gt_sequence_diversity.ipynb
was used to calculate the BLAST Score Ratio to facilitate comparing the degree of sequence similarity across all and to visualise the BSR.
The figures generated using this notebook are written to:
results/sequences/gt25-clustermap.pdf
- presented in the SIresults/sequences/gt25-clustermap-fullSized.pdf
- all labels are readable and is provided in the online repositoryresults/sequences/gt31-clustermap.pdf
- presented in the SIresults/sequences/gt31-clustermap-fullSized.pdf
- all labels are readable and is provided in the online repository