Reference Databases
MARTi does not download or manage classification databases automatically. Users must ensure the databases are pre-built and accessible by the MARTi Engine at runtime. You must provide valid paths to these databases in the MARTi configuration file.
Pre-built Blast databases
Users can provide a pre-built BLAST database, such as the nucleotide sequence database (nt) or Prokaryotic RefSeq database, or build and use a custom BLAST database.
The easiest way to obtain the latest pre-built BLAST databases is by running the update_blastdb.pl script that comes with the BLAST+ command line tool (Perl is also a prerequisite). Documentation for this script can be seen by running the script without any arguments.
To view all available BLAST databases, run the following command:
update_blastdb.pl --showall
To download one of these pre-built BLAST databases, run the script followed by any relevant options and the name(s) of the BLAST databases to download. For example:
update_blastdb.pl --decompress ref_prok_rep_genomes
Custom Blast databases
If you want to make a custom BLAST database from FASTA files, you can use the makeblastdb tool distributed with the BLAST+ command line application. Before running the command you need to ensure that each sequence has a unique identifier and that you have created an additional file that maps these identifiers to NCBI taxids (see here for more). Then you can build your database with a command similar to this:
makeblastdb -in zymo_mock.fasta -parse_seqids -blastdb_version 5 -title "Zymo mock" -dbtype nucl -taxid_map taxid_map.txt
CARD
If specified in the configuration file, the MARTi Engine will also BLAST reads to the Comprehensive Antibiotic Resistance Database (CARD) for AMR gene identification. To use the CARD database, you will need to:
Download both the CARD Data and CARD Ontology files from the CARD website
Extract the contents of each file into a single directory.
Create a BLAST database from the FASTA sequences:
makeblastdb -in nucleotide_fasta_protein_homolog_model.fasta -dbtype nucl
Kraken2 Databases
To classify reads with Kraken2 users must provide a Kraken2-compatible database, either by building one themselves or downloading a pre-built one.
Pre-built databases are available from: https://benlangmead.github.io/aws-indexes/k2
To build a custom Kraken2 database, refer to the official Kraken2 manual: https://ccb.jhu.edu/software/kraken/MANUAL.html#custom-databases
After downloading or building a database, provide the full path to the database directory in the MARTi configuration file.
Centrifuge Databases
MARTi also supports Centrifuge for classification.
Prebuilt Centrifuge databases are available from: https://benlangmead.github.io/aws-indexes/centrifuge
Note: These prebuilt databases are quite old (2016–2018) and do not reflect the latest available reference sequences. We recommend building your own up-to-date Centrifuge database where possible.
Users can build their own Centrifuge database using the centrifuge-download and centrifuge-build commands. Here is an example for constructing a basic metagenomic database for Centrifuge:
# download NCBI taxonomy
centrifuge-download -o taxonomy taxonomy
# download RefSeq genomes for archaea, bacteria, viral and fungi
centrifuge-download -o library -m -d "archaea,bacteria,viral,fungi" refseq > seqid2taxid.map
# add the human reference genome
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' refseq >> seqid2taxid.map
# concatenate all sequences into one file
find library -name '*.fna' -exec cat {} >> input-sequences.fna \;
# build the Centrifuge index
centrifuge-build -p 100 \
--conversion-table seqid2taxid.map \
--taxonomy-tree taxonomy/nodes.dmp \
--name-table taxonomy/names.dmp \
input-sequences.fna metagenome