Config file format

Each time you run a MARTi analysis on a sequencing run, you need to specify a config file which provides the details of the analysis to be performed.

This config file is generated by the MARTi launcher front-end (Desktop) or GUI (cluster/HPC).

The following table specifies the meaning of the parameters in the file. Keywords in bold are mandatory, others are optional.

Sample and global settings

Keyword

Example

Meaning

SampleName

BAMBI_1D_18042017

Sample name

RawDataDir

/path/to/dir

Run directory - specifically the path to the directory containing the fastq_pass, fastq_fail etc. directories. Or for guppy/dorado run separately, the directory containing the fastq directory.

SampleDir

/path/to/dir

Path to directory to use for MARTi analysis files (will be created if doesn’t exist)

ProcessBarcodes

01,02,03

If a barcoded sample, indicates which barcodes to process

BarcodeId<n>

BarcodeSampleId1

Sample ID to use for barcode n

Scheduler

local

Job scheduler to use - either “local” or “slurm”.

Queue

ei-medium

The default job submission queue. Currently only required for SLURM and equates to the partition name.

MaxJobs

4

Specifies the maximum number of concurrent jobs that can be run by the scheduler (local or SLURM).

InactivityTimeout

10

How long (seconds) before giving up waiting for new reads to appear. After this timeout, all remaining analysis will be completed and analysis will stop. Default timeout is 10 seconds.

StopProcessingAfter

50000

Stop analysis after this number of reads. Default behaviour is no limit.

schedulerFileTimeout

600000

For SLURM, the allowed time between a job completing an an output file appearing before concluding a failutre. Default 600000 (i.e. 10m).

SchedulerFileWriteDelay

30000

For SLURM, the delay after a job completing and an output file appearing before MARTi attempts to read it. Default 30000 (i.e. 30s).

SchedulerResubmissionAttemplts

2

For SLURM, how many times to try resubmitting a failed job before giving up.

TaxonomyDir

/path/to/dir

Specifies location of NCBI taxonomy files (i.e. the directory containing nodes.dmp and names.dmp).

AccessionMap

/path/to/file

Specifies an accession map for mapping accession IDs to taxa. This is generated using the NCBI accession2taxid data by a separte tool. Option should not be required for normal MARTi operation.

ConvertFastQ

n/a

Deprecated.

ReadsPerBlast

4000

BLAST chunk size - reads are batched into bundles of this number before BLASTing.

Pre-filtering settings

Keyword

Example

Meaning

ReadFilterMinQ

9

Minimmum mean quality value. Reads with mean Q below this are not processed. Default 0 (all reads). |

ReadFilterMinLength

150

Minimum read length. Reads shorter than this are not processed. Default 150. |

LCA classification settings

These Lowest Common Ancestor settings apply to BLAST results (see below).

Keyword

Example

Meaning

LCAMaxHits

100

Maximum number of BLAST hits to consider in LCA assignment. Default 100.

LCAScorePercent

90

Only consider hits within this percentage of the top hit for a given read. Default 90.

LCAMinIdentity

70

Only consider hits with this minimum identity. Default 70.

LCAMinLength

150

Minimum length of alignment to consider. Default 150.

LCAMinReadLength

100

Minimum length of read to consider. Default 0. Note, this comes after ReadFilterMinLength, so if set to a value lower than that it will have no effect.

LCAMinQueryCoverage

70

Only consider hits with this minmum percent coverage of query. Default 0.

LCAMinCombinedScore

120

Only consider hits where identity % added to query coverage % is greater than this value. Default 0.

LCALimitToSpecies

n/a

Limit LCA classification to species level and no lower.

BLAST processes

You can run multiple BLAST processes. Each begins with the Keyword BlastProcess.

Keyword

Example

Meaning

BlastProcess

n/a

Defines the start of a BLAST process

Name

nt

Name of process

Program

megablast

Blast algorithm to use e.g megablast, blastn

Database

/path/to/db

Database (path)name. Note, this should be the same as you would specify to the BLAST command line with the -db parameter i.e. it is typically a prefix, or may point to the FASTQ file that the database was built from.

UseToClassify

n/a

Use BLAST results for classification (can only be set for 1 BLAST process)

TaxaFilter

/path/to/file.txt

Taxa filter file to use with BLAST (e.g. to filter to bacteria/viruses)

MaxE

0.001

Max E value for BLAST

MaxTargetSeqs

100

Maximum number of target sequences for BLAST

RunMeganEvery

n/a

Deprecated.

BlastThreads

4

Number of threads to use when running BLAST. Note: for SLURM scheduler, MARTi also uses this value for the SLURM –cpus-per-task option.

Memory

16G

For SLURM scheduler, the memory to use per BLAST job. Passed with the SLURM –mem parameter.

Queue

ei-medium

The job submission queue to use. Can be left out and the default queue (see above) will be used. Currently only required for SLURM and equates to the partition name.

Dust

15 64 1

Dust string to be passed on to all blast commands for this blast process (optional).

Options

-ungapped

Any additional options to pass to BLAST (multiple options can be separated with spaces)

Diamond processes

Diamond can be used to classify reads against a Diamond database that is built with taxonomy information. For example, to build a compatible diamond database using NCBI taxonomy, use the command

diamond makedb --threads 8 --in nr.gz -d nr.diamond-2.0.9 --taxonmap prot.accession2taxid.FULL.gz --taxonnodes nodes.dmp --taxonnames names.dmp

The fields --taxonmap, --taxonnodes, and --taxonnames must be specified for the database to be compatible with MARTi.

Diamond processes are a subset of BLAST processes, with the Program field set to diamond. All compatible fields from the BLAST process are passed through to Diamond. Diamond processes have an additional options field to specify the sensistivity mode (or any other options). See below for an example.

Centrifuge processes

You can run multiple Centrifuge processes. Each begins with the keyword CentrifugeProcess.

Keyword

Example

Meaning

CentrifugeProcess

n/a

Defines the start of a Centrifuge process

Name

cent_nt

Name of process

Database

/path/to/db

Path to Centrifuge database

UseToClassify

n/a

Use Centrifuge results for classification (can only be set for 1 classification process)

CentrifugeThreads

4

Number of threads to use when running Centrifuge. Note: for SLURM scheduler, MARTi also uses this value for the SLURM –cpus-per-task option.

Memory

16G

For SLURM scheduler, the memory to use per Centrifuge job. Passed with the SLURM –mem parameter.

Queue

ei-medium

The job submission queue to use. Can be left out and the default queue (see above) will be used. Currently only required for SLURM and equates to the partition name.

MinHitLen

500

This value is passed to the Centrifuge option –min-hitlen for this process.

TaxaFilter

544,550

Passes through to Centrifuge’s exclude-taxids option which is described as “a comma-separated list of taxonomic IDs that will be excluded in classification procedure. The descendants from these IDs will also be excluded.”

Options

–reorder

Any additional options to pass to Centrifuge (multiple options can be separated with spaces)

Kraken2 processes

You can run multiple Kraken2 processes. Each begins with the keywork Kraken2Process.

Keyword

Example

Meaning

Kraken2Process

n/a

Defines the start of a Kraken2 process

Name

k2_refseq

Name of process

Database

/path/to/db/

Path to directory containing Kraken2 database

UseToClassify

n/a

Use Kraken2 results for classification (can only be set for 1 classification process)

Kraken2Threads

4

Number of threads to use when running Kraken2. Note: for SLURM scheduler, MARTi also uses this value for the SLURM –cpus-per-task option.

Memory

16G

For SLURM scheduler, the memory to use per Kraken2 job. Passed with the SLURM –mem parameter.

Queue

ei-medium

The job submission queue to use. Can be left out and the default queue (see above) will be used. Currently only required for SLURM and equates to the partition name.

Options

–confidence 0.01

Any additional options to pass to Kraken2 (multiple options can be separated with spaces)

AMR Walkout

Keyword

Example

Meaning

WalkoutMinDistance

50

Minimum distance from AMR hit that host hit must extend

WalkoutMinID

80

Minimum percentage identity for an AMR hit

WalkoutMinLength

100

Minimum length for an AMR hit alignment

Metadata

Metadata blocks are optional blocks that contain data describing the collection of samples. A metadata block could describe the whole run or a subset of barcodes.

Keyword

Example

Meaning

Metadata

n/a

Defines the start of a metadata block

Location

52.62170,1.21900

GPS coordinates of location where sample was collected.

Date

31/10/23

Date of sample collection

Time

11:41

Time of sample collection

Temperature

21.7C

Temperature at location at time of collection.

Humidity

49%

Humidity at location at time of collection.

Keywords

field,potatoes,infected

Comma-separated list of keywords to describe the sample. Used for searching.

Barcodes

01,02,03,04,05

Optional comma-separated list of barcodes for which this metadata applies. Do not include this field to use metadata for all barcodes.

Example

Example file:

SampleName:BAMBI_1D_19092017_MARTi
RawDataDir:/Users/leggettr/Documents/Datasets/BAMBI_1D_19092017_MARTi
SampleDir:/Users/leggettr/Documents/Projects/MARTiTest/BAMBI_1D_19092017_MARTi
ProcessBarcodes:
BarcodeId1:SampleNameHere

Scheduler:local
LocalSchedulerMaxJobs:4

InactivityTimeout:10
StopProcessingAfter:50000000

TaxonomyDir:/Users/leggettr/Documents/Databases/taxonomy_6Jul20
LCAMaxHits:20
LCAScorePercent:90
LCAMinIdentity:60
LCAMinQueryCoverage:0
LCAMinCombinedScore:0
LCAMinLength:50

ConvertFastQ

ReadsPerBlast:8000

ReadFilterMinQ:9
ReadFilterMinLength:500

BlastProcess
    Name:nt
    Program:megablast
    Database:/Users/leggettr/Documents/Databases/nt_30Jan2020_v5/nt
    TaxaFilter:/Users/leggettr/Documents/Datasets/bacteria_viruses.txt
    MaxE:0.001
    MaxTargetSeqs:25
    BlastThreads:4
    UseToClassify

BlastProcess
    Name:card
    Program:blastn
    Database:/Users/leggettr/Documents/Databases/card/nucleotide_fasta_protein_homolog_model.fasta
    MaxE:0.001
    MaxTargetSeqs:100
    BlastThreads:1

Metadata
    Location:52.62170,1.21900
    Date:31/10/23
    Time: 11:41
    Temperature:21.7C
    Humidity:49%
    Keywords:bambi

Different classification processes can be performed in the same MARTi process (but only one classification process can have the “UseToClassify” field). The example below shows a config file that classifies reads using Kraken2, and searches for AMR hits using BLAST and the CARD database. Note that if a BLAST/CARD process is used, a walkout analysis giving the putative host taxa for AMR genes is only performed if a BLAST process is used to classify the reads.

SampleName:BAMBI_1D_19092017_MARTi
RawDataDir:/Users/leggettr/Documents/Datasets/BAMBI_1D_19092017_MARTi
SampleDir:/Users/leggettr/Documents/Projects/MARTiTest/BAMBI_1D_19092017_MARTi
ProcessBarcodes:
BarcodeId1:SampleNameHere

Scheduler:local
LocalSchedulerMaxJobs:4

InactivityTimeout:10
StopProcessingAfter:50000000

TaxonomyDir:/Users/leggettr/Documents/Databases/taxonomy_6Jul20
LCAMaxHits:20
LCAScorePercent:90
LCAMinIdentity:60
LCAMinQueryCoverage:0
LCAMinCombinedScore:0
LCAMinLength:50

ConvertFastQ

ReadsPerBlast:8000

ReadFilterMinQ:9
ReadFilterMinLength:500

Kraken2Process
    Name:refseq_16
    Database:/Users/leggettr/Documents/Databases/kraken2/k2_standard_16gb_20231009/
    Kraken2Threads:4
    UseToClassify

BlastProcess
    Name:card
    Program:blastn
    Database:/Users/leggettr/Documents/Databases/card/nucleotide_fasta_protein_homolog_model.fasta
    MaxE:0.001
    MaxTargetSeqs:100
    BlastThreads:1

To classify using Diamond and a compatible database, use a BlastProcess with the Program field set to diamond. For example

SampleName:BAMBI_1D_19092017_MARTi
RawDataDir:/Users/leggettr/Documents/Datasets/BAMBI_1D_19092017_MARTi
SampleDir:/Users/leggettr/Documents/Projects/MARTiTest/BAMBI_1D_19092017_MARTi
ProcessBarcodes:
BarcodeId1:SampleNameHere

Scheduler:local
LocalSchedulerMaxJobs:4

InactivityTimeout:10
StopProcessingAfter:50000000

TaxonomyDir:/Users/leggettr/Documents/Databases/taxonomy_6Jul20
LCAMaxHits:20
LCAScorePercent:90
LCAMinIdentity:60
LCAMinQueryCoverage:0
LCAMinCombinedScore:0
LCAMinLength:50

ConvertFastQ

ReadsPerBlast:8000

ReadFilterMinQ:9
ReadFilterMinLength:500

BlastProcess
    Name:diamond-nr
    Program:diamond
    Database:/Users/leggettr/Documents/Databases/diamond/nr.diamond-2.0.9
    MaxE:0.001
    MaxTargetSeqs:100
    BlastThreads:2
    options: --sensitive --range-culling

Processing Barcodes Example

The following example demonstrates how to configure MARTi to process multiple barcodes.

RunName:Sample_Name
RawDataDir:/path/to/data/reads
SampleDir:/path/to/marti_output/Sample_Name

ProcessBarcodes:01,02,03,04,05,06,07,08,09,10,11,12
BarcodeId1:Kessingland1
BarcodeId2:Kessingland2
BarcodeId3:CarltonMarshes1
BarcodeId4:CarltonMarshe2
BarcodeId5:ThetfordForest1
BarcodeId6:ThetfordForest2
BarcodeId7:CityCentre1
BarcodeId8:CityCentre2
BarcodeId9:Brancaster1
BarcodeId10:Brancaster2
BarcodeId11:FoxleyWood1
BarcodeId12:FoxleyWood2

Scheduler:local
MaxJobs:64
InactivityTimeout:10
StopProcessingAfter:0
TaxonomyDir:/path/to/databases/taxonomy/taxdump_2024_03_09
ReadFilterMinQ:8
ReadFilterMinLength:150
ConvertFastQ
ReadsPerBlast:10000

BlastProcess
Name:nt
Program:megablast
Database:/path/to/databases/blast/ncbi/nt_20240305/nt
NegativeTaxaFilter:/path/to/results/marti/exclude/other_sequences_taxids.txt
MaxE:0.001
MaxTargetSeqs:25
UseToClassify

LCAMaxHits:100
LCAScorePercent:90.0
LCAMinIdentity:75
LCAMinQueryCoverage:0
LCAMinCombinedScore:0
LCAMinLength:150

The ProcessBarcodes line specifies which barcodes MARTi should analyse during the run. The lines following ProcessBarcodes (e.g., BarcodeId1:Kessingland1) are used to assign custom names to each barcode. If these lines are omitted, MARTi will assign default names using the run name followed by the barcode number (e.g., Sample_Name_bc01).

Users can also rename barcodes after running MARTi. This can be done through the GUI or by creating an ids.json file in the MARTi output directory. For this example, the file would be placed at /path/to/marti_output/Sample_Name/ids.json.

Here is an example of an ids.json file to rename two samples after the analysis has been completed:

{
    "Kessingland1": "Kessingland1_Autumn24",
    "Kessingland2": "Kessingland2_Autumn24"
}