-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathmode_build_query.txt
387 lines (284 loc) · 16.4 KB
/
mode_build_query.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
SYNOPSIS
metacache build+query -targets <sequence file/directory>... [OPTION]...
metacache build+query [OPTION]... -targets <sequence file/directory>...
metacache build+query -targets <sequence file/directory>... -query <sequence file/directory>... [OPTION]...
metacache build+query -targets <sequence file/directory>... [OPTION]... -query <sequence file/directory>...
metacache build+query [OPTION]... -targets <sequence file/directory>... -query <sequence file/directory>...
DESCRIPTION
Create a new database of reference sequences (usually genomic sequences) and use it to map (other) sequences to their most likely taxon of origin.
This mode is mainly recommended for use with the GPU version.
REQUIRED PARAMETERS
<sequence file/directory>...
FASTA or FASTQ files containing genomic sequences
(complete genomes, scaffolds, contigs, ...) that shall
beused as representatives of an organism/taxon.
If directory names are given, they will be searched for
sequence files (at most 10 levels deep).
BASIC OPTIONS
-taxonomy <path> directory with taxonomic hierarchy data (see NCBI's
taxonomic data files)
-taxpostmap <file>
Files with sequence to taxon id mappings that are used as
alternative source in a post processing step.
default: 'nucl_(gb|wgs|est|gss).accession2taxid'
-sequence-id-format (smart|ncbi|gi|filename|leadingword)
Method used for extracting sequence IDs from filenames and
sequence headers.Sequence IDs are also used to assign taxa
to reference sequences.
Available types are:
smart : try NCBI > genbank > filename
ncbi : NCBI-style accession/accession.version
gi : genbank identifier
filename : filename without extension
leadingword : first stretch of non-whitespace characters
default: smart
-silent|-verbose information level during build:
silent => none / verbose => most detailed
default: neither => only errors/important info
SKETCHING (SUBSAMPLING)
-kmerlen <k> number of nucleotides/characters in a k-mer
default: 16
-sketchlen <s> number of features (k-mer hashes) per sampling window
default: 16
-winlen <w> number of letters in each sampling window
default: 127
-winstride <l> distance between window starting positions
default: 112 (w-k+1)
ADVANCED OPTIONS
-reset-taxa Attempts to re-rank all sequences after the main build
phase using '.accession2taxid' files. This will reset the
taxon id of a reference sequence even if a taxon id could
be obtained from other sources during the build phase.
default: off
-max-locations-per-feature <#>
maximum number of reference sequence locations to be
stored per feature;
If the value is too high it will significantly impact
querying speed. Note that an upper hard limit is always
imposed by the data type used for the hash table bucket
size (set with compilation macro
'-DMC_LOCATION_LIST_SIZE_TYPE').
default: 254
-remove-overpopulated-features
Removes all features that have reached the maximum allowed
amount of locations per feature. This can improve querying
speed and can be used to remove non-discriminative
features.
default: off
Not available in the GPU version.
-remove-ambig-features <rank>
Removes all features that have more distinct reference
sequence on the given taxonomic rank than set by
'-max-ambig-per-feature'. This can decrease the database
size significantly at the expense of sensitivity. Note
that the lower the given taxonomic rank is, the more
pronounced the effect will be.
Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain
default: off
Not available in the GPU version.
-max-ambig-per-feature <#>
Maximum number of allowed different reference sequence
taxa per feature if option '-remove-ambig-features' is
used.
Not available in the GPU version.
-max-load-fac <factor>
maximum hash table load factor;
This can be used to trade off larger memory consumption
for speed and vice versa. A lower load factor will improve
speed, a larger one will improve memory efficiency.
default: 0.800000
Not available in the GPU version.
-parts <#> Splits the database into multiple parts. Each part
contains a separate hash table.
default: 1
-save-db <database filename>
Save database to disk after querying.
QUERY PARAMETERS
<sequence file/directory>...
FASTA or FASTQ files containing genomic sequences (short
reads, long reads, contigs, complete genomes, ...) that
shall be classified.
* If directory names are given, they will be searched for
sequence files (at most 10 levels deep).
* If no input filenames or directories are given,
MetaCache will run in interactive query mode. This can be
used to load the database into memory only once and then
query it multiple times with different query options.
MAPPING RESULTS OUTPUT
-out <file> Redirect output to file <file>.
If not specified, output will be written to stdout. If
more than one input file was given all output will be
concatenated into one file.
-split-out <file> Generate output and statistics for each input file
separately. For each input file <in> an output file with
name <file>_<in> will be written.
PAIRED-END READ HANDLING
-pairfiles Interleave paired-end reads from two consecutive files, so
that the nth read from file m and the nth read from file
m+1 will be treated as a pair. If more than two files are
provided, their names will be sorted before processing.
Thus, the order defined by the filenames determines the
pairing not the order in which they were given in the
command line.
-pairseq Two consecutive sequences (1+2, 3+4, ...) from each file
will be treated as paired-end reads.
-insertsize <#> Maximum insert size to consider.
default: sum of lengths of the individual reads
CLASSIFICATION
-lowest <rank> Do not classify on ranks below <rank>
(Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain)
default: sequence
-highest <rank> Do not classify on ranks above <rank>
(Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain)
default: domain
-hitmin <t> Sets classification threshhold to <t>.
A read will not be classified if less than t features from
the database match. Higher values will increase precision
at the expense of sensitivity.
default: 0
-hitdiff <d> Sets candidate LCA threshhold to <d> percent.
Influences if only candidate with the most hits will be
used as classification result or if taxa of other
candidates will be considered.
All candidate (taxa) will be included that have at least
d% as many hits above the hit-min threshold as the
candidate with the most hits.
default: 100
-maxcand <#> maximum number of reference taxon candidates to consider
for each query;
A large value can significantly decrease the querying
speed!.
default: 2
-cov-percentile <p>
Remove the p-th percentile of hit reference sequences with
the lowest coverage. Classification is done using only the
remaining reference sequences. This can help to reduce
false positives, especially whenyour input data has a high
sequencing coverage.
This feature decreases the querying speed!
default: off
GENERAL OUTPUT FORMATTING
-no-summary Dont't show result summary & mapping statistics at the end
of the mapping output
default: off
-no-query-params Don't show query settings at the beginning of the mapping
output
default: off
-no-err Suppress all error messages.
default: off
CLASSIFICATION RESULT FORMATTING
-no-map Don't report classification for each individual query
sequence; show summaries only (useful for quick tests).
default: off
-mapped-only Don't list unclassified reads/read pairs.
default: off
-taxids Print taxon ids in addition to taxon names.
default: off
-taxids-only Print taxon ids instead of taxon names.
default: off
-omit-ranks Do not print taxon rank names.
default: off
-separate-cols Prints *all* mapping information (rank, taxon name, taxon
ids) in separate columns (see option '-separator').
default: off
-separator <text> Sets string that separates output columns.
default: '\t|\t'
-comment <text> Sets string that precedes comment (non-mapping) lines.
default: '# '
-queryids Show a unique id for each query.
Note that in paired-end mode a query is a pair of two read
sequences. This option will always be activated if option
'-hits-per-ref' is given.
default: off
-lineage Report complete lineage for per-read classification
starting with the lowest rank found/allowed and ending
with the highest rank allowed. See also options '-lowest'
and '-highest'.
default: off
ANALYSIS: ABUNDANCES
-abundances <file>
Show absolute and relative abundance of each taxon.
If a valid filename is given, the list will be written to
this file.
default: off
-abundance-per <rank>
Show absolute and relative abundances for each taxon on
one specific rank.
Classifications on higher ranks will be estimated by
distributing them down according to the relative
abundances of classifications on or below the given rank.
(Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain)
If '-abundances <file>' was given, this list will be
printed to the same file.
default: off
ANALYSIS: RAW DATABASE HITS
-tophits For each query, print top feature hits in database.
default: off
-allhits For each query, print all feature hits in database.
default: off
-locations Show locations in candidate reference sequences.
Activates option '-tophits'.
default: off
-hits-per-ref <file>
Shows a list of all hits for each reference sequence.
If this condensed list is all you need, you should
deactive the per-read mapping output with '-no-map'.
If a valid filename is given after '-hits-per-ref', the
list will be written to a separate file.
Option '-queryids' will be activated and the lowest
classification rank will be set to 'sequence'.
default: off
ANALYSIS: ALIGNMENTS
-align Show semi-global alignment to best candidate reference
sequence.
Original files of reference sequences must be available.
This feature decreases the querying speed!
default: off
ADVANCED: GROUND TRUTH BASED EVALUATION
-ground-truth Report correct query taxa if known.
Queries need to have either a 'taxid|<number>' entry in
their header or a sequence id that is also present in the
database.
This feature decreases the querying speed!
default: off
-precision Report precision & sensitivity by comparing query taxa
(ground truth) and mapped taxa.
Queries need to have either a 'taxid|<number>' entry in
their header or a sequence id that is also found in the
database.
This feature decreases the querying speed!
default: off
-taxon-coverage Report true/false positives and true/false negatives.This
option turns on '-precision', so ground truth data needs
to be available.
This feature decreases the querying speed!
default: off
ADVANCED: PERFORMANCE TUNING / TESTING
-threads <#> Sets the maximum number of parallel threads to use.default
(on this machine): 8
-batch-size <#> Process <#> many queries (reads or read pairs) per thread
at once.
default (on this machine): 4096
-query-limit <#> Classify at max. <#> queries (reads or read pairs) per
input file.
default: 9223372036854775807
EXAMPLES
Build database from sequence file 'genomes.fna' and query all sequences in 'myreads.fna':
metacache build+query -targets genomes.fna -query myreads.fna
Build database with latest complete genomes from the NCBI RefSeq and query interactively
download-ncbi-genomes refseq/bacteria myfolder
download-ncbi-genomes refseq/viruses myfolder
download-ncbi-taxonomy myfolder
metacache build+query -targets myfolder -taxonomy myfolder