Skip to content

Commit 6d3372c

Browse files
committed
chore: Update README.md, check gc
1 parent 7f36f31 commit 6d3372c

File tree

3 files changed

+34
-25
lines changed

3 files changed

+34
-25
lines changed

README.md

+17-15
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
[![Rust](https://img.shields.io/badge/built_with-Rust-dca282.svg)](https://www.rust-lang.org/)
22
[![License](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://github.com/St4NNi/jam-rs/blob/main/LICENSE)
3+
[![Crates.io](https://img.shields.io/crates/v/jam-rs.svg)](https://crates.io/crates/jam-rs)
34
[![Codecov](https://codecov.io/github/St4NNi/jam-rs/coverage.svg?branch=main)](https://codecov.io/gh/St4NNi/jam-rs)
45
[![Dependency status](https://deps.rs/repo/github/St4NNi/jam-rs/status.svg)](https://deps.rs/repo/github/St4NNi/jam-rs)
56
___
@@ -9,7 +10,7 @@ ___
910
Just another minhash (jam) implementation. A high performance minhash variant to screen extremely large (metagenomic) datasets in a very short timeframe.
1011
Implements parts of the ScaledMinHash / FracMinHash algorithm described in [sourmash](https://joss.theoj.org/papers/10.21105/joss.00027).
1112

12-
Unlike traditional implementations like [sourmash](https://joss.theoj.org/papers/10.21105/joss.00027) or [mash](https://doi.org/10.1186/s13059-016-0997-x) this version tries to specialise more on estimating the containment of small sequences in large sets. This is intended to be used to screen terabytes of data in just a few seconds / minutes.
13+
Unlike traditional implementations like [sourmash](https://joss.theoj.org/papers/10.21105/joss.00027) or [mash](https://doi.org/10.1186/s13059-016-0997-x) this version tries to focus on estimating the containment of small sequences in large sets by (optionally) introducing an intentional bias towards smaller sequences. This is intended to be used to screen terabytes of data in just a few seconds / minutes.
1314

1415
### Installation
1516

@@ -19,17 +20,17 @@ A pre-release is published via [crates.io](https://crates.io/) to install it use
1920
cargo install jam-rs
2021
```
2122

22-
If you want the bleeding edge development release you can install via git:
23+
If you want the bleeding edge development release you can install it via git:
2324

2425
```bash
2526
cargo install --git https://github.com/St4NNi/jam-rs
2627
```
2728

2829
### Comparison
2930

30-
- [xxhash3](https://github.com/DoumanAsh/xxhash-rust) or [ahash-fallback](https://github.com/tkaitchuck/aHash/wiki/AHash-fallback-algorithm) (for kmer < 32) instead of [murmurhash3](https://github.com/mhallin/murmurhash3-rs)
31+
- Multiple algorithms: [xxhash3](https://github.com/DoumanAsh/xxhash-rust), [ahash-fallback](https://github.com/tkaitchuck/aHash/wiki/AHash-fallback-algorithm) (for kmer < 32) and legacy [murmurhash3](https://github.com/mhallin/murmurhash3-rs)
3132
- No jaccard similarity since this is meaningless when comparing small embeded sequences against large sets
32-
- (coming soon) optimisations for specificity and sensitivity (and speed) specifically for search of small sequences in assembled metagenomes
33+
- Additional filter and sketching options to increase for specificity and sensitivity for small sequences in collections of large assembled metagenomes
3334

3435
### Scaling methods
3536

@@ -44,12 +45,12 @@ If `KmerCountScaling` and `MinMaxAbsoluteScaling` are used together the minimum
4445

4546
```console
4647
$ jam
47-
Just another minhasher, obviously blazingly fast
48+
Just another (genomic) minhasher (jam), obviously blazingly fast
4849

4950
Usage: jam [OPTIONS] <COMMAND>
5051

5152
Commands:
52-
sketch Sketches one or more files and writes the result to an output file
53+
sketch Sketch one or more files and write result to output file (or stdout)
5354
merge Merge multiple input sketches into a single sketch
5455
dist Estimate distance of a (small) sketch against a subset of one or more sketches as database. Requires all sketches to have the same kmer size
5556
help Print this message or the help of the given subcommand(s)
@@ -67,7 +68,7 @@ The easiest way to sketch files is to use the `jam sketch` command. This accepts
6768

6869
```console
6970
$ jam sketch
70-
Sketch one or more files and write result to output file (or stdout)
71+
Sketch one or more files and write the result to an output file (or stdout)
7172

7273
Usage: jam sketch [OPTIONS] [INPUT]...
7374

@@ -76,13 +77,13 @@ Arguments:
7677

7778
Options:
7879
-o, --output <OUTPUT> Output file
79-
-k, --kmer-size <KMER_SIZE> kmer size all sketches to be compared must have the same size [default: 21]
80+
-k, --kmer-size <KMER_SIZE> kmer size, all sketches must have the same size to be compared [default: 21]
8081
--fscale <FSCALE> Scale the hash space to a minimum fraction of the maximum hash value (FracMinHash)
8182
--kscale <KSCALE> Scale the hash space to a minimum fraction of all k-mers (SizeMinHash)
8283
-t, --threads <THREADS> Number of threads to use [default: 1]
8384
-f, --force Overwrite output files
84-
--nmin <NMIN> Minimum number of k-mers (per record) to be hashed
85-
--nmax <NMAX> Maximum number of k-mers (per record) to be hashed
85+
--nmin <NMIN> Minimum number of k-mers (per record) to be hashed, bottom cut-off
86+
--nmax <NMAX> Maximum number of k-mers (per record) to be hashed, top cut-off
8687
--format <FORMAT> Change to other output formats [default: bin] [possible values: bin, sourmash]
8788
--algorithm <ALGORITHM> Change the hashing algorithm [default: default] [possible values: default, ahash, xxhash, murmur3]
8889
--singleton Create a separate sketch for each sequence record
@@ -95,9 +96,9 @@ Calculate the distance for one or more inputs vs. a large set of database sketch
9596

9697
```console
9798
$ jam dist
98-
Calculate distance of a (small) sketch against one or more sketches as database. Requires all sketches to have the same kmer size
99+
Estimate containment of a (small) sketch against a subset of one or more sketches as database. Requires all sketches to have the same kmer size
99100

100-
Usage: jam dist [OPTIONS] --input <INPUT> --database <DATABASE>
101+
Usage: jam dist [OPTIONS] --input <INPUT>
101102

102103
Options:
103104
-i, --input <INPUT> Input sketch or raw file
@@ -106,12 +107,13 @@ Options:
106107
-c, --cutoff <CUTOFF> Cut-off value for similarity [default: 0.0]
107108
-t, --threads <THREADS> Number of threads to use [default: 1]
108109
-f, --force Overwrite output files
110+
--stats Use the Stats params for restricting results
111+
--gc-lower <GC_LOWER> Use GC stats with an upper bound of x% (gc_lower and gc_upper must be set)
112+
--gc-upper <GC_UPPER> Use GC stats with an lower bound of y% (gc_lower and gc_upper must be set)
109113
-h, --help Print help
110114
```
111115

112116

113-
114-
115117
#### Merge
116118

117119
Merge multiple sketches into one large one.
@@ -138,7 +140,7 @@ This project is licensed under the MIT license. See the [LICENSE](LICENSE) file
138140

139141
### Disclaimer
140142

141-
jam-rs is still in early active development and not ready for production use. Use at your own risk. Once a stable version is released additional information and installation guidelines will be added.
143+
jam-rs is still in active development and not ready for production use. Use at your own risk.
142144

143145
### Credits
144146

src/cli.rs

+9-9
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ use std::path::PathBuf;
77
#[command(bin_name = "jam")]
88
#[command(version = "0.1.0-beta.1")]
99
#[command(
10-
about = "Just another minhasher, obviously blazingly fast",
11-
long_about = "A heavily optimized minhash implementation that focuses less on accuracy and more on quick scans of large datasets."
10+
about = "Just another (genomic) minhasher (jam), obviously blazingly fast",
11+
long_about = "An optimized minhash implementation that focuses on quick scans for small sequences in large datasets."
1212
)]
1313
pub struct Cli {
1414
#[command(subcommand)]
@@ -38,7 +38,7 @@ pub enum HashAlgorithms {
3838

3939
#[derive(Debug, Subcommand, Clone)]
4040
pub enum Commands {
41-
/// Sketch one or more files and write result to output file (or stdout)
41+
/// Sketch one or more files and write the result to an output file (or stdout)
4242
#[command(arg_required_else_help = true)]
4343
Sketch {
4444
/// Input file(s), one directory or one file with list of files to be hashed
@@ -48,7 +48,7 @@ pub enum Commands {
4848
#[arg(short, long)]
4949
#[arg(value_parser = clap::value_parser!(std::path::PathBuf))]
5050
output: Option<PathBuf>,
51-
/// kmer size all sketches to be compared must have the same size
51+
/// kmer size, all sketches must have the same size to be compared
5252
#[arg(short = 'k', long = "kmer-size", default_value = "21")]
5353
kmer_size: u8,
5454
/// Scale the hash space to a minimum fraction of the maximum hash value (FracMinHash)
@@ -57,10 +57,10 @@ pub enum Commands {
5757
/// Scale the hash space to a minimum fraction of all k-mers (SizeMinHash)
5858
#[arg(long)]
5959
kscale: Option<u64>,
60-
/// Minimum number of k-mers (per record) to be hashed
60+
/// Minimum number of k-mers (per record) to be hashed, bottom cut-off
6161
#[arg(long)]
6262
nmin: Option<u64>,
63-
/// Maximum number of k-mers (per record) to be hashed
63+
/// Maximum number of k-mers (per record) to be hashed, top cut-off
6464
#[arg(long)]
6565
nmax: Option<u64>,
6666
/// Change to other output formats
@@ -84,7 +84,7 @@ pub enum Commands {
8484
#[arg(value_parser = clap::value_parser!(std::path::PathBuf))]
8585
output: PathBuf,
8686
},
87-
/// Estimate distance of a (small) sketch against a subset of one or more sketches as database.
87+
/// Estimate containment of a (small) sketch against a subset of one or more sketches as database.
8888
/// Requires all sketches to have the same kmer size
8989
#[command(arg_required_else_help = true)]
9090
Dist {
@@ -104,10 +104,10 @@ pub enum Commands {
104104
/// Use the Stats params for restricting results
105105
#[arg(long)]
106106
stats: bool,
107-
/// Use GC stats with an upper bound of x% and a lower bound of y%
107+
/// Use GC stats with an upper bound of x% (gc_lower and gc_upper must be set)
108108
#[arg(long)]
109109
gc_lower: Option<u8>,
110-
/// Use GC stats with an upper bound of x% and a lower bound of y%
110+
/// Use GC stats with an lower bound of y% (gc_lower and gc_upper must be set)
111111
#[arg(long)]
112112
gc_upper: Option<u8>,
113113
},

src/main.rs

+8-1
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,14 @@ fn main() {
6464

6565
let gc_bounds = match (gc_lower, gc_upper) {
6666
(Some(l), Some(u)) => Some((l, u)),
67-
_ => None,
67+
(None, None) => None,
68+
_ => {
69+
cmd.error(
70+
ErrorKind::ArgumentConflict,
71+
"Both gc_lower and gc_upper must be set",
72+
)
73+
.exit();
74+
}
6875
};
6976

7077
let mut input_sketch = Vec::new();

0 commit comments

Comments
 (0)