Skip to content

Commit

Permalink
readme updates
Browse files Browse the repository at this point in the history
  • Loading branch information
mwlon committed Oct 29, 2023
1 parent 4ce1182 commit bbac876
Show file tree
Hide file tree
Showing 5 changed files with 15 additions and 26 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ with high compression ratio and moderately fast speed.
* lossless; preserves ordering and exact bit representation
* nth-order delta encoding
* compresses faster or slower depending on compression level from 0 to 12
* fully streaming decompression

**Data types:**
`u32`, `u64`, `i32`, `i64`, `f32`, `f64`
Expand Down Expand Up @@ -56,25 +55,26 @@ multiple chunks per file.
| page | interleaving w/ wrapping format | \>1k numbers |
| batch | decompression | 256 numbers (fixed) |

The standalone format is essentially a minimal implementation of a wrapped format.
It supports batched decompression and seeking, but not nullability, multiple
columns, random access, or other niceties.
The standalone format is a minimal implementation of a wrapped format.
It supports batched decompression only; no nullability, multiple
columns, random access, seeking, or other niceties.
It is mainly useful for quick proofs of concept (sometimes by the CLI).

<img alt="pco compression and decompression steps" title="compression and decompression steps" src="./images/processing.svg" />

## Etymology

The names pcodec and pco were chosen for these reasons:
* "Pico" suggests that it makes very small things.
* Pco is reminiscent of qco, its preceding format.
* Pco is reminiscent of qco, its predecessor.
* Pco is reminiscent of PancakeDB (Pancake COmpressed). Though PancakeDB is now
history, it had a good name.
* Pcodec is short, provides some semantic meaning, and should be easy to
search for.

The names are used for these purposes:
* pco => the library and data format
* pco_cli => the binary crate name
* pco\_cli => the binary crate name
* pcodec => the binary CLI and the repo

## Extra
Expand Down
17 changes: 3 additions & 14 deletions pco/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
**⚠️ Both the API and the data format are unstable for the 0.0.0-alpha.\*
releases. Do not depend on pco for long-term storage yet. ⚠️**

## Usage as a Standalone Format
## Quick Start

```rust
use pco::standalone::{auto_compress, auto_decompress};
Expand Down Expand Up @@ -32,7 +32,7 @@ To run something right away, try
[the benchmarks](../bench/README.md).

For a lower-level standalone API that allows writing one chunk at a time /
streaming reads, see [the docs.rs documentation](https://docs.rs/pco/latest/pco/).
batched reads, see [the docs.rs documentation](https://docs.rs/pco/latest/pco/).

## Usage as a Wrapped Format

Expand All @@ -58,15 +58,4 @@ implementations are insufficient)
`pco::data_types::UnsignedLike` and
`pco::data_types::FloatLike`.

### Seeking and Statistics

Each chunk has a metadata section containing
* the total count of numbers in the chunk,
* the bins for the chunk and relative frequency of each bin,
* and the size in bytes of the compressed body.

Using the compressed body size, it is easy to seek through the whole file
and collect a list of all the chunk metadatas.
One can aggregate them to obtain the total count of numbers in the whole file
and even an approximate histogram.
This is typically about 100x faster than decompressing all the numbers.
The maximum legal precision of a custom data type is currently 128 bits.
4 changes: 2 additions & 2 deletions pco/src/chunk_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,7 @@ impl ChunkConfig {
}
}

/// `PagingSpec` specifies how a chunk is split into pages
/// (default: equal pages up to 1,000,000 numbers each).
/// `PagingSpec` specifies how a chunk is split into pages.
#[derive(Clone, Debug)]
#[non_exhaustive]
pub enum PagingSpec {
Expand All @@ -130,6 +129,7 @@ pub enum PagingSpec {
ExactPageSizes(Vec<usize>),
}

/// Default: equal pages up to 1,000,000 numbers each.
impl Default for PagingSpec {
fn default() -> Self {
Self::EqualPagesUpTo(DEFAULT_MAX_PAGE_SIZE)
Expand Down
4 changes: 2 additions & 2 deletions pco/src/data_types/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,9 @@ pub trait UnsignedLike:
/// wouldn't preserve ordering and would cause pco to fail. In this example,
/// one needs to flip the sign bit and, if negative, the rest of the bits.
pub trait NumberLike: Copy + Debug + Display + Default + PartialEq + 'static {
/// A number from 0-255 that corresponds to the number's data type.
/// A number from 1-255 that corresponds to the number's data type.
///
/// Each `NumberLike` implementation should have a different `HEADER_BYTE`.
/// Each `NumberLike` implementation should have a different `DTYPE_BYTE`.
/// This byte gets written into the file's header during compression, and
/// if the wrong header byte shows up during decompression, the decompressor
/// will return an error.
Expand Down
4 changes: 2 additions & 2 deletions pco_cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Setup

You can compress, decompress, and inspect .pco files using our simple CLI.
You can compress, decompress, and inspect standalone .pco files using our simple CLI.
Follow this setup:

1. Install Rust: https://www.rust-lang.org/tools/install
Expand Down Expand Up @@ -52,7 +52,7 @@ This command prints numbers in a .pco file to stdout.
Examples:

```shell
pcodec decompress --limit 10 in.pco
pcodec decompress --limit 256 in.pco
```

### Inspect
Expand Down

0 comments on commit bbac876

Please sign in to comment.