🉑char parser

aliases:

🧩char parsers

A char parser is a #🧩parser for parsing individual chars.

Every char parser has a char set, which is like a list of all the inputs that kind of parser recognizes as being characters.
Every char parser also has a 🉑char class specifier, which is a subset of the char set that a specific parser instance is configured to accept.

Char parsers always read the current position in the input.

If they don’t recognize that part of the input as a valid character, they will panic with 😬UnrecognizedByte.
If the character is part of the char set but not part of the 🉑char class specifier, they will ⛔nope with ⛔UnexpectedChar.

It sounds simple, but let’s dig deeper.

building

To build a 🉑char parser instance, you take a char parser constructor like:

You give them an initial 🉑char class specifier, getting a parser for a specific char class. You then combine multiple char classes using 🉑char class operators.

import {ascii}
const letter = ascii("letter") // parses letters
const digit = ascii("digit") // parses digits
const letterDigit = letter.or(digit) // parses either

What’s a char?

Char parsers define what char means.

A char parser can produce JavaScript strings of length 2 or higher as a char because these strings count as chars according to its definition.

For instance, as far as the 🉑ascii parser is concerned, the only chars are ASCII characters, but with the DOS newline sequence tacked on. That is, given the string:

abc\n\n\n\r\n\r

It will be parsed as the chars:

a b c \n \n \n \r\n \r

As you can see from the ASCII parser, char parsers will always parse the longest (in terms of JavaScript length) character they accept.

For the 🉑unicode parser, a single char is any Unicode codepoint, which can be a string of length $2$ due to how JavaScript implements the standard. The DOS newline sequence is included there as well.

Custom char parsers could produce even longer strings as chars — examples includes Hangul syllabic blocks, complex emoji like 🧑‍🧑‍👧🏽, and combining diacritics.

Char parsers are different

Char parsers are constructed using a similar functional API as the rest of the library, but this similarity is just superficial. Char parsers actually behave very differently from other parsers, down to their underlying structure and implementation.

This can be shown using a simple example.

For normal parsers, the more combinators you apply, the more slower the resulting parser will perform. This is directly related to the complexity of the resulting parse tree.

const A = int.pipe(
    or(hex),
    must("must not be greater than 100", x => x < 100),
    pre("the number: "),
)

const B = int

Char parsers support built-in 🛠️tuners that are similar to combinators, but since these are tuners are not combinators, they don’t result in a parse tree.

Moreover, even char parsers for parsing elaborate character classes will perform as well as something like letter or digit!

const A = letter

const B = letter.or(digit).not(upper).invert()

// They will both perform the same

How’s that possible?

A space-time trade off.
Doing more work during parser construction to do less work during execution.
Fancy, space-efficient data structures.

However, it all boils down to a single bigass table.

You can think of a char parser as a giant table of character. The characters in the table are the parser’s entire char set.

Each entry has a $\mathbf{1}$ or $0$ attached to it. Like this:

Char	Yes?
a	$0$
b	$\mathbf{1}$
c	$\mathbf{1}$
d	$0$
e	$0$
…
1	$\mathbf{1}$
2	$0$

The char class of a parser is the set of rows that are marked with $\mathbf{1}$. Now, it’s pretty obvious that no matter how crazy the table gets, checking whether a character is allowed or not is just a single lookup.

This table is the secret to char parsers. While constructing it might be somewhat expensive, one it’s there, it will perform as fast as JavaScript possibly can.

The table is an abstraction that’s very close to the implementation. The 🉑ascii parser implements it with a single contiguous block of memory — a Uint8Array.

The 🉑unicode parser, which has to deal with 300,000 characters allocated in a potential space of over a million, is the one that needs to use the fancy data structure. But conceptually and functionally it’s basically the same.

🉑char parser

building

What’s a char?

Char parsers are different

How’s that possible?

Further reading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Building blocks