-
Notifications
You must be signed in to change notification settings - Fork 19
🉑char parser
aliases:
- 🧩char parsers
A char parser is a #🧩parser for parsing individual chars.
-
Every char parser has a char set, which is like a list of all the inputs that kind of parser recognizes as being characters.
-
Every char parser also has a 🉑char class specifier, which is a subset of the char set that a specific parser instance is configured to accept.
Char parsers always read the current position in the input.
-
If they don’t recognize that part of the input as a valid character, they will panic with 😬UnrecognizedByte.
-
If the character is part of the char set but not part of the 🉑char class specifier, they will ⛔nope with ⛔UnexpectedChar.
It sounds simple, but let’s dig deeper.
To build a 🉑char parser instance, you take a char parser constructor like:
You give them an initial 🉑char class specifier, getting a parser for a specific char class. You then combine multiple char classes using 🉑char class operators.
import {ascii}
const letter = ascii("letter") // parses letters
const digit = ascii("digit") // parses digits
const letterDigit = letter.or(digit) // parses either
Char parsers define what char means.
A char parser can produce JavaScript strings of length 2
or higher as a char because these strings count as chars according to its definition.
For instance, as far as the 🉑ascii parser is concerned, the only chars are ASCII characters, but with the DOS newline sequence tacked on. That is, given the string:
abc\n\n\n\r\n\r
It will be parsed as the chars:
a b c \n \n \n \r\n \r
As you can see from the ASCII parser, char parsers will always parse the longest (in terms of JavaScript length
) character they accept.
For the 🉑unicode parser, a single char is any Unicode codepoint, which can be a string of length
Custom char parsers could produce even longer strings as chars — examples includes Hangul syllabic blocks, complex emoji like 🧑🧑👧🏽, and combining diacritics.
Char parsers are constructed using a similar functional API as the rest of the library, but this similarity is just superficial. Char parsers actually behave very differently from other parsers, down to their underlying structure and implementation.
This can be shown using a simple example.
For normal parsers, the more combinators you apply, the more slower the resulting parser will perform. This is directly related to the complexity of the resulting parse tree.
const A = int.pipe(
or(hex),
must("must not be greater than 100", x => x < 100),
pre("the number: "),
)
const B = int
Char parsers support built-in 🛠️tuners that are similar to combinators, but since these are tuners are not combinators, they don’t result in a parse tree.
Moreover, even char parsers for parsing elaborate character classes will perform as well as something like letter
or digit
!
const A = letter
const B = letter.or(digit).not(upper).invert()
// They will both perform the same
- A space-time trade off.
- Doing more work during parser construction to do less work during execution.
- Fancy, space-efficient data structures.
However, it all boils down to a single bigass table.
You can think of a char parser as a giant table of character. The characters in the table are the parser’s entire char set.
Each entry has a
Char | Yes? |
---|---|
a | |
b | |
c | |
d | |
e | |
… | |
1 | |
2 |
The char class of a parser is the set of rows that are marked with
This table is the secret to char parsers. While constructing it might be somewhat expensive, one it’s there, it will perform as fast as JavaScript possibly can.
The table is an abstraction that’s very close to the implementation. The 🉑ascii parser implements it with a single contiguous block of memory — a Uint8Array
.
The 🉑unicode parser, which has to deal with 300,000 characters allocated in a potential space of over a million, is the one that needs to use the fancy data structure. But conceptually and functionally it’s basically the same.
A char parser can be summed up using its char class — the list of characters that it accepts. Two char parsers of the same type that have the same char class are functionally identical, no matter how they were constructed.
That is, even though char parser construction can be just as elaborate as using combinators, it doesn’t result in a parse tree. The result is a single parser.
const result = letter.or(digit).not(range("a", "b"))
You start with a base char class, which is a standard group of characters that the parser