Make tokenizer more generic #3

New issue

Closed

opened 2023-03-18 22:28:53 +00:00 by zyxw59 · 0 comments

zyxw59 commented

2023-03-18 22:28:53 +00:00

(Migrated from github.com)

Currently the tokenizer hard-codes the characters that can occur together in a token. There should be a way to provide an alternate implementation.

Some things probably should still be hard-coded, namely the number and string tokenizers, since they have more complex internal state than a simple character-by-character lexer. Also I don't think the whitespace skipping would need to be genericized.

The simplest API I can imagine is something like this:

pub trait TokenizerState {
    /// Get the tokenizer state for the given initial character.
    fn token_start(c: char) -> Self; // `c` will never be a digit, double quote, or whitespace
    /// Match the next character, returning `None` if it is not part of the same token.
    fn token_continue(self, c: char) -> Option<Self>;
}

Currently the tokenizer hard-codes the characters that can occur together in a token. There should be a way to provide an alternate implementation. Some things probably should still be hard-coded, namely the number and string tokenizers, since they have more complex internal state than a simple character-by-character lexer. Also I don't think the whitespace skipping would need to be genericized. The simplest API I can imagine is something like this: ```rust pub trait TokenizerState { /// Get the tokenizer state for the given initial character. fn token_start(c: char) -> Self; // `c` will never be a digit, double quote, or whitespace /// Match the next character, returning `None` if it is not part of the same token. fn token_continue(self, c: char) -> Option<Self>; } ```