Make tokenizer more generic #3

Closed
opened 2023-03-18 22:28:53 +00:00 by zyxw59 · 0 comments
zyxw59 commented 2023-03-18 22:28:53 +00:00 (Migrated from github.com)

Currently the tokenizer hard-codes the characters that can occur together in a token. There should be a way to provide an alternate implementation.

Some things probably should still be hard-coded, namely the number and string tokenizers, since they have more complex internal state than a simple character-by-character lexer. Also I don't think the whitespace skipping would need to be genericized.

The simplest API I can imagine is something like this:

pub trait TokenizerState {
    /// Get the tokenizer state for the given initial character.
    fn token_start(c: char) -> Self; // `c` will never be a digit, double quote, or whitespace
    /// Match the next character, returning `None` if it is not part of the same token.
    fn token_continue(self, c: char) -> Option<Self>;
}
Currently the tokenizer hard-codes the characters that can occur together in a token. There should be a way to provide an alternate implementation. Some things probably should still be hard-coded, namely the number and string tokenizers, since they have more complex internal state than a simple character-by-character lexer. Also I don't think the whitespace skipping would need to be genericized. The simplest API I can imagine is something like this: ```rust pub trait TokenizerState { /// Get the tokenizer state for the given initial character. fn token_start(c: char) -> Self; // `c` will never be a digit, double quote, or whitespace /// Match the next character, returning `None` if it is not part of the same token. fn token_continue(self, c: char) -> Option<Self>; } ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
mle/selkirk#3
No description provided.