An example tokenizer

A tokenizer is anything of type

(str: string) =>
  Iterator<
    Token<string>,
    Token<null> | TokenizationError
  >

In particular, a generator function will work.

function *tokenizer(str: string) {
  // Yield the tokens here.
}

Our example tokenizer returns two kinds of tokens: Token<'word'> and Token<'number'>. It skips whitespace.

We zeroth extend Token to create WordToken and NumberToken. These tokens will store, along with their kind ('word' or 'number'), the word or number they represent.

// `tokenizer.ts`
export class WordToken extends Token<'word'> {
  constructor(
    public word: string,
    start: SrcPosition,
    end: SrcPosition,
  ) {
    super('word', start, end);
  }
}

export class NumberToken extends Token<'number'> {
  constructor(
    public number: string,
    start: SrcPosition,
    end: SrcPosition,
  ) {
    super('number', start, end);
  }
}

Now we define a few helper functions. These take a string and the current position, and return what they matched, or a value indicating whether they had any effect.

handleWhitespace updates the position to skip a whitespace character, returning true if it did so.

// `tokenizer.ts`
const handleWhitespace = (str: string, position: SrcPosition) => {
  if (str[position.i] === '\n') {
    position.line++;
    position.col = 0;
    position.i++;
    
    return true;
  }
  
  if (str[position.i] === ' ') {
    position.col++;
    position.i++;
    
    return true;
  }
  
  return false;
};

handleCharClass is a factory for our helper functions. The function it returns tries to match the longest possible string of characters in charClass. If that string is empty, returns null. Else, if the next character is in noFollow, returns a TokenizationError. Else, the function returns the matched string as a token of type TokenClass.

// `tokenizer.ts`
const letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
const digits = '0123456789';

type TokenConstructor<T> = new (
  value: string,
  start: SrcPosition,
  end: SrcPosition,
) => T;

const handleCharClass =
  <T>(
    charClass: string,
    noFollow: string,
    TokenClass: TokenConstructor<T>,
  ) =>
  (str: string, position: SrcPosition) =>
{
  let j = position.i;
  
  while (charClass.includes(str[j])) {
    j++;
  }
  
  if (position.i === j) return null;
  
  if (noFollow.includes(str[j])) {
    return new TokenizationError(
      new SrcPosition(position.line, position.col, j),
    );
  }
  
  const token = new TokenClass(
    str.slice(position.i, j),
    new SrcPosition(position.line, position.col, position.i),
    new SrcPosition(position.line, position.col + j - position.i, j),
  );
  
  position.col += j - position.i;
  position.i = j;
  
  return token;
};

Lastly, we define our tokenizer using a generator function.

// `tokenizer.ts`
const handleWord = handleCharClass(letters, digits, WordToken);
const handleNumber = handleCharClass(digits, letters, NumberToken);

export function *tokenizeWordsAndNumbers(str: string) {
  const position = new SrcPosition(0, 0, 0);
  
  while (position.i < str.length) {
    if (handleWhitespace(str, position)) continue;
    
    const word = handleWord(str, position);
    
    if (word) {
      yield word;
      continue;
    }
    
    const number = handleNumber(str, position);
    
    if (number) {
      yield number;
      continue;
    }
    
    return new TokenizationError(position);
  }
  
  // The "end of input" token.
  return new Token(
    null,
    new SrcPosition(position.line, position.col, position.i),
    new SrcPosition(position.line, position.col, position.i),
  );
}

To use our tokenizer, we just need to pass it to a Parser constructor.

// `parser.ts`
import { Parser } from 'lr-parser-typescript';

import { StartingSymbol } from './grammar';
import { tokenizeWordsAndNumbers } from './tokenizer';

export const parser = new Parser(StartingSymbol, {
  tokenizer: tokenizeWordsAndNumbers,
});

We're done! 🎉 Our grammar can now use 'word' and 'number' tokens.

// `grammar.ts`
import { Repeat, SyntaxTreeNode } from 'lr-parser-typescript';

export class StartingSymbol extends SyntaxTreeNode {
  static pattern = new Repeat('word', {
    delimiter: 'number',
  });
}

// `index.ts`
import { parser } from './parser';

parser.parse('hello 123 world'); // An instance of StartingSymbol.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example-tokenizer.md

example-tokenizer.md

An example tokenizer

Files

example-tokenizer.md

Latest commit

History

example-tokenizer.md

File metadata and controls

An example tokenizer