Skip to content

Overview of rexical

Dmytro Shteflyuk edited this page Dec 7, 2022 · 2 revisions

This text has been taken from Jeff Nyman's blog.

Introduction

Rex is a lexical analyzer. That means it’s a program that breaks provided input into recognized pieces. As an example, a lexical analyzer might take as input a written document and count the words in that document. That would imply there is a rule that specifies how to recognize a word. It might be a simple rule, like “any bit of text that is separated by spaces.” When using Rex, you will construct a “specification file.” This file will tell Rex how to build a corresponding lexical analyzer. That lexical analyzer will be generated in Ruby code. (Lex does the same thing but outputs C code instead.)

So what goes in this specification file? Within this Rex file, you will have a series of rules that Rex translates into the lexical analyzer. Each rule, in turn, consists of a pattern and some code to be run when that pattern is matched. Any text that isn’t matched is simply copied to standard output. The idea is that you want to make sure all possible inputs are caught by the lexer or that the lexer appropriately handles situations where it finds input that it does not know how to match.

Your First Rexical Program

Let’s play around a bit. Follow these steps:

Create a folder called test_language. This will be our project folder. Within that folder create a file called test_language.rex. This will be our specification file. The main responsibility of Rex is to tokenize the code of a language into unit tokens, called terminals. Generally the goal is to allows those terminals to be understood by a parser, like Racc. For now, however, we’ll just use Rex on its own. Enter the following into your file:

class TestLanguage
 
end

To make sure things work, let’s compile this file. Perform the following command:

rex test_language.rex -o lexer.rb

Here I’m just using the rexical tool to compile the rex file into a file called lexer.rb. If you look at the generated file, you will see that the generated class will inherit from Racc::Parser. So let’s keep in mind what this lexer is for. It’s for performing lexical analysis, of course. Lexical analysis, however, is just the first phase. In this case taking various inputs and converting them into a stream of symbols which are fed in to the second phase, the parser. Any lexer should raise errors indicating invalid characters and anything it can’t find a symbol for. Take a bit of time to look at the generated lexer.rb file. Don’t worry about understanding all of it. Just get a feel for it.

Also note that any time you make a change to the test_language.rex file, you must rerun the above command to generate a new lexer.rb.

In order for this lexer to be in any way useful, you do have to provide some rules. Specifically, you will include a rules section. The rules section associates regular expression patterns with Ruby logic. The idea is that when the lexer sees text that matches a pattern you specified, the lexer will execute the associated Ruby code. Add the following to your test_language.rex file:

class TestLanguage
rule
  u     { puts "Single u." }
  uu    { puts "Double u." }
end

Here you have two rules that specify if a single u is input and matched, a certain bit of logic should execute. If two u’s together are input and matched, a certain other bit of logic should be executed. Think of these lines as rex expressions. These expressions are patterns. Here the pattern is simple: literal values of ‘u’. But the patterns can (and usually will) be specified by regular expressions. The idea is that when provided input, Rexical will search for strings that match your rex expressions.

There are, however, some tricky bits to this matching. To show that, create a file called test_language.rb. Then add the following to it:

require './lexer.rb'
 
class TestLanguageTester
  @evaluator = TestLanguage.new
  @evaluator.tokenize("u")
end

Now try to run that file. You will be told that there is no tokenize method to execute. Well, that’s true — there isn’t. You have to create one. So add the following to your test_language.rex file:

class TestLanguage
rule
  u     { puts "Single u." }
  uu    { puts "Double u." }
 
inner
  def tokenize(code)
    scan_setup(code)
    tokens = []
    while token = next_token
      tokens << token
    end
    tokens
  end
end

The call to scan_setup is a call to a method in the lexer.rb file. This sets up the lexical scanner to do its thing. Likewise, a call is made to the next_token method, also defined in lexer.rb. This basically grabs the next token from the rule section of the file. Notice how the tokenize method that we defined is placed in an inner section? That’s important because what happens is that anything in an inner section is copied into the class in the lexer.rb file. If you check, you’ll find your tokenize method inside lexer.rb near the bottom of the file.

Make sure to regenerate the lexer.rb file and then run your test_language.rb file again. You should get this output:

Single u.

Okay, now change the test_language.rb file like this:

require './lexer.rb'
 
class TestLanguageTester
  @evaluator = TestLanguage.new
  @evaluator.tokenize("uu")
end

Run the file again. This time you get ….

Single u.
Single u.

Hmm. Is that what you expected? Or did you expect to receive the text “Double u”? What’s happening here is that when given input, Rex tries each rule in order, matching the longest stream of input that it can. Here’s where I enter into some confusion, though. It seems that in the world of Lex, preference will be given to longer matches even if they are later in the file. That does not seem to be the case in Rex. To test this change the test_language.rex file as follows:

class TestLanguage
rule
  uu    { puts "Double u." }
  u     { puts "Single u." }
 
inner
  def tokenize(code)
    scan_setup(code)
    tokens = []
    while token = next_token
      tokens << token
    end
    tokens
  end
end

Here I’ve just switched the ordering of the rules. Now try to run your logic again and you’ll find the following:

  • @evaluator.tokenize(“u”) results in “Single u.”
  • @evaluator.tokenize(“uu”) results in “Double u.”

So let’s try one more addition to the test_language.rex file:

class TestLanguage
rule
  uu    { puts "Double u." }
  u     { puts "Single u." }
  uuu   { puts "Triple u." }
 
inner
  def tokenize(code)
    scan_setup(code)
    tokens = []
    while token = next_token
      tokens << token
    end
    tokens
  end
end

Now you should find the following:

  • @evaluator.tokenize(“u”) results in “Single u.”
  • @evaluator.tokenize(“uu”) results in “Double u.”
  • @evaluator.tokenize(“uuu”) results in “Double u.” and “Single u.”

If you were to put the triple u rule first, above all the other rules in the rex file, you would find the following:

  • @evaluator.tokenize(“u”) results in “Single u.”
  • @evaluator.tokenize(“uu”) results in “Double u.”
  • @evaluator.tokenize(“uuu”) results in “Triple u.”

So the trick here is that the ordering matters: Rex is finding the longest pattern it can in the order that it is reading the rules. Play around with this simple example to see how things work.

If something doesn’t match any rules at all, Rex will provide an error saying that it could not match that particular token. What some people will do is add a last rule that matches anything and either does nothing at all or prints out some kind of friendly message. Here’s an example of what you could do:

class TestLanguage
rule
  uu    { puts "Double u." }
  u     { puts "Single u." }
  uuu   { puts "Triple u." }
  .     { puts "Could not match." }
 
inner
  def tokenize(code)
    scan_setup(code)
    tokens = []
    while token = next_token
      tokens << token
    end
    tokens
  end
end

The . is an example of a regular expression, which is basically saying match anything. Try @evaluator.tokenize(“y”) and you should get the message “Could not match.”. You should find that all the previous examples work as before.

Clone this wiki locally