Blog • How to implement an easiest compiler

This article will lead the reader to make a simple C-like (but much simpler than C) language compiler.

Here, is the reference code.

This language supports:

2 basic types: int and double
Arrays in these basic types
2 kinds of control flow statements if (with optionally else) and while
Basic mathematical operators +、-、*、/、<、<=、>、>=、==、!=

The author assumes the user has a good command of C language¹ and knows how to use regex, flex and bison.

If you are not familiar with flex and bison, refer to another article of mine。

The whole compiling process

The compiler follows the following process²：

routine

Obviously there are a lot of code which we don't need to write, and it is both fortunate and unfortunate for us to be born in such an era with rich tools.

Lexical Analysis

Here is the key part of flex code:

"char"                      {yylval.type=Char; return TYPE;}
"int"                       {yylval.type=Int; return TYPE;}
"double"                    {yylval.type=Double; return TYPE;}
"string"                    {yylval.type=String; return TYPE;}
"if"                        {return IF;}
"else"                      {return ELSE;}
"while"                     {return WHILE;}
"print"                     {return PRINT;}
([-])?[0-9]+                {yylval.int_value=atoi(yytext); return INT_LITERAL;}
([-])?[0-9]+\.[0-9]*        {yylval.double_value=atof(yytext); return DOUBLE_LITERAL;}
[a-zA-Z][0-9a-zA-Z_]*       {yylval.string_value=strdup(yytext); return IDENTIFY;}
"=="                        {return EQUAL;}
"!="                        {return NONEQUAL;}
"<"                         {return *yytext;}
"<="                        {return LESSEQ;}
">"                         {return *yytext;}
">="                        {return GREATEREQ;}
"+"                         {return *yytext;}
"-"                         {return *yytext;}
"*"                         {return *yytext;}
"/"                         {return *yytext;}
"["                         {return *yytext;}
"]"                         {return *yytext;}
[ \t]                       {}
.                           {return *yytext;}

Gramma analysis

Here is the key part of the bison code:

program:
    program statement           
    |                          
    ;

assign:
    '=' expression             
    ;

defineStatement:
    TYPE IDENTIFY ';'
    | TYPE IDENTIFY '[' INT_LITERAL ']' ';' 
    | TYPE IDENTIFY assign ';'
    ;

assignStatement:
    referenceExpression assign ';'
    ;

statementList:
    statementList statement                 
    |                                       
    ;

block:
    '{' statementList '}' 
    ;

ifStatement:
    IF expression block                   
    | IF expression block ELSE block        
    ;

whileStatement:
    WHILE expression block                  
;

printStatement:
    PRINT '(' expression ')' ';'
;

statement:
    defineStatement
    | assignStatement
    | block
    | ifStatement
    | whileStatement
    | printStatement
;

referenceExpression:
    IDENTIFY 
    | IDENTIFY '[' expression ']' 

atomExpression:
    INT_LITERAL                            
    | DOUBLE_LITERAL                      
    | referenceExpression                  
    | '(' expression ')'                  
    ;

unaryOperator:
    '+'
    | '-'
    | '!'
    ;

unaryExpression:
    atomExpression                        
    | unaryOperator atomExpression         
;

binaryOrAtomExpression:
    unaryExpression                                    
    | binaryOrAtomExpression '+' unaryExpression       
    | binaryOrAtomExpression '-' unaryExpression   
    | binaryOrAtomExpression '*' unaryExpression        
    | binaryOrAtomExpression '/' unaryExpression      
    | binaryOrAtomExpression '<' unaryExpression       
    | binaryOrAtomExpression '>' unaryExpression       
    | binaryOrAtomExpression LESSEQ     unaryExpression 
    | binaryOrAtomExpression GREATEREQ  unaryExpression
    | binaryOrAtomExpression EQUAL      unaryExpression 
    | binaryOrAtomExpression NONEQUAL   unaryExpression 
;

expression:
    binaryOrAtomExpression 
    ;

Note that the corresponding behavior code is not given here, because we need to understand AST first to understand these behaviors.

AST（abstract syntax tree）

An abstract syntax tree organizes the parsed syntax units into a tree.

Basically each statement and expression has a correspond node on the AST.

AST design of this compiler

Some key concepts are borrowed from LLVM's AST design.

ast

So we have %union and %type:

%union {
    int int_value;
    double double_value;
    char* string_value;
    ASTNode* node;
    SymbolType type;
};

%token <type> TYPE
%token <string_value> IDENTIFY
%token <int_value> INT_LITERAL
%token <double_value> DOUBLE_LITERAL
%token <type> STRING INT DOUBLE
%token <void> IF ELSE WHILE FOR
%token <void> LESSEQ GREATEREQ EQUAL NONEQUAL
%token <void> PRINT

%left '>' '<' LESSEQ GREATEREQ EQUAL NONEQUAL
%left '+' '-'
%left '*' '/'

%type <node> statement assignStatement statementList block ifStatement whileStatement defineStatement printStatement
%type <node> expression referenceExpression assign atomExpression unaryExpression binaryOrAtomExpression
%type <node> program

The whole program is a CompoundStatement, we need to store the result in a global variable:

CompoundStatement *result;

Then we can finally add the behavior part in the bison code, take program and binaryOrAtomExpression as an example:

program:
    program statement           {add_statement((CompoundStatement *)$1, (Statement*)$2);}
    |                           {result=create_compound_statement(); $$=(ASTNode*)result;}
    ;

binaryOrAtomExpression:
    unaryExpression                                     {$$=$1;}
    | binaryOrAtomExpression '+' unaryExpression        {$$=(ASTNode*)create_binary_operation_result('+',(RValue*)$1,(RValue*)$3);}
    | binaryOrAtomExpression '-' unaryExpression        {$$=(ASTNode*)create_binary_operation_result('-',(RValue*)$1,(RValue*)$3);}
    | binaryOrAtomExpression '*' unaryExpression        {$$=(ASTNode*)create_binary_operation_result('*',(RValue*)$1,(RValue*)$3);}
    | binaryOrAtomExpression '/' unaryExpression        {$$=(ASTNode*)create_binary_operation_result('/',(RValue*)$1,(RValue*)$3);}
    | binaryOrAtomExpression '<' unaryExpression        {$$=(ASTNode*)create_binary_operation_result('<',(RValue*)$1,(RValue*)$3);}
    | binaryOrAtomExpression '>' unaryExpression        {$$=(ASTNode*)create_binary_operation_result('>',(RValue*)$1,(RValue*)$3);}
    | binaryOrAtomExpression LESSEQ     unaryExpression {$$=(ASTNode*)create_binary_operation_result(LESSEQ, (RValue*)$1,(RValue*)$3);}
    | binaryOrAtomExpression GREATEREQ  unaryExpression {$$=(ASTNode*)create_binary_operation_result(GREATEREQ,(RValue*)$1,(RValue*)$3);}
    | binaryOrAtomExpression EQUAL      unaryExpression {$$=(ASTNode*)create_binary_operation_result(EQUAL, (RValue*)$1,(RValue*)$3);}
    | binaryOrAtomExpression NONEQUAL   unaryExpression {$$=(ASTNode*)create_binary_operation_result(NONEQUAL, (RValue*)$1,(RValue*)$3);}

Symbol Table and Scope

Since this compiler does not support multiple source files and we don't have complex data types, design of symbol table here is relatively simple, and a global symbol table stack can be maintained to handle scopes:

typedef enum {
    Bool,
    Int,
    String,
    Double,
    Array
} SymbolType;

typedef struct {
    SymbolType type;
    bool mutable;
    char *name;
    size_t namespace_id;
} Symbol;

typedef struct {
    Symbol base;
    SymbolType elementType;
    size_t length;
} ArraySymbol;

typedef struct {
    size_t length;
    Symbol **symbols;
    size_t namespace_id;
} SymbolTableFrame;

typedef struct {
    size_t length;
    SymbolTableFrame **frames;
} SymbolTableStack;

SymbolTableStack symbol_table_stack = {0, NULL};

void push_frame();

void pop_frame();

void add_symbol(Symbol *symbol);

Symbol *get_symbol(char *name);

get_symbol look for symbol which name is name for the top to button of the stack.

On add_symbol, the namespace_id of symbol is set to the namespace_id of frame, which appears as part of the variable name during code generation.

When parsing:

block:
    '{' {push_frame();} statementList '}'   {$$=$3;pop_frame();}
    ;

在使用 symbol 时，由于刚刚进入的 block 对应的 frame 在栈顶附近，故会优先在这个 frame 中寻找名字为 name的符号，在这个 frame 中找不到时才会逐级向上寻找，这样就实现了作用域。

Every time we encounter a block, just push a frame, and when you leave the block, you pop the frame out, so that the symbol declared in the block will get the corresponding block's namespace_id.

When using symbol, since the frame corresponding to the block you just entered is near the top of the stack, the symbol named name will be searched first in this frame, and searched up level by level if not found. In such a way we implemented scoping of variables.

Target Code Generation

In order to let our software cross-platform and make use of llvm's excellent code optimization capabilities³, our target code is LLVM IR.

LLVM IR兼有高级语言和汇编的特点，比如：

LLVM IR is strongly typed
Many control structures in LLVM IR are similar to assembly, such as if, while and other control structures are represented by br jumps
The "local variable" in LLVM IR is quite a "register", but LLVM IR has an infinite number of such kind of "registers"
LLVM IR is an SSA form of IR, so a "register" can only be assigned a value once.
Many operations of LLVM are similar to assembly code, for example %a = add i32 1, %b is similar to add %a, 1, %b

Variable definition

Use alloca to allocate a space on stack:

%i_0 = alloca i32

Note alloca returns an address.

Assign value to variable

Use store：

store i32 0, i32* %i_0

It means we put 0 into the space pointed by %i_0.

Read the value of variable

Use load：

%temp = load i32, i32* %i_0

Which means take the value in the space pointed by %i_0 to register %temp.

Calculation

Take + as an example:

%temp_22 = add i32 %temp_21, 1

Which means put %temp_21 + 1 into %temp_22.

For float numbers, use fadd instead of add.

Compare

Use icmp and the way you want to compare:

%temp_2 = icmp slt i32 %temp_1, 10

slt is "Signed Less Than", ie. <.

For floats, there are fcmp, olt, etc.

Jump

Use br for jumping

Conditionless

br label %label1

With condition

br i1 %condition, label %condition_true, label %condition_false

That's all we need to know for accomplishing our task.

Then let's do the code generation.

Take variable declaration as an example:

static void generate_code(DeclareStatement *node) {
    printf("%%%s_%zu = alloca %s\n", node->variable->name, node->variable->namespace_id, type_name(node->variable));
    // Generate an assign statement if there is an initial value
    if (node->initial != NULL) {
        ((Statement *) (node->initial))->generate_code((Statement *) (node->initial));
    }
}

Assign:

static void generate_code(AssignStatement *node) {
    ((LValue *) (node->lhs))->generate_lvalue_code((LValue *) (node->lhs));
    node->rhs->generate_rvalue_code(node->rhs);
    char *rvalue_ir = node->rhs->rvalue_ir(node->rhs);
    char *lvalue_ir = node->lhs->lvalue_ir(node->lhs);
    printf("store %s %s, %s* %s\n",
           type_string(((RValue *) (node->lhs))->type),
           rvalue_ir,
           type_string(((RValue *) (node->lhs))->type),
           lvalue_ir);
    free(rvalue_ir);
    free(lvalue_ir);
}

For an expression we heed to handle it's rvalue and lvalue separately.

// rvalue needs to read to the register first
static void generate_rvalue_code(VariableReference *rValue) {
char *rvalue_ir_string = ((RValue *) rValue)->rvalue_ir((RValue*) rValue);
    char *lvalue_ir_string = ((LValue*) rValue)->lvalue_ir((LValue *) rValue);
    printf("%s = load %s, %s* %s\n",
           rvalue_ir_string,
           type_name(rValue->variable),
           type_name(rValue->variable),
           lvalue_ir_string);
    free(rvalue_ir_string);
    free(lvalue_ir_string);
}

// while lvalue doesn't
static void generate_lvalue_code(VariableReference *lValue) {
}

// variable name when it is used as a lvalue
static char *lvalue_ir(VariableReference *lValue) {
    char *result = malloc(128);
    sprintf(result, "%%%s_%zu", lValue->variable->name, lValue->variable->namespace_id);
    return result;
}

// variable name when it is used as a rvalue
static char *rvalue_ir(VariableReference *rValue) {
    char *result = malloc(128);
    sprintf(result, "%%temp_%zu", rValue->temp_register_id);
    return result;
}

How to generate other codes can be easily figured out by the reader. If you are interested in my implementation, please read the code.

References

Compilation principle

Three famous animal books:

"Compilers: Principles, Techniques, and Tools" - "Dragon Book"
"Modern Compiler Implementation" - "Tiger Book"
"Advanced Compiler Design and Implementation" - "Whale Book", however this book is mostly about backend optimization

Lex & Grammar you may refer to

LLVM IR

llvm official cite, especially its language reference
mapping high level constructs to llvm ir

The code will use object-oriented programming in C language, see Implementing object-oriented programming in pure C. Using this technique makes me feel that using C language is far more better than using C++ (Linus is not wrong at all).

In fact, I did a little bit of semantic analysis when constructing the AST.

Actually it's because I'm lazy. ;)