This article will lead the reader to make a simple C-like (but much simpler than C) language compiler.
Here, is the reference code.
This language supports:
- 2 basic types:
int
anddouble
- Arrays in these basic types
- 2 kinds of control flow statements
if
(with optionallyelse
) andwhile
- Basic mathematical operators
+
、-
、*
、/
、<
、<=
、>
、>=
、==
、!=
The author assumes the user has a good command of C language1 and knows how to use regex, flex and bison.
If you are not familiar with flex and bison, refer to another article of mine。
The whole compiling process
The compiler follows the following process2:
Obviously there are a lot of code which we don't need to write, and it is both fortunate and unfortunate for us to be born in such an era with rich tools.
Lexical Analysis
Here is the key part of flex code:
"char" {yylval.type=Char; return TYPE;}
"int" {yylval.type=Int; return TYPE;}
"double" {yylval.type=Double; return TYPE;}
"string" {yylval.type=String; return TYPE;}
"if" {return IF;}
"else" {return ELSE;}
"while" {return WHILE;}
"print" {return PRINT;}
([-])?[0-9]+ {yylval.int_value=atoi(yytext); return INT_LITERAL;}
([-])?[0-9]+\.[0-9]* {yylval.double_value=atof(yytext); return DOUBLE_LITERAL;}
[a-zA-Z][0-9a-zA-Z_]* {yylval.string_value=strdup(yytext); return IDENTIFY;}
"==" {return EQUAL;}
"!=" {return NONEQUAL;}
"<" {return *yytext;}
"<=" {return LESSEQ;}
">" {return *yytext;}
">=" {return GREATEREQ;}
"+" {return *yytext;}
"-" {return *yytext;}
"*" {return *yytext;}
"/" {return *yytext;}
"[" {return *yytext;}
"]" {return *yytext;}
[ \t] {}
. {return *yytext;}
Gramma analysis
Here is the key part of the bison code:
program:
program statement
|
;
assign:
'=' expression
;
defineStatement:
TYPE IDENTIFY ';'
| TYPE IDENTIFY '[' INT_LITERAL ']' ';'
| TYPE IDENTIFY assign ';'
;
assignStatement:
referenceExpression assign ';'
;
statementList:
statementList statement
|
;
block:
'{' statementList '}'
;
ifStatement:
IF expression block
| IF expression block ELSE block
;
whileStatement:
WHILE expression block
;
printStatement:
PRINT '(' expression ')' ';'
;
statement:
defineStatement
| assignStatement
| block
| ifStatement
| whileStatement
| printStatement
;
referenceExpression:
IDENTIFY
| IDENTIFY '[' expression ']'
atomExpression:
INT_LITERAL
| DOUBLE_LITERAL
| referenceExpression
| '(' expression ')'
;
unaryOperator:
'+'
| '-'
| '!'
;
unaryExpression:
atomExpression
| unaryOperator atomExpression
;
binaryOrAtomExpression:
unaryExpression
| binaryOrAtomExpression '+' unaryExpression
| binaryOrAtomExpression '-' unaryExpression
| binaryOrAtomExpression '*' unaryExpression
| binaryOrAtomExpression '/' unaryExpression
| binaryOrAtomExpression '<' unaryExpression
| binaryOrAtomExpression '>' unaryExpression
| binaryOrAtomExpression LESSEQ unaryExpression
| binaryOrAtomExpression GREATEREQ unaryExpression
| binaryOrAtomExpression EQUAL unaryExpression
| binaryOrAtomExpression NONEQUAL unaryExpression
;
expression:
binaryOrAtomExpression
;
Note that the corresponding behavior code is not given here, because we need to understand AST first to understand these behaviors.
AST(abstract syntax tree)
An abstract syntax tree organizes the parsed syntax units into a tree.
Basically each statement and expression has a correspond node on the AST.
AST design of this compiler
Some key concepts are borrowed from LLVM's AST design.
So we have %union
and %type
:
%union {
int int_value;
double double_value;
char* string_value;
ASTNode* node;
SymbolType type;
};
%token <type> TYPE
%token <string_value> IDENTIFY
%token <int_value> INT_LITERAL
%token <double_value> DOUBLE_LITERAL
%token <type> STRING INT DOUBLE
%token <void> IF ELSE WHILE FOR
%token <void> LESSEQ GREATEREQ EQUAL NONEQUAL
%token <void> PRINT
%left '>' '<' LESSEQ GREATEREQ EQUAL NONEQUAL
%left '+' '-'
%left '*' '/'
%type <node> statement assignStatement statementList block ifStatement whileStatement defineStatement printStatement
%type <node> expression referenceExpression assign atomExpression unaryExpression binaryOrAtomExpression
%type <node> program
The whole program is a CompoundStatement
, we need to store the result in a global variable:
CompoundStatement *result;
Then we can finally add the behavior part in the bison
code, take program
and binaryOrAtomExpression
as an example:
program:
program statement {add_statement((CompoundStatement *)$1, (Statement*)$2);}
| {result=create_compound_statement(); $$=(ASTNode*)result;}
;
binaryOrAtomExpression:
unaryExpression {$$=$1;}
| binaryOrAtomExpression '+' unaryExpression {$$=(ASTNode*)create_binary_operation_result('+',(RValue*)$1,(RValue*)$3);}
| binaryOrAtomExpression '-' unaryExpression {$$=(ASTNode*)create_binary_operation_result('-',(RValue*)$1,(RValue*)$3);}
| binaryOrAtomExpression '*' unaryExpression {$$=(ASTNode*)create_binary_operation_result('*',(RValue*)$1,(RValue*)$3);}
| binaryOrAtomExpression '/' unaryExpression {$$=(ASTNode*)create_binary_operation_result('/',(RValue*)$1,(RValue*)$3);}
| binaryOrAtomExpression '<' unaryExpression {$$=(ASTNode*)create_binary_operation_result('<',(RValue*)$1,(RValue*)$3);}
| binaryOrAtomExpression '>' unaryExpression {$$=(ASTNode*)create_binary_operation_result('>',(RValue*)$1,(RValue*)$3);}
| binaryOrAtomExpression LESSEQ unaryExpression {$$=(ASTNode*)create_binary_operation_result(LESSEQ, (RValue*)$1,(RValue*)$3);}
| binaryOrAtomExpression GREATEREQ unaryExpression {$$=(ASTNode*)create_binary_operation_result(GREATEREQ,(RValue*)$1,(RValue*)$3);}
| binaryOrAtomExpression EQUAL unaryExpression {$$=(ASTNode*)create_binary_operation_result(EQUAL, (RValue*)$1,(RValue*)$3);}
| binaryOrAtomExpression NONEQUAL unaryExpression {$$=(ASTNode*)create_binary_operation_result(NONEQUAL, (RValue*)$1,(RValue*)$3);}
Symbol Table and Scope
Since this compiler does not support multiple source files and we don't have complex data types, design of symbol table here is relatively simple, and a global symbol table stack can be maintained to handle scopes:
typedef enum {
Bool,
Int,
String,
Double,
Array
} SymbolType;
typedef struct {
SymbolType type;
bool mutable;
char *name;
size_t namespace_id;
} Symbol;
typedef struct {
Symbol base;
SymbolType elementType;
size_t length;
} ArraySymbol;
typedef struct {
size_t length;
Symbol **symbols;
size_t namespace_id;
} SymbolTableFrame;
typedef struct {
size_t length;
SymbolTableFrame **frames;
} SymbolTableStack;
SymbolTableStack symbol_table_stack = {0, NULL};
void push_frame();
void pop_frame();
void add_symbol(Symbol *symbol);
Symbol *get_symbol(char *name);
get_symbol
look for symbol which name is name
for the top to button of the stack.
On add_symbol
, the namespace_id
of symbol
is set to the namespace_id
of frame
, which appears as part of the variable name during code generation.
When parsing:
block:
'{' {push_frame();} statementList '}' {$$=$3;pop_frame();}
;
在使用 symbol
时,由于刚刚进入的 block
对应的 frame
在栈顶附近,故会优先在这个 frame
中寻找名字为 name
的符号,在这个 frame
中找不到时才会逐级向上寻找,这样就实现了作用域。
Every time we encounter a block
, just push
a frame
, and when you leave the block
, you pop
the frame
out, so that the symbol
declared in the block
will get the corresponding block
's namespace_id
.
When using symbol
, since the frame
corresponding to the block
you just entered is near the top of the stack, the symbol named name
will be searched first in this frame
, and searched up level by level if not found. In such a way we implemented scoping of variables.
Target Code Generation
In order to let our software cross-platform and make use of llvm's excellent code optimization capabilities3, our target code is LLVM IR.
LLVM IR兼有高级语言和汇编的特点,比如:
- LLVM IR is strongly typed
- Many control structures in LLVM IR are similar to assembly, such as if, while and other control structures are represented by
br
jumps - The "local variable" in LLVM IR is quite a "register", but LLVM IR has an infinite number of such kind of "registers"
- LLVM IR is an SSA form of IR, so a "register" can only be assigned a value once.
- Many operations of LLVM are similar to assembly code, for example
%a = add i32 1, %b
is similar toadd %a, 1, %b
Variable definition
Use alloca
to allocate a space on stack:
%i_0 = alloca i32
Note alloca
returns an address.
Assign value to variable
Use store
:
store i32 0, i32* %i_0
It means we put 0
into the space pointed by %i_0
.
Read the value of variable
Use load
:
%temp = load i32, i32* %i_0
Which means take the value in the space pointed by %i_0
to register %temp
.
Calculation
Take +
as an example:
%temp_22 = add i32 %temp_21, 1
Which means put %temp_21 + 1
into %temp_22
.
For float numbers, use fadd
instead of add
.
Compare
Use icmp
and the way you want to compare:
%temp_2 = icmp slt i32 %temp_1, 10
slt
is "Signed Less Than", ie. <
.
For floats, there are fcmp
, olt
, etc.
Jump
Use br
for jumping
Conditionless
br label %label1
With condition
br i1 %condition, label %condition_true, label %condition_false
That's all we need to know for accomplishing our task.
Then let's do the code generation.
Take variable declaration as an example:
static void generate_code(DeclareStatement *node) {
printf("%%%s_%zu = alloca %s\n", node->variable->name, node->variable->namespace_id, type_name(node->variable));
// Generate an assign statement if there is an initial value
if (node->initial != NULL) {
((Statement *) (node->initial))->generate_code((Statement *) (node->initial));
}
}
Assign:
static void generate_code(AssignStatement *node) {
((LValue *) (node->lhs))->generate_lvalue_code((LValue *) (node->lhs));
node->rhs->generate_rvalue_code(node->rhs);
char *rvalue_ir = node->rhs->rvalue_ir(node->rhs);
char *lvalue_ir = node->lhs->lvalue_ir(node->lhs);
printf("store %s %s, %s* %s\n",
type_string(((RValue *) (node->lhs))->type),
rvalue_ir,
type_string(((RValue *) (node->lhs))->type),
lvalue_ir);
free(rvalue_ir);
free(lvalue_ir);
}
For an expression we heed to handle it's rvalue and lvalue separately.
// rvalue needs to read to the register first
static void generate_rvalue_code(VariableReference *rValue) {
char *rvalue_ir_string = ((RValue *) rValue)->rvalue_ir((RValue*) rValue);
char *lvalue_ir_string = ((LValue*) rValue)->lvalue_ir((LValue *) rValue);
printf("%s = load %s, %s* %s\n",
rvalue_ir_string,
type_name(rValue->variable),
type_name(rValue->variable),
lvalue_ir_string);
free(rvalue_ir_string);
free(lvalue_ir_string);
}
// while lvalue doesn't
static void generate_lvalue_code(VariableReference *lValue) {
}
// variable name when it is used as a lvalue
static char *lvalue_ir(VariableReference *lValue) {
char *result = malloc(128);
sprintf(result, "%%%s_%zu", lValue->variable->name, lValue->variable->namespace_id);
return result;
}
// variable name when it is used as a rvalue
static char *rvalue_ir(VariableReference *rValue) {
char *result = malloc(128);
sprintf(result, "%%temp_%zu", rValue->temp_register_id);
return result;
}
How to generate other codes can be easily figured out by the reader. If you are interested in my implementation, please read the code.
References
Compilation principle
Three famous animal books:
- "Compilers: Principles, Techniques, and Tools" - "Dragon Book"
- "Modern Compiler Implementation" - "Tiger Book"
- "Advanced Compiler Design and Implementation" - "Whale Book", however this book is mostly about backend optimization
Lex & Grammar you may refer to
LLVM IR
The code will use object-oriented programming in C language, see Implementing object-oriented programming in pure C. Using this technique makes me feel that using C language is far more better than using C++ (Linus is not wrong at all).
In fact, I did a little bit of semantic analysis when constructing the AST.
Actually it's because I'm lazy. ;)