Garbage C Compiler (gcc)

Mar 19, 2024
6 min read

(sadly I will not share my source code in the interest of upholding academic integrity by not posting solutions to an ongoing piece of coursework)

So how exactly does a compiler work?

(skip this if you know how a compiler works)

A compiler is nothing more than a piece of software that takes a program written in one language as an input, and generates a program in another language as an output. This means that compilers need to first interpret the source file. Compilers do this by first splitting up the source code into a stream of tokens (the basic block of the language EG: int, =, [), a process called lexing. These tokens are then converted into a syntactic analyzer, which group tokens together using grammar rules to generate a tree form representation of the source code. Several optimization techniques are then applied, the most complex and computationally expensive part of most modern compilers. The tree optimized tree is finally traversed to generate code in the target language.

Implementation: The Parser

Lexical and syntactic analysis are quite mature fields, and implementing a good parser can be quite complicated process, so we opted to use a preexisting tool to to generate the parser for us. To generate our lexer we used flex. Our flex file mostly contains a list of rules for what tokens to return for given regular expressions, for example:

// definitaion section
d   [0-9]
l   [a-za-z_]
%% // rules section
"int"       {return(int);} // if mathcing "int", return an int token
{l}({l}|{d})*     {yylval.string = new std::string(yytext); return(identifier);} // a lowercase letter followed by anny combination of letters or numbers returns an identifier with yyvval.string being set to the string value of the identifier.
%% // user code section

Flex generates a c++ file that Externs the function yylex(), which parses until matching a rule, after which it runs our actions. In the case of our identifier, it returns sets yylval and then escapes yylex(), returning the token type. Each time yylex() is called, it picks off again where it left off, allowing us to iterate through the tokens in the source file. Flex achieves this by using a DFA to decide how to match tokens.
The next step is to organize these tokens into a tree resembling the structure of the source code. To do this we used bison; a program that generates a parser from a grammar defined in our .y file (as a side note, the reason why bison uses the .y file extension is because bison is a free and open source version of a similar program called YACC), in a similar structure to our flex file. The bison utility creates a bison parser implementation file that implements a function called yyparse(), that parses tokens generated by calling yylex. The example below shows how Struct declarations are defined:

// declaration section
%token mul_assign div_assign mod_assign add_assign //etc...
%type <node> conditional_expression assignment_expression assignment_operator //etc...
%% // rules section
// an assignment expression is either a conditional expression, or a uniary expression followed by an assignment operator followed by an assignment expression.
assignment_expression
// upon a gramar rule being matched, the rendered (what the template gets turned into by bison) version of the c++ template in {} brackets at the end of each case is called.
// $1, $2 ...etc represent the $$ generated by the rule (or raw value in the case of terminal components) matching the nth component.
 : conditional_expression // implicit {$$ = $1;}
 | unary_expression assignment_operator assignment_expression {$$ = new assignment(($1), $2, $3); }
 ;

assignment_operator
 : '='    { $$ = assign_op_equals; }
 | mul_assign { $$ = mul_op_assign; }
 | div_assign { $$ = div_op_assign; }
 | mod_assign { $$ = mod_op_assign; }
 | add_assign { $$ = add_op_assign; }
 ;

%% // user code section

As well as defining the grammar, the .y file also contains some other key components, such as the root node declaration, the definition of our parse function, and declarations for all of the variants of the AST node typed (also used to generate yylval).

Implementation: Code Generation

Now that we have generated an AST, the next step would be to implement a series of optimizations. This is a complicated process and often involves converting the source code into an intermediate representation (IR) for further analysis. One of the most common ways of doing this is to emit an IR like the LLVM IR which can then be optimized and compiled to any instruction set supported by LLVM. Seeming as the goal of this project was to generate a simple compiler, we skipped the optimization steps and simply generated our Risc-v assembly from our AST.

The source code for our AST is defined over several files, with each node inheriting from a basic node, and implementing a few basic functions, namely a constructor, destructor, debug print and most importantly emitrisc, which generates the assembly code. Take for example a for loop:

void for::emitrisc(std::ostream &stream, context &context) const {  
    std::string address = std::to_string((unsigned long long)(void **)this); // generate a unique label identifier - each node in the AST must be at a unique address

    context.push_continue("endloop" + address);     // push labels to jump to for break and continue for the body to the global context
    context.push_break("post_expr" + address);

    expression_init_->emitrisc(stream, context);    // generate init statement (EG: int i)
    stream << "j "
           << "check" << address << "\n";           // jump to checking the condition
    stream << "body" << address << ":\n";           // label for where the body of the for loop begins

    statement_->emitrisc(stream, context);          // generate code for the body

    stream << "post_expr" << address << ":\n";      // label for post expression (EG: i++)
    if (expression_post_ != nullptr) {              // generate post expression if exists
        expression_post_->emitrisc(stream, context);
    }
    stream << "check" << address << ":\n";          // label for condition (EG: i < 10)
    expression_check_->emitrisc(stream, context);   // generate condition logic
    stream << "bnez a0, "
           << "body" << address << "\n";            // output of condition logic is stored in a0, so branch based on result.
    stream << "endloop" << address << ":\n";        // endloop label for continue statement

    context.pop_continue();                         // pop labels for jump and continue
    context.pop_break();
}

As we can see, this calls emitrisc for all the necessary child nodes for each node of the AST, effectively traversing the tree and emitting code as it goes. Some nodes are more complex, requiring a few extra functions, for example Identifers need to behave differently based on if they are being assigned to or read from.