Neocortex 馃

Search

Search IconIcon to open search

Anatomy of a compiler

Last updated Nov 16, 2021 Edit Source

# Anatomy of a compiler

Compilation Diagram

# Front-End analysis

  1. Lexical analysis: Lexical analysis is the process of taking the source code as a stream of characters and splitting it into tokens (Tokens are sequences of characters that have a collective meaning.).

  2. Syntax analysis & Parsing: Syntax analysis is the process of parsing a sequence of tokens generated in lexical analysis and outputting a parse-tree or a derivation.

  3. Semantic Analysis: In semantic analysis we check the code for non-syntactic but semantic errors. These errors include improper arguments, access violations and undeclared variables. An example of a semantic error is:

    1
    2
    
    foo = [1,2,3]
    foo + 2 # You can't add an int to a list
    
  4. Intermediate Code Generation: In this step we create the intermediate representation of the source code. Intermediate representation should be easy to generate and translate to the target program. A very common form is the three-address code(TAC) which is a sequence of simple instructions with at most three operands.

1
2
3
4
      real code            TAC
      -------------------- ---------------
      a = ( c + b ) \* 2   ~t1~ = c + b
                           a = ~t1~ \* 2

# Back-End analysis (Synthesis)

  1. Intermediate Code Optimisation: This stage accepts the intermediate representation generated in Intermediate Code Generation and applies several optimisation techniques to it including but not limited to:
    • suppressing code generation of unreachable code segments,
    • ridding of unused variables,
    • eliminating multiplication by 1 and addition by 0,
    • loop optimisation (e.g., remove statements that are not modified in the loop),
    • common sub-expression elimination,
  2. Object Code Generation: In this step, the target program gets generated. This step usually outputs either machine code or assembly.
  3. Object Code Optimisation: This is a non-mandatory step that processes the machine code generated and applies hardware-specific optimisations such as special instructions, pipelining and branch prediction).

# Lexical Analysis

Lexical Analysis Diagram

A lexical analyser takes a stream of characters and generates tokens as its output. It can recognise particular instances of tokens which are called lexemes. A lexeme is the actual sequence forming a token. The scanner’s task is to determine the tokens from an input stream but it has no idea where the tokens belong to. Therefore it can only detect errors caused by invalid tokens, it can’t detect out of place tokens, mismatched parentheses etc. The lexical analyser is a convenient tool to strip out comments and unnecessary white spaces

# How rules are implemented

When scanning a stream of characters, an analyser might encounter situations where multiple rules match the same string or some rules might match a substring of another rule that matches a longer string. In order to prevent undefined behaviour in those cases, the scanner uses these two principles:

# Scanner Implementation with Regular Expressions and Finite Automata

# Important terminology

# From RegEx to Automata

A finite automata can be used to implement scanners in the following manner:

  1. Nondeterministic Finite Automata

    What sets an NFA different from a Deterministic Finite Automata(DFA) is that a state can have 蔚 moves(moves that don't shift the input stream) and multiple moves from the same state that are associated with the same character. These features allow us to more easily generate NFAs from regular expressions.

    An NFA that accepts the RegEx (0|1)*(000|111)(0|1)* RegToNFA

    When generating an NFA from a regular expression, one can follow the following rules:

    • Rule 1 An NFA that accepts any symbol from the alphabet

      NFAR1

    • Rule 2 An NFA that accepts only 系

      NFAR2

    • Rule 3 An 系 NFA that accepts r1|r2

      NFAR3

    • Rule 4 An 系 NFA that accepts r1r2

      NFAR4

    • Rule 5 An 系 NFA that accepts r1*

      NFAR5

  2. Converting from NFA to Deterministic Finite Automata

    In the process of conversion from NFA to DFA we use a technique called subset construction. Each state in the resulting DFA is made up of a set of states from the original NFA and it has the same start state and the same alphabets. This means that given a state from the original NFA, an input symbol x takes us from our current state to all the possible states we can reach using the x move and 系 moves. The combined set of all those states form a new state in our DFA. Therefore, each state in our DFA is a subset of S(the set of states in our NFA) so it can have at most $2^n$ states, n being the size of the set S.

# flex Overview

flex allows you to specify the scanner you want using patterns to match and actions to apply. It then uses the language you specified to generate and NFA, then converts it into an equivalent DFA and generates C code that implements that automaton. You can learn more about flex here or by running info flex.

# A flex Input File

1
2
3
4
5
6
7
8
%{
Declarations
%}
Definitions
%%
Rules
%%
User subroutines

The optional fields Declarations and User subroutines sections are copied directly to the generated C. In the Definitions field you specify options for the scanner and setup definitions to give names to regular expressions. The only required field, Rules allows you to specify patterns that identify your tokens and the actions performed upon recognising. Rules in flex are simply RegEx rules.

# flex Global variables

The yylex function is a function that takes no argument and returns an integer, it is the token-grabbing function. Since it returns nothing, it sets global variables that be read by the caller. Here are the global variables it uses:


Interactive Graph