-
Notifications
You must be signed in to change notification settings - Fork 89
Compiler internals
This page is to help anyone get into the code base of Nemerle compiler. It is already a quite large project and many explanations are required.
The compiling process is divided into a few more or less separate phases. Most of them transform some representation of compiled program and when connected together they yield the binary executable. This gluing of compiler passes is done in passes.n and can be customized to some degree by changing compilation options
We now describe each pass in order in which it appears in compilation of common program.
Before we actually load and analyse some source code, we take a look at options and library references specified from command line. We need to load every class present in libraries specified by user (like -ref:bla.dll) and needed by compiler (like mscorlib.dll).
All the code devoted to loading external metadata is gathered in ncc/external/. Here for every assembly we load classes and place them in global namespace tree (described in hierarchy building). We analyze the contents of classes lazily, when they are referenced somewhere in program. When this occurs for some class Foo we build a special subclass of Nemerle.Compiler.TypeInfo, which contains information about external type.
In this stage also macros are loaded and placed in namespace tree. This is important, because macros introduce new syntax, so when we begin parsing also syntax extensions associated with macros are loaded. The nice thing is that you can tell compiler to load macros from different library than default Nemerle.Macros.dll
and this way you can change quite much of Nemerle's syntax.
Lexer transforms text from compiled files into so called tokens, which simply represent stuff like identifiers, numbers, operators, etc. Lexing is done inside lexer classes and consists of loading a file from disk, going through every characted in file, ignoring whitespaces (but counting them to create correct locations information) and returning tokens one by one when requested.
One thing to note here is that lexer can be requested to add a new keyword for recognition - it is because our macros allow loading new syntax with new keywords during parsing of file (by using Name.Space;
directive).
We also have special lexer subclasses, which analyze given string instead of file from disk (LexerString).
The pre-parsing phase groups stream of tokens obtained from lexer into a tree of parentheses. We have distinguished four types of them ({} () [] <[]>). Tokens inside those parentheses are also divided into groups separated by special separator tokens. This way we can have a general skeleton of code (tree based on matched paretheses) quite early in compilation process. It is useful for our syntax extensions.
This pass is quite simple, it modifies the Next field in Token to point to the next token in stream. Also for every pair of braces the special node is created, which contains its inner token stream in Child field. One more important thing happens in this phase - preparser recognize using Name.Space;
statements and enables syntax extensions available in loaded macros (they are looked up in namespace tree by namespace).
The first thing, which compiler does with parsed program is analyse its general structure; it notes the presence of all classes which are defined, build the inheritance relation between them, add members and check the overrides and interface implementation. Between some of them, attribute macros are expanded. These operations are gathered inside ncc/hierarchy/ directory. Here is the exact list of tasks performed with short explanation:
- Walk parsed types in all files, for every type create its TypeBuilder and add them to global namespace tree. Add nested types to their enclosing types' special list. Here we also expand macros and delegates into their underlying classes. It all happens in ScanTypeHierarchy.n.
- We need to build generic environment for every class with special care to nested types and merging of partial classes (which are gathered in previous step). It happens in make_tyenvs function of TypeBuilder.
- Now we expand macros marked as BeforeInheritance.
- Now we build inheritance relation for classes. Every class (TypeBuilder) is set up with information about all its parents and interfaces, together with substitution information (for example class implements some interface under some generic instantiation, like
class Int32 : IComparable [Int32]
). We also analyse here which class is interface, struct, etc. and make them inherit some faked types (like classes are subtypes of object and structs of System.ValueType). - Macros with BeforeTypedMembers are expanded here.
- Next we create objects representing members of classes. Every TypeBuilder iterates over its members (in parsed form) and creates subclasses of MemberBuilder, bind types specified in their headers (lookup in namespace tree is performed), check some consistency rules about member attributes etc.
- Macros with WithTypedMembers flag are expanded on class members, which now be supplied as various kinds of MemberBuilder.
- We perform analysis if every interface was fully implemented implicitly or by explicit implements keyword.
- NamespaceTree holds the whole hierarchy of objects present in compilation (both loaded from external libraries and from current compilation) according to their names and nesting inside namespaces and classes. Each class or macro can be found there by its full qualified name (like System.Collections.ArrayList is a leaf on path through System, Console and ArrayList). This data structure is organised into tree, where each node might have some childs (nested objects) and value. The value may be one of the variant options, like Cached for single, created TypeInfo object or MacroCall holding IMacro instance, etc. Code is located in ncc/hierarchy/NamespaceTree.n
-
GlobalEnv represents the set of imported namespaces (by
using Name.Space;
contruct), declared using aliases and current namespace. It is ment as global context in which the given program fragment resides. Every identifier generated by parser contains the reference to its context (it is especially useful for macros). This way when we see for example WriteLine and the current global env containsusing System.Console;
then we can interpret it as System.Console.WriteLine.
Some more info is on the page about type inference.