Compiler Overview

Introduction

A compiler writer builds bridges between people and machines, and every day this task becomes more challenging.

Software engineers want abstractions
Hardware engineers want efficiency

Homework

Most homework

we talk about some algorithm using pseudo
You implement that algorithm
You write up a blog post explaining what happened
If you like up to 3 people can submit a homework

Since most programming languages have a json library, you can do homework in any language you like.

Readings

we are going to critically read current research papers

Each person leads a paper discussion (up to two people can sign up as a team to present the same paper)
Everyone reads the paper; the leader goes over the contents pros and cons
The leader writes a blog post, (possibly including discussion insights )
blog is due one week after the presentation.
I recommend that people pair up (two people going over a paper before hand is a lot easier)
I listed a lot of papers, but if there is a different paper you want to present let me know

Project

Everybody gets to do a project, which is compiler related

you will need to get a proposal approved half way through the term
you submit a blog reporting on what happened
If you like up to 3 people can submit a project

homework 0

Write a paragraph to introduce yourself in a reply to the canvas introductions topic. Add a picture of you can. Mention a compilers topic you’d like to learn about someday, either in this class or beyond. Add your info to the canvas introductions discussion topic.

Pick a paper from the weekly schedule whose discussion you will lead. Claim the paper by opening a pull request (at the class github) for the weekly.qmd file, fill in your name in the LEADER: line. (I encurage teams of two to sign up for the same paper)

Once everyone has signed up, and I see which papers are covered, I’ll finalize the dates and times.

Add a text file containing done to Canvas assignment 0 to indicate you have done the introduction and claimed a paper

For this assignment you just need to submit a response to the canvas assignment to indicate that you are done after you write your introduction into canvas
For other assignments you should:
1. Write a blog post describing your work, and submit it via a pull request to the github page
2. Add a response to the the canvas assignment giving the name of your blog post

My plan is that grades, personal details and the like stay in canvas and everything else becomes public and goes on the github website.

Early Compilers

Originally, a compiler was a person doing calculations.

hidden figures

In 1952, Grace Hopper an operational link-loader, which she called a compiler. She later said, “Nobody believed that,” and that she “had a running compiler and nobody would touch it. They told me computers could only do arithmetic.”

FORTRAN

In 1957, John Backus created the first commercial compiler, FORTRAN (14 people worked on it for about 4 years).

Their paper is located at https://dl.acm.org/doi/10.1145/1455567.1455599.

The name stands for formula translation. It’s in upper case because at that time, compilers did not support lower case.

The FORTRAN project was begun in the summer of 1954. Its purpose was to reduce by a large factor the task of preparing scientific problems for IBM’s next large computer, the 704. If it were possible for the 704 to code problems for itself and produce as good programs as human coders (but without the errors), it was clear that large benefits could be achieved. For it was known that about two-thirds of the cost of solving most scientific and engineering problems on large computers was that of problem preparation. Furthermore, more than 90 per cent of the elapsed time for a problem was usually devoted to planning, writing, and debugging the program. In many cases the development of a general plan for solving a problem was a small job in comparison to the task of devising and coding machine procedures to carry out the plan.

The goal of the FORTRAN project was to enable the programmer to specify a numerical procedure using a concise language like that of mathematics and obtain automatically from this specification an efficient 704 program to carry out the procedure. It was expected that such a system would reduce the coding and debugging task to less than one-fifth of the job it had been.

FORTRAN was provided for the IBM 1401 computer by an innovative 63-phase compiler that ran entirely in its core memory of only 8000 (six-bit) characters.

Compiler Development Model

In these early years, the vendor development model was:

build a new machine
design a new language
implement the compiler

Vendors sometimes built compilers but often used small startup compiler companies.

Compilers stabilized on a classic structure (using an ir intermediate language). IR is machine independent.

Front end - parse the program into IR
Middle end - machine independent optimizations and analyses
Back end - machine specific stuff where machine code is generated

%%{init: {"flowchart": {"htmlLabels": false}} }%%

%%| echo: false 
graph LR;
A[Front end]--IR--> B[Middle end];
B--IR--> C[Back end];
A--IR--> C;

%%{init: {"flowchart": {"htmlLabels": false}} }%%

%%| echo: false 
graph LR;
A[Front end]--IR--> B[Middle end];
B--IR--> C[Back end];
A--IR--> C;

This course focuses on stage 2 (The middle end)

A goal of this course is to explain how to transform a program automatically, while preserving its semantics, in such a way that the new program is more efficient according to a well-defined metric.

There are many ways to compare the performance of programs:

Time
Space
Energy

gcc

In 1987, GCC was released. It formalized the IR, and was more or less open source. Within the stages, compiler writers could use any data structures but at the edges they had to use the single IR. Adding an optimization or reordering optimizations is quite hard.

Vendors could use one front end, one middle end and only need to write a new back end.

This ended almost all the compiler startups. Free front end, middle end.

In gcc the IR is somewhat C based, for instance there are pointers but there is no simple way to talk about garbage collection without hacks.

LLVM

in about 2006 LLVM (originally low level virtual machine) appeared. This changed the model to look like a library.

The core of LLVM is the intermediate representation (IR), a low-level programming language similar to assembly. IR is a strongly typed reduced instruction set computer (RISC) instruction set which abstracts away most details of the target.

%%{init: {"flowchart": {"htmlLabels": false}} }%%

%%| echo: false 
graph LR;
A[Front end]--IR--> B0[OP0] --IR--> B1[OP1] --IR--> B2[OPT2]--IR --> BN[OPTn]--IR -->C{Back end};
A--IR--> C;

%%{init: {"flowchart": {"htmlLabels": false}} }%%

%%| echo: false 
graph LR;
A[Front end]--IR--> B0[OP0] --IR--> B1[OP1] --IR--> B2[OPT2]--IR --> BN[OPTn]--IR -->C{Back end};
A--IR--> C;

Optimizations form passes. A user could mix and match – run some optimizations but not others to compile a specific program. It became easy for people to add a pass. Lots of academic research, lots of experiments.

bril

In this course we are going to an IR call BRIL, which is a very simplified version of LLVM IR, and we are going to string passes together by using UNIX pipes.

%%{init: {"flowchart": {"htmlLabels": false}} }%%

%%| echo: false 
graph LR;
A[TEXT_Version of BRIL]--> B0[BRIL in JSON] --> B1["new pass"] --> B2[BRIL interpreter];

%%{init: {"flowchart": {"htmlLabels": false}} }%%

%%| echo: false 
graph LR;
A[TEXT_Version of BRIL]--> B0[BRIL in JSON] --> B1["new pass"] --> B2[BRIL interpreter];

Cost of a compiler.

Compilers are massive and expensive to build.

Compiler	Year Started	Developers	Lines Of Code	Est Cost $
GCC 9.2.0	1988	617	5,591,759	425,747,279
LLVM 8.0.1	2001	1,210	6,877,489	529,894,190
OpenJDK 14+10	2007	883	7,955,827	616,517,789
v8 7.8.112	2008	736	3,043,793	225,195,832
Rust 1.37.0	2010	2,737	852,877	59,109,425
Swift	2010	857	665,238	45,535,689
Intel Graphics 1.0.10	2018	149	694,688	46,934,626

source

Some observations:

Production compilers are expensive.
IR does not change easily.
Much of compiler technology is old.
There is a vast difference between production and student projects.

Compiler Assumptions (How many are still true?)

The time to compile a program should be roughly linear. So, non-linear algorithms can only be used if they work on a small part of a program.
Users are ok with large programs taking minutes to compile
Compilers run on machines that are memory-limited.
Compilers run on single-threaded machines.
Most targets are C-like.

Some changes since early 2000’s:

Integrated development environments. When you type a.b what has to happen?
DSL (Domain specific languages for AI)
More kinds of hardware

How well do compilers do

At the scale of data-centers, every single performance percent matters! Just take a look at Google’s (and other’s) publicly available numbers on expenditures on datacenters. We are talking about billions of dollars. A single percent improvement can mean millions of dollars from more program features or improved utilization.

proebsting’s law

Compiler Advances Double Computing Power Every 18 Years.
while hardware computing horsepower doubles every 18 months. How would you prove this?

one attempt

why compilers are not better

Talk given by KAI’s Arch Robison

Compile-time program optimizations are similar to poetry: more are written than actually published in commercial compilers. Hard economic reality is that many interesting optimizations have too narrow an audience to justify their cost in a general-purpose compiler and custom compilers are too expensive to write.

effects of optimization

Remove performance penalty for:

using higher level constructs
safety checks (e.g., array bounds checks)
writing clean, simple code (no benefit to applying loop unrolling by hand)
Encourage ADT’s that are as efficient as primitive types

Over time hardware has become more of a challenge for compilers, for example caches are not predictable at compile time. So compilers have to guess

And hardware can ignore features of the compiler can deal with them - for example interlock

Other Formats