A compiler writer builds bridges between people and machines, and every day this task becomes more challenging.
Software engineers want abstractions
Hardware engineers want efficiency
A compiler
Homework
Most homework
we talk about some algorithm using pseudo
You implement that algorithm
You write up a blog post explaining what happened
If you like up to 3 people can submit a homework
Since most programming languages have a json library, you can do homework in any language you like.
Readings
we are going to critically read current research papers
Each person leads a paper discussion (up to two people can sign up as a team to present the same paper)
Everyone reads the paper; the leader goes over the contents pros and cons
The leader writes a blog post, (possibly including discussion insights )
blog is due one week after the presentation.
I recommend that people pair up (two people going over a paper before hand is a lot easier)
I listed a lot of papers, but if there is a different paper you want to present let me know
Project
Everybody gets to do a project, which is compiler related
you will need to get a proposal approved half way through the term
you submit a blog reporting on what happened
If you like up to 3 people can submit a project
homework 0
Write a paragraph to introduce yourself in a reply to the canvas introductions topic. Add a picture of you can. Mention a compilers topic you’d like to learn about someday, either in this class or beyond. Add your info to the canvas introductions discussion topic.
Pick a paper from the weekly schedule whose discussion you will lead. Claim the paper by opening a pull request (at the class github) for the weekly.qmd file, fill in your name in the LEADER: line. (I encurage teams of two to sign up for the same paper)
Once everyone has signed up, and I see which papers are covered, I’ll finalize the dates and times.
Add a text file containing done to Canvas assignment 0 to indicate you have done the introduction and claimed a paper
For this assignment you just need to submit a response to the canvas assignment to indicate that you are done after you write your introduction into canvas
For other assignments you should:
Write a blog post describing your work, and submit it via a pull request to the github page
Add a response to the the canvas assignment giving the name of your blog post
My plan is that grades, personal details and the like stay in canvas and everything else becomes public and goes on the github website.
Early Compilers
Originally, a compiler was a person doing calculations.
In 1952, Grace Hopper an operational link-loader, which she called a compiler. She later said, “Nobody believed that,” and that she “had a running compiler and nobody would touch it. They told me computers could only do arithmetic.”
Grace Hopper
FORTRAN
In 1957, John Backus created the first commercial compiler, FORTRAN (14 people worked on it for about 4 years).
The name stands for formula translation. It’s in upper case because at that time, compilers did not support lower case.
The FORTRAN project was begun in the summer of 1954. Its purpose was to reduce by a large factor the task of preparing scientific problems for IBM’s next large computer, the 704. If it were possible for the 704 to code problems for itself and produce as good programs as human coders (but without the errors), it was clear that large benefits could be achieved. For it was known that about two-thirds of the cost of solving most scientific and engineering problems on large computers was that of problem preparation. Furthermore, more than 90 per cent of the elapsed time for a problem was usually devoted to planning, writing, and debugging the program. In many cases the development of a general plan for solving a problem was a small job in comparison to the task of devising and coding machine procedures to carry out the plan.
The goal of the FORTRAN project was to enable the programmer to specify a numerical procedure using a concise language like that of mathematics and obtain automatically from this specification an efficient 704 program to carry out the procedure. It was expected that such a system would reduce the coding and debugging task to less than one-fifth of the job it had been.
FORTRAN was provided for the IBM 1401 computer by an innovative 63-phase compiler that ran entirely in its core memory of only 8000 (six-bit) characters.
Compiler Development Model
In these early years, the vendor development model was:
build a new machine
design a new language
implement the compiler
Vendors sometimes built compilers but often used small startup compiler companies.
Compilers stabilized on a classic structure (using an ir intermediate language). IR is machine independent.
Front end - parse the program into IR
Middle end - machine independent optimizations and analyses
Back end - machine specific stuff where machine code is generated
A goal of this course is to explain how to transform a program automatically, while preserving its semantics, in such a way that the new program is more efficient according to a well-defined metric.
There are many ways to compare the performance of programs:
Time
Space
Energy
gcc
In 1987, GCC was released. It formalized the IR, and was more or less open source. Within the stages, compiler writers could use any data structures but at the edges they had to use the single IR. Adding an optimization or reordering optimizations is quite hard.
Vendors could use one front end, one middle end and only need to write a new back end.
This ended almost all the compiler startups. Free front end, middle end.
In gcc the IR is somewhat C based, for instance there are pointers but there is no simple way to talk about garbage collection without hacks.
LLVM
in about 2006 LLVM (originally low level virtual machine) appeared. This changed the model to look like a library.
The core of LLVM is the intermediate representation (IR), a low-level programming language similar to assembly. IR is a strongly typed reduced instruction set computer (RISC) instruction set which abstracts away most details of the target.
Optimizations form passes. A user could mix and match – run some optimizations but not others to compile a specific program. It became easy for people to add a pass. Lots of academic research, lots of experiments.
bril
In this course we are going to an IR call BRIL, which is a very simplified version of LLVM IR, and we are going to string passes together by using UNIX pipes.
%%{init: {"flowchart": {"htmlLabels": false}} }%%%%| echo: false graph LR;A[TEXT_Version of BRIL]--> B0[BRIL in JSON] --> B1["new pass"] --> B2[BRIL interpreter];
There is a vast difference between production and student projects.
Compiler Assumptions (How many are still true?)
The time to compile a program should be roughly linear. So, non-linear algorithms can only be used if they work on a small part of a program.
Users are ok with large programs taking minutes to compile
Compilers run on machines that are memory-limited.
Compilers run on single-threaded machines.
Most targets are C-like.
Some changes since early 2000’s:
Integrated development environments. When you type a.b what has to happen?
DSL (Domain specific languages for AI)
More kinds of hardware
How well do compilers do
At the scale of data-centers, every single performance percent matters! Just take a look at Google’s (and other’s) publicly available numbers on expenditures on datacenters. We are talking about billions of dollars. A single percent improvement can mean millions of dollars from more program features or improved utilization.
Compile-time program optimizations are similar to poetry: more are written than actually published in commercial compilers. Hard economic reality is that many interesting optimizations have too narrow an audience to justify their cost in a general-purpose compiler and custom compilers are too expensive to write.
effects of optimization
Remove performance penalty for:
using higher level constructs
safety checks (e.g., array bounds checks)
writing clean, simple code (no benefit to applying loop unrolling by hand)
Encourage ADT’s that are as efficient as primitive types
Over time hardware has become more of a challenge for compilers, for example caches are not predictable at compile time. So compilers have to guess
And hardware can ignore features of the compiler can deal with them - for example interlock