In Bril and LLVM a program can use an arbitrary number of registers,
void f() {
int x0 = compute(0);
int x1 = compute(1);
// ...
int x99 = compute(99);
// --- 100 possibly different values were computed
// --- where are those values stored?
consume(x0);
consume(x1);
// ...
consume(x99);
}
one option
Allocate a memory location for each local variable. All of the \(x_N\) variables live in memory.
When the function is called, it allocates a new area on the stack called the stack frame and uses it to store local variables.
This means that adding two variables, takes two loads, one add, and one store so it is very slow .
Compiling code in this way is very fast because we need to make almost no decisions: a variable reference always becomes a memory load,
faster would be to keep variables in registers but, we have a limited set of registers
Register allocation: is assigning a value in the program to a register for storage. The register allocator decides how to shuffle values between memory and registers, and between register.
In Bril and LLVM we have virtual registers - as many as you want. The register allocator has to rewrite the instructions to use physical registers. Since the number of physical registers is limited, The allocator might insert additional instructions:
The locations in memory are usually on the stack and are called spill-slots
virtual register code | physical register code | assigments |
---|---|---|
{v0 -> r0, v1 -> r1} | ||
store r1, [sp+0] | ||
add v2, v0, v1 | add r1, r0, r1 | {v0 -> r0, v1->[sp+0], v2->r1} |
sub v3, v2, v0 | sub r1, r1, r0 | |
load r0, [sp+0] | ||
mul v4, v3, v1 | mul r0, r1, r0 | {v4->r0} |
store v4, [sp+48] | store r0, [sp+48] |
if you do register allocation for code that is not in SSA, this is NP-complete But if you do it on code that is in SSA, the time complexity is polynomial.
There are lots of approximate algorithms- all complicated, lots of machines have extra constraints for instance there is a GPU load instruction that read 128 bits from memory and puts the value into 4 consecutive registers
however, there is no known fast algorithm for deciding which registers to spill
I used this code
def select_spill_register(self, live: Set[str], interference_graph) -> str:
# for each register in live set, that can be spilled, find the register with the most neighbors in the interference graph
max_neighbors = -1
spill_register = None
for reg in live:
if reg in self.spilled_registers:
continue
if can_spill(reg)
neighbors = len(interference_graph.graph[reg])
if neighbors > max_neighbors:
max_neighbors = neighbors
spill_register = reg
self.spilled_registers.append(spill_register)
return spill_register
never spill the stack pointer
Before and after the allocator, we have the same instructions (except for those added by the allocator)
Assume we have a machine with an infinite register set and a second machine with a finite register set.
Correct means both programs executed on these two machines get the same answer for all possible inputs
How do we test this equivalence?
pick a random program and a random input. interpret and see if the result is the same.
Could try more random inputs, could generate more random programs (fuzzer tools)
Could reasonably confident but not 100% and very expensive
// original // allocated
ld v0, [A] ld r0, [A]
ld v1, [B] ld r1, [B]
ld v2, [C] ld r2, [C]
add v3, v0, v1 add r0, r0, r1
add v4, v2, v3 add r0, r2, r0
return v4 return r0
v0 -> vn 1 = ld [A] r0 -> vn 1 = ld [A]
v1 -> vn 2 = ld [B] r1 -> vn 2 = ld [B]
v2 -> vn 3 = ld [C] r2 -> vn 3 = ld [C]
v3 -> vn 4 = add 1,2 r0 -> vn 4 = add 1,2
v4 -> vn 5 = add 3,4 r0 -> vn 5 = add 3,4
return vn 5 return vn 5
// original // allocated
ld v0, [A] ld r0, [A]
write r0 holds v0
ld v1, [B] ld r1, [B]
write r1 holds v1
ld v2, [C] ld r2, [C]
write r2 holds v2
add v3, v0, v1 add r0, r0, r1
read r0 - holds v0 - match
read r1 - holds v1 - match
write r0 holds v3
add v4, v2, v3 add r0, r2, r0
read r2 holds v2 - match
read r0 holds v3 - match
write r0 holds v4
return v4 return r0
read r0 holds v4 match
## check more then one program (all programs at once)
This requires a proof that the two programs get the same result - this is an active research question -
some success but not easy
not used in production
## best we can do is generate lots of programs check each one
```{mermaid}
%%{init: {"flowchart": {"htmlLabels": false}} }%%
graph LR
A[Virtual code]
B[Register allocator]
C[Machine code]
D[Checker]
E[Fuzzing engine]
A--> B
B--> C
C --> D
D --> E
E --> A
A --> D
We could use a fuzzer to generate random programs or we could use a test set
for each instruction we need to form pairs - virtual and physical register that holds the same value
for instruction v and p, check that the arguments are equal, if not fail add the pair dest of v == dest of p
does not matter what the original op code was, just need register names
for each physical-register | vertual register or Unknown for each spill slot | verturaL register or unknown
Treat the allocated program as containing:
Spill
update spill slot state to virtual register if we know the target
Reload
copy state from spill slot to physical register
copy state from physical register to destination
for each read:
check if the state of the physcial register is the corresponding virtual regster- if not we found a bug
for each dest, update the state of the physical dest to the virtural register
@main {
a: int = const 1;
b: int = const 2;
c: int = const 3;
d: int = const 4;
e: int = const 5;
f: int = const 6;
g: int = const 7;
h: int = add a b;
i: int = mul c d;
j: int = sub e f;
k: int = div g a;
l: int = add h i;
m: int = mul j k;
print l m;
}
Assume we have 5 registers (will need to spill)
In Bril we allocate a stack by:
pr1: int = const 7; ## register that sets the size of the stack
pr3: ptr<int> = alloc pr1; ## allocate a block of memory for the stack
Store into the stack, at fixed offset 2, assume stack ptr is in pr3
pr2: int = const 2;
pr2 = ptradd pr3 pr2;
store pr2 pr1;
load from the stack,
pr2: int = const 1;
pr2 = ptradd pr3 pr2;
pr2 = load pr2;
@main {
pr1: int = const 7;
pr3: ptr<int> = alloc pr1; allocate stack with 7 ints
pr1: int = const 1;
pr2: int = const 2;
pr2 = ptradd pr3 pr2; stack[2] = pr1
store pr2 pr1;
pr2: int = const 2;
pr4: int = const 3;
pr1: int = const 4;
pr5: int = const 6;
pr5 = ptradd pr3 pr5; stack[6] = pr1
store pr5 pr1;
pr1: int = const 5;
pr5: int = const 4;
pr5 = ptradd pr3 pr5; stack[3] = pr1
store pr5 pr1;
pr1: int = const 6;
pr5: int = const 5;
pr5 = ptradd pr3 pr5; stack[5] = pr1
store pr5 pr1;
pr1: int = const 7;
pr5: int = const 3;
pr5 = ptradd pr3 pr5; stack[3] = pr1
store pr5 pr1;
pr5: int = const 2;
pr5 = ptradd pr3 pr5; pr5 = stack[2]
pr5 = load pr5;
pr2: int = add pr5 pr2;
pr5: int = const 0;
pr5 = ptradd pr3 pr5; stack[0] = pr2
store pr5 pr2;
pr1: int = const 6;
pr1 = ptradd pr3 pr1; pr1 = stack[6]
pr1 = load pr1;
pr2: int = mul pr4 pr1;
pr4: int = const 1;
pr4 = ptradd pr3 pr4; pr2 = stack[1]
store pr4 pr2;
pr2: int = const 4;
pr2 = ptradd pr3 pr2; pr2 = stack[4]
pr2 = load pr2;
pr1: int = const 5;
pr1 = ptradd pr3 pr1; pr1 = stack[5]
pr1 = load pr1;
pr4: int = sub pr2 pr1;
pr2: int = const 2;
pr2 = ptradd pr3 pr2; pr3 = stack[2]
pr2 = load pr2;
pr1: int = const 3;
pr1 = ptradd pr3 pr1; pr1 = stack[3]
pr1 = load pr1;
pr1: int = div pr1 pr2;
pr5: int = const 0;
pr5 = ptradd pr3 pr5; pr5 = stack[-]
pr5 = load pr5;
pr2: int = const 1;
pr2 = ptradd pr3 pr2; pr2 = stack[1]
pr2 = load pr2;
pr2: int = add pr5 pr2;
pr1: int = mul pr4 pr1;
print pr2 pr1;
free pr3;
}
@main {
pr1: int = const 1;
stack[2] = pr1
pr2: int = const 2;
pr4: int = const 3;
pr1: int = const 4;
stack[6] = pr1
pr1: int = const 5;
stack[3] = pr1
pr1: int = const 6;
stack[5] = pr1
pr1: int = const 7;
stack[3] = pr1
pr5 = stack[2]
pr2: int = add pr5 pr2;
stack[0] = pr2
pr1 = stack[6]
pr2: int = mul pr4 pr1;
pr2 = stack[1]
pr2 = stack[4]
pr1 = stack[5]
pr4: int = sub pr2 pr1;
pr3 = stack[2]
pr1 = stack[3]
pr1: int = div pr1 pr2;
pr5 = stack[0]
pr2 = stack[1]
pr2: int = add pr5 pr2;
pr1: int = mul pr4 pr1;
print pr2 pr1;
}
pr3 is the stack pointer
compiler has to mark all the spills and reloads
virtual | physical | regs | stack |
---|---|---|---|
a=const | pr1 = const | pr1=a | |
stack[2]=pr1 | pr1=a | stack[2] = a | |
b=const | pr2=const | pr1=a, pr2=b | stack[2] = a |
c=const | pr1=const | pr1=c, pr2=b | stack[2] = a |
stack[6]=pr1 | pr1=c, pr2=b | stack[2] = a, stack[6]=c | |
d=const | pr4=const | pr1=c, pr2=b, pr4=d | stack[2] = a, stack[6]=c |
e=const | pr1= const | pr1=e, pr2=b, pr4=d | stack[2] = a, stack[6]=c |
stack[5]=pr1 | pr1=e, pr2=b, pr4=d | stack[2] = a, stack[5] = e, stack[6]=c | |
f=const | pr1=const | pr1=f, pr2=b, pr4=d | stack[2] = a, stack[5] = e, stack[6]=c |
stack[4]=pr1 | pr1=f, pr2=b, pr4=d | stack[2] = a, stack[4] = f, stack[5] = e, stack[6]=c | |
g=const | pr1=const | pr1=g, pr2=b, pr4=d | stack[2] = a, stack[4] = f, stack[5] = e, stack[6]=c |
stack[3] = pr1 | pr1=g, pr2=b, pr4=d | stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c | |
pr5= stack[2] | pr1=g, pr2=b, pr4=d, pr5=a | stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c | |
h=a+b | pr2=pr5+pr2 | pr1=g, pr2=h, pr4=d, pr5=a | stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c |
stack[0]=pr2 | pr1=g, pr2=h, pr4=d, pr5=a | stack[0]=h, stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c | |
pr1=stack[6] | pr1=c, pr2=h, pr4=d, pr5=a | stack[0]=h, stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c | |
i=c*d | pr2=pr1*pr4 | pr1=c, pr2=i, pr4=d, pr5=a | stack[0]=h, stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c |
stack[1]=pr2 | pr1=c, pr2=i, pr4=d, pr5=a | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c | |
pr2=stack[4] | pr1=c, pr2=f, pr4=d, pr5=a | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c | |
pr1=stack[5] | pr1=e, pr2=f, pr4=d, pr5=a | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c | |
j=e-f | pr4=pr1-pr2 | pr1=e, pr2=f, pr4=j, pr5=a | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c |
pr2=stack[2] | pr1=e, pr2=a, pr4=j, pr5=a | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c | |
pr1=stack[3] | pr1=g, pr2=a, pr4=j, pr5=a | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c | |
k=g/a | pr1= pr1/pr2 | pr1=k, pr2=a, pr4=j, pr5=a | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c |
pr5=stack[0] | pr1=k, pr2=a, pr4=j, pr5=h | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c | |
pr2= stack[1] | pr1=k, pr2=i, pr4=j, pr5=h | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c | |
l=h+i | pr2=pr5+pr2 | pr1=k, pr2=l, pr4=j, pr5=h | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c |
m= j*k | pr1=pr4*pr1 | pr1=m, pr2=l, pr4=j, pr5=h | stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c |
print l,m | print pr2,pr1 |