Testing Register allocators

This is based on the cranelift compiler used for web-assembly and rust

what is register allocation

In Bril and LLVM a program can use an arbitrary number of registers,

void f() {
    int x0 = compute(0);
    int x1 = compute(1);
    // ...
    int x99 = compute(99);
    
    // --- 100 possibly different values were computed 
    // --- where are those values stored?
    
    consume(x0);
    consume(x1);
    // ...
    consume(x99);
}

storing variables

one option

Allocate a memory location for each local variable. All of the \(x_N\) variables live in memory.

When the function is called, it allocates a new area on the stack called the stack frame and uses it to store local variables.

This means that adding two variables, takes two loads, one add, and one store so it is very slow .

Compiling code in this way is very fast because we need to make almost no decisions: a variable reference always becomes a memory load,

faster would be to keep variables in registers but, we have a limited set of registers

Definitions

Register allocation: is assigning a value in the program to a register for storage. The register allocator decides how to shuffle values between memory and registers, and between register.

In Bril and LLVM we have virtual registers - as many as you want. The register allocator has to rewrite the instructions to use physical registers. Since the number of physical registers is limited, The allocator might insert additional instructions:

spills/reloads

stores (called spills) to move a register to memory
loads (called reloads) to move memory to a register
moves to copy from one register to another

The locations in memory are usually on the stack and are called spill-slots

example of register allocation on a machine with two physical registers

virtual register code	physical register code	assigments
		{v0 -> r0, v1 -> r1}
	store r1, [sp+0]
add v2, v0, v1	add r1, r0, r1	{v0 -> r0, v1->[sp+0], v2->r1}
sub v3, v2, v0	sub r1, r1, r0
	load r0, [sp+0]
mul v4, v3, v1	mul r0, r1, r0	{v4->r0}
store v4, [sp+48]	store r0, [sp+48]

complexity

if you do register allocation for code that is not in SSA, this is NP-complete But if you do it on code that is in SSA, the time complexity is polynomial.

There are lots of approximate algorithms- all complicated, lots of machines have extra constraints for instance there is a GPU load instruction that read 128 bits from memory and puts the value into 4 consecutive registers

however, there is no known fast algorithm for deciding which registers to spill

I used this code

    def select_spill_register(self, live: Set[str], interference_graph) -> str:
        # for each register in live set, that can be spilled, find the register with the most neighbors in the interference graph
        max_neighbors = -1
        spill_register = None
        for reg in live:
            if reg in self.spilled_registers:
                continue
            if can_spill(reg)
                neighbors = len(interference_graph.graph[reg])
                if neighbors > max_neighbors:
                    max_neighbors = neighbors
                    spill_register = reg
        self.spilled_registers.append(spill_register)
        return spill_register

never spill the stack pointer

How to Verify Correctness of an allocator?

Before and after the allocator, we have the same instructions (except for those added by the allocator)

Assume we have a machine with an infinite register set and a second machine with a finite register set.

Correct means both programs executed on these two machines get the same answer for all possible inputs

how do we test this?

How do we test this equivalence?

pick a random program and a random input. interpret and see if the result is the same.

Could try more random inputs, could generate more random programs (fuzzer tools)

Could reasonably confident but not 100% and very expensive

use value numbering check one program, all possible inputs

// original               // allocated 
ld v0, [A]                ld r0, [A]
ld v1, [B]                ld r1, [B]
ld v2, [C]                ld r2, [C]
add v3, v0, v1            add r0, r0, r1
add v4, v2, v3            add r0, r2, r0
return v4                 return r0

v0 -> vn 1 = ld [A]         r0 -> vn 1 = ld [A]
v1 -> vn 2 = ld [B]         r1 -> vn 2 = ld [B]
v2 -> vn 3 = ld [C]         r2 -> vn 3 = ld [C]
v3 -> vn 4 = add 1,2        r0 -> vn 4 = add 1,2 
v4 -> vn 5 = add 3,4        r0 -> vn 5 = add 3,4 
return vn 5               return vn 5

weak value numbering

// original               // allocated 
ld v0, [A]                ld r0, [A]
                          write r0 holds v0
ld v1, [B]                ld r1, [B]
                          write r1 holds v1 
ld v2, [C]                ld r2, [C]
                          write r2 holds v2
add v3, v0, v1            add r0, r0, r1
                          read r0 - holds  v0 - match
                          read r1 - holds  v1 - match
                          write r0 holds v3
add v4, v2, v3            add r0, r2, r0
                          read r2 holds v2 - match
                          read r0 holds v3 - match
                          write r0 holds v4 
return v4                 return r0
                          read r0 holds v4 match 



##  check more then one program  (all programs at once)

This requires a proof that the two programs get the same result - this is an active research question -  

some success but not easy 

not used in production 

## best we can do is generate lots of programs check each one 

```{mermaid}
%%{init: {"flowchart": {"htmlLabels": false}} }%%

graph LR
A[Virtual code]
B[Register allocator]
C[Machine code]
D[Checker]
E[Fuzzing engine]
A--> B
B--> C
C --> D
D --> E
E --> A
A --> D

We could use a fuzzer to generate random programs or we could use a test set

algorithm (linear in number of instructions)

for each instruction we need to form pairs - virtual and physical register that holds the same value

for instruction v and p, check that the arguments are equal, if not fail add the pair dest of v == dest of p

does not matter what the original op code was, just need register names

state (for each physcial register)

for each physical-register | vertual register or Unknown for each spill slot | verturaL register or unknown

Treat the allocated program as containing:

Spill , : copy data (symbol representing virtual register) from a register to a spill slot.

update spill slot state to virtual register if we know the target

Reload , : copy data from a spill slot to a register.

copy state from spill slot to physical register

copy , : move data from one CPU register to another (N.B.: only regalloc-inserted moves are recognized as a Move, not moves in the original input program.)

copy state from physical register to destination

Op read:, read_orig: write: write_orig:: some arbitrary operation that reads some registers and writes some other registers.

for each read:

check if the state of the physcial register is the corresponding virtual regster- if not we found a bug

for each dest, update the state of the physical dest to the virtural register

an example in Bril

@main {
  a: int = const 1;
  b: int = const 2;
  c: int = const 3;
  d: int = const 4;
  e: int = const 5;
  f: int = const 6;
  g: int = const 7;
  h: int = add a b;
  i: int = mul c d;
  j: int = sub e f;
  k: int = div g a;
  l: int = add h i;
  m: int = mul j k;
  print l m;
}

Assume we have 5 registers (will need to spill)

In Bril we allocate a stack by:

  pr1: int = const 7;              ## register that sets the size of the stack
  pr3: ptr<int> = alloc pr1;       ## allocate a block of memory for the stack

Store into the stack, at fixed offset 2, assume stack ptr is in pr3

  pr2: int = const 2;
  pr2 = ptradd pr3 pr2;
  store pr2 pr1;

load from the stack,

  pr2: int = const 1;
  pr2 = ptradd pr3 pr2;
  pr2 = load pr2;

register allocated code

@main {
               pr1: int = const 7;
               pr3: ptr<int> = alloc pr1;  allocate stack with 7 ints 
  pr1: int = const 1;
               pr2: int = const 2; 
               pr2 = ptradd pr3 pr2;   stack[2] = pr1
              store pr2 pr1;
  pr2: int = const 2;
  pr4: int = const 3;
  pr1: int = const 4;

                pr5: int = const 6;
                pr5 = ptradd pr3 pr5;  stack[6] = pr1
                store pr5 pr1;
  pr1: int = const 5;
                pr5: int = const 4;
                pr5 = ptradd pr3 pr5; stack[3] = pr1
                store pr5 pr1;
  pr1: int = const 6;
                 pr5: int = const 5;
                 pr5 = ptradd pr3 pr5; stack[5] = pr1 
                 store pr5 pr1;
  pr1: int = const 7;
                 pr5: int = const 3;
                 pr5 = ptradd pr3 pr5;  stack[3] = pr1
                 store pr5 pr1;
  
                 pr5: int = const 2;
                 pr5 = ptradd pr3 pr5;  pr5 = stack[2]
                 pr5 = load pr5;
  pr2: int = add pr5 pr2;
                 pr5: int = const 0;
                 pr5 = ptradd pr3 pr5; stack[0] = pr2
                 store pr5 pr2;
  
                 pr1: int = const 6;
                 pr1 = ptradd pr3 pr1; pr1 = stack[6]
                 pr1 = load pr1;
  pr2: int = mul pr4 pr1;
                 pr4: int = const 1;
                 pr4 = ptradd pr3 pr4; pr2 = stack[1] 
                 store pr4 pr2;
  
                 pr2: int = const 4;
                 pr2 = ptradd pr3 pr2; pr2 = stack[4]
                 pr2 = load pr2;
  
                 pr1: int = const 5;
                 pr1 = ptradd pr3 pr1; pr1 = stack[5]
                 pr1 = load pr1;

  pr4: int = sub pr2 pr1;
                 pr2: int = const 2;
                 pr2 = ptradd pr3 pr2; pr3 = stack[2] 
                 pr2 = load pr2;
  
                 pr1: int = const 3;
                 pr1 = ptradd pr3 pr1; pr1 = stack[3]
                 pr1 = load pr1;
  pr1: int = div pr1 pr2;
                 pr5: int = const 0;
                 pr5 = ptradd pr3 pr5; pr5 = stack[-]
                 pr5 = load pr5;
  
                 pr2: int = const 1;
                 pr2 = ptradd pr3 pr2; pr2 = stack[1]
                 pr2 = load pr2;
  pr2: int = add pr5 pr2;
  pr1: int = mul pr4 pr1;
  print pr2 pr1;
  free pr3;      
}

simplified

@main {
  pr1: int = const 1;
  stack[2] = pr1
  pr2: int = const 2;
  pr4: int = const 3;
  pr1: int = const 4;
  stack[6] = pr1
  pr1: int = const 5;
  stack[3] = pr1
  pr1: int = const 6;
  stack[5] = pr1 
  pr1: int = const 7;
  stack[3] = pr1
  pr5 = stack[2]
  pr2: int = add pr5 pr2;
  stack[0] = pr2
  pr1 = stack[6]
  pr2: int = mul pr4 pr1;
  pr2 = stack[1] 
  pr2 = stack[4]
  pr1 = stack[5]
  pr4: int = sub pr2 pr1;
  pr3 = stack[2] 
  pr1 = stack[3]
  pr1: int = div pr1 pr2;
  pr5 = stack[0]
  pr2 = stack[1]
  pr2: int = add pr5 pr2;
  pr1: int = mul pr4 pr1;
  print pr2 pr1;      
}

check

pr3 is the stack pointer

compiler has to mark all the spills and reloads

virtual	physical	regs	stack
a=const	pr1 = const	pr1=a
	stack[2]=pr1	pr1=a	stack[2] = a
b=const	pr2=const	pr1=a, pr2=b	stack[2] = a
c=const	pr1=const	pr1=c, pr2=b	stack[2] = a
	stack[6]=pr1	pr1=c, pr2=b	stack[2] = a, stack[6]=c
d=const	pr4=const	pr1=c, pr2=b, pr4=d	stack[2] = a, stack[6]=c
e=const	pr1= const	pr1=e, pr2=b, pr4=d	stack[2] = a, stack[6]=c
	stack[5]=pr1	pr1=e, pr2=b, pr4=d	stack[2] = a, stack[5] = e, stack[6]=c
f=const	pr1=const	pr1=f, pr2=b, pr4=d	stack[2] = a, stack[5] = e, stack[6]=c
	stack[4]=pr1	pr1=f, pr2=b, pr4=d	stack[2] = a, stack[4] = f, stack[5] = e, stack[6]=c
g=const	pr1=const	pr1=g, pr2=b, pr4=d	stack[2] = a, stack[4] = f, stack[5] = e, stack[6]=c
	stack[3] = pr1	pr1=g, pr2=b, pr4=d	stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c
	pr5= stack[2]	pr1=g, pr2=b, pr4=d, pr5=a	stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c
h=a+b	pr2=pr5+pr2	pr1=g, pr2=h, pr4=d, pr5=a	stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c
	stack[0]=pr2	pr1=g, pr2=h, pr4=d, pr5=a	stack[0]=h, stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c
	pr1=stack[6]	pr1=c, pr2=h, pr4=d, pr5=a	stack[0]=h, stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c
i=c*d	pr2=pr1*pr4	pr1=c, pr2=i, pr4=d, pr5=a	stack[0]=h, stack[2] = a, stack[3]=g, stack[4] = f, stack[5] = e, stack[6]=c
	stack[1]=pr2	pr1=c, pr2=i, pr4=d, pr5=a	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
	pr2=stack[4]	pr1=c, pr2=f, pr4=d, pr5=a	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
	pr1=stack[5]	pr1=e, pr2=f, pr4=d, pr5=a	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
j=e-f	pr4=pr1-pr2	pr1=e, pr2=f, pr4=j, pr5=a	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
	pr2=stack[2]	pr1=e, pr2=a, pr4=j, pr5=a	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
	pr1=stack[3]	pr1=g, pr2=a, pr4=j, pr5=a	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
k=g/a	pr1= pr1/pr2	pr1=k, pr2=a, pr4=j, pr5=a	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
	pr5=stack[0]	pr1=k, pr2=a, pr4=j, pr5=h	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
	pr2= stack[1]	pr1=k, pr2=i, pr4=j, pr5=h	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
l=h+i	pr2=pr5+pr2	pr1=k, pr2=l, pr4=j, pr5=h	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
m= j*k	pr1=pr4*pr1	pr1=m, pr2=l, pr4=j, pr5=h	stack[0]=h, stack[1]=i, stack[2]=a, stack[3]=g, stack[4]=f, stack[5]=e, stack[6]=c
print l,m	print pr2,pr1