home

Writeups for Dennis Yurichev's Reverse Engineering Challenges (#2-#11)

March 10, 2019 ❖ Tags: writeup, reverse-engineering, arm, x86

As mentioned in the (now deleted) post I wrote describing my plans for 2019, one of my goals this year is to get through at least 50 of the exercises on Dennis Yurichev's challenges.re. I've decided to document my progress in the form of writeups for the challenges I complete, batched in sets of ten exercises. For each challenge, I'll try to explain the intuitions that brought me closer to answering the recurring question from Yurichev, "[w]hat does this code do?"

Challenge #2

In nearly all of the challenges available on the site, we're given equivalent disassembly listings of a function, f, as generated by different compilers on different processor architectures, and we're asked to describe what the code does. For now, I've decided to take it easy and only pay attention to the disassemblies for GCC on x86, as that's what I've done the most work with. We aren't given a target operating system, but I think it's reasonable to assume that the x86 code uses the cdecl calling convention.

Although I stayed within my comfort zone in terms of instruction set architecture, I refrained from my usual habit of converting the disassembly listing to AT&T syntax for once.

Below is a rough translation of the disassembly listing to C. My process is relatively unchanged from the workflow I described in an older post.

unsigned f(unsigned a)
{
        // mov    eax,DWORD PTR [esp+0x4]
        // bswap  eax
        a = ((a & 0xff) << 24)
                | ((a & 0xff00) << 8)
                | ((a & 0xff0000) >> 8)
                | ((a & 0xff000000) >> 24);

        // mov    edx,eax
        // and    eax,0xf0f0f0f
        // and    edx,0xf0f0f0f0
        // shr    edx,0x4
        // shl    eax,0x4
        // or     eax,edx
        a = ((a & 0xf0f0f0f) << 4) | ((a & 0xf0f0f0f0) >> 4);

        // mov    edx,eax
        // and    eax,0x33333333
        // and    edx,0xcccccccc
        // shr    edx,0x2
        // shl    eax,0x2
        // or     eax,edx
        a = ((a & 0x33333333) << 2) | ((a & 0xcccccccc) >> 2);

        // and    eax,0x55555555
        // and    edx,0xaaaaaaaa
        // add    eax,eax
        // shr    edx,1
        // or     eax,edx
        a = ((a & 0x55555555) << 1) | ((a & 0xaaaaaaaa) >> 1);

        // ret
        return a;
}

I think it should make sense that add eax,eax is mathematically equivalent to imul eax, 2, but it takes another step to see that it's equivalent to shl eax,1, which is represented in the C code as << 1. This isn't terribly complicated, but it's an optimization detail that some might not be familiar with. bswap is an instruction I was unfamiliar with, so I consulted my favorite x86 reference. It converts the endianness of the word in the register. If you don't know what that means, I'd suggest you read the page in the ISA reference.

The code seems nonsensical at first, but we can compile it and inspect the output given some test values.

#include <stdio.h>

void main(void)
{
        unsigned i;

        for (i = 0; i <= 256; i++) {
                printf("%010u %08x\n", i, i);
                printf("%010u %08x\n", f(i), f(i));
                printf("\n");
        }
}

Which produces:

0000000000 00000000
0000000000 00000000

0000000001 00000001
2147483648 80000000

0000000002 00000002
1073741824 40000000

0000000003 00000003
3221225472 c0000000

0000000004 00000004
0536870912 20000000
...

What's happening might not be immediately obvious, but there's a pattern in the column of input/output represented in hexadecimal. Let's have a look at the binary representations of a few pairs:

bin(0x00000001) # --> '0b00000000000000000000000000000001'
bin(0x80000000) # --> '0b10000000000000000000000000000000'

bin(0x00000003) # --> '0b00000000000000000000000000000011'
bin(0xc0000000) # --> '0b11000000000000000000000000000000'

# ...

bin(0x0000004d) # --> '0b00000000000000000000000001001101'
bin(0xb2000000) # --> '0b10110010000000000000000000000000'

My answer to the question is that f reverses the bits of the word it is given.

Challenge #3

This time, we're given an array of 64 32-bit integers and a hint that "[t]he algorithm is well-known, but I've changed [the] constant so it wouldn't be googleable."

int f(unsigned n)
{
        unsigned a, b;

        // mov  edx, edi
        // shr  edx
        // or   edx, edi
        // mov  eax, edx
        a = b = (n >> 1) | n;

        // shr  eax, 2
        // or   eax, edx
        // mov  edx, eax
        a = b = (a >> 2) | b;

        // shr  edx, 4
        // or   edx, eax
        // mov  eax, edx
        a = b = (b >> 4) | a;

        // shr  eax, 8
        // or   eax, edx
        // mov  edx, eax
        a = b = (a >> 8) | b;

        // shr  edx, 16
        // or   edx, eax
        b = (b >> 16) | a;

        // imul eax, edx, 79355661 ; 0x4badf0d
        // shr  eax, 26
        a = (b * 0x4badf0d) >> 26;

        // mov  eax, DWORD PTR v[0+rax*4]
        // ret
        return v[a];
}

The first thing that stood out to me was the presence of -1 in the array of integers. Testing from 0 to UINT_MAX, the only n that returns -1 is 0. Interesting. It's also worth noting that the array contains every integer from 0, 31, so this function is using some rule to map the input space onto [0, 31].

If we inspect the values of f for test values from 0 to UINT_MAX:

nilbody

There's a pattern of exponential growth here – every result occurs twice as frequently as the previous result. Mathematically, this is \(31 - [log_2(n)]\) where the brackets represent the Greatest Integer Function (\(f(x)\) returning the largest integer less than or equal to \(x\)). This can be verified by comparing the result of f to the following function for some test values:

int my_f(unsigned n)
{
        return 31 - ((int) (log(n) / log(2)));
}

Challenge #4

This time around we're given an additional question to answer: "Some versions have the 0x1010101 constant, some do not. Why?" I decided that I'd reverse the x86 disassembly first, and then compare it to the other architectures.

unsigned f(unsigned a)
{
        // mov    edx,edi
        // shr    edx,1
        // and    edx,0x55555555
        // sub    edi,edx
        a -= ((a >> 1) & 0x55555555);

        // mov    eax,edi
        // shr    edi,0x2
        // and    eax,0x33333333
        // and    edi,0x33333333
        // add    edi,eax
        a = (a & 0x33333333) + ((a >> 2) & 0x33333333);

        // mov    eax,edi
        // shr    eax,0x4
        // add    eax,edi
        // and    eax,0xf0f0f0f
        // imul   eax,eax,0x1010101
        // shr    eax,0x18
        // ret
        return (((a + (a >> 4)) & 0xf0f0f0f) * 0x1010101) >> 0x18
}

The past few challenges have shown us that a good way of reversing these bit-twiddling functions is to test a few input values and look at the binary representations of the input and output values.

 In: 00000000
Out: 0

 In: 00000001
Out: 1

 In: 00000010
Out: 1

 In: 00000011
Out: 2

...

 In: 00001100
Out: 2

 In: 00001101
Out: 3

 In: 00001110
Out: 3

 In: 00001111
Out: 4

It doesn't take much effort to see that the function is counting the number of bits set in the input. This was particularly interesting to me as I was asked to derive this algorithm for a past job interview (though I wasn't able to in the time given).

This falls apart for numbers larger than 0xff, however. It returns the number of bits plus some constant that changes depending on which bits in the higher bytes are set. I'll assume that f is only meant to be called with 8-bit integers.

With that, we can move onto the second question. The disassemblies for x86, ARM64, and Thumb have the 0x1010101 constant, while the disassemblies for ARM and MIPS do not.

Returning to the strategy of inspecting binary representations:

00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000001
00000001 00000001 00000001 00000001

00000000 00000000 00000000 00000010
00000010 00000010 00000010 00000010

...

00000000 00000000 00000000 00010000
00010000 00010000 00010000 00010000

00000000 00000000 00000000 00010001
00010001 00010001 00010001 00010001

00000000 00000000 00000000 00010010
00010010 00010010 00010010 00010010

...

00000000 00000000 00000000 11111101
11111101 11111101 11111101 11111101

00000000 00000000 00000000 11111110
11111110 11111110 11111110 11111110

00000000 00000000 00000000 11111111
11111111 11111111 11111111 11111111

It would appear that multiplying a 32-bit integer by 0x1010101 propagates the low byte to the three bytes above it. This makes sense when you notice that the multiplication is paired with a shr of 0x18 – moving the highest byte into the lowest byte.

Though, this doesn't really answer the question. What difference is there between the architectures that use the 0x1010101 and the architectures that don't? ARM and MIPS still do a shift by 0x18, so what's going on?

Looking at the ARM example, the instructions around the shift are:

ADD      r0,r0,r0,LSL #16
ADD      r0,r0,r0,LSL #8
LSR      r0,r0,#24

For MIPS, it looks like:

sll     $3,$2,8
addu    $2,$3,$2
sll     $3,$2,16
addu    $2,$2,$3
j       $31
srl     $2,$2,24

Both of these (humor me, I know the orders are different) are equivalent to:

a = (a << 8) + a;
a = (a << 16) + a;
a >> 24;

And, with some test values, we can see that this is equivalent to multiplication by 0x1010101 and shifting by 24.

unsigned a(unsigned n)
{
        n = (n << 8) + n;
        n = (n << 16) + n;
        return n >> 24;
}

unsigned b(unsigned n)
{
        return (n * 0x1010101) >> 24;
}

void main(void)
{
        for (unsigned i = 0; i < UINT_MAX; i++) {
                if (a(i) != b(i)) {
                        printf("%u\n", i);
                }
        }
}

I suspect the reason it doesn't show up in the ARM or MIPS disassemblies is due to the fixed-width instruction encoding. The compiler likely decided it would be less efficient to work with the 0x1010101 constant than to break it up into a pair of shifts and additions.

Challenge #5

This is the first challenge we're given that has loops and conditionals, as indicated by the telltale labels starting with ".L". Another initial observation is that the first instruction in f operates on %rsi, and the third operates on %rcx, so it's very likely that this function has four parameters.

Translation to C is more involved than it was with the previous challenges, but it is valuable as it makes the purpose of f very clear. In lieu of an analysis of inputs and outputs, I'll provide a few notes on the process of translation. First, cmp gave me a bit of trouble as I've been out of practice for some time and the difference between AT&T and Intel syntax threw me for a loop. Fortunately, the wikibooks for x86 assembly covers this in detail. In AT&T syntax, the order is cmp subtrahend, minuend, while in Intel syntax, the order is cmp minuend, subtrahend. The subtrahend is subtracted from the minuend, so, in Intel syntax, cmp rcx, rsi; ja .L10 will jump if %rcx is greater than %rsi.

Looking further into the function, there is some dereferencing with BYTE PTR, which tipped me off that this was probably a function operating on a string.

There's a curious push rbx, followed by a pop rbx before the ret. I ignored this initially, taking it to be register preservation. It was. An intuition of what's worth ignoring is valuable in reverse engineering.

Upon reaching .L16, there are a lot of registers in use. It helped to look at each register in isolation and see how they were used. For example, %r10 is used in the following instructions: xor r10d, r10d, add r10, 1, lea rax, [rdi+r10], and cmp r10, r11. This is very typical of a for-loop counter. %r9 on the other hand only shows up in two instructions: mov r9d, 1, and cmovne r8d, r9d. %r9 is just used as a source of 1 for cmovne, since there are no encodings for cmovne that have an immediate source.

cmovne was unfamiliar to me, so I did look it up in my favorite x86 reference. It's a conditional move. movz was similarly unfamiliar. It simply loads %bl with the source byte and zeroes out the higher portions of the register.

char *f(char *a, unsigned b, char *c, unsigned d)
{
        // cmp rcx, rsi
        // ja .L10
        if (d >= b) {
                // .L10:
                // xor eax, eax
                // ret
                return NULL;
        }


        // sub rsi, rcx
        // add rsi, 1
        // mov r11, rsi
        b = b - d + 1;

        // je .L10
        if (b == 0) {
                // .L10:
                // xor eax, eax
                // ret
                return NULL;
        }

        // test rcx, rcx
        // jne .L16
        // mov rax, rdi
        // ret
        if (d == 0) {
                return a;
        }

        // .L16:
        // push rbx
        // xor r10d, r10d
        // mov r9d, 1
        // ...
        // cmp r10, r11
        // jne .L4
        for (int i = 0; i != b; i++) {
                // xor r8d, r8d
                unsigned ret = 0;

                // .L4:
                // lea rax, [rdi+r10]
                // xor esi, esi
                // ...
                // add rsi, 1
                // cmp rsi, rcx
                // jne .L8
                for (int j = 0; j != d; j++) {
                        // movzx   ebx, BYTE PTR [rdx+rsi]
                        // cmp BYTE PTR [rax+rsi], bl
                        // cmovne  r8d, r9d
                        if (a[i] != c[j]) {
                                ret = 1;
                        }
                }

                // test r8d, r8d
                // je .L12
                if (!ret) {
                        // .L12:
                        // pop rbx
                        // ret
                        return a + i;
                }
        }

        // xor eax, eax
        // pop rbx
        // ret
        return NULL;
}

The variable names I chose are pretty opaque, but if you stare at this long enough, it should be pretty clear that f returns the offset of c in a. b and d are just the lengths of a and c respectively.

Challenge #6

An additional hint given for this exercise is that, "[t]his is one of the simplest exercises I made, but still this code can be served as useful library function and is certainly used in many modern real-world applications." I'll leave the relative addresses in my annotations of the disassembly, as it appears to be PIC.

For the sake of showing the mapping between assembly instructions and C code, I'll first give a translation that uses goto, followed by a cleaned up version.

//  0: push   rbp
//  1: mov    rbp,rsp
//  4: mov    QWORD PTR [rbp-0x8],rdi
//  8: mov    QWORD PTR [rbp-0x10],rsi
int f(char *a, char *b)
{
_start:
        //  c: mov    rax,QWORD PTR [rbp-0x8]
        // 10: movzx  eax,BYTE PTR [rax]
        // 13: movsx  dx,al
        // 17: mov    rax,QWORD PTR [rbp-0x10]
        // 1b: mov    WORD PTR [rax],dx
        *b = *a;

        // 1e: mov    rax,QWORD PTR [rbp-0x10]
        // 22: movzx  eax,WORD PTR [rax]
        // 25: test   ax,ax
        // 28: jne    2c
        // 2a: jmp    38
        if (*a & 0xffff != 0) {
                // 2c: add    QWORD PTR [rbp-0x8],0x1
                // 31: add    QWORD PTR [rbp-0x10],0x2
                // 36: jmp    c
                a++;
                b++;
                goto _start;
        }

        // 38: pop    rbp
        // 39: ret
}
int f(char *a, char *b)
{
        while (*a != '\0') {
                *b++ = *a++;
        }
}

Cool. Yurichev wasn't lying, this is a damn simple exercise, but it is something that's used in nearly every C program. It's strcpy!

Challenge #7

This exercise gives the same hint as last time, and similarly uses address offsets instead of symbols.

Control flow isn't as initially obvious as some of the past exercises, but the first instruction is a pretty good tell that this function takes a char * as a parameter, and the test dl,dl was a good tell that the control flow depends on the individual characters in that parameter. The 0x41 in that =lea esi,[rdx-0x41] instruction stood out to me, as 0x41 is 'A' in ASCII, and the 0x20 in the add edx,0x20 was also a big clue, as 'a' - 'A' is 0x20.

void f(char *a)
{
        char *cur;

        //  0: movzx edx,BYTE PTR [rdi]
        //  3: mov   rax,rdi
        //  6: mov   rcx,rdi
        //  9: test  dl,dl
        //  b: je    29
        // 29: repz ret
        if (*a == '\0')
                return;

        // 6: mov rcx,rdi
        cur = a;

        // 25: test   dl,dl
        // 27: jne    10
        while (*cur != '\0') {
                // 10: lea    esi,[rdx-0x41]
                // 13: cmp    sil,0x19
                // 17: ja     1e
                // 19: add    edx,0x20
                // 1c: mov    BYTE PTR [rcx],dl
                if (*cur - 0x41 <= 0x19)
                        *cur += 0x20;

                // 1e: add    rcx,0x1
                // 22: movzx  edx,BYTE PTR [rcx]
                cur++;
        }

        // 29: repz ret
}

Just from the tells outlined in the previous paragraph, I don't even need to run f to know that it converts a to lowercase, albeit only capable of transforming capital ASCII characters (producing garbage for, say, a space character).

Challenge #8

The hint we're given this time is, "[t]his is one of the busiest algorithms under the hood, though, usually hidden from programmers. It implements one of the most popular algorithms in computer science. It features recursion and a callback function."

In preparation for an exercise that's would likely be more difficult than the past few, I did a couple quick perusals to get a basic idea of the control flow, the parameters, and the return values. The mov rbp,rdx early on indicates that there are at least three parameters.

There's a push rbp instruction, but explicit creation of a stack frame. There are also push r12 and push rbx instructions. These all occur at the beginning of the function, so we see some register preservation and an indication that these are the registers that are going to be used in the code.

I find that a lot of reverse engineering involves getting good footing, so this is the information you want when starting out.

What I normally try to find out next is whether the parameters and return type are integers or pointers: mov rsi,QWORD PTR [rbx], after %rsi was moved into %rbx is a good tell that the second parameter is a pointer, likely to an array of pointer as it's dereferenced as QWORD PTR, and the call r12 tells me that the first parameter is the callback that was mentioned in the hint. The js 40 after testing the callback's return value tells me that its return value is signed – probably an int, not a pointer – and the pair of mov rsi,QWORD PTR [rbx] and mov rdi,rbp before the call indicate that it takes two parameters.

void *f(int (*a)(void *, int), void **b, int c)
{
        int ret;
        // 0: push r12
        // 2: test rsi,rsi
        // ...
        // 10: je  32
        if (b == 0) {
                // 32: pop rbx
                // 33: pop rbp
                // 34: xor eax,eax
                // 36: pop r12
                // 38: ret
                return NULL;
        }

        // r12 <- a
        // rbx <- b
        // rbp <- c

        while (1) {
                // (This code path is also duplicated at 49-54. The branch that
                // contains the duplicated code has been omitted, as the same
                // effect arises from this loop continuing to iterate.
                //
                // 18: mov  rsi,QWORD PTR [rbx]
                // 1b: mov  rdi,rbp
                // 1e: call r12
                ret = a(*b, c);

                // 21: test eax,eax
                // 23: je 56
                if (ret == 0) {
                        // 56: mov rax,rbx
                        // 59: pop rbx
                        // 5a: pop rbp
                        // 5b: pop r12
                        // 5d: ret
                        return b;
                }

                // 25: js 40
                else if (ret < 0) {
                        // 40: mov rbx,QWORD PTR [rbx+0x10]
                        b = b[4];

                        // 44: test rbx,rbx
                        // 47: je   32
                        if (b == NULL) {
                                // 32: pop rbx
                                // 33: pop rbp
                                // 34: xor eax,eax
                                // 36: pop r12
                                // 38: ret
                                return NULL;
                        }
                }

                else {
                        // 27: mov rbx,QWORD PTR [rbx+0x18]
                        b = b[6];

                        // 2b: test rbx,rbx
                        // 30: jne 18
                        if (b == NULL) {
                                // 32: pop rbx
                                // 33: pop rbp
                                // 34: xor eax,eax
                                // 36: pop r12
                                // 38: ret
                                return NULL;
                        }
                }
        }
}

In deriving meaning from this, I have a bit of an advantage; I've just recently implemented this exact algorithm for my university's computer systems principle course. This is the search function for a binary search tree, which takes an arbitrary comparison function, a,, and returns the first node for which a returns 0. The function returns NULL if the item is not in the tree. c is some sort of "data" parameter for the callback function, hence why it isn't used in the algorithm.

b is probably a pointer to a struct looking something like the following:

struct tree_node {
        char data[0x10];
        struct tree_node *left;
        struct tree_node *right;
}

as QWORD PTR [rbx+0x10] is followed when a returns something less than 0 (represented in the struct as left), and QWORD PTR [rbx+0x18] is followed when a returns something greater than 0 - (represented in the struct as right).

This exercise is a little unusual. The hint mentions recursion, but this algorithm is entirely iterative. Perhaps it was implemented recursively in C, and the compiler performed some sort of tail-call optimization? I honestly have no idea.

Challenge #9

The hint we're given this time is, "[n]ow that's easy." I certainly hope it is.

This is the first challenge we're given that uses libc. It's also the first challenge in which we see the compiler using binary search to optimize a conditional with more than one branch. I tend to write these out as switch statements whenever I see them, but it's perfectly reasonable for a compiler to optimize an if in the same way.

#include <stdio.h>
#include <stdlib.h>

int f(char a)
{
        // sub rsp, 8
        // movzx eax, BYTE PTR [rdi]
        switch (a) {
        // cmp al, 89
        // je .L3
        case 'Y':
        // cmp al, 121
        // jne .L2
        case 'y':
                // .L3:
                // mov eax, 1
                // add rsp, 8
                // ret
                return 1;

        // jle .L21
        // ...
        // .L21:
        // cmp al, 78
        // je .L6
        case 'N':
        // ...
        // cmp al, 110
        // je .L6
        case 'n':
                // .L6:
                // xor eax, eax
                // add rsp, 8
                // ret
                return 0;

        default:
                // .L2:
                // mov edi, OFFSET FLAT:.LC0
                // call puts
                // xor edi, edi
                // call exit
                puts("error!");
                exit(0);
        }
}

Yurichev wasn't lying, this was an easy challenge. In fact, if I were reverse engineering a binary and came across something like this, I probably wouldn't bother translating the assembly to equivalent C. It's a function that converts a character to a boolean (in the sense of a prompt that asks the user for 'Y' or 'N' – "Yes" or "No") and exits prematurely if the character wouldn't make sense in that context.

Webmentions for this Page