[DEFCON 2018] Doublethink – 8-Architecture Assembly Polyglot

I played with PPP in the DEFCON CTF 2018 competition, held in Las Vegas from August 10 to August 12. The competition was fierce this year, with 24 teams in the finals (up from just 12 the previous year). This year, it was organized by a new group of people – the Order of the Overflow – taking over from Legitimate Business Syndicate the past year. They chose to shake up the competition format a bit by introducing a new “King of the Hill” challenge type where competitors strove to build the best possible solutions to a given challenge, alongside the traditional “Attack-Defense” challenges where competitors directly attacked each other’s systems.

One of these King of the Hill challenges, called Doublethink, asked users to build assembly code that would simultaneously run on up to 12 different machine architectures, from present (AMD64, ARM64, MIPS) to far past (PDP-1, PDP-8, Data General Nova, and several others), to “future” (RISC-V, Hexagon DSP, MMIX, and even Clemency, the custom architecture used in last year’s DEFCON CTF). To qualify, the code had to output a flag string provided in a file or memory out to the console, making it a limited kind of shellcode. Competitors were graded according to how many architectures their code ran on.

Although we were aware that there might be bugs in the scoring system that would enable us to “cheat” (given that other King of the Hill-style challenges at CTFs have featured such intentional errors, in the spirit of a hacking competition), we didn’t end up finding any such bugs. Instead, we ended up building a polyglot shellcode that really did run on 8 different architectures (64-bit Intel, Royal Precision LGP-30, Donald Knuth’s MIX, PDP-1, PDP-8, PDP-10, DEFCON 2017’s cLEMENCy, and Data General Nova), and really did output a flag on every single one. This is the writeup on how we managed this feat in less than 24 hours.

The structure of the challenge was as follows: you connected to a server (for which we were given source code), which asked you for 4096 bytes of code:

[**] Who controls the past controls the future.
[**] Who controls the present controls the past.
[**]
[**] Take control.
[**]
[**] Shellcode:

You could then attempt to control one of the “present” architectures (ARM64, AMD64 or MIPSel), after which it would allow you to select up to seven “past” architectures (LGP-30, PDP-1, PDP-8, PDP-10, MIX, IBM-1401, Nova), and up to four “future” architectures (RISC-V, Hexagon, MMIX, Clemency). Your shellcode was tested on each one, and at the end you were scored for how many architectures you managed to control; thus, the maximum possible score was 12.

Getting Started

We started off by downloading all the documentation we could find on these architectures, and thinking about the general structure of the assembly code. Crucially, since every architecture would start executing our code at the first instruction, the first few instructions would have to contain the same sequence of bits in every architecture, yet be interpreted differently in each one. We decided that it would be easiest if we built the shellcode such that it eventually jumped away to an architecture-specific bit of code somewhere else in our 4096-byte shellcode buffer, which meant that the main challenge was trying to prevent the various architectures from crashing before they got to their assigned jump instruction.

First of all, we had to write shellcode that would print out the flag for each architecture. For ARM64, AMD64, MIPSel, RISC-V, Hexagon and MMIX, the shellcode had to open a file called “flag”, read it in, and output it to standard output. For the rest of the architectures, the flag file was preloaded into a special place in memory, from which our shellcode would have to read the flag and output it. The shellcode additionally had to be relocatable – i.e. it had to be possible to easily modify the shellcode to run at a particular address, so that we could stick it somewhere in the buffer and jump to it.

One factor which massively complicated the process was the fact that most of the architectures were not byte-oriented. The scoring script would break your shellcode into bits depending on the architecture’s word size, meaning that we would actually have to craft our shellcode at the bit level rather than reasoning about it at the byte level. To give you a quick rundown of what we had to deal with:

– AMD64: 8-bit-unit variable-length instructions, memory addressed in 8-bit bytes
– ARM64: 32-bit fixed-length instructions, memory addressed in 8-bit bytes
– MIPS: 32-bit fixed-length instructions, memory addressed in 8-bit bytes
– IBM-1401: 7-bit-unit variable-length instructions, memory addressed in 7-bit units
– LGP-30: 31-bit fixed-length instructions, memory addressed in 31-bit word
– MIX: 30-bit fixed-length instructions, memory addressed in 31-bit words
– NOVA: 16-bit fixed-length instructions, memory addressed in 16-bit words
– PDP-1: 18-bit fixed-length instructions, memory addressed in 18-bit words
– PDP-8: 12-bit fixed-length instructions, memory addressed in 12-bit words
– PDP-10: 36-bit fixed-length instructions, memory addressed in 36-bit words
– Clemency: 9-bit variable-length instructions (middle-endian), memory address in 9-bit “bytes”
– Hexagon: 32-bit instructions, memory addressed in 8-bit bytes
– MMIX: 32-bit fixed-length instructions, memory addressed in 8-bit bytes
– Risc-V: 16-bit-unit variable-length instructions, memory addressed in 8-bit bytes

So, as you can plainly see, there is a lot of variability. While the variable bit lengths make it possible to “misalign” architectures relative to each other, it doesn’t help us tremendously at the very start of the shellcode.

Humble Beginnings: AMD64, and LGP-30 (Royal Precision Electronic Computer Company, 1956)

On Friday night, this challenge appeared at 7:50pm, 10 minutes before the contest network shutdown for the day. I quickly uploaded a cheap `cat flag` AMD64 shellcode to get a few points for one-architecture control:

00000000  4831C0            xor rax, rax
00000003  488D3D18000000    lea rdi, [rel bin]
0000000A  50                push rax                ; NULL
0000000B  54                push rsp
0000000C  5A                pop rdx                 ; envp -> [NULL]
0000000D  488D4F0B          lea rcx, [rdi+11]
00000011  51                push rcx                ; cmd
00000012  488D4F08          lea rcx, [rdi+8]
00000016  51                push rcx                ; "-c"
00000017  488D4F05          lea rcx, [rdi+5]
0000001B  51                push rcx                ; "sh"
0000001C  54                push rsp
0000001D  5E                pop rsi                 ; argv -> ["sh", "-c", cmd, NULL]
0000001E  B03B              mov al, 59              ; __NR_execve
00000020  0F05              syscall
00000022  2F62696E2F736800  bin: db "/bin/sh", 0
0000002A  2D6300            arg1: db "-c", 0
0000002D  63617420666C616700    cmd: db "cat flag", 0

After the contest went offline, four of us broke off to try and build the super shellcode.

I started off looking at LGP-30. This was a table-sized machine, first sold in 1956, that retailed for $47,000 (equivalent to $423,000 in 2017). It has sixteen opcodes, and surprisingly no way to perform indirect memory accesses. Instead, to do an indirect memory access, you simply overwrote the address part of a direct memory access with a new address, then executed the direct access instruction. It did, however, have distinct instructions for printing data and receiving input.

Luckily for us, for LGP-30, the top 12 bits of each instruction are actually ignored by the computer, with the opcode stored in bits 12-15. This means that, with some small constraints, we could make the first 2 bytes of the shellcode an AMD64 jump instruction, allowing us to get a 2-architecture polyglot relatively easily.

Here’s the LGP-30 shellcode that actually prints the flag (which is stored at address 13,37):

00,00 b 1337   # bring data at 13,37 into the accumulator
00,01 y 0002+b # store the address portion of the accumulator into the next instruction
00,02 p 0000   # print data (the data to write has been set by the previous instruction)
00,03 b 0000+b # bring instruction 00,00 into the accumulator
00,04 a 0007+b # add [00,07] = 1
00,05 y 0000+b # store the address portion instruction back to 00,00, changing the address it loads
00,06 u 0000+b # jump back to 00,00
00,07 z 0001   # constant value loaded by instruction 00,04

The notation “+b” means to add the base address of the code, to make it relocatable. Note that many tasks in LGP-30 are accomplished by directly overwriting the operand field of other instructions, since the machine lacked the ability to perform indirect accesses or offset accesses, and only had a single working register (the accumulator).

Here’s what the first polyglot shellcode starts with, as a bitstring, where “z” means a don’t-care bit (this becomes more relevant later, as we figure out what we can and can’t modify freely):

1110101101111010zz000001001010z

As AMD64 assembly:

1110101101111010 EB7A jmp short 0x7c

and as LGP-30 assembly:

1110101101111010zz000001001010z u 0110 # jump to 01,10

Then, we store the AMD64 shellcode at byte address 0x7c (code address 0x133707c, bit address 992 bits), and the LGP-30 shellcode at word address 10 (code address 01,10, bit address 310). Note that, because the first instruction is a jump in LGP-30, the low 4 bits of the AMD-64 jmp instruction are constrained to be 0x0A (the opcode for the LGP-30 jump), but the other 12 bits are free (so for example “EB8A jmp short 0x8c” would be OK too).

Adding MIX to the mix (Donald Knuth, 1968)

MIX is a hypothetical computer developed by Donald Knuth for his famous “The Art of Computer Programming” book set. Emulators, assemblers and even compilers have been built for this system.

Zachary Wade figured out that the shellcode for this platform was exceptionally simple, consisting of a single 30-bit instruction called OUT. This instruction, when executed, copies a “block” of memory to a specified device starting at a given address. For the console (device 19), a block is 70 characters, which is more than enough to print a flag (and trailing garbage, which is ignored by the scoring script). So, the shellcode is simply

OUT 3137,19

and because it’s just one instruction, no JMP is needed. All we needed to do was ensure that the instructions leading up to this were safely ignored by the processor; luckily, the first instruction executed is a LD (load) instruction that loads a random (but valid) address, so it doesn’t crash. Thus, we can stick the MIX shellcode at word address 1 (bit address 30), and we have a three-architecture shellcode.

By this time it was 1:30am, so I retired to my room to continue hacking.

PDP-8 (12-bit architecture, Digital Equipment Corporation, 1965)

Although I probably should’ve gone to bed to be awake for Saturday, I really wanted to add another architecture, so I stayed up building a PDP-8 shellcode. In the end, I found a usable print routine online, and adapted it to the contest setup. Notably, the PDP-8 requires you to actually wait for the teletype output device to become ready before outputting a character – and the SIMH emulator actually simulates this, silently discarding output sent to an unready simulated teletype. The code for the PDP-8 is definitely a bit hairier:

00 tls:        6046   # wake up printer
01 cla cll:    7300   # clear AC and link
02 tad flag:   1214+b # grab flag address
03 dca 10:     3010   # deposit address of string to auto-index register 10
04 cla cll:    7300   # clear AC and link
05 tad i 10:   1410   # load character
06 sna:        7450   # skip if non-zero
07   hlt:      7402   # halt
10 tsf:        6041   # printer ready?
11   jmp .-1:  5010+b # no? loop!
12 tls:        6046   # yes? print!
13 jmp loop:   5004+b # get next char
14 <flag>:     1336

Note that everything is octal, so that a single 12-bit instruction is nicely coded as 4 octal digits. The PDP-8 also features some really fun features, such as the “microcoded” 7xxx instructions, which execute up to 7 different operations in a single instruction depending on the flags that are set (7300 = “cla cll” is one such example of such an instruction). Luckily, jumps are really easy – just “5000” + the address to jump to.

For some silly reason, late at night I thought it’d be a good idea to switch the “present” architecture to ARM64 instead of AMD64, because I thought its 32-bit instruction set would be easier to manipulate. Now the shellcode is getting a little crazy:

Here, MIX and ARM64 both execute some arithmetic operations on random registers before running shellcode, which is perfectly OK since they don’t crash 🙂

Well, now we’re successfully up to 4 architectures, and it only took me three hours…so now it’s 4:30am and I really need to sleep.

PDP-1 (18-bit architecture, Digital Equipment Corporation, 1959)

We wake up on Saturday, I contemplate working on a different problem but decide this one is more interesting. I started by making some room in the shellcode, moving the MIX instruction down several places. PDP-1 turns out to be pretty easy – a jump is the 6 octal bits 60 or 62, followed by the 12-bit address to jump to. The assembly code, donated by Matthew Savage, looked like this:

230077 LIO I 77
440077 IDX 77
662077 RIL 6
730003 TYO I
662077 RIL 6
730003 TYO I
662077 RIL 6
730003 TYO I
600100 JMP 100

This uses address 077 (one word before the program memory) as a counter, and just prints out all of the memory as successive 6-bit words. It jumps to address 100, which cleverly makes it loop forever without having to adjust the base address (since it will just re-execute the jump).

With the MIX code moved down, it was pretty easy to stuff in the PDP-1 jump at the second 18-bit word, meaning that we had 5 architectures before the contest officially opened on Saturday morning.

cLEMENCy (9-bit “bytes”, 27-bit words, DEFCON CTF, 2017)

cLEMENCy was the crazy custom architecture developed by LegitBS for last year’s DEFCON CTF finals. It features 9-bit “bytes” (which we called nytes), 27-bit words, and “middle-endian” memory storage…in short, it’s a totally crazy thing.

In cLEMENCy, the flag is just stored in a fixed position in memory. There’s a super handy “memcpy” instruction (called DMT – direct memory transfer) and a memory-mapped terminal output buffer, so my first attempt at the shellcode was simply

# load r1 = flag address
ml r1, 0
mh r1, 0x10040
# load r2 = output address
ml r2, 0
mh r2, 0x14040
# load r3 = size
ml r3, 0xFFF
# copy
dmt r2, r1, r3
# trigger terminal output
stt r3, [r2+0x2000, 1]
ht

Done, right? Run it and…total garbage. What happened? The flag is stored in cLEMENCy’s memory as a series of nytes, one nyte per byte of the original flag (with the high bit set to 0). But then, when cLEMENCy goes to output the flag, it packs 9-bit nytes into successive 8-bit bytes, meaning that it outputs 9 regular bytes for each 8 nytes – this shifts the bits around so the output is garbled. Alas, to solve this problem we had to develop a bit packer routine that would properly “pre-pack” the nytes so they came out right when printed. Luckily, we had a super solid assembler from last year, and Max Serrano to develop the shellcode. Just a couple hours later, we had functional cLEMENCy shellcode.

#define MOVI(r,imm) ml r, {(imm) & 0x3ff} ;\
   mh r, {(imm) >> 10}

_start:
 MOVI(R9, 0x4010000) // flag_page
 MOVI(R0, 0x5010000) // out_page
 ml R2, 13 // FLAG_LEN
 ml R1, 1 // shift_amount
 ml R3, 0 // output_idx
 ml R4, 0 // i
 ml R28, 0x1ff // mask
 ml R27, 8 // sub
 ml R26, 9 // modulus
_begin_loop:
 cm R4, R2
 bge $_end
 xr R5, R5, R5 // curr
 // curr = flag_page[i]
 AD R20, R9, R4
 LDS R5, [R20]
 // curr = curr << shift_amount
 SL R5, R5, R1
 // curr = curr & mask;
 AN R5, R5, R28
 // R6 = flag_page[i+1]
 ADI R4, R4, 1
 AD R20, R9, R4
 LDS R6, [R20]
 // R6 = (R6 >> (8 - shift_amount)
 SB R7, R27, R1
 SR R6, R6, R7
 // R7 = ((1 << shift_amount) - 1)
 ML R7, 1
 SL R7, R7, R1
 SBI R7, R7, 1
 // R6 = (R6 & R7)
 AN R6, R6, R7
 // curr |= R6
 OR R5, R5, R6
 // out_page[output_idx] = curr
 AD R21, R0, R3
 STS R5, [R21]
 // shift_amount = (shift_amount + 1) % 9
 ADI R1, R1, 1
 MD R1, R1, R26
 cmi R1, 1
 be $_begin_loop
 ADI R3, R3, 1
 b $_begin_loop
_end:
 MOVI(R8, 0x5012000)
 STT R2, [R8]
 ht

I got really lazy trying to “pack” the jump in. Instead, what I did was I just ran the cLEMENCy emulator on the code, and noted the first instruction where it crashed. cLEMENCy’s emulator, as it turns out, will happily skip over instructions it cannot decode, and only runs into trouble if you access bad memory with a real instruction. When I ran the existing code, I found a crash at nyte address 11 (bit address 99), so I just shoved the jump instruction at that address – luckily it didn’t conflict too hard with anything existing!

PDP-10 (36-bit architecture, Digital Equipment Corporation, 1966)

DEC really went a little nuts with this machine, featuring a whopping 36 bits per word. Matthew again provided the shellcode:

0:	201040001337   MOVEI 1,1337
1:	201100000400   MOVEI 2,400
2:	200001000000   MOVE 0,0(1)
3:	322000000013+b JUMPE 0,13
4:	202000000033   MOVEM 0,33
5:	436100000033   IORM 2,33
6:	700200012000   WRAPR 12000
7:	332000000033   SKIPE 0,33
10:	324000000007+b JUMPA 0,7
11:	271040000001   ADDI 1,1
12:	324000000002+b JUMPA 0,2
13:	254200000000   HALT 0

Jumps on this architecture are the 9 bits “324” followed by a bunch of zeros, then the 18-bit jump address. I couldn’t code this without a bunch of stuff breaking, so I had to do a bunch of re-jiggering. At some point, I decided that AMD64 was still easier to reason about than the very complicated ARM64 encoding scheme (seriously just look at this page to see the craziness), so I switched back to AMD64 (luckily very quick to do after a bit of fiddling). The nice thing about AMD64 is that you can make some variable-length instructions “eat” many sequential bytes as large 32- or even 64-bit constants, making it possible to skip lots of bytes easily.

I stuck a PDP-10 halt at the first open spot – instruction 5 (bit address 180), and reworked the existing code until I could hit the halt at that address. Then I just replaced it with a jump instruction to jump to our shellcode. 7 architectures!!

Nova (16-bit architecture, Data General, 1969)

Nova turned out to be pretty easy. Ryan Goulden found out that our existing code contained an indirect jump to address 0225 (i.e. it loaded the value at 0225 and jumped to whatever that was), so by sticking an address to our shellcode there, we were in business. So he whipped up a simple bit of shellcode that prints out the 13 characters of the flag:

0:	LDA 2,2
1:	JMP 3
2:	JMP -41,2
3:	LDA 0,0,2
4:	MOVS 0,0
5:	DOAS 0,TTO
6:	SKPDN TTO
7:	JMP 6
10:	MOVS 0,0
11:	DOAS 0,TTO
12:	SKPDN TTO
13:	JMP 12
14:	LDA 0,1,2
15:	MOVS 0,0
16:	DOAS 0,TTO
17:	SKPDN TTO
20:	JMP 17
21:	MOVS 0,0
22:	DOAS 0,TTO
23:	SKPDN TTO
24:	JMP 23
25:	LDA 0,2,2
26:	MOVS 0,0
27:	DOAS 0,TTO
30:	SKPDN TTO
...
107:	MOVS 0,0
110:	DOAS 0,TTO
111:	SKPDN TTO
112:	JMP 111
113:	HALT

Because of the indirect jump, it’s really easy to just stash this code basically anywhere, and put a pointer to it at 0225. Job done!

Even with 8 architectures, we were only in 3rd place on this challenge. We later found out that both 1st (HITCON) and 2nd (Dragon Sector) found a bug that let them trick the scoring system for several architectures, so we were the only ones with 8 real architectures.

Shortly after we submitted this monstrosity, the organizers indicated that the challenge would be retired soon, so we stopped working on it. I’m fairly confident that we could have gotten up to all 12 architectures with a real solution, but it would have likely taken longer than the time allotted for the challenge (and, I’m sad to say, took a good bit of manpower away from other challenges – alas, the dangers of nerdsniping are real).

The final code and data are posted online at https://github.com/nneonneo/doublethink.