NES Just-in-time Compilation for Fun and Little Profit #

Image of SNES controller

Project goals #

NMOS 6502 less decimal mode JIT to x86_64 and arm64 (this will likely be partially or entirely based on existing works)
HLE-style abstraction for NES PPU
Peek-poke, debug and extension support for the emulator through LuaJIT

Description #

The Nintendo Entertainment System exists in an era without creature comforts and a world of standards for computer graphics, such as OpenGL and Vulkan. Instead, it is a (sometimes too) clever mish-mash of bespoke discrete logic, generating graphics tied closely to the underlying analog TV format, and audio from a fixed set of waveform generators.

Everything from polling controller inputs, to rendering sprites is a delicate dance of assembler trickery and precise timings. This resulted in many game developers employing interesting tactics to bend the system to their will. Accurate emulation, therefore, relies on matching these behaviours precisely, often to the detriment of readability and performance.

The purpose of this project is to, for the most part, ignore the “quirks” of the NES system (e.g. the sprite overflow bug). Later, if the user wishes to use a game ROM which takes advantage of a known quirk, dynamic patching will be used to ensure the expected “broken” behaviour is followed. This allows for a few things:

The “essence” of the NES’s rendering pipeline is intuitively expressed.
The emulator can be expanded and poked without having to worry about breaking broken behaviour.
Common optimizations can be applied which may have been prohibited by “quirks”.

Also (and crucially to this project), this allows us to begin to construct consistent APIs around the NES’s behaviours. For example, we could develop a graphics rendering API atop of the NES’s PPU which makes (at least some) sense from a modern game developer’s perspective. Note though, that the purpose of this project is not to build a game engine.

Existing Works #

A Rust-based pure emulator, incomplete, written by myself to self-teach Rust.
A scarily similar work by the great Andrew Kelley (jamulator). I followed Andrew’s Zig dev-logs on Vimeo some time ago. Perhaps I picked up the idea for this project via unconscious osmosis.

Exploting LLVM IR and the `inkwell` Rust bindings #

Let’s write a simple sum function in LLVM using the inkwell bindings. These bindings yield a safe (in the Rust sense) abstraction around the LLVM C/C++ API. The NES is an 8-bit system so, staying true to that, we will employ LLVM’s 8-bit integer type (i8) for our arithmetic. The following program outputs the corresponding machine code (I’m using an Apple M1-based Mac for this) for this sum function in a .s assembly file. The program is also instrumented to allow for JIT invocation of the sum function, but we’ll revisit this later.

use inkwell::OptimizationLevel;
use inkwell::builder::Builder;
use inkwell::context::Context;
use inkwell::execution_engine::{ExecutionEngine, JitFunction};
use inkwell::module::Module;
use inkwell::targets::{Target, TargetMachine, CodeModel, RelocMode, FileType};
use std::error::Error;
use std::path::Path;

type SumFunc = unsafe extern "C" fn(u8, u8, u8) -> u8;

struct CodeGen<'ctx> {
    context: &'ctx Context,
    module: Module<'ctx>,
    builder: Builder<'ctx>,
    execution_engine: ExecutionEngine<'ctx>,
}

impl<'ctx> CodeGen<'ctx> {
    fn jit_compile_sum(&self) -> Option<JitFunction<SumFunc>> {
        let i8_type = self.context.i8_type();
        let fn_type = i8_type.fn_type(&[i8_type.into(), i8_type.into(), i8_type.into()], false);
        let function = self.module.add_function("sum", fn_type, None);
        let basic_block = self.context.append_basic_block(function, "entry");

        self.builder.position_at_end(basic_block);

        let x = function.get_nth_param(0)?.into_int_value();
        let y = function.get_nth_param(1)?.into_int_value();
        let z = function.get_nth_param(2)?.into_int_value();

        let sum = self.builder.build_int_add(x, y, "sum");
        let sum = self.builder.build_int_add(sum, z, "sum");

        self.builder.build_return(Some(&sum));

        unsafe { self.execution_engine.get_function("sum").ok() }
    }
}

fn main() -> Result<(), Box<dyn Error>> {
    let context = Context::create();
    let module = context.create_module("sum");
    let execution_engine = module.create_jit_execution_engine(OptimizationLevel::None)?;
    let codegen = CodeGen {
        context: &context,
        module,
        builder: context.create_builder(),
        execution_engine,
    };
    let path = Path::new("out.s");
    let target_triple = TargetMachine::get_default_triple();
    let target = Target::from_triple(&target_triple)?;
    let target_machine = target.create_target_machine(
        &target_triple,
        "apple-m1",
        "",
        OptimizationLevel::Default,
        RelocMode::Default,
        CodeModel::Default
    ).ok_or("Unable to initialize target machine")?;

    let sum = codegen.jit_compile_sum().ok_or("Unable to JIT compile `sum`")?;

    let x = 1u8;
    let y = 2u8;
    let z = 3u8;

    unsafe {
        println!("{} + {} + {} = {}", x, y, z, sum.call(x, y, z));
        assert_eq!(sum.call(x, y, z), x + y + z);
    }

    target_machine.write_to_file(&codegen.module, FileType::Assembly, &path)?;

    Ok(())
}

According to the LLVM Language Reference Manual, “if the sum has unsigned overflow, the result returned is the mathematical result modulo 2^n, where n is the bit width of the result.” This should come in handy. It sounds like we should get NES arithmetic behaviour for free in our generated native machine code. But wait a minute…

	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.globl	_sum
	.p2align	2
_sum:
	.cfi_startproc
	add	w8, w0, w1
	add	w0, w8, w2
	ret
	.cfi_endproc

.subsections_via_symbols

The generated add instructions are using aarch64 w* registers, which are 32-bits wide. There is also no additional logic imposed by LLVM to give the behaviour that the manual described. Huh. This is actually just the result of LLVM identifying that we never actually use the result, so not bothering to rectify addition behaviour. More specifically, LLVM is not sign extending the i8 to fill the 32-bit register before performing a potentially wrapping addition.

It’s worth noting that LLVM doesn’t do any checks akin to “has the addition overflowed into bit 8 (zero-indexed), if so, truncate the result.” You can instruct LLVM to “poison” a value which overflowed its type, but that’s a somewhat unrelated topic. Instead, what we observe in the generated machine code of a less trivial example, which actually uses the result, is the following.

(Note: This applies to the aarch64 backend, which I am using)

The backend loads the i8, by emitting a ldrb instruction.
The backend sign-extends the value to fill the register, by emitting a sxtb instruction. Note that this has the effect of enabling the style of overflowing arithmetic we expect. A signed value of -1 will go from having a representation of 0b11111111 to 0b11111111111111111111111111111111. See that adding 1 to either of these values, would cause an overflow (in the unsigned sense), producing a value of 0.
Do something with, and ultmately store the lower 8 bits of the word-length register using strb.

The key to the arithmetic, therefore, is in the sign extension.

NES 6502 JIT Strategy #

The NES uses the Ricoh 2A03 8-bit processor, based on the MOS Technology 6502 core with NMOS process. It excludes a decimal mode, though some games try to use it. The 6502 machine code is relatively trivial to fully understand, at least by today’s standards. This project will make heavy use of the LLVM compiler infrastructure, as well as the ORC JIT API. In essence, the 6502 machine code will be transformed into LLVM IR, which will then be JIT compiled into native machine code. Stores to known CPU/PPU-mapped addresses will be trapped and rewritten into interactions with the rendering API, to be defined.