Compile (Linux):
mkdir build
cd build
cmake ..
make -j
cd ..
After compiling: First create a folder for the output and create a folder for the input. Add all of the CGO/DGO files into the input folder.
build/jak_disassembler config/jak1_ntsc_black_label.jsonc in_folder/ out_folder/
The config
folder has settings for the disassembly. Currently Jak 2 and Jak 3 are not as well supported as Jak 1.
The ObjectFileDB
tracks unique object files. The games have a lot of duplicated objected files, and object files with the same names but different contents, so ObjectFileDB
is used to create a unique name for each unique object file. It generates a file named dgo.txt
which maps its names to the original name and which DGO files it appears in. The ObjectFileDB
extracts all object files from a DGO file, decompressing the DGO first if needed. (note: Jak 2 demo DGOs do not decompress properly). Each object file has a number of segments, which the game can load to separate places. Sometimes there is just a single "data" segment, and other times there are three segments:
top-level
is executed at the end of the linking process, then discarded and goes in a special temporary heapmain
is loaded and linked onto the specified heapdebug
is loaded and linked onto the debug heap
This function interprets breaks the object file's data into segments, and processes the link data. The data is stored as a sequence of LinkedObjectWord
s, which contain extra data from the link. The LinkedObjectWord
s are stored by segment in a LinkedObjectFile
, which also contains a list of Label
s that allow LinkedObjectWord
s to refer to other LinkedObjectWord
s. Note that a Label
can have a byte-offset into a word, which GOAL uses to load non-4-byte-aligned bytes and halfwords, and also to represent a pair
object.
This function looks through the LinkedObjectFile
s and splits each segment into data and code zones.
The only files with code zones are from object files with three segments, and the code always comes first. The end of the code zone is found by looking for the last GOAL function
object, then finding the end of this object by looking one word past the last jr ra
instruction. This assumes that the last function in each segment doesn't have an extra inline assembly jr ra
somewhere in the middle, but functions with multiple jr ra
's are extremely rare (and not generated by the GOAL compiler without the use of inline assembly), so this seems like a safe assumption for now.
The code zones are scanned for GOAL function
types, which are in front every GOAL function, and used to create Functions
. Each Function
is disassembled into EE Instructions, which also adds Label
s for branch instructions, and can also contain linking data when appropriate. The final step is to look for instructions which use the fp
register to reference static data, and insert the apprioriate Label
s. GOAL uses the following fp
relative addressing modes:
lw
,lwc1
,ld
relative to thefp
register to load static data.daddiu
to create a pointer to fp-relative data within +/-2^15
bytes- Sequence of
ori
,daddu
to generate a pointer that reaches within+2^16
bytes - Sequence of
lui
,ori
,daddu
to generate any 32-bit offset fromfp
.
The last two are only found in very large object files, and GOALDIS doesn't handle these.
The fp
register is set with this sequence. The function prologue only sets fp
if it is needed in the function.
;; goal function call, t9 contains the function address
jalr ra, t9
sll v0, ra, 0
;; example goal function prologue:
daddiu sp, sp, -16
sd ra, 0(sp)
sd fp, 8(sp)
or fp, t9, r0
Note: there are a few hacks to avoid generating labels when fp
is used as a temporary register in inline assembly. Like ignoring stores/loads of fp
from the stack (kernel does this to suspend resume a thread), or ignoring fp
when used with the PEXTLW
function, or totally skipping this step for a single object file in Jak 2 (effect-control
).
This step simply renames labels with L1
, L2
, .... It should happen before any custom label naming as it will overwrite all label names.
Looks for static linked lists and attempts to print them. Doesn't support printing everything, but can print nested lists, strings, numbers, and symbols.
Dumps words in each segment like hexdump
. There's an option to only run this on v3
object files, which contain data, as opposed to v2
which are typically large data.
Like write_object_file_words
, but code is replaced with disassembly. There's a config option to avoid running this on object files with no functions, as these are usually large data files which are uninteresting to view as a binary dump and slow to dump.
Look at branch intstructions and their destinations to find all basic blocks. Implemented in find_blocks_in_function
as part of analyze_functions
. This works for Jak 1, 2 and 3.
This will help us find stack variables and make sure that the prologue/epilogue are ignored by the statement generation.
A "full" prologue looks like this:
daddiu sp, sp, -208
sd ra, 0(sp)
sd fp, 8(sp)
or fp, t9, r0 ;; set fp to the address of this function
sq s3, 128(sp)
sq s4, 144(sp)
sq s5, 160(sp)
sq gp, 176(sp)
swc1 f26, 192(sp)
swc1 f28, 196(sp)
swc1 f30, 200(sp)
GOAL will leave out instructions that aren't needed. This prologue is "decoded" into:
Total stack usage: 0xd0 bytes
$fp set? : yes
$ra set? : yes
Stack variables : yes, 112 bytes at sp + 16
Saved gprs: gp s5 s4 s3
Saved fprs: f30 f28 f26
A similar process is done for the epilogue, and it is checked against the prologue.
The prologue is removed from the first basic block and the epilogue + alignment padding is removed from the last one.
Currently the focus is to get these working for Jak 1. But it shouldn't be much extra work to support Jak 2/3.
When possible, we should guess function names. It's not always possible because GOAL supports anonymous lambda functions, like for example:
(lambda ((x int) (y int)) (+ x y))
which will generate a GOAL function
object without a name.
But these are pretty uncommon, and the majority of GOAL functions are
- Normal functions, which are stored into a
symbol
with the same name as the function - Methods, which are stored into the method table of their
type
with themethod-set!
function. Sadly we can't get the name of methods, but we can get their ID (to figure out the inheritance hierarchy) and what type they are defined for. - State handlers / behaviors (not yet fully understood)
- Virtual state handlers / behaviors (not yet fully understood)
Currently the state/behavior stuff isn't well understood, or used in the early initialization of the game, so name guessing won't worry about this for now.
The majority of GOAL types have a compiler-generated inpsect
method which prints their fields. We should detect these methods in the previous function name guessing step, and then read through them to determine the data layout of the type.
The basic blocks should be built into a graph and annotated with control flow patterns, like if
, cond
, and
, and various loops. To do this, register liveliness will be determined for each instruction.
Instructions (or sequences of instructions that should not be separated) should be converted into Statement
s, which represent something like (add! r1 r2 r3)
. The registers should be mapped to variables, using as many variables as possible, as we don't know at this point if a register will be holding the same GOAL variable at different instructions.
Variable
s should get types determined by arguments of the function, which should then be propagated to other Statement
s in the function, and can then refine the argument types of other functions. This process should be repeated until things stop changing.
Variables which are actually the same variable will be merged. The point at which variables are first defined/declared will be determined based on liveliness and then expanded to come up with a scope nesting that doesn't cross control flow boundaries.
Due to the the simple single pass GOAL compiler design, we build a tree which represents how Statements can be combined to eliminate variables. As an extremely simple example:
(set! r1 thing1)
(set! r2 thing2)
(add-int! r4 r2 r3)
(mult-int! r1 r4)
can be collapsed to
(* thing1 (+ thing2 r3))
But
(set! r2 thing2)
(add-int! r4 r2 r3)
(set! r1 thing1)
(mult-int! r1 r4)
can be collapsed to
(let ((temp0 (+ thing2 r3)))
(+ thing1 temp0)
)
and this difference will actually reflect the difference in how the code was originally written! This is a huge advantage over existing decompilers, which will be unable to tell the subtle difference between the two.
Lots of GOAL language features are implemented with macros, so once the s-expression nesting is recovered, we can pattern match to undo macros very precisely.