Skip to content

Commit ef6afa4

Browse files
authored
adds experimental Ghidra disassembler and lifting backend (#1326)
* initial scaffolding * adds support for nameless registers in the disasm backend We don't want to model unique (temporary) registers as normal physical registers, so we add an option to set the name to -1 to indicate that the register is a unique anonymous variable. * the minimal implementation * introduces sub-instructions to the disassembler There was some provision for that in the backend but never fully implemented. This feature enables seambless integeration of the ghidra backend but may also used for VLIW architecture, e.g., hexagon llvm backend is using it. * fixes the handling of the unique namespace * implements semantics of sequential instructions * specifies semantics for most of the pcode instructions * passes the basic instruction info to the sequential semantics value * fixes the semantics of LOAD * improves handling of pcode namespaces Fixes varnode classification (no longer treating addresses as registers). Also to prevent name clashes between virtual variables from different scopes we use unique prefixes (aka shortcuts in ghidra's parlance) to distinguish between them. This also makes the generated code more readable and closer to the originally genereated pcode. * fixes the subpiece implementation * puts all pcode opcodes into the pcode namespace Since they are the same for each architecture and now we have this feature. * implements support for user-defined opcodes we translate `CALLOTHER(<name>,<arg1>,...,<argN>)` to `<name>(<arg1>,...,<argN>)` * translates local (intra-instruction) branches into GOTOs Branches in p-code are overloaded by the type of destination. If the destination is a virtual address then it is a normal branch and if it is a constant (a varnode from the constant namespace) then it is an intra-instruction branch that represents the inner instruction logic. * fixes the satisfies function to account for subinstructions before that it was only looking into the top-level instruction * an attempt to pack subinstructions inside an instruction (breaks lots of stuff) * publishes subinstructions I will probably forfeit this approach, still investigating. * introduces the null object to the knowledge base It was actually already there, hidden in the [obj] domain. Now it is properly documented with well defined semantics. * uses the null object to represent unlabeled blocks in the lifters Also, more lifters now respect the passed label to the blk operator. * removes the subinstruction slot from instruction * adds labels reification to BIL semantics, also reifies gotos all using special encoding * implements intra-instruction gotos * adds the sequence number documentation. * fixes error hanlding in the goto-subinstruction primitive * implements a proper disassembler factory that scans for ldefs files So far not working quite correctly, as the default variables (like word size, etc) are not properly set. Investigating... * adds a proper processor initialization * implements proper command-line interface Now the plugin is able to list the targets and pass the path to ghidra root. * adds a tentative --x86-backend option to enable ghidra for x86 * fixes offset and address calculation * adds `is-symbol` semantic primitive * fixes overloading of the Primus Lisp semantic definitions The overloading was prevented by the attributes computation, which expected no overloads. Also, makes error message more readable. * tries to overload p-code operations based on their operand types * passes operands types per each operand, removes extra opcodes It looks like that not only branches are overloaded in p-code but all operations, e.g., we can have `INT_ADD (mem:x) (mem:y) (mem:z)` that represents `mem[x] <- mem[y] + mem[z]`. We could also resolve this overloading by adding suffixes to operations, e.g., `INT_ADDmr`, `INT_ADDmm`, and so on, but will explode the number of opcodes, especially in the presence of user-defined opcodes. * catches the bad or unimplemented instructions during decoding ghidra raises an exception if an instruction is not valid or there is no semantics for it. * adds signed ordering Primus Lisp primitives and implemented corresponding pcode operations * passes full information about the operand type from the backend In p-code the semantics of an operation is defined by the types of its operands. In case of the unique variables the type is not known to us so we have to pass it. This commit extends the previous approach, where we were passing only the kind of the operand (mem vs imm) to the full type qualification, where the type of memory is represented with Nil and the type of immediates is represented by its size in the number of bits. * adds the missing BOOL_NEGATE operation * fixes the negation operator (pcode represent bool as byte) * removes aliased registers from the register table * fixes the selection of the default backend for x86 it should be llvm if not specified otherwise * switches to caseless ordering of variables Changes and documents the ordering of variables. Variables are no compared caseless and the ordering is made loosely compatible with the caseless lexicographical ordering of the textual representation of variables' identifiers, e.g., ``` ``` * enables ghidra backend for the arm targets Right now it is disabled by default, use `--arm-backend=ghidra` to enable. * uses pcode-x86 as the CT language for pcode in x86 targets * adds ghidra backend to mips * minor pretty-printing tweaks to make things more readable * improves primitives performance in Primus Lisp With this optimization Primus Lisp-based lifters run five times faster. This especially important for ghidra backend lifters, which are fully dependent on Primus Lisp. The idea is to let the primitive implementors provide the body of their primitive so that every primitive is not computed via the semantics promise but is invoked directly. Another big idea is to provide such an interface that will allow to factor out computations that are common to the target. The same idea could be extrapolated to all promises. * optimization: improves name resolution in Primus Lisp Uses maps for names, not lists. Surprisingly not much of improvement, something about 5%. * optimizes unit computatation by hoisting it out of the loop * optimization: do not request lisp arguments if not necessary To compute list arguments we need to invoke the theory and reflect them into it. The resulting work is discarded if the name is not bound. The optimization checks if the name is bound by the program and only after that asks for the list of arguments. * adds ghidra to powerpc (works out of the box) * adds the `--x86-64-backend` option it is just an alias for `--x86-backend`, for consistency * enables ghidra for riscv (doesn't run out of the box) * enables the ghidra backend in CI/CD We will build Ghidra only on supported targets, right now it is Ubuntu Bionic. We will soon add more targets and more packages. The bap packages will be now split into `bap-core`, `bap`, and `bap-extra`. The `bap-core` will contain the minimal part of platform without analyses. The `bap` package will include most of the analyses, finally, `bap-extra` will include heavy-weight analysis (in terms of extra dependencies and build times), e.g., the symbolic executor, and the ghidra backend. * moves the ghidra install section lower * downgrades the ubuntu version on CI/CD * tries to fix the macOS build * disables ghidra in the opam/opam used in CI
1 parent 2d94822 commit ef6afa4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+1842
-412
lines changed

.github/workflows/build-and-test.yml

+8-2
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ on:
66

77
jobs:
88
build:
9-
runs-on: ubuntu-latest
9+
runs-on: ubuntu-18.04
1010

1111
env:
1212
OPAMJOBS: 2
@@ -32,7 +32,13 @@ jobs:
3232
run: opam pin add bap . --no-action
3333

3434
- name: Install system dependencies
35-
run: opam depext -u bap
35+
run: opam depext -u bap-extra
36+
37+
- name: Install Ghidra
38+
run: |
39+
sudo add-apt-repository ppa:ivg/ghidra -y
40+
sudo apt-get install libghidra-dev -y
41+
sudo apt-get install libghidra-data -y
3642
3743
- name: Install opam dependencies
3844
run: |

.github/workflows/build-dev-repo.yml

-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,6 @@ jobs:
4343
if: matrix.os == 'macos-latest'
4444
run: |
4545
rm -rf /usr/local/bin/2to3
46-
brew unlink gcc@8
4746
brew unlink gcc@9
4847
brew update
4948
brew upgrade

.github/workflows/build-from-opam.yml

+9-3
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ jobs:
1010
- 4.11.x
1111
- 4.08.x
1212

13-
runs-on: ubuntu-latest
13+
runs-on: ubuntu-18.04
1414

1515
env:
1616
OPAMJOBS: 2
@@ -23,11 +23,17 @@ jobs:
2323
ocaml-compiler: ${{ matrix.ocaml-compiler }}
2424
dune-cache: true
2525

26+
- name: Install Ghidra
27+
run: |
28+
sudo add-apt-repository ppa:ivg/ghidra -y
29+
sudo apt-get install libghidra-dev -y
30+
sudo apt-get install libghidra-data -y
31+
2632
- name: Add the testing Repository
2733
run: opam repo add bap git://github.com/BinaryAnalysisPlatform/opam-repository#testing
2834

2935
- name: Install system dependencies
30-
run: opam depext -u bap
36+
run: opam depext -u bap-extra
3137

3238
- name: Cleanup the Caches
3339
run: sudo apt clean --yes
@@ -36,7 +42,7 @@ jobs:
3642
run: df -h
3743

3844
- name: Build and Install BAP Packages
39-
run: opam clean -a; opam install bap
45+
run: opam clean -a; opam install bap-extra
4046

4147
- uses: actions/upload-artifact@v2
4248
if: ${{ always() }}

.github/workflows/nightly-testing.yml

+7-12
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ jobs:
1010
strategy:
1111
matrix:
1212
os:
13-
- ubuntu-latest
13+
- ubuntu-18.04
1414
ocaml-compiler:
1515
- 4.11.x
1616
- 4.08.x
@@ -29,30 +29,25 @@ jobs:
2929
dune-cache: true
3030
cache-prefix: nightly
3131

32-
- name: Configure Homebrew
33-
if: matrix.os == 'macos-latest'
32+
- name: Install Ghidra
3433
run: |
35-
rm -rf /usr/local/bin/2to3
36-
brew unlink gcc@8
37-
brew unlink gcc@9
38-
brew update
39-
brew upgrade
40-
echo 'LLVM_CONFIG=/usr/local/opt/llvm@9/bin/llvm-config' >> $GITHUB_ENV
34+
sudo add-apt-repository ppa:ivg/ghidra -y
35+
sudo apt-get install libghidra-dev -y
36+
sudo apt-get install libghidra-data -y
4137
4238
- name: Add the Testing Repository
4339
run: opam repo add bap git://github.com/BinaryAnalysisPlatform/opam-repository#testing
4440
- name: Install System Dependencies
45-
run: opam depext -u bap
41+
run: opam depext -u bap-extra
4642

4743
- name: Install radare2 Dependencies
4844
run: opam depext -u bap-radare2
4945

5046
- name: Cleanup the Caches
51-
if: matrix.os == 'ubuntu-latest'
5247
run: sudo apt clean --yes
5348

5449
- name: Build and Install BAP
55-
run: opam install bap bap-radare2
50+
run: opam install bap-extra bap-radare2
5651

5752
- name: Checkout the Tests
5853
uses: actions/checkout@v2

.github/workflows/publish-docker-image.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,4 @@ jobs:
2121
with:
2222
push: true
2323
tags: binaryanalysisplatform/bap:latest
24-
file: docker/ubuntu/xenial/Dockerfile
24+
file: docker/ubuntu/bionic/Dockerfile

.github/workflows/release.yml

+6
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,12 @@ jobs:
2424
ocaml-compiler: ocaml-variants.4.11.2+flambda
2525
dune-cache: true
2626

27+
- name: Install Ghidra
28+
run: |
29+
sudo add-apt-repository ppa:ivg/ghidra -y
30+
sudo apt-get install libghidra-dev -y
31+
sudo apt-get install libghidra-data -y
32+
2733
- name: Add the testing Repository
2834
run: opam repo add bap git://github.com/BinaryAnalysisPlatform/opam-repository#testing
2935
- name: Build deb packages

docker/ubuntu/bionic/Dockerfile

+5-2
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,16 @@ FROM ocaml/opam2:ubuntu-18.04
22

33
WORKDIR /home/opam
44

5-
RUN sudo apt-get update \
5+
RUN sudo add-apt-repository ppa:ivg/ghidra -y
6+
&& sudo apt-get install libghidra-dev -y
7+
&& sudo apt-get install libghidra-data -y
8+
&& sudo apt-get update \
69
&& opam switch 4.09 \
710
&& eval "$(opam env)" \
811
&& opam remote set-url default https://opam.ocaml.org \
912
&& opam repo add bap git://github.com/BinaryAnalysisPlatform/opam-repository --all \
1013
&& opam update \
11-
&& opam depext --install bap --yes -j 1 \
14+
&& opam depext --install bap-extra --yes -j 1 \
1215
&& opam clean -acrs \
1316
&& rm -rf /home/opam/.opam/4.0[2-8,10] \
1417
&& rm -rf /home/opam/.opam/4.09/.opam-switch/sources/* \

lib/arm/arm_target.ml

+27-3
Original file line numberDiff line numberDiff line change
@@ -384,6 +384,7 @@ let enable_arch () =
384384
let llvm_a32 = CT.Language.declare ~package "llvm-armv7"
385385
let llvm_t32 = CT.Language.declare ~package "llvm-thumb"
386386
let llvm_a64 = CT.Language.declare ~package "llvm-aarch64"
387+
let pcode = CT.Language.declare ~package "pcode-arm"
387388

388389
module Dis = Disasm_expert.Basic
389390

@@ -459,6 +460,27 @@ let before_thumb2 t = t < LE.v6t2 || t < EB.v6t2
459460
let is_64bit t = LE.v8a <= t || EB.v8a <= t || Bi.v8a <= t
460461
let is_thumb_only t = LE.v7m <= t || EB.v7m <= t || Bi.v7m <= t
461462

463+
let is_big t = Theory.Target.endianness t = Theory.Endianness.eb
464+
let is_little t = Theory.Target.endianness t = Theory.Endianness.le
465+
466+
let register_pcode () =
467+
Dis.register pcode @@ fun t ->
468+
let triple = match is_64bit t,is_little t,is_big t with
469+
| true,true,_ -> "ARM:LE:32:v8"
470+
| true,_,true -> "ARM:BE:32:v8"
471+
| true,_,_ -> "ARM:LEBE:32:v8LEInstruction"
472+
| false,true,_ -> "ARM:LE:32:v7"
473+
| false,_,true -> "ARM:BE:32:v7"
474+
| false,_,_ -> "ARM:LEBE:32:v7LEInstruction" in
475+
Dis.create ~backend:"ghidra" triple
476+
477+
let enable_pcode () =
478+
register_pcode ();
479+
KB.promise Theory.Label.encoding @@ fun label ->
480+
Theory.Label.target label >>| fun t ->
481+
if is_arm t then pcode
482+
else Theory.Language.unknown
483+
462484
let guess_encoding interworking label target =
463485
if is_arm target then
464486
if is_64bit target then !!llvm_a64 else
@@ -472,7 +494,7 @@ let guess_encoding interworking label target =
472494
| false -> !!llvm_a32
473495
else !!CT.Language.unknown
474496

475-
let enable_decoder ?interworking () =
497+
let enable_llvm ?interworking () =
476498
let open KB.Syntax in
477499
register llvm_a32 "armv7";
478500
register llvm_t32 "thumbv7" ~attrs:"+thumb2";
@@ -481,7 +503,9 @@ let enable_decoder ?interworking () =
481503
CT.Label.target label >>= guess_encoding interworking label
482504

483505

484-
let load ?interworking () =
506+
let load ?interworking ?(backend="llvm") () =
485507
enable_loader ();
486508
enable_arch ();
487-
enable_decoder ?interworking ()
509+
if String.equal backend "llvm"
510+
then enable_llvm ?interworking ()
511+
else enable_pcode ()

lib/arm/arm_target.mli

+1-1
Original file line numberDiff line numberDiff line change
@@ -115,4 +115,4 @@ val llvm_a64 : Theory.language
115115
a symbol there with an odd address (which is used to indicate
116116
thumb encoding) then interworking is enabled.
117117
*)
118-
val load : ?interworking:bool -> unit -> unit
118+
val load : ?interworking:bool -> ?backend:string -> unit -> unit

lib/bap/bap.mli

+46
Original file line numberDiff line numberDiff line change
@@ -2412,6 +2412,18 @@ module Std : sig
24122412
val intrinsic: string Attribute.t
24132413

24142414

2415+
(** [label] a named code location.
2416+
2417+
@since 2.4.0 *)
2418+
val label : string Attribute.t
2419+
2420+
2421+
(** [goto] represents a control-flow transfer to a named label.
2422+
2423+
@since 2.4.0 *)
2424+
val goto : string Attribute.t
2425+
2426+
24152427
(** Core Theory specification of BIL. *)
24162428
module Theory : sig
24172429

@@ -6976,6 +6988,40 @@ module Std : sig
69766988
by evaluating in many languages, e.g. Python, Js, etc *)
69776989
val pp_adt : Format.formatter -> t -> unit
69786990

6991+
6992+
(** Subinstruction Sequence Number.
6993+
6994+
A subinstruction sequence number plays the role of an address
6995+
for sub-instruction (which otherwise share the same physical
6996+
address).
6997+
6998+
Each subinstruction is having a unique address across the
6999+
whole program (not only unique across to other subinstructions
7000+
of the same instruction) and much like [Theory.Label.for_addr]
7001+
it is possible to get a label that corresponds to an
7002+
instruction with the given sequence number using
7003+
[Seqnum.label].
7004+
7005+
The sequence number is represented with an integer to enable
7006+
address arithemetics. A subinstruction that follows a
7007+
subinstruction with the sequence number [N] has the sequence
7008+
number [N+1].
7009+
7010+
@since 2.4.0
7011+
*)
7012+
module Seqnum : sig
7013+
type t = int
7014+
7015+
7016+
(** [label seqnum] returns the program label that corresponds
7017+
to [seqnum]. *)
7018+
val label : ?package:string -> t -> Theory.Label.t KB.t
7019+
7020+
(** [slot] for accessing the sequence number of a subinstruction. *)
7021+
val slot : (Theory.program, t option) KB.slot
7022+
end
7023+
7024+
69797025
(** {3 Prefix Tree}
69807026
This module provides a trie data structure where a sequence of
69817027
instructions is used as a key (and an individual instruction

lib/bap/bap_project.ml

+11-11
Original file line numberDiff line numberDiff line change
@@ -39,22 +39,22 @@ let memory_slot = KB.Class.property Theory.Unit.cls "unit-memory"
3939
~desc:"annotated memory regions of the unit"
4040
Memmap.domain
4141

42-
let with_filename spec target code memory path =
42+
let with_filename spec target _code memory path f =
4343
let open KB.Syntax in
4444
let width = Theory.Target.code_addr_size target in
4545
let bias = query spec Image.Scheme.bias |> Option.map
4646
~f:(fun x -> Bitvec.(int64 x mod modulus width)) in
47+
Theory.Unit.for_file path >>= fun unit ->
48+
KB.sequence [
49+
KB.provide Image.Spec.slot unit spec;
50+
KB.provide Theory.Unit.bias unit bias;
51+
KB.provide Theory.Unit.target unit target;
52+
KB.provide Image.Spec.slot unit spec;
53+
KB.provide Theory.Unit.path unit (Some path);
54+
KB.provide memory_slot unit memory;
55+
] >>= fun () ->
4756
KB.promising Theory.Label.unit ~promise:(fun _ ->
48-
Theory.Unit.for_file path >>= fun unit ->
49-
KB.sequence [
50-
KB.provide Image.Spec.slot unit spec;
51-
KB.provide Theory.Unit.bias unit bias;
52-
KB.provide Theory.Unit.target unit target;
53-
KB.provide Image.Spec.slot unit spec;
54-
KB.provide Theory.Unit.path unit (Some path);
55-
KB.provide memory_slot unit memory;
56-
] >>| fun () ->
57-
Some unit)
57+
!!(Some unit)) f
5858

5959

6060
module State = struct

lib/bap_core_theory/bap_core_theory.mli

+40-5
Original file line numberDiff line numberDiff line change
@@ -138,11 +138,12 @@
138138
be bound to expressions. Sometimes variables are typed, sometimes
139139
they are just identifiers with not associated type.
140140
141-
In the Core Theory all variables are sorted, i.e., the have an
142-
associated value sort. Variables are also having scope and
143-
extent. Finally, variables could be mutable or immutable.
141+
In the Core Theory all variables are sorted, i.e., they have an
142+
associated value sort. Variables are also having scope (lexical
143+
visibility), and extent (lifetime) Finally, variables could be
144+
mutable or immutable.
144145
145-
A physical variable is a global mutable variable with infinite
146+
A physical variable is a global mutable variable with the infinite
146147
scope and extent. They are used to refer predefined (micro)
147148
architectural locations of a modeled system, e.g., registers,
148149
memory banks, caches, register files, etc. Global variables has
@@ -778,6 +779,10 @@ module Theory : sig
778779
Since such structures are required to be monomorphic, the
779780
sort type index should be removed using the [forget] function,
780781
before a sort could be stored in it.
782+
783+
Note, that the type index is only removed from the meta
784+
language (OCaml) type, but is preserved in the value term,
785+
so it could be reconstructed (refined) later.
781786
*)
782787
module Top : sig
783788
type t = unit sort [@@deriving bin_io, compare, sexp]
@@ -1080,7 +1085,9 @@ module Theory : sig
10801085
*)
10811086
module Var : sig
10821087
type 'a t
1088+
10831089
type ident [@@deriving bin_io, compare, sexp]
1090+
10841091
type ord
10851092

10861093

@@ -1161,7 +1168,21 @@ module Theory : sig
11611168
@since 2.3.0 *)
11621169
val pp : Format.formatter -> 'a t -> unit
11631170

1164-
(** Variable identifiers. *)
1171+
(** Variable identifiers.
1172+
1173+
Identifiers are compared caseless, otherwise the order loosely
1174+
matches the lexicographical order of the textual
1175+
representation. Identifiers of virtual variables are ordered
1176+
before identifiers of physical variables and mutable virtual
1177+
variables are ordered before immutable. Identifiers of a
1178+
versioned variable are ordered in the ascending order of their
1179+
versions. And identifiers of virtual variables are ordered in
1180+
the ascending order of their numeric values, e.g., `#2`
1181+
is ordered before `#123`.
1182+
1183+
@before 2.4.0 the ordering was unspecified but wasn't caseless.
1184+
@since 2.4.0 the ordering is caseless
1185+
*)
11651186
module Ident : sig
11661187
type t = ident [@@deriving bin_io, compare, sexp]
11671188
include Stringable.S with type t := t
@@ -1174,6 +1195,10 @@ module Theory : sig
11741195
11751196
This module enables construction of complex data structures on
11761197
variables, e.g., [Set.empty (module Theory.Var.Top)].
1198+
1199+
The variables are ordered by their identifiers so that two
1200+
variables with the same name but different sorts are compared
1201+
equal.
11771202
*)
11781203
module Top : sig
11791204
type nonrec t = unit t [@@deriving bin_io, compare, sexp]
@@ -1782,6 +1807,16 @@ module Theory : sig
17821807
val is_subroutine : (program, bool option) KB.slot
17831808

17841809

1810+
(** [fresh] a fresh label (a shortcut for [KB.create cls]).
1811+
1812+
@since 2.4.0 *)
1813+
val fresh : t knowledge
1814+
1815+
(** [null] is a shortcut for [KB.null cls].
1816+
1817+
@since 2.4.0 *)
1818+
val null : t
1819+
17851820
(** [for_addr x] generates a link to address [x].
17861821
17871822
It is guaranteed that every call [for_addr ~package x] with

0 commit comments

Comments
 (0)