Module Disasm.Driver

Disassembler Driver.

This is a low-level interface to the CFG reconstruction and disassembling engine. It is used by BAP's high-level components, such as the recursive-descent disassembler, so there is in general no need to use it directly, unless you're devising a custom disassembly pipe-line.

The disassembler is driven and controlled by the knowledge base, so it is possible to modify the behavior of the BAP disassembler layer through the knowledge base and turn the default recursive-descent mode into something more conservative, e.g., speculative, superset, shingled, or even probabilistic disassembler.

Algorithm

Memory Classification

When the driver is fed with a new memory region (using the Driver.scan function), it uses the knowledge base to initially classify addresses that belong to this region.

For each byte in the region, it creates a temporary core-theory:program object and sets is label-address property to the address of that byte. It then queries if it is a function start (core-theory:is-function) and if it is known to be code or data (if the core-theory:is-valid property is (true) then it is classified as code and if it (false) then it is data, otherwise, it is undetermined). All objects created during classification are deleted immediately after the query and never committed to the knowledge base (they are scoped objects). Therefore it is fine to speculate and assume that all bytes are code by providing (true) to the is-valid property of each byte.

Disassembling

The disassembling starts from each function start (as identified by the previous step) and continues until there is no more unprocessed function starts and all addresses, which were classified as code, are either successfully disassembled or proved to be data.

During disassembling the driver will discover more jump destinations and add them to the worklist. The default mode is speculative, i.e., when we meet a barrier, we continue disassembling beyond it. If the worklist is empty, but the set of addresses that were classified as code (in the first step) is still not empty, which means that these addresses are not reachable from the initially provided starting points (function starts) then the minimal address is extracted from the set and is assumed to be a start of a basic block and added to the worklist.

The process continues until both the worklist and the set of code addresses are empty. When the process converges the knowledge base will contain all disassembled instructions (though some of them might be invalid). The result of the disassembly is the value that contains information sufficient to reconstruct the control-flow graph of the program. It could be queried directly, using various accessors or folded over with the explore function, which is a generalized control-flow graph building function.

Backtracking

The disassembler has a backtracking mechanism that enables it to track each disassembled byte back to the memory byte that was initially marked as code or a function start. When we identify an instruction chain that is invalid, i.e., when data follow a machine instruction or when its destination is some data, we retract the whole chain of instructions. This ensures that all valid instructions belong to at least one valid instruction chain. The justification for not including invalid instructions chains into the disassembly is that such chains will unconditionally switch the CPU into the invalid instruction state which will terminate the program. Since such a chain can't include system calls or CPU exceptions (both are not barriers) it can't have any side-effects visible outside of the process so it could be safely ignored.

Delay slots

Any instruction could have a delay (core-theory:delay) that is greater than zero. In that case the execution order of the instructions will not be equal to the linear order of the instructions addresses and m instructions that follow the delayed instruction will be executed before that instruction (put in the basic block before it), where m is the size of the delay slot, expressed in instructions.

type state

information necessary to build the control-flow graph.

  • since 2.2.0 implements [bin_io]
val bin_shape_state : Core_kernel.Bin_prot.Shape.t
val bin_size_state : state Core_kernel.Bin_prot.Size.sizer
val bin_write_state : state Core_kernel.Bin_prot.Write.writer
val bin_writer_state : state Core_kernel.Bin_prot.Type_class.writer
val bin_read_state : state Core_kernel.Bin_prot.Read.reader
val __bin_read_state__ : (int -> state) Core_kernel.Bin_prot.Read.reader
val bin_reader_state : state Core_kernel.Bin_prot.Type_class.reader
val bin_state : state Core_kernel.Bin_prot.Type_class.t
type insns

abstract type for a sequence of instructions.

type jump

abstract representation of a jump instruction.

val init : state

init the initial disassembler state.

val merge : state -> state -> state

merge x y is a sum of information in states x and y.

  • since 2.2.0
val equal : state -> state -> bool

equal x y is true if x and y denote equal graphs.

  • since 2.2.0

scan mem state updates the state.

If bytes in mem were already scanned then returns state without changes. Nothing is stored in the knowledge base.

If it is a new memory region then classifies and disassembles the whole region. For each disassembled address p an object Theory.Label.for_addr p is stored in the knowledge base. It will contain a fully disassembled and lifted instruction. All instructions that are subroutine entry points will have is-subroutine property set to (true).

See sec:Algorithm for the detailed description of the algorithm.

val explore : ?entries:addr Core_kernel.Sequence.t -> ?entry:addr -> ?follow:(addr -> bool Bap_knowledge.knowledge) -> block:(mem -> insns -> 'n Bap_knowledge.knowledge) -> node:('n -> 'c -> 'c Bap_knowledge.knowledge) -> edge:('n -> 'n -> 'c -> 'c Bap_knowledge.knowledge) -> init:'c -> state -> 'c Bap_knowledge.knowledge

explore ~block ~node ~edge ~init state builds a control-flow graph from the state.

This function is a generalized fold function that calls the user provided functions to construct the graph, which has abstract type 'g.

  • block mem insns creates a basic block of type 'n which covers memory mem; insns is the sequence of instructions that constitute that memory, use list_insns insns to get instructions in the linear order of their addresses, or execution_order to get the execution order (which might be different from linear in the presence of delayed and speculative instructions).
  • node x g inserts the node x into graph g;
  • edge x y g inserts an edge between nodes x and y;
  • init is the initial graph;
  • follow x if it returns true then the function will follow this destination. Defaults to a function that always evaluates to true.
  • entry is the entry point from which to build the graph, if absent, then all basic blocks will be consecuitively, in the order of ascending addresses, used as the entry points.
  • entries is the sequence of entry points, if both entry and entries are specified then entry is consed with entries.
  • since 2.2.0 the optional [entries] parameter was added.
val list_insns : ?rev:bool -> insns -> Bap_core_theory.Theory.Label.t list

list_insns xs returns the list of instructions in the ascending order of their addresses (or descending if rev is true (defaults to false.

execution_order xs reruns a list of instructions in the order in which they will be executed by the target CPU, which could be different when instructions are delayed or speculatively executed.

Low-level interface

All functions in this interface were made availabe in BAP 2.2.0 unless stated otherwise.

val subroutines : state -> Core_kernel.Set.M(Addr).t

subroutines state is a set of subroutine entry points that either were provided through the knowledge base or later discovered as destinations of call instructions.

  • since 2.2.0
val blocks : state -> Core_kernel.Set.M(Addr).t

blocks state is the set of addresses of instructions that start basic blocks.

  • since 2.2.0
val is_data : state -> addr -> bool

is_data state x is true if x was classified as data.

  • since 2.2.0
val is_subroutine : state -> addr -> bool

is_subroutine s x is Set.mem (subroutines s) x.

  • since 2.2.0
val is_block : state -> addr -> bool

is_block s x is Set.mem (blocks s) x.

  • since 2.2.0
val is_jump : state -> addr -> bool

is_jump s x is true if x is the maximal address of an instruction in the basic block that terminates with a jump.

Note, that when jumps are delayed, the linear order of instructions differs from the execution order, so the address of the last instruction is not the maximal address in the basic block.

  • since 2.2.0
val jump : state -> addr -> jump option

jump state src if src is the last in the linear order instruction of a basic block that terminates with a jump instruction then returns the descrption of that jump instruction.

  • since 2.2.0
val destinations : jump -> Core_kernel.Set.M(Addr).t

destinations jump returns the set of resolved destinations.

  • since 2.2.0
val is_call : jump -> bool

is_call jump is true if jump is call.

  • since 2.2.0
val is_barrier : jump -> bool

is_barrier jump is true if jump is barrier.

  • since 2.2.0