Module Bap_byteweight.Bytes

Default implementation that uses memory chunk as the domain.

include V2.S with type key = Bap.Std.mem and type corpus = Bap.Std.mem and type token := Bap.Std.word
include V1.S with type key = Bap.Std.mem with type corpus = Bap.Std.mem
type t
include Bin_prot.Binable.S with type t := t
val bin_size_t : t Bin_prot.Size.sizer
val bin_write_t : t Bin_prot.Write.writer
val bin_read_t : t Bin_prot.Read.reader
val __bin_read_t__ : (int -> t) Bin_prot.Read.reader
val bin_shape_t : Bin_prot.Shape.t
val bin_writer_t : t Bin_prot.Type_class.writer
val bin_reader_t : t Bin_prot.Type_class.reader
val bin_t : t Bin_prot.Type_class.t
include Ppx_sexp_conv_lib.Sexpable.S with type t := t
val t_of_sexp : Sexplib0__.Sexp.t -> t
val sexp_of_t : t -> Sexplib0__.Sexp.t
type key = Bap.Std.mem
type corpus = Bap.Std.mem
val create : unit -> t

create () creates an empty instance of the byteweigth decider.

val train : t -> max_length:int -> (key -> bool) -> corpus -> unit

train decider ~max_length test corpus train the decider on the specified corpus. The test function classifies extracted substrings. The max_length parameter binds the maximum length of substrings.

val length : t -> int

length decider total amount of different substrings known to a decider.

next t ~length ~threshold data begin the next positive chunk.

Returns an offset that is greater than begin of the next longest substring up to the given length, for which h1 / (h0 + h1) > threshold.

This is a specialization of the next_if function from the extended V1.V2.S interface.

val next : t -> length:int -> threshold:float -> corpus -> int -> int option
val pp : Stdlib.Format.formatter -> t -> unit

pp ppf decider prints all known to decider chunks.

val next_if : t -> length:int -> f:(key -> int -> stats -> bool) -> corpus -> int -> int option

next_if t ~length ~f data begin the next chunk that f.

Finds the next offset greater than begin of a string of the given length for which there was an observing of a substring s with length n and statistics stats, such that f s n stats is true.

val fold : t -> init:'b -> f:('b -> Bap.Std.word list -> stats -> 'b) -> 'b

fold t ~init ~f applies f to all chunks known to the decider.

val find : t -> length:int -> threshold:float -> corpus -> Bap.Std.addr list

find mem ~length ~threshold corpus extract addresses of all memory chunks of the specified length, that were classified positively under given threshold.

val find_if : t -> length:int -> f:(key -> int -> stats -> bool) -> corpus -> Bap.Std.addr list

find_if mem ~length ~f corpus finds all positively classfied chunks.

This is a generalization of the find function with an arbitrary thresholding function.

It scans the input corpus using the next_if function and collects all positive results.

val find_using_bayes_factor : t -> min_length:int -> max_length:int -> float -> corpus -> Bap.Std.addr list

find_using_bayes_factor sigs mem classify functions starts using the Bayes factor procedure.

Returns a list of addresses in mem that have a signature in sigs with length min_length <= n <= max_length and the Bayes factor greater than threshold.

The Bayes factor is the ratio between posterior probabilities of two hypothesis, the h1 hypothesis that the given sequence of bytes occurs at the function start, and the dual h0 hypothesis,

k = P(h1|s)/P(h0|s) = (P(s|h1)/P(s|h0)) * (P(h1)/P(h0)),

where

  • P(hN|s) is the probability of the hypothesis P(hN) given the sequence of bytes s as the evidence,
  • P(s|hN is the probability of the sequence of bytes s, given the hypothesis hN,
  • P(hN) is the prior probability of the hypothesis hN.

Given that m is the total number of occurences of a sequence of bytes s at the beginning of a function, and n is the total number of occurences of s in a middle of a function, we compute P(s|h1) and P(s|h0) as

  • P(s|h1) = m / (m+n),
  • P(s|h0) = 1 - P(s|h1) = n / (m+n).

Given that q is the total number of substrings in sigs of length min_length <= l <= max_length and p is the total number of substrings of the length l that start functions, we compute prior probabilities as,

  • P(h1) = p / q,
  • P(h0) = 1 - P(h1).

The resulting factor is a value 0 < k < infinity that quantify the strength of the evidence that a given substring gives in support of the hypothesis h1. Levels below 1 support hypothesis h0, levels above 1 give some support of h1, with the following interpretations (Kass and Raftery (1995)),

        Bayes Factor          Strength

        1 to 3.2              Weak
        3.2 to 10             Substantial
        10 to 100             Strong
        100 and greater       Decisive
val find_using_threshold : t -> min_length:int -> max_length:int -> float -> corpus -> Bap.Std.addr list

find_using_threshold sigs mem classify function starts using a simple thresholding procedure.

Returns a list of addresses in mem that have a signature s in sigs with length min_length <= n <= max_length and the sample probability P1(s) of starting a function greater than threshold,

P1(s) = m / (m+n), where

  • m - the total number of occurences of s at the begining of a function in sigs;
  • n - the total number of occurences of s not at the begining of a function in sigs.