`Bap_byteweight.Bytes`

Default implementation that uses memory chunk as the domain.

```
include V2.S
with type key = Bap.Std.mem
and type corpus = Bap.Std.mem
and type token := Bap.Std.word
```

`include V1.S with type key = Bap.Std.mem with type corpus = Bap.Std.mem`

`include Bin_prot.Binable.S with type t := t`

`val bin_size_t : t Bin_prot.Size.sizer`

`val bin_write_t : t Bin_prot.Write.writer`

`val bin_read_t : t Bin_prot.Read.reader`

`val __bin_read_t__ : ( int -> t ) Bin_prot.Read.reader`

`val bin_writer_t : t Bin_prot.Type_class.writer`

`val bin_reader_t : t Bin_prot.Type_class.reader`

`val bin_t : t Bin_prot.Type_class.t`

`type key = Bap.Std.mem`

`type corpus = Bap.Std.mem`

`val create : unit -> t`

`create ()`

creates an empty instance of the byteweigth decider.

`train decider ~max_length test corpus`

train the `decider`

on the specified `corpus`

. The `test`

function classifies extracted substrings. The `max_length`

parameter binds the maximum length of substrings.

`val length : t -> int`

`length decider`

total amount of different substrings known to a decider.

`next t ~length ~threshold data begin`

the next positive chunk.

Returns an offset that is greater than `begin`

of the next longest substring up to the given `length`

, for which `h1 / (h0 + h1) > threshold`

.

This is a specialization of the `next_if`

function from the extended `V1.V2.S`

interface.

`val pp : Stdlib.Format.formatter -> t -> unit`

`pp ppf decider`

prints all known to decider chunks.

`next_if t ~length ~f data begin`

the next chunk that `f`

.

Finds the next offset greater than `begin`

of a string of the given `length`

for which there was an observing of a substring `s`

with length `n`

and statistics `stats`

, such that `f s n stats`

is `true`

.

`val fold : t -> init:'b -> f:( 'b -> Bap.Std.word list -> stats -> 'b ) -> 'b`

`fold t ~init ~f`

applies `f`

to all chunks known to the decider.

`val t : t Bap_byteweight_signatures.data`

`val find : t -> length:int -> threshold:float -> corpus -> Bap.Std.addr list`

`find mem ~length ~threshold corpus`

extract addresses of all memory chunks of the specified `length`

, that were classified positively under given `threshold`

.

```
val find_if :
t ->
length:int ->
f:( key -> int -> stats -> bool ) ->
corpus ->
Bap.Std.addr list
```

`find_if mem ~length ~f corpus`

finds all positively classfied chunks.

This is a generalization of the `find`

function with an arbitrary thresholding function.

It scans the input corpus using the `next_if`

function and collects all positive results.

```
val find_using_bayes_factor :
t ->
min_length:int ->
max_length:int ->
float ->
corpus ->
Bap.Std.addr list
```

`find_using_bayes_factor sigs mem`

classify functions starts using the Bayes factor procedure.

Returns a list of addresses in `mem`

that have a signature in `sigs`

with length `min_length <= n <= max_length`

and the Bayes factor greater than `threshold`

.

The Bayes factor is the ratio between posterior probabilities of two hypothesis, the `h1`

hypothesis that the given sequence of bytes occurs at the function start, and the dual `h0`

hypothesis,

`k = P(h1|s)/P(h0|s) = (P(s|h1)/P(s|h0)) * (P(h1)/P(h0))`

,

where

`P(hN|s)`

is the probability of the hypothesis`P(hN)`

given the sequence of bytes`s`

as the evidence,`P(s|hN`

is the probability of the sequence of bytes`s`

, given the hypothesis`hN`

,`P(hN)`

is the prior probability of the hypothesis`hN`

.

Given that `m`

is the total number of occurences of a sequence of bytes `s`

at the beginning of a function, and `n`

is the total number of occurences of `s`

in a middle of a function, we compute `P(s|h1)`

and `P(s|h0)`

as

`P(s|h1) = m / (m+n)`

,`P(s|h0) = 1 - P(s|h1) = n / (m+n)`

.

Given that `q`

is the total number of substrings in `sigs`

of length `min_length <= l <= max_length`

and `p`

is the total number of substrings of the length `l`

that start functions, we compute prior probabilities as,

`P(h1) = p / q`

,`P(h0) = 1 - P(h1)`

.

The resulting factor is a value `0 < k < infinity`

that quantify the strength of the evidence that a given substring gives in support of the hypothesis `h1`

. Levels below `1`

support hypothesis `h0`

, levels above `1`

give some support of `h1`

, with the following interpretations (Kass and Raftery (1995)),

Bayes Factor Strength 1 to 3.2 Weak 3.2 to 10 Substantial 10 to 100 Strong 100 and greater Decisive

```
val find_using_threshold :
t ->
min_length:int ->
max_length:int ->
float ->
corpus ->
Bap.Std.addr list
```

`find_using_threshold sigs mem`

classify function starts using a simple thresholding procedure.

Returns a list of addresses in `mem`

that have a signature `s`

in `sigs`

with length `min_length <= n <= max_length`

and the sample probability `P1(s)`

of starting a function greater than `threshold`

,

`P1(s) = m / (m+n)`

, where

- m - the total number of occurences of
`s`

at the begining of a function in`sigs`

; - n - the total number of occurences of
`s`

not at the begining of a function in`sigs`

.