Bap_strings_detectorString computation detector.
The detector uses maximum aposteriori likelihood estimator (MAP) to detect code that operates with textual data.
We define textual data as a contiguous blocks of characters from a specified alphabet.
In our model a code is a sequence of instructions. Each instruction reads or writes characters. We're observing a set of characters accessed by an instruction (without any distinguishing whether it was a write or a read operation). The characters are produced by unknown values that are used during the computation. Some of these values are textual data. Any particular instruction either works with data produced by textual values or not. Moreover, we're looking for contiguous sequences of instructions that work with textual value.
The detector doesn't depend on a particular instruction implementation, instead it should be called after each instruction (or whatever granularity a user prefers) with a sequence of characters that were observed on that step.
The detector distinguishes between two competing hypotheses. The null hypothesis is that we are observing a subsequence of instructions that doesn't interact with textual values. The alternative hypothesis states that the observed subsequence operates with one or more textual values.
The detector operates in two different modes. In the search mode it looks for a subsequence that operates with textual values. Once it is found it switches to the second mode, and tries to find the end of this sequence, under an assumption that textual values have finite lengths. At the second mode we're switching hypotheses. We now assume as H0 a hypothesis that we're still processing textual values, and the H1 is that there are no longer textual values. Once we found the start and the end of the sequence, as well as all the characters that we attribute to the textual values the detection is considered finished.
The detection relies on a fact, that when a code processes textual values, built from a specified alphabet, the characters from this alphabet occurs slightly more often, then the it is prescribed by the uniform distribution.
val create : 
  ?alpha:float ->
  ?beta:float ->
  ?p1:float ->
  ?ps:float ->
  ?len_pdf:(float -> float) ->
  Core_kernel.Char.Set.t ->
  'a tcreate alphabet creates a detector.
run detector trace runs a detector on a sequence on a trace represented as a sequence of bytes accessed during an execution. Returns a sequence of char sequences, where each subsequence is represented as a string and contains characters that were assumed to belong to the textual data.
val result : 'a decision -> ('a * char) listval chars : 'a decision -> stringval data : ?rev:bool -> 'a decision -> 'a listval pp : Stdlib.Format.formatter -> 'a t -> unitval pp_decision : Stdlib.Format.formatter -> 'a decision -> unitval pp_stats : Stdlib.Format.formatter -> 'a t -> unit