Module Std.Graphlib

Generic Graph Library

module Make (Node : Regular.Std.Opaque.S) (Edge : Core_kernel.T) : Graph with type node = Node.t and type Node.label = Node.t and type Edge.label = Edge.t

Make(Node)(Edge) creates a module that implements Graph interface and has unlabeled nodes of type Node.t and edges labeled with Edge.t

module Labeled (Node : Regular.Std.Opaque.S) (NL : Core_kernel.T) (EL : Core_kernel.T) : Graph with type node = (Node.t, NL.t) labeled and type Node.label = (Node.t, NL.t) labeled and type Edge.label = EL.t

Labeled(Node)(Node_label)(Edge_label) creates a graph structure with both nodes and edges labeled with abitrary types.

val create : (module Graph with type t = 'c and type Edge.label = 'b and type Node.label = 'a) -> ?nodes:'a list -> ?edges:('a * 'a * 'b) list -> unit -> 'c

create (module G) ~nodes ~edges () creates a graph using implementation provided by module G. Example:

module G = Graphlib.String.Bool;;
let g = Graphlib.create (module G) ~edges:[
    "entry", "loop", true;
    "loop", "exit", false;
    "loop", "loop", true] ()
val union : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> 'c -> 'c -> 'c

union (module G) g1 g2 returns a graph g that is a union of graphs g1 and g2, i.e., contains all nodes and edges from this graphs.

Postcondition:

          - N(g) = N(g1) ∪ N(g2).
          - E(g) = E(g1) ∪ E(g2).
val inter : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> 'c -> 'c -> 'c

inter (module G) g1 g2 returns a graph g that is an intersection of graphs g1 and g2, i.e., it contain and edges from this graphs.

Postcondition:

          - N(g) = N(g1) ∩ N(g2).
          - E(g) = E(g1) ∩ E(g2).
val to_dot : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> ?graph_attrs:('c -> graph_attr list) -> ?node_attrs:('n -> node_attr list) -> ?edge_attrs:('e -> edge_attr list) -> ?string_of_node:('n -> string) -> ?string_of_edge:('e -> string) -> ?channel:Core_kernel.Out_channel.t -> ?formatter:Stdlib.Format.formatter -> ?filename:string -> 'c -> unit

to_dot (module G) ~filename:"graph.dot" g dumps graph g using dot format. This is a customizable version of printing function. For most cases it will be enough to use G.pp or G.to_string function. Use this function, if you really need to customize your output.

  • parameter graph_attrs

    a list of global graph attributes;

  • parameter node_attrs

    a list of node specific attributes;

  • parameter edge_attrs

    a list of edge specific attributes;

  • parameter string_of_node

    used to print nodes;

  • parameter string_of_edge

    used to print edges;

  • parameter channel

    where to output the graph;

  • parameter formatter

    where to output the graph;

  • parameter filename

    where to output the graph;

    Note: if no output parameter is provided, the graph will not be outputted. More than one output targets is OK. For example, to_dot (module G) ~filename:"graph.dot" ~channel:stdout g will output graph g into both file named "graph.dot" and standard output.

    Note: if string_of_node function is not provided, then graph nodes will be labeled with the reverse post order number.

depth_first_search (module G) ~init g. It is the most important algorithm of the Graphlib. It builds a forest of spanning trees of a graph, classifies graph edges and numbers nodes. It is a Swiss-army knife, that is very useful in implementing many other algorithms. You can think of this function as fold on steroids. But unlike fold, that accepts only one function, the depth_first_search accepts 5 different functions, that will be called on different situations, allowing you to «fill in the blanks» of your algorithm.

Although depth_first_search doesn't allow you to drive the walk itself, there're still ways to do this, using filtered function. That allows you to hide nodes or edges from the walker, thus effectively erasing them from a graph, without even touching it.

  • parameter rev

    if true, then the graph g is traversed in a reverse direction. This is essentially the same, as reversing the graph, but make sure, that you've adjusted the start node.

  • parameter start

    if specified, then the traverse will be started from the node that is equal to node start. Otherwise the traverse is started from the first node of a graph as returned by G.nodes, i.e., usually it is an arbitrary node.

  • parameter start_tree

    node state is called on each new spanning tree started by the algorithm. If all nodes are reachable from the start node, then this function will be called only once. If all nodes of a graph are connected, then this function, will be called only once.

  • parameter enter_node

    pre node state is called when a node is first discovered by the traversal. The number is a preorder number, also known as depth-first number or dfnum. All nodes are entered in a pre-order.

  • parameter leave_node

    rpost node state is called when all successors of a node are left (finished). The provided number is a reverse post order number, that also defines a topological sorting order of a graph. All nodes, are left in a post order.

  • parameter enter_edge

    kind edge state is called when and edge is first discovered. Edge kinds are described below. The destination of the edge may not be discovered (i.e., entered) yet. But the source is already entered (but not finished).

  • parameter leave_edge

    kind edge state is called when the edge destination is at least started.

    Edges classification

    An edge in a spanning tree, produced by a depth first walk, can belong to one of the following category (kind):

    • Tree edges constitutes a spanning tree T of a graph;
    • Forward edges go from an ancestor to a descendants in a tree T;
    • Back edges go from descendants to ancestors in T, including node itself (they are also known as cycle edges).
    • Cross edges - all other edges, i.e., such edges for which doesn't go from ancestors to descendants or vice verse. They are possible since, tree defines only partial ordering.

    With respect to a pre-order and reverse post-ordering numbering the source x and a destination y of an edge with a given kind satisfy to the following inequalities:

                +---------+-----------------+---------------------+
                | Tree    | pre[x] < pre[y] | rpost[x] < rpost[y] |
                | Forward | pre[x] < pre[y] | rpost[x] < rpost[y] |
                | Back    | pre[x] ≥ pre[y] | rpost[x] ≥ rpost[y] |
                | Cross   | pre[x] > pre[y] | rpost[x] < rpost[y] |
                +---------+-----------------+---------------------+

    Note: since there can be more than one valid order of traversal of the same graph, (and thus more than one valid spanning tree), depending on a traversal the same edges can be classified differently. With the only exception, that a back edge will be always a back edge, disregarding the particular order.

    Complexity

    The algorithm is linear in time. It uses constant stack space. In fact, for small graphs it uses stack, but for large graphs dynamically switches to a heap storage. The space complexity is bounded by linear function of the graph depth.

val depth_first_visit : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> ?rev:bool -> ?start:'n -> 'c -> init:'s -> ('n, 'e, 's) dfs_visitor -> 's

depth_first_visit (module G) ~init visitor g allows to specify visiting functions using object. That opens space for re-usability and using open recursion.

class ['n, 'e, 's] dfs_identity_visitor : ['n, 'e, 's] dfs_visitor

base class with all methods defaults to nothing.

val reverse_postorder_traverse : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> ?rev:bool -> ?start:'n -> 'c -> 'n Regular.Std.seq

returns a sequence of nodes in reverse post order.

val postorder_traverse : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> ?rev:bool -> ?start:'n -> 'c -> 'n Regular.Std.seq

returns a sequence of nodes in post order

val dominators : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> ?rev:bool -> 'c -> 'n -> 'n tree

dominators (module G) g entry builds a dominators tree for a given graph.

Definition: a walk is a sequence of alternating nodes and edges, where each edge's endpoints are the preceding and following nodes in the sequence.

Definition: a node v is reachable if there exists a walk starting from entry and ending with v.

Definition: node u dominates v if u = v or if all walks from entry to v contains u.

Definition: node u strictly dominates v if it dominates v and u <> v.

Definition: node u immediately dominates v if it strictly dominates v and there is no other node that strictly dominates v and is dominated by u.

Algorithm computes a dominator tree t that has the following properties:

  1. Sets of graph nodes and tree nodes are equal;
  2. if node u is a parent of node v, then node u immediately dominates node v;
  3. if node u is an ancestors of node v, then node u strictly dominates node v;
  4. if node v is a child of node u, then node u immediately dominates node v;
  5. if node v is a descendant of node u, then node u strictly dominates node v.

If every node of graph g is reachable from a provided entry node, then properties (2) - (5) are reversible, i.e., an if statement can be read as iff, and the tree is unique.

To get a post-dominator tree, reverse the graph by passing true to rev and pass exit node as a starting node.

Note: although it is not imposed by the algotihm, but it is a good idea to have an entry node, that doesn't have any predecessors. Usually, this is what is silently assumed in many program analysis textbooks, but is not true in general for control-flow graphs that are reconstructed from binaries.

Note: all nodes that are not reachable from the specified entry node are parented by the entry node.

val dom_frontier : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> ?rev:bool -> 'c -> 'n tree -> 'n frontier

dom_frontier (module G) g dom_tree calculates dominance frontiers for all nodes in a graph g.

The dominance frontier of a node d is the set of all nodes n such that d dominates an immediate predecessor of n, but d does not strictly dominate n. It is the set of nodes where d's dominance stops.

val strong_components : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> 'c -> 'n partition

strong_components (module G) g partition graph into strongly connected components. The top of each component is a root node, i.e., a node that has the least pre-order number.

val shortest_path : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> ?weight:('e -> int) -> ?rev:bool -> 'c -> 'n -> 'n -> 'e path option

shortest_path (module G) ?weight ?rev g u v Find a shortest path from node u to node v.

  • parameter weight

    defines a weight of each edge. It defaults to 1.

  • parameter rev

    allows to reverse graph.

val is_reachable : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> ?rev:bool -> 'c -> 'n -> 'n -> bool

is_reachable (module G) ?rev g u v is true if node v is reachable from node u in graph g. If rev is true, then it will solve the same problem but on a reversed graph.

val fold_reachable : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> ?rev:bool -> init:'a -> f:('a -> 'n -> 'a) -> 'c -> 'n -> 'a

fold_reachable (module G) ?rev ~init ~f g n applies function f to all nodes reachable from node g in graph g. If rev is true, then the graph is reversed.

For example, the following will build a set of reachable nodes: fold_reachable (module G) ~init:G.Node.Set.empty ~f:Set.add

val compare : (module Graph with type node = 'n and type t = 'a) -> (module Graph with type node = 'n and type t = 'b) -> 'a -> 'b -> int

compare (module G1) (module G2) g1 g2 compares two graphs, with different implementation but the same node type.

val filtered : (module Graph with type edge = 'e and type node = 'n and type t = 'c) -> ?skip_node:('n -> bool) -> ?skip_edge:('e -> bool) -> unit -> (module Graph with type edge = 'e and type node = 'n and type t = 'c)

let module G' = filtered (module G) ?skip_node ?skip_edge () creates a new module G' that can be used at any place instead of G, but that will hide nodes and edges, for which functions skip_node and skip_edge return true.

Example:

let killed_edges = G.Edge.Hash_set.create () in
let module G = Graphlib.filtered (module G)
    ~skip_edge:(Hash_set.mem killed_edges) () in
let rec loop g () =
  (* use (module G) as normal *)
  Hash_set.add killed_edges some_edge;
  (* all edges added to [killed_edges] will no be visible *)
val view : (module Graph with type edge = 'e and type node = 'n and type t = 'c and type Edge.label = 'b and type Node.label = 'a) -> node:(('n -> 'f) * ('f -> 'n)) -> edge:(('e -> 'd) * ('d -> 'e)) -> node_label:(('a -> 'p) * ('p -> 'a)) -> edge_label:(('b -> 'r) * ('r -> 'b)) -> (module Graph with type edge = 'd and type node = 'f and type t = 'c and type Edge.label = 'r and type Node.label = 'p)

view (module G) ~node ~edge ~node_label ~edge_label creates a proxy module, that will transform back and forward elements of graph, using corresponding functions.

module To_ocamlgraph (G : Graph) : Graph.Sig.P with type t = G.t and type V.t = G.node and type E.t = G.edge and type V.label = G.Node.label and type E.label = G.Edge.label

To_ocamlgraph(G) returns a module that implements OCamlGraph interface for a persistent graph.

module Of_ocamlgraph (G : Graph.Sig.P) : Graph with type t = G.t and type node = G.V.t and type edge = G.E.t and type Node.label = G.V.label and type Edge.label = G.E.label

Of_ocamlgraph(O) creates an adapter module, that implements Graphlib interface on top of the module implementing OCamlGraph interface.

module Filtered (G : Graph) (P : Predicate with type node = G.node and type edge = G.edge) : Graph with type t = G.t and type node = G.node and type edge = G.edge and module Node = G.Node and module Edge = G.Edge

functorized version of a filter function.

module Mapper (G : Graph) (N : Isomorphism with type s = G.node) (E : Isomorphism with type s = G.edge) (NL : Isomorphism with type s = G.Node.label) (EL : Isomorphism with type s = G.Edge.label) : Graph with type t = G.t and type node = N.t and type edge = E.t and type Node.label = NL.t and type Edge.label = EL.t

functorized version of Graphlib.view function.

val fixpoint : (module Graph with type node = 'n and type t = 'c) -> ?steps:int -> ?start:'n -> ?rev:bool -> ?step:(int -> 'n -> 'd -> 'd -> 'd) -> init:('n, 'd) Solution.t -> equal:('d -> 'd -> bool) -> merge:('d -> 'd -> 'd) -> f:('n -> 'd -> 'd) -> 'c -> ('n, 'd) Solution.t

fixpoint ~equal ~init ~merge ~f g computes a solution for a system of equations denoted by graph g, using the initial approximation init (obtained either with Solution.create or from the previous calls to fixpoint).

The general representation of the fixpoint equation is

          x(i) = f(i) (a(1,i) x(1) %  ... % a(n,i) x(n)),

where

  • x(i) is the value of the i'th variable (node);
  • a(s,d) is 1 if there is an edge from the node s to the node d and 0 otherwise;
  • % the merge operator;
  • f(i) is the transfer function for the node i.

A solution is obtained through a series of iterations until the fixed point is reached, i.e., until the system stabilizes. The total number of iterations could be bound by an arbitrary number. If the maximum number of iterations is reached before the system stabilizes then the solution is not complete. An incomplete solution could be resumed later, or used as it is (for example, in case of ascending chain the solution is always a lower approximation of a real solution, so it is always safe to use it).

  • parameter steps

    the upper bound to the number of iterations the solver can make.

  • parameter start

    the entry node of the graph

  • parameter rev

    if true then graph is visited in the reverse order, defaults to false.

  • parameter step

    a function that is called every time a new value of a variable is obtained (an extension point for narrowing/widening).

  • parameter equal

    compares two approximations for equivalence

  • parameter init

    initial approximation

  • parameter merge

    the operator for path merging (usually meet/join depending on a starting point).

  • parameter f

    the transfer function

    Data Flow Analysis and Abstract Interpretation

    The fixpoint function implements a general fixed pointed iterative solver, that is suitable for implementing data-flow analysis or abstract interpreters. This section will provide some insight on how a particular choice of parameters affects the result of computation. We will start with a small introduction to the theory of Data Flow Analysis and Abstract Interpretation, with respect to the solution of a system of fixed point equations.

    Introduction

    The data domain is a set of values equipped with a partial ordering operation (L,<=), also know as a poset. We assume, that the poset is bounded complete, i.e., there are two special elements of the set that are called top and bot (the bottom value). The top element is the greatest element of the set L, i.e., for all x in L, x <= top. Correspondingly, the bottom element is the least element of the set L, i.e., for all x in L, bot <= x. It is not required by the framework that the poset has both or any of the bounds, however their presence makes the explanation easier, and since any poset could be artificially extended with these two elements, their introduction will not introduce a loss of generality. Since values of the set L represent information, the partial ordering between two pieces of information a and b, e.g., a <= b, tells us that a contains no more information than b. Therefore the top value contains all the information, representable in the set, correspondingly the bot value represents an absence of information.

    A solution to a fixed point equation (i.e., an equation of the form x = f(x)) could be obtainted by starting from some initial approximation x0 and repeatedly applying f to it, until the solution is found, i.e., x = f(... f(f(x0)) ...). In general, a function may have multiple (or no at all) fixed points. If a function has several fixed points, then we can distinguish two extremums - the least fixed point lfp and the greatest fixed point gfp w.r.t the ordering of the poset L. Assuming that function f is positive monotonic function (i.e., x <= y implies that f(x) <= f(y)), thechoice of the initial value x0 denotes which of the two fixed points is computed. When we start with the bot value, we are ascending from it until the least fixed point is obtained. Dually, if we will start with the top value, we will descend until the maximal fixpoint is reached. Assuming that both fixpoints are lower approximations of the ground truth, we can easily see, that the maximal fixpoint solution is more attractive, as it bears more information than the minimal (unless both are the same). However, the ascending search, bears a nice property, that any intermediate solution is an over-approximation of the ground truth (i.e., we are monotonically aggregating facts).

    In general, a function may not have a solution at all, or the solution may not be computable in practice, i.e., when the chain of function applications x = f(... f(x0) ...) is either infinite or effectively infinite (e.g., 2^64 applications). The Tarksi theorem states that if L is complete and f is monotone, then it has a fixed point. If a poset has a limited height (maximum length of an ascending chain of elements, such that x0 < x1 < .. < xM) then we will obtain a solution in no more than M steps. However, if M is very big, or infinite, then the solution won't be find in general, and the computation may not terminate.

    This brings us to the main distinction between Abstract Interpretation and Data Flow Analysis. The latter usually deals with lattices that have finite heights, while the former deals with very big and infinite lattices. To accelerate the convergence of a chain of approximations, a special technique called widening is used. Widening can be seen as an operation that jumps over several applications of the function f, hence it actually accelerates the convergence. Given, that widening jumps forward in the chain, it is possible to overshoot the fixed point solution. This can potential involve a lossage of precision (in case of the descending chain) or to an incorrect solution (in case of the ascending chain).

    So far, we were considering only one equation. However, we need to solve a system of equations, denoted by a graph (as graph denotes a boolean adjacency matrix A = {a(i,j)}, such that a(i,j) is 1 if there is an edge from i to j, and 0 otherwise). To solve the system of equations, we need to find such vector x1,..,xM that solves all equations. In general, a system may not have a solution (an over-constrainted system), may have one solution, and may have many solutions (under constrained system). In our case, the system is not linear, as each equation is function of type L -> L not L^M -> N, since all input variables are merged with some operator, usually meet or join. A correct choice of the merge opeator ensures correctness and convergence of the solution.

    The meet operator is a commutative, associative, and idempotent operator, such that if z = meet x y, then z <= x && z <= y and for all w in L, w <= x && w <= y implies w <= z. The z value is called the greatest lower bound (glb, or inf) of x and y. Intuitively, the meet operator takes two pieces of information and removes all contradictions. Thus z is the maximal consensus between x and y. The top element is the neutral element with respect to the meet operation, e.g., meet x top = x. A consequent application of the meet operation builds a descending chain of approximations, i.e., the amount of information is reduced on each step.

    The join operator is dual to meet and is defined correspondingly (with the flipped order). A join of x and y is called the least upper bound (lub, sup) of x and y. Intuitively, the join operator takes two non-contradictory pieces information and returns their concatenation. The bot element is the neutral element with respect to the join operation, e.g., join x bot = x. A consequent application of the join operation builds an ascending chain of approximations, i.e., the amount of information is increased on each step.

    Using the fixpoint function

    The fixpoint interface is trying to be as general as possible, but at the same time easy to use. The interface, allows a user to choose the direction of approximation (ascending vs. descending) and how to accelerate the convergence in case of tall lattices by applying narrowing and widening. The implementation fixes the iteration strategy (by always using the topological ordering). In case if you need a fixed point solver, that allows you to use different iteration strategies, the fixpoint1 library provides a descent alternative.

    1: http://pop-art.inrialpes.fr/people/bjeannet/bjeannet-forge/fixpoint/

    Using fixpoint for Classical Data Flow analysis

    The classical Data Flow analysis uses the descending chain of approximations in a complete lattice with finite (usually very small) height. Thus, the fixpoint solution is the greatest (maximal) fixed point (a maximal set of facts on which all equations agree). If the transfer function f is monotone and distributive, then it is the meet-over-paths (mop) solution, in a more general case gfp <= mop and mop <= t (where t is the ground truth). (Note, gfp is of course the best solution to the system of equations, as it is the maximal of all solutions. Both mop and t are not solutions to the given system, but define the true properties of a system under test, that we are trying to model with the system of fixed point equations. The fact that gfp is smaller, indicates that our model looses the precision (i.e., it is overconstrained). For example, the meet-over-path solution is a meet of all possible paths, even those that are infeasible, thus the system is overconstrained as we are taking into account system behaviors that will never happen, this is however safe, to overconstraint a system, as it will give us an over-approximation of a real system behavior).

    The fixpoint function could be easily applied to both forward and backward data flow problem, as backward problem could be seen as a forward problem on a reversed graph. To effectively reverse a graph, set the rev flag to true and set the enter parameter to the exit node.

    Using fixpoint for Abstract Interpretation

    Abstract Interpretation usually deals with complex and large lattices, and applies different heuristics to ensure termination with the minimum loss of precision. This usually ends up in applying widening and narrowing operations. Since widening accelerates by jumping forward in the chain of approximations, it can easily overshoot a fixed point, and in case of the ascending chain this may lead to a solution that is an under approximation of the ground truth. Thus, abstract interpretation is usually applied in the descending order. In this case the widen operation may still overshot the maximal fixpoint, that will lead to an over-approximation of the ground truth, that is safe, but still looses precision. Therefore, it is necessary to apply widening as rarely as possible. Unfortunately, a question where and when to apply widening is undecidable by itself, that's why heuristics are used. The fixpoint function, provides a limited capabilities to control widening via the step i n x x' function that is called every time a new i'th approximation x' for variable n is computed. The step function must return an upper bound (not necessary the least one) of the previous approximation x and the new approximation x'. The default, implementation just returns x'. An alternative implementation may widen x' if the number of steps i in the chain is higher than some threshold, and/or if x is a widening point (e.g., the loop header).

    Note: terms widening and narrowing comes from the interval analysis where they were first introduced, and correspond to the widening of an interval (usually up to infinitiy) and narrowing a widened interval based on some heurisitic.

    Using fixpoint for general iterative approximation

    In a general case, the fixpoint function could be used to compute successive approximations of a solution to a system of (in)equations, even if f is not monotone, and the lattice is not finite. The termination could be guaranteed by limiting the maximum number of iterations. And the correctness could be ensured by starting from the bottom element, and using the ascending chain of approximations. In that case, even a partially complete solution would be an over-approximation of a real solution. The obtained partial solution could be later resumed (and possibly extended with newly obtained facts).

    Implementation

    The fixpoint uses the Kildall iterative algorithm. And applies equations in reverse postorder (topological order). The solution is represented as an abstract finite mapping, that is also used to specify the initial set of constraints and the initial value of unconstrained variables. It is possible to specify more than one constraint to the system of equation (as opposed to the classical approach where the constraint denotes only the input for the entry node).

type scheme

name generation scheme

type 'a symbolizer = 'a -> string

a function that gives a name for a value of type 'a

val create_scheme : next:(string -> string) -> string -> scheme

create_scheme ~next init create a name generator, that will start with init and apply next on it infinitly.

val symbols : scheme

lower case symbols, starting from 'a' and moving up to 'z'. As 'z' is reached, all foregoing symbols will have a form of 'node_N' where 'N' is an increasing natural number.

val numbers : scheme

numbers from zero to inifinity (Sys.max_int in fact)

empty string

val nothing : scheme

empty string

val by_given_order : scheme -> ('a -> 'a -> int) -> 'a Core_kernel.Sequence.t -> 'a symbolizer
val by_natural_order : scheme -> ('a -> 'a -> int) -> 'a Core_kernel.Sequence.t -> 'a symbolizer
module Dot : sig ... end

Generic dot printer.