Skip to content

Technical documentation

Yulei Sui edited this page Jul 17, 2015 · 38 revisions

This documentation describes the detailed internal working of SVF including Memory Model, Pointer Analysis, and Value-Flow Construction based on its design

##1. MemoryModel SVF parses the LLVM bit code and creates its internal symbol table information for pointer analysis. We adopt the convention of LLVM by separating all symbols into two kinds: ValSym represents a register LLVM Value, which is a top-level pointer; and ObjSym represents an abstract memory object, which is an address-taken variables of a pointer. Some objects are labelled specially. For example, symbols may not be determined statically are labeled with BlackHole e.g., UndefValue in LLVM. Constant objects are labeled with ConstantObj.

###1.1 Abstract Memory Object Every abstract memory object (MemObj) including global, stack, heap, is parsed at a memory allocation site. Each object has its own type information recorded in ObjTypeInfo. An object with struct type (StInfo) is modelled with all its field information (FieldInfo) including offset and type of each field. Nested structs are flattened. Array of structs are modelled with stride information.

###1.2 Program Assignment Graph (PAG) SVF transforms LLVM instructions into a graph representation PAG. Each node (PAGNode) on it represents a pointer, and each edge (PAGEdge) represents a constraint between two pointers.

####1.2.1 PAGNode PAGNode

There are two main kinds of PAGNode: ValPN represents a pointer, ObjPN represent an abstract object. GepObjPN which is subclass of ObjPN, represent a field of an aggregate object. GepValPN, which is subclass of ValPN, represent an introduced dummy node for field sensitivity when handling external library calls (e.g., memcpy, where pointers (LLVM Values) point to the fields of an struct does not explicitly appear at a instruction). RetNode represents the unique return of a procedure, VarargNode denotes a variadic argument of a procedure.

####1.2.2 PAGEdge PAGEdge

We classify PAGEdge into the following categories:

  • AddrPE                 // at a memory allocation site (ValPN <-- ObjPN)
  • CopyPE                 // at a PHINode, CastInst or SelectInst (ValPN <-- ValPN)
  • StorePE                 // at a StoreInst (ValPN <-- ValPN)
  • LoadPE                 // at a LoadInst (ValPN <-- ValPN)
  • GepPE                 // at a GetElementPtrInst (ValPN <-- ValPN)
  • CallPE                 // from an actual to a formal parameter (ValPN <-- ValPN)
  • RetPE                 // from a RetNode to a return value at a call site (ValPN <-- ValPN)

####1.2.3 PAGBuilder PAG is built using PAGBuilder by handling the following LLVM instructions and expressions:

  • AllocaInst
  • PHINode
  • StoreInst
  • LoadInst
  • GetElementPtrInst
  • CallInst
  • InvokeInst
  • ReturnInst
  • CastInst
  • SelectInst
  • IntToPtrInst
  • ExtractValueInst
  • ConstantExpr
  • GlobalValue

All the other instructions in LLVM are not handled, but they can also be translated into the basic edge types in PAG.

###1.3 Constraint Graph Constraint Graph (ConstraintGraph) is a copy of PAG for inclusion-based pointer analysis. PAG serves as the basic graph for all analyses in SVF, so its edge is fixed, however edges of ConstraintGraph can be changed during constraint resolution.

##2. Pointer Analysis PointerAnalysis

PointerAnalysis is the root class for all implementations. To implement different pointer analysis algorithms, a user can choose different implementations based on different points-to data structures (PTData). For example Flow-sensitive and flow-insensitive pointer analysis can choose BVDataPTAImpl as base class, which uses bit vector as its core data structure to map a pointer to its bit vector based points-to set. Context- and path-sensitive analysis requires a context or path condition as a qualifier before each pointer, CondPTAImpl can be a good choice for their implementations. CondPTAImpl maps a condition variable to its points-to set.

Every pointer analysis implementation operates on a graph representation with a solver for resolving constraints. For example, an Andersen’s analysis can be easily written by choosing a proper solver for deriving transitive closure rules on an inclusion-based constraint graph (ConsG). Similarly, a flow-sensitive analysis can be implemented by following flow-sensitive strong/weak updates rules with a points-to propagation solver on the sparse value-flow graph (VFG).

##3. Value-Flow Construction

An inter-procedural sparse value-flow graph (SVFG) for a program is a multi-edged directed graph that captures the def-use chains of both top-level pointers and address-taken objects.

svfgframework

###3.1 Memory SSA The def-use chains for top-level variables are readily available in LLVM IR. Address-taken variables are accessed indirectly at loads and stores. Their def-use chains are built in several steps. First, a pointer analysis is performed to generate points-to information for top-level pointers. Second, a load p = ∗q is annotated with a function μ(o) (LoadMU) for each variable o that may be pointed to by q to represent a potential use of o at the load. Similarly, a store ∗p = q is annotated with o = χ(o) (StoreCHI) for each variable o that may be pointed to by p to represent a potential def and use of o at the store. A callsite cs is also annotated with μ(o) (CallMU) and o = χ(o) (CallCHI) for each variable o to capture inter-procedural uses and defs of o. Likewise, o = χ(o) (EntryCHI) and μ(o) (RetMU) is annotated at the entry and exit of a procedure f to mimic the parameter (return) passing for the non-local variable o. Memory SSA PHI o = mssaPhi(o) (MSSAPHI) is inserted at the control flow joint point to merge multiple definitions of object o. Third, all the address-taken variables are converted to SSA form, with each μ(o) being treated as a use of o and each o = χ(o) as both a def and use of o.

mssa-cha

###3.2 Sparse Value-Flow Graph (SVFG)

Given a program with annotated μ/χ functions after SSA conversion, SVFG is constructed by connecting the def-uses for each converted SSA variable.

####3.2.1 SVFGNode

A SVFG Node is one of the following:

SVFGNode

(1) A statement (A PAGEdge)

  • AddrPE
  • CopyPE
  • GepPE
  • LoadPE
  • StorePE

(2) A memory region definition

  • FormalIN                 // EntryCHI
  • FormalOut                 // RetMU
  • ActualIn                 // CallMU
  • ActualOut                 // CallCHI
  • MSSAPHI                 // MSSAPHI

(3) A parameter

  • FormalParm                 //Formal Parameter
  • ActualParm                 //Actual Parameter
  • FormalRet                 // Procedure Return Variable
  • ActualRet                 //CallSite Return Variable

####3.2.2 SVFGEdge A SVFG Edge represents value-flow dependence from a variable's definition to its use The value-flow is one of the following:

SVFGEdge

(1) Direct value-flow for top-level pointers

(2) Indirect value-flow for address-taken objects

####3.2.3 SVFG Optimization Some of the SVFG nodes can be optimized away to make the graph more compact.

The following nodes can be removed: ActualParm ActualIn FormalRet FormalOut