Skip to content

Modified version of Daikon

Juan Carlos Alonso Valenzuela edited this page May 24, 2024 · 10 revisions

Table of contents

  1. Daikon
  2. Invariants taxonomy
  3. Daikon container configuration options
  4. Modifications on Daikon
    1. Minimizing false positives
    2. Added invariants
    3. Disabled invariants
    4. Enabled invariants
    5. Changes in computeConfidence functions
    6. Suppression of derived variables
    7. Invariants disabled in the default Daikon configuration

Daikon

Daikon is an open-source tool that applies machine learning techniques to detect likely invariants in programs by monitoring test executions. This monitoring process involves observing the program state at designated program points, initially considering all possible invariants as valid. Those invariants that are not violated by any execution are reported as likely invariants. Daikon operates by analyzing an instrumented version of a software execution, generated by an instrumenter or front-end. This instrumentation produces a declaration file and a data trace file. The declaration file describes the structure of program points in terms of input and output variables. The data trace file contains the values assigned to the variables in each execution. Instrumenters are available for various programming languages and data formats, including Java, Perl, C++, and CSV.

Invariants taxonomy

Overall, our customized version of Daikon supports a total of 105 distinct types of invariants for REST APIs, classified into five categories:

  • Arithmetic comparisons (48 invariants). Specify numerical bounds (e.g., size(return.artists[]) >= 1) and relations between numerical fields (e.g., input.limit >= size(return.items[])).
  • Array properties (23 invariants). Represent comparisons between arrays, such as subsets, supersets, or fields that are always member of an array (e.g., return.hotel.hotelId in input.hotelIds[]).
  • Specific formats (22 invariants). Specify restrictions regarding the expected format (e.g., return.href is Url) or length (e.g., LENGTH(return.id)==22) of string fields.
  • Specific values (9 invariants). Restrict the possible values of fields (e.g., return.visibility one of {"public", "private"}).
  • String comparisons (3 invariants). Specify relations between string fields, such as equality (e.g., input.name == return.name) or substrings (e.g., input.id is a substring of return.href).

Daikon container configuration options

This section describes the flags and configuration options available when running the provided Docker image containing our modified Daikon version.

  • use_modified_daikon_version (default: true): If true, the invariants will be detected using the modified version of Daikon proposed in our paper. If false, it will use the default version of Daikon, except for 13 types of invariants that were suppressed to avoid a combinatorial explosion of reported invariants.
  • use_agora_pp (default: true). If true, the heuristic defined in section Minimizing false positives for automatically suppressing false positives caused by invariants of the arithmetic comparison category is applied. Unless explicitly stated otherwise, if the default configuration of Daikon is being used (i.e., use_modified_daikon_version=false), this parameter is set to false by default.
  • suppress_input_invariants (default: true): If true, the invariants that only affect input variables will not be reported.
  • suppress_arithmetic_comparisons (default: false): If true, the invariant of the Arithmetic comparisons category will not be reported.
  • suppress_string_comparisons (default: false): If true, the invariant of the String comparisons category will not be reported.
  • suppress_specific_formats (default: false): If true, the invariant of the Specific formats category will not be reported.
  • suppress_specific_values (default: false): If true, the invariant of the Specific values category will not be reported.
  • suppress_array_properties (default: false): If true, the invariant of the Array properties category will not be reported.

For instance, to run our modified version of Daikon suppressing the invariants of the Arithmetic comparisons category, we use the following command:

docker run --rm -v C:\myUser\instrumentation:/files javalenzuela/daikon_agora java -jar daikon_modified.jar /files/declsFile.decls /files/dtraceFile.dtrace suppress_arithmetic_comparisons=true > invariants_without_arithmetic_comparisons.csv

Modifications on Daikon

This section describes the changes performed in Daikon version 5.8.10 (released on November 2021)

Section Invariants disabled in the default Daikon configuration contains the 13 invariants that we disabled in the default Daikon configuration to avoid a combinatorial explosions of string comparisons and a high number of false positives.

Minimizing false positives

In the evaluation of the first version of AGORA, we identified the invariants of the arithmetic comparison category as the cause of 3 out of every 4 false positives. Specifically, most false positives were invariants comparing two independent output properties. For example, the invariant return.duration_ms > size(return.artists[]) states that the duration of a song in milliseconds should always be greater than the number of artists in the song. While this may be true in most cases, it clearly does not represent the expected API behaviour and is considered a false positive. Finding counterexamples to automatically rule out these false positives is extremely unlikely, and the number of reported false positives can become unbearable if the API has several numerical response fields.

To address this, AGORA integrates a novel heuristic to minimize false positives derived from arithmetic comparisons based on the following observation: parameters with a different value range are unlikely to be related. In the previous example, for instance, the duration of a song in milliseconds may oscillate in the order of thousands, whereas the number of artists of a song may vary from 1 to around 10.

In practice, we propose a heuristic that discards the invariants that represent arithmetic inequalities (i.e., x < y, x <= y, x > y and x >= y) if there is no overlap between the values of the two variables involved. Specifically, we modified a total of 32 Daikon invariants so that invariants are not reported if the maximum observed value for one of the involved variables (e.g., 10 artists) is less than the minimum observed value of the other variable (e.g., 30K milliseconds).

This heuristic is enabled by default, to disable it, set the boolean parameter use_agora_pp of the modified version of Daikon to false.

Added invariants

  1. IsUrl: Indicates that the value of a string variable is always a URL. Represented as x is Url.

  2. SequenceStringElementsAreUrl: Indicates that all elements of an array of strings are URLs. Represented as All the elements of x are URLs.

  3. FixedLengthString: Indicates that the value of a string variable always has a fixed length n. Represented as LENGTH(x)==n.

  4. SequenceFixedLengthString: Indicates that all the elements of an array of strings have a fixed length n. Represented as All the elements of x have LENGTH=n.

  5. IsNumeric: Indicates that the characters of a string variable are always numeric. Represented as x is Numeric.

  6. SequenceStringElementsAreNumeric: Indicates that the characters of all the elements of an array of strings are always numeric. Represented as All the elements of x are Numeric.

  7. IsEmail: Indicates that the value of a string type variable is always an email. Represented as x is Email.

  8. SequenceStringElementsAreEmail: Indicates that all elements of an array of strings are emails. Represented as All the elements of x are emails.

  9. IsDateDDMMYYYY: Indicates that the value of a string variable is always a date following the format DD/MM/YYYY (the separator can be “/” or “-“). Represented as x is a Date. Format: DD/MM/YYYY.

  10. SequenceStringElementsAreDateDDMMYYYY: Indicates that all the elements of an array of strings are dates following the format DD/MM/YYYY. Represented as All the elements of x are dates. Format: DD/MM/YYYY.

  11. IsDateMMDDYYYY: Indicates that the value of a string variable is always a date following the format MM/DD/YYYY (the separator can be “/” or “-“). Represented as x is a Date. Format: MM/DD/YYYY.

  12. SequenceStringElementsAreDateMMDDYYYY: Indicates that all the elements of an array of strings are dates following the format MM/DD/YYYY. Represented as All the elements of x are dates. Format: MM/DD/YYYY.

  13. IsDateYYYYMMDD: Indicates that the value of a string variable is always a date following the format YYYY/MM/DD (the separator can be “/” or “-“). Represented as x is a Date. Format: YYYY/MM/DD.

  14. SequenceStringElementsAreDateYYYYMMDD: Indicates that all the elements of an array of strings are dates following the format YYYY/MM/DD. Represented as All the elements of x are dates. Format: YYYY/MM/DD.

  15. IsHour: Indicates that the value of a string variable is always a time in 24-hour format. Represented as x is Hour: HH:MM 24-hour format, optional leading 0.

  16. SequenceStringElementsAreHour: Indicates that all elements of an array of strings are hours in 24-hour format. Represented as All the elements of x are Hours: HH:MM 24-hour format, optional leading 0.

  17. IsHourAMPM: Indicates that the value of a string variable is always a time in 12-hour format. Represented as x is Hour: HH:MM 12-hour format, optional leading 0.

  18. SequenceStringElementsAreHourAMPM: Indicates that all elements of an array of strings are hours in 12-hour format. Represented as All the elements of x are Hours: HH:MM 12-hour format, optional leading 0, mandatory meridiems (AM/PM).

  19. IsHourWithSeconds: Indicates that the value of a string variable is always a time in 24-hour format, including seconds. Represented as x is Hour: HH:MM:SS 24-hour format with optional leading 0.

  20. SequenceStringElementsAreHourWithSeconds: Indicates that all elements of an array of strings are hours in 24-hour format, including seconds. Represented as All the elements of x are Hours: HH:MM:SS 24-hour format with optional leading 0.

  21. IsTimestampYYYYMMHHThhmmssmm: Indicates that the value of a string variable is always a timestamp. Represented as x is Timestamp. Format: YYYY-MM-DDTHH:MM:SS.mmZ (Miliseconds are optional).

  22. SequenceStringElementsAreTimestampYYYYMMHHThhmmssmm: Indicates that all elements of an array of strings are timestamps. Represented as All the elements of x are Timestamps. Format: YYYY-MM-DDTHH:MM:SS.mmZ (Miliseconds are optional).

Disabled invariants

Section 5.5 Invariant list of the Daikon user manual contains a description of each one of these invariants.

  1. daikon.inv.unary.scalar.NonZero
  2. daikon.inv.unary.scalar.NonZeroFloat
  3. daikon.inv.unary.scalar.RangeInt.PowerOfTwo
  4. daikon.inv.unary.sequence.EltNonZero
  5. daikon.inv.unary.sequence.EltNonZeroFloat
  6. daikon.inv.unary.sequence.EltRangeInt.PowerOfTwo
  7. daikon.inv.binary.twoScalar.IntNonEqual
  8. daikon.inv.binary.twoScalar.FloatNonEqual
  9. daikon.inv.binary.twoScalar.LinearBinary
  10. daikon.inv.binary.twoScalar.LinearBinaryFloat
  11. daikon.inv.binary.twoString.StringNonEqual
  12. daikon.inv.binary.twoString.StringLessThan
  13. daikon.inv.binary.twoString.StringGreaterThan
  14. daikon.inv.binary.twoString.StringLessEqual
  15. daikon.inv.binary.twoString.StringGreaterEqual
  16. daikon.inv.binary.twoSequence.SeqSeqStringEqual
  17. daikon.inv.binary.twoSequence.SeqSeqStringLessThan
  18. daikon.inv.binary.twoSequence.SeqSeqStringGreaterThan
  19. daikon.inv.binary.twoSequence.SeqSeqStringLessEqual
  20. daikon.inv.binary.twoSequence.SeqSeqStringGreaterEqual
  21. daikon.inv.binary.twoSequence.PairwiseStringLessThan
  22. daikon.inv.binary.twoSequence.PairwiseStringGreaterThan
  23. daikon.inv.binary.twoSequence.PairwiseStringLessEqual
  24. daikon.inv.binary.twoSequence.PairwiseStringGreaterEqual
  25. daikon.inv.ternary.threeScalar.LinearTernary
  26. daikon.inv.ternary.threeScalar.LinearTernaryFloat
  27. daikon.inv.binary.twoScalar.NumericInt.Divides
  28. daikon.inv.binary.twoScalar.NumericInt.Square
  29. daikon.inv.binary.twoScalar.NumericFloat.Divides
  30. daikon.inv.binary.twoScalar.NumericFloat.Square
  31. daikon.inv.binary.twoSequence.PairwiseNumericInt.Divides
  32. daikon.inv.binary.twoSequence.PairwiseNumericInt.Square
  33. daikon.inv.binary.twoSequence.PairwiseNumericFloat.Divides
  34. daikon.inv.binary.twoSequence.PairwiseNumericFloat.Square
  35. daikon.inv.binary.twoSequence.PairwiseLinearBinary
  36. daikon.inv.binary.twoSequence.PairwiseLinearBinaryFloat

Enabled invariants

Section 5.5 Invariant list of the Daikon user manual contains a description of each one of these invariants.

  1. daikon.inv.binary.twoString.StdString.SubString: We applied two modifications on this invariant:

    • We added a property that sets a minimum string length (2 by default) in order to avoid reporting that a string that is always empty is always a substring of every other string in the program point.

    • This invariant may report redundant information that would bloat our results, for example, if x is a substring of y and y is a substring of z, Daikon would report the invariants: “x is substring of y”, “y is substring of z” and “x is substring of z”. In this situation we consider the “x is substring of z” invariant redundant and modified Daikon so it would not report redundant invariants, in order to avoid bloating the results.

  2. daikon.inv.binary.twoSequence.SubSequence

  3. daikon.inv.binary.twoSequence.SubSequenceFloat

  4. daikon.inv.binary.twoSequence.SubSet

  5. daikon.inv.binary.twoSequence.SubSetFloat

  6. daikon.inv.binary.twoSequence.SuperSequence

  7. daikon.inv.binary.twoSequence.SuperSequenceFloat

  8. daikon.inv.binary.twoSequence.SuperSet

  9. daikon.inv.binary.twoSequence.SuperSetFloat

Changes in computeConfidence functions

At the end of the Daikon execution, the computeConfidence function is executed for every invariant that has not been falsified, returning a number between 0 and 1. If this number achieves a certain threshold, the invariant is considered to be statistically justified and it is reported to the user. We found some invariants for which Daikon always returned a confidence of 1.

protected double computeConfidence() {
   return CONFIDENCE_JUSTIFIED;
}

We modified these functions to return 1 only if there was at least one occurrence of the invariant during the program execution.

protected double computeConfidence() {
    if (ppt.num_samples() == 0) {
      return CONFIDENCE_UNJUSTIFIED;
    }
return CONFIDENCE_JUSTIFIED;
}

These are the invariants for which we modified the computeConfidence function:

  • Unary:
    • Scalar:
      • RangeFloat
      • RangeInt
    • Sequence:
      • EltRangeFloat
      • EltRangeInt
  • Binary:
    • SequenceString:
      • MemberString
    • SequenceScalar:
      • Member
      • MemberFloat
    • TwoSequence:
      • Reverse
    • ReverseFloat
      • SubSequence
      • SubSequenceFloat
      • Subset

Suppression of derived variables

A derived variable is an expression that does not appear directly in the instrumented program, but which Daikon uses as a variable for invariant detection. For example, for array variables, Daikon generates a derived variable whose value is its size, allowing to detect invariants such as input.limit >= size(return.items).

However, there are derived variables that do not provide any relevant information, this is the case of the derived variable orig(), which indicates the value of a variable at the input program point and is used to make comparisons with the value of the same variable at the output program point. In the context of black box testing of RESTful APIs, the value of the input parameters does not change at any time, so Daikon would only detect equality invariants between the parameter value at the input and output (input.limit == orig(input.limit)) that will always be satisfied. For this reason, this derived variable has been disabled.

In some cases, Daikon applies the shift derived variable (consisting on decreasing the value of a numerical variable by a shift value) to try to find relations between numerical variables (e.g., input-1 > return), we disabled this derived variable because it would most likely report misleading information.

Invariants disabled in the default Daikon configuration

Section 5.5 Invariant list of the Daikon user manual contains a description of each one of these invariants.

  1. daikon.inv.binary.twoString.StringNonEqual
  2. daikon.inv.binary.twoString.StringLessThan
  3. daikon.inv.binary.twoString.StringGreaterThan
  4. daikon.inv.binary.twoString.StringLessEqual
  5. daikon.inv.binary.twoString.StringGreaterEqual
  6. daikon.inv.binary.twoSequence.SeqSeqStringLessThan
  7. daikon.inv.binary.twoSequence.SeqSeqStringGreaterThan
  8. daikon.inv.binary.twoSequence.SeqSeqStringLessEqual
  9. daikon.inv.binary.twoSequence.SeqSeqStringGreaterEqual
  10. daikon.inv.binary.twoSequence.PairwiseStringLessThan
  11. daikon.inv.binary.twoSequence.PairwiseStringGreaterThan
  12. daikon.inv.binary.twoSequence.PairwiseStringLessEqual
  13. daikon.inv.binary.twoSequence.PairwiseStringGreaterEqual