Skip to content

Search Expressions

William W. Kimball, Jr., MBA, MSIS edited this page Sep 18, 2022 · 16 revisions
  1. Introduction
  2. Supported Search Operators
  3. Searching Descendants

Introduction

YAML Path provides some segment types which support search expressions. This is different from Search Keywords.

The general form of a search expression is KEY[OPERAND OPERATOR TERM] with white-space being optional, where:

  • KEY is the name of the node (Hash, Array, Array of Hashes, or Set) to search; leave empty for searching the document root.
  • [ and ] are both required.
  • OPERAND has five applications:
    • When searching Hashes, this is the name of the Hash's immediate child key of which to search the value.
    • When searching Hashes, set to . in order to search the names of the Hash's immediate child keys.
    • When searching Arrays of Hashes, this is the name of one child key appearing in each child Hash, the value of which is to be searched.
    • When searching Arrays of scalar values or Sets, this is always . to indicate a search against each element.
    • In all cases, this may be a YAML Path for searching descendent nodes within the same expression.
  • OPERATOR is one of the Supported Search Operators listed below.
  • TERM is the value to seek. If the term contains white-space, it must use String demarcation (' or "). If the term is a Python Regular Expression, it must use any non-white-space demarcation symbol. Escape symbols (\) are permitted within String values but become a literal part of Python Regular Expressions rather than an escape symbol.

Supported Search Operators

The following search operators are supported:

  • = (or ==): tests for identical values.
  • <: returns results less than a test value.
  • >: returns results greater than a test value.
  • <=: returns results less than or equal to a test value.
  • >=: returns results greater than or equal to a test value.
  • ^: returns results which start with a given term.
  • $: returns results which end with a given term.
  • %: returns results which contain a given term.
  • =~: returns results which match a Python Regular Expression excepting that you do not have to escape uses of the backslash character (\) unless you want it to be literal, like \\. For this special operator, the next non-white-space character becomes the Python Regular Expression demarcation symbol, marking both the start and end of the expression. While the most popular symbol is the /, virtually any symbol can be used provided the symbol does not appear within the expression. Note that within the Python Regular Expression, escape symbols are neither possible nor necessary and will instead become a literal part of the expression.
  • !: inverts the search. This special symbol can appear anywhere before the operand or the operator. It is most often placed immediately before the operator it inverts. Double-inversion is meaningless -- most likely a composition error -- so it is deliberately trapped and will generate an exception.

Non-demarcated white-space symbols are removed from search expressions. When the test value must contain white-space characters, demarcate the test value as any String using ' or " pairs: hash[OPERAND = "string with spaces"].

Searching Descendants

The OPERAND may be a YAML Path, enabling deep searches against descendent nodes. Any matching descendants will yield the value of the node on which the Search Expression was attached when any descendent matches the expression. An illustration can better explain what this means because the precise behavior may seem different -- it really isn't -- between searching Arrays-of-Hashes versus complex Hashes.

Consider this document:

---
products_hash:
  doodad:
    availability:
      start:
        date: 2020-10-10
        time: 08:00
      stop:
        date: 2020-10-29
        time: 17:00
    dimensions:
      width: 5
      height: 5
      depth: 5
      weight: 10
  doohickey:
    availability:
      start:
        date: 2020-08-01
        time: 10:00
      stop:
        date: 2020-09-25
        time: 10:00
    dimensions:
      width: 1
      height: 2
      depth: 3
      weight: 4
  widget:
    availability:
      start:
        date: 2020-01-01
        time: 12:00
      stop:
        date: 2020-01-01
        time: 16:00
    dimensions:
      width: 9
      height: 10
      depth: 1
      weight: 4
 
products_array:
  - product: doodad
    availability:
      start:
        date: 2020-10-10
        time: 08:00
      stop:
        date: 2020-10-29
        time: 17:00
    dimensions:
      width: 5
      height: 5
      depth: 5
      weight: 10
  - product: doohickey
    availability:
      start:
        date: 2020-08-01
        time: 10:00
      stop:
        date: 2020-09-25
        time: 10:00
    dimensions:
      width: 1
      height: 2
      depth: 3
      weight: 4
  - product: widget
    availability:
      start:
        date: 2020-01-01
        time: 12:00
      stop:
        date: 2020-01-01
        time: 16:00
    dimensions:
      width: 9
      height: 10
      depth: 1
      weight: 4

This is two different representations of the same data. Whereas products_hash is a complex Hash, products_array is an Array-of-Hashes.

Suppose you wanted all products which have a weight of 4. You might search either using a YAML Path like products_*.*.dimensions[weight=4]. This will produce 4 results, all of them just the number 4:

4
4
4
4

That's not particularly useful unless we just wanted a tally of the total number of products with that exact weight.

What if we wanted to know the availability of all products with a weight of 4? Performing a search against descendent nodes enables this answer, products_*.*[dimensions.weight=4].availability. This time, we get 4 answers again but we got the availability data for each product:

{"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}
{"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}
{"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}
{"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}

That's a lot more useful! As you can see, the descendent node search yielded the value of each product node which matched the Search Expression.

This has one notable consequence for products_hash: it is impossible to get the name of each matching product from this technique, alone. This is because the product key (which is the product's name when using a Hash data structure like this) is excluded from the search results by being the ancestor (outside of the search operation). Should you really want the name of each product with a weight of 4, you would need to utilize [name()] from the available Search Keywords (available since version 3.5.0). Doing so would look like products_hash.*[dimensions.weight=4][name()] and yield:

doohickey
widget

On the other hand, working with an Array-of-Hashes is a bit simpler. Such a data structure can return just the matching product names without requiring Search Keywords because each Array element is a whole record, including each product's name as part of its record. Such a query would look like: products_array[dimensions.weight==4].product. Because each record has a product field (the product name, in this case), this produces:

doohickey
widget
Clone this wiki locally