-
Notifications
You must be signed in to change notification settings - Fork 25
TimeSeries
TimeSeries is a part of our Scala DSL that allows one to group timestamped records
by periods of time such as weeks, months, quarters etc., by transforming the UTC timestamp columns of each record.
Clickhouse is fast enough that we are able to store our raw unprocessed data directly, while we are still able to use it to generate complex reports "on-the-fly". Clickhouse however, does not offer a simple way to apply deterministic interval grouping. By having a simple operator to group our data by its timestamp on any interval for any timezone allows for flexibility in queries (that clickhouse otherwise does not offer) as well as to create aggregations / views on our data.
For offering this functionality, we introduced a new class MultiInterval
.
By understanding what it does, the clickhouse DSL feature becomes more obvious.
Take a look at below example:
val jan = new DateTime("2018-01-01T00:00:00+01:00")
val dec = new DateTime("2018-12-31T23:59:59+01:00")
//This represents an interval of the entire year of 2018, with subintervals of 2 months, in Amsterdam/Europe timezone.
val yearBiMonthlyAmsterdam = new MultiInterval(jan, dec, MultiDuration(2,TimeUnit.Month));
If we now do yearBiMonthlyAmsterdam.subIntervals()
we will receive a Seq[Interval]
with joda.time.Interval
in the same timezone (jan/feb, mar/apr, jun/jul etc. etc.)
If we would provide 2nd of January as the start date, some days within the MultiInterval belong to a subinterval which doesn't fit. MultiInterval automatically expands
its interval to fit the entirety of its subintervals ranges.
Therefore we will see the below behaviour:
val midJan = new DateTime("2018-01-21T00:00:00+01:00")
val janToDec = MultiInterval(midJan, dec, MultiDuration(2,TimeUnit.Month));
janTwoToDec.getStart();
//org.joda.time.DateTime = 2018-01-01T00:00:00+02:00
ActionID | ActionType | Timestamp | UserID |
---|---|---|---|
1 | Click | 1330684623 | 1 |
2 | Click | 1331348034 | 2 |
3 | Buy | 1323523246 | 2 |
For the above table, which we will call ActionTable
we could do:
select(
uniq(ActionTable.ActionID) as action_count,
uniq(ActionTable.ActionID) as user_count,
ActionTable.ActionType,
timeSeries(ActionTable.TimeStamp, yearBiMonthlyAmsterdam) as interval_start_time
)
.from(
ActionTable
)
.groupBy(
interval_start_time,
ActionType
)
In this example, we provide the MultiDuration
of 2 month subintervals for 2018, in amsterdam timezone (see previous code examples for the definition of this interval). The timeSeries operator will bring back the timestamp to the start of a bi-monthly interval, also taking into account the timezone set for this MultiInterval. Important! It assumes that the timestamp of the record is UTC timezone.
The subIntervals of a multi interval are both in clickhouse as well as in scala/java deterministic. That means that if you would take a "5 month" interval, the first interval within 2018 might actually start in november of 2017, as 5 months do not align with 12 months a year.
The reference point for these deterministic "chunks" is decided upon by the clickhouse getRelativeXXXNum functions.
See their documentation for more information on this: https://clickhouse.yandex/docs/en/query_language/functions/date_time_functions/#torelativeyearnum
Todo pages:
- Table schemas
- SELECT statements
- Simple SELECT
- DISTINCT
- Inner queries
- JOIN
- GROUP
- Array operators
- Aggregation operators (e.g. uniqState, uniqMerge)
- COMPOSING of multiple queries
- Composition operators
<+:
,+
and:+>
- Composition operators
- Using custom types in the DSL
- Explaining the query parsing process
- The QueryValue typeclass