Skip to content
This repository has been archived by the owner on Jan 27, 2020. It is now read-only.

Linking Multiple Instances

Utz Westermann edited this page Mar 27, 2018 · 26 revisions

Schedoscope provides functionality to use views mantained by a different Schedoscope instance. This enables you run two (or more) instances independently where the views of one Schedoscope instance depend on views managed by a different instance. Such dependencies are called external dependencies; a view at the end of an external dependency is called external view.

Prerequisites

Multiple Schedoscope instances don't talk directly with each other via a network protocol. They also don't have to be available at the same time; hence, there is need for buffering of communication. They only synchronize via the shared Hive metastore.

Setup

There is no setup or configuration necessary on the foreign schedoscope instance. You only have to adapt the instance that wants to access external views.

1. Include the foreign Schedoscope Library

To build on top of views of another Schedoscope instance, you first have to include the foreign schedoscope instance's view definitions in your classpath/dependencies. This can be done by simply using maven or the build tool of your liking. This is necessary in order to be able to reference external views.

2. Enable the Feature in the schedoscope.conf.

This section has be included in your schedoscope.conf.

external-dependencies {

  #
  # This setting allows you to use external dependencies and operate several schedoscope instances in conjunction.
  #

  enabled = true

  #
  # A list of prefixes of packages with internal views. Every package not starting with a string in this list
  # will be treated as external and can not be referenced from the client or used as dependency if not flagged as
  # external
  #

  home = ["${env}.datahub", "${env}.datamart"]

  #
  # Toggles checks whether internal views are used as external views and vice versa
  #

  checks = true
}

You have to include all internal packages/views in the home list. Views that are declared outside of these packages can only be used as external dependencies. Vice versa, views inside these package cannot be used as external dependencies. As long as you have the checks enabled Schedoscope will notify you if you violate these rules.

Usage

Include an External Dependency

case class Products(shopCode: Parameter[String],
                        year: Parameter[String],
                       month: Parameter[String],
                         day: Parameter[String]) extends View
  with PointOccurrence
  with JobMetadata
  with DailyParameterization {

  val shop = dependsOn(() => external(ExternalShop(shopCode)))

  val productId = fieldOf[String]
  val productName = fieldOf[String]

  transformVia(() =>
    HiveTransformation(insertInto(
      this,
      s"""SELECT * FROM ${shop().n}""")))
}

Materializing a View with external Dependencies

The process of materializing a view with external dependencies does not change. Upon receiving a materialize command the view will simply check the Metastore if the external dependency has changed.

Scheduling between Instances

If you want to trigger a linked Schedoscope instance as soon as the topmost views in your Schedoscope instance have been materialized, you can do this by introducing a _toplevel _view. The toplevel view has dependencies to all views at the end your data pipeline. At the toplevel view you can register a shell transformation which triggers the linked Schedoscope instance, e.g., via a curl command.

Things to Consider

The operation of linked Schedoscope instances is not maintenance free. If you're doing drastic changes in one of your Schedoscope instances, all linked instances might be affected as well. For example, resetting checksums and timestamps in a big fashion requires synchronization.

Clone this wiki locally