Skip to content

A unit testing framework for the Cascading data processing platform.

License

Notifications You must be signed in to change notification settings

Alanoudj/plunger

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

              .-..   ____  __    _  _  __ _   ___  ____  ____    ..-.
.------------|  ||  (  _ \(  )  / )( \(  ( \ / __)(  __)(  _ \   ||  |------------.
'------------|  ||   ) __// (_/\) \/ (/    /( (_ \ ) _)  )   /   ||  |------------'
              '-''  (__)  \____/\____/\_)__) \___/(____)(__\_)   ''-'

Use a Plunger to push test data through your Cascading pipework. Catch the output in a Bucket and check it for correctness.

Start using

You can obtain plunger from Maven Central :

Maven Central GitHub license

  • For Cascading 3.x.x use plunger-3.0.2.
  • For Cascading 2.x.x use plunger-2.1.1.

Overview

plunger is a unit testing framework for the Cascading platform. It helps you write small and fast unit tests for your Assemblies with fined grained assertions. A fluent API allows you to rapidly construct test data for delivery via Data instances and then make assertions on the data captured by Buckets. All test data is maintained in memory to avoid unnecessary I/O overhead. Utility methods are provided as a bridge between input/output files and their in-memory equivalents so that the same style of testing can be applied to Flows and Cascades.

Testing Assemblies

An end-to-end example

Plunger plunger = new Plunger();

Data corpus = new DataBuilder(new Fields("line"))
    .addTuple("one word the found other")
    .addTuple("other waterfalls found of")
    .build();
Pipe words = plunger.newNamedPipe("words", corpus);
    
Pipe assemblyToTest = new WordCountAssembly(words);

Bucket bucket = plunger.newBucket(new Fields("word", "count"), assemblyToTest);

List<Tuple> actual = bucket.result().asTupleList();
assertThat(actual.size(), is(7));
assertThat(actual.get(0), is(new Tuple("found", 2)));

The above example exhibits many of the core features provided by plunger. What follows is a cookbook of the typical usage scenarios:

Build test data incrementally

Data employees = new DataBuilder(new Fields("first", "last", "age"))
    .addTuple("bob", "smith", 28)
    .copyTuple().set("last", "jones")
    .build();
// bob, smith, 28
// bob, jones, 28

Only provide the data you need for the test

Data big = new DataBuilder(manyManyFields)
	.withFields(new Fields("id", "name", "modified_date"))
	.addTuple(1, "bill", "2013-01-01")
    .addTuple(2, "dave", "2001-01-02")
	.build();

Apply your test data to your assembly and capture the output

Pipe inputPipe = plunger.newPipe(inputData);
Pipe assembly = new SmithFilterAssembly(inputPipe);
Bucket bucket = plunger.newBucket(outputFields, assembly);

Make assertions on the captured data

List<TupleEntry> tupleEntries = bucket.result().asTupleEntryList();
assertThat(tupleEntries.size(), is(1));
assertThat(tupleEntries.get(0).getString("first"), is("Bob"));

Note that plunger requires that you connect all of your Pipes and create all of your Buckets before attempting to obtain the result from any of your Buckets.

Create a sorted view of captured data

To increase the predictability of the results you can optionally pass in one or more sort fields to order the tuples:

List<Tuple> tuples = bucket.result().orderBy(new Fields("age")).asTupleList();
Tuple tupleWithLowestAge = tuples.get(0);

Testing Flows and Cascades

It's nice to be able to perform integration tests on your entire Flow or Cascade. In these circumstances it's normal for both the input and output to the flows to be in the form of real files. Constructing and maintaining sets of input and expect output files is an arduous task, additionally it is non-trivial to perform anything other than coarse grained assertions on file based output. To assist with the development of such tests plunger allows you to sink data to a file, allowing you to construct your test data with the fluent API while supplying it to the flows under test in a file based form. Conversely, Data instances can be constructed from a Tap, allowing you to read the output files of your flow into memory, and then to make fine grained assertions on them in exactly the same manner as you would when testing an Assembly. This approach is especially useful when you require your test data be stored with a scheme that is cannot be easily created by hand - binary or compressed formats for example. At this time plunger can perform direct writes and reads using both local and Hadoop taps, including PartitionTaps and 'multi-tap' variants.

Create test data and sink to a file

Tap sink = new FileTap(new TextDelimited(fields), "new_test_data_file.tsv")

Data employees = new DataBuilder(fields)
    .addTuple(1, "bill", "2013-01-01")
    .addTuple(2, "dave", "2001-01-02")
    .build();
    
Plunger.writeData(employees).toTap(sink);
// Creates the file 'new_test_data_file.csv' and appends two records

Load flow generated output files into memory

Tap generated = new Hfs(new TextDelimited(outputFields), "output"));

// Loads the contents of 'output'    
List<Tuple> actual = Plunger.readDataFromTap(generated).asTupleList();

assertThat(actual.size(), is(7));

Other data related features

Pretty printing data

During the development of your tests it can be useful to see what data is being both delivered and collected. The Data class provides a pretty-print method to deliver a tabular view of their data for quick inspection by you - the developer. To increase readability of the output you can optionally pass in one or more sort fields using the orderBy method to order the results. The sort uses the natural ordering of the types of each field. You can also supply a column filter using the withFields method. Considering our earlier example:

Bucket bucket = plunger.newBucket(new Fields("word", "count"), assembly);
bucket.result().prettyPrinter().print();

The code above will output the following to System.out (you can supply another PrintStream with the printTo method if you wish):

word    count
found   2
other   2
the     1
...

Testing Aggregators, Buffers, and Functions

For the most part, it's fairly straight forward to test Filter, Function, Aggregator, and Buffer classes using only your favourite mocking framework. However, it is often the case with aggregator implementations - and to a lesser extent functions - that we need to hold some state between invocations in the operation's Context. It is not always possible to implement this behaviour with our mocks, and even when it is, the resulting code can be rather verbose. plunger provides some stub classes to facilitate the testing of aggregators and functions that use the OperationCall.Context. Additionally they allow sets of test data to be fluently declared and operation output captured for later validation in much the same way as the DataBuilder and Bucket classes. Here is an aggregator example.

Aggregator<Context> aggregator = new MyLast(FIELDS);

AggregatorCallStub<Context> stubCall = Plunger.<Context>newAggregatorCallStubBuilder(GROUP_FIELDS, FIELDS)
    .newGroup(1)
    .addTuple("2013-01-01")
    .addTuple("2013-01-02")
    .build()
    .complete(mockFlowProcess, aggregator);

List<TupleEntry> collected = stubCall.result().asTupleEntryList();
assertThat(collected.size(), is(1));
assertThat(collected.get(0), is(new TupleEntry(FIELDS, new Tuple("2013-01-02"));

Note that all tuples added using addTuple are associated with the group declared by the most recent newGroup call. The FunctionCallStub and BufferCallStub classes operate in a very similar manner.

Assertions

Verifying serialization

When running Cascading jobs on Hadoop it is often a requirement that your Cascading classes and their dependencies are Serializable. However, this is not necessary when running test jobs in local mode. Consequently serialization is often overlooked during development and problems arise only when first deploying the to a Hadoop environment. To help identify these issues early on in the development process plunger provides a convenient assertion which you can use to check your Assemblies, Functions, Filters, and so on:

import static com.hotels.plunger.asserts.PlungerAssert.serializable;
...
Pipe assembly = new WordCountAssembly(wordsPipe);
assertThat(assembly, is(serializable())); // Fails if WordCountAssembly cannot be serialized

The PlungerAssert class also provides a traditional assertion method if that is more your style: assertSerializable(Object).

TupleEntry matching

When verifying the results of your assemblies it can be time consuming to interrogate a TupleEntry for all its expected values and write assertions for each individually. The resulting code can also be rather verbose. plunger provides a Matcher with convenient overloads to enable simpler assertions of your output:

import static com.hotels.plunger.asserts.PlungerAssert.tupleEntry;
...
TupleEntry result = ...
Fields fields = new Fields("word", "count");
assertThat(result, is(tupleEntry(new TupleEntry(fields, new Tuple("found", 2)))));
assertThat(result, is(tupleEntry(fields, new Tuple("found", 2))));
assertThat(result, is(tupleEntry(fields, "found", 2)));

Debug output

For the most part Cascading flows can be debugged with your favourite IDE's debugger and Cascading's LocalFlowConnector. However, sometimes it's useful to quickly see what fields and values are flowing through your pipes. Cascading provides the cascading.operation.Debug operation for printing the data in your pipes out to either STDOUT or STDERR. plunger tries to simplify this a step further with the Dump assembly:

pipe = new Dump(pipe); // To STDOUT with no prefix
pipe = new Dump("prefix:\t", pipe, SYSERR) // prefix all output, use a PrintStream of our choosing

Building

This project uses the Maven build system. It also naturally has dependencies on some Cascading artifacts which can be found in the ConJars repository. To use this repository you may need to add the following stanza to your Maven repository configuration:

<repository>
  <id>conjars.org</id>
  <url>http://conjars.org/repo</url>
</repository>

Dependencies

Plunger expects the following dependencies to be provided:

  • Cascading SDK ≥ 2.6.1
  • Hadoop ≥ 2.4.0.2.1.3.0-563
  • JUnit ≥ 4.11
  • Hamcrest core ≥ 1.3

Earlier versions may work but have not been tested.

Credits

Created by Elliot West, with thanks to: Dave Maughan, Patrick Duin, James Grant, Adrian Woodhead, Sven Zethelius.

Legal

This project is available under the Apache 2.0 License.

Copyright 2014-2017 Expedia Inc.

About

A unit testing framework for the Cascading data processing platform.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 100.0%