Skip to content

rpolyano/wrangler

 
 

Repository files navigation

Data Prep

cm-available cdap-transform Build Status Coverity Scan Build Status Maven Central Javadoc License Join CDAP community

A collection of libraries, a pipeline plugin, and a CDAP service for performing data cleansing, transformation, and filtering using a set of data manipulation instructions (directives). These instructions are either generated using an interative visual tool or are manually created.

New Features

More here on upcoming features.

  • User Defined Directives, also known as UDD, allow you to create custom functions to transform records within CDAP DataPrep or a.k.a Wrangler. CDAP comes with a comprehensive library of functions. There are however some omissions, and some specific cases for which UDDs are the solution. Additional information on how you can build your custom directives here.

    • Migrating directives from version 1.0 to version 2.0 here
    • Information about Grammar here
    • Various TokenType supported by system here
    • Custom Directive Implementation Internals here
  • A new capability that allows CDAP Administrators to restrict the directives that are accessible to their users. More information on configuring can be found here

Demo Videos and Recipes

Videos and Screencasts are best way to learn, so we have compiled simple, short screencasts that shows some of the features of Data Prep. Additional videos can be found here

Videos

Recipes

Available Directives

These directives are currently available:

Directive Description
Parsers
JSON Path Uses a DSL (a JSON path expression) for parsing JSON records
Parse as AVRO Parsing an AVRO encoded message - either as binary or json
Parse as AVRO File Parsing an AVRO data file
Parse as CSV Parsing an input record as comma-separated values
Parse as Date Parsing dates using natural language processing
Parse as Excel Parsing excel file.
Parse as Fixed Length Parses as a fixed length record with specified widths
Parse as HL7 Parsing Health Level 7 Version 2 (HL7 V2) messages
Parse as JSON Parsing a JSON object
Parse as Log Parses access log files as from Apache HTTPD and nginx servers
Parse as Protobuf Parses an Protobuf encoded in-memory message using descriptor
Parse as Simple Date Parses date strings
Parse XML To JSON Parses an XML document into a JSON structure
Parse as Currency Parses a string representation of currency into a number.
Output Formatters
Write as CSV Converts a record into CSV format
Write as JSON Converts the record into a JSON map
Write JSON Object Composes a JSON object based on the fields specified.
Format as Currency Formats a number as currency as specified by locale.
Transformations
Changing Case Changes the case of column values
Cut Character Selects parts of a string value
Set Column Sets the column value to the result of an expression execution
Find and Replace Transforms string column values using a "sed"-like expression
Index Split (Deprecated)
Invoke HTTP Invokes an HTTP Service (Experimental, potentially slow)
Quantization Quantizes a column based on specified ranges
Regex Group Extractor Extracts the data from a regex group into its own column
Setting Character Set Sets the encoding and then converts the data to a UTF-8 String
Setting Record Delimiter Sets the record delimiter
Split by Separator Splits a column based on a separator into two columns
Split Email Address Splits an email ID into an account and its domain
Split URL Splits a URL into its constituents
Text Distance (Fuzzy String Match) Measures the difference between two sequences of characters
Text Metric (Fuzzy String Match) Measures the difference between two sequences of characters
URL Decode Decodes from the application/x-www-form-urlencoded MIME format
URL Encode Encodes to the application/x-www-form-urlencoded MIME format
Trim Functions for trimming white spaces around string data
Encoders and Decoders
Decode Decodes a column value as one of base32, base64, or hex
Encode Encodes a column value as one of base32, base64, or hex
Unique ID
UUID Generation Generates a universally unique identifier (UUID)
Date Transformations
Diff Date Calculates the difference between two dates
Format Date Custom patterns for date-time formatting
Format Unix Timestamp Formats a UNIX timestamp as a date
Lookups
Catalog Lookup Static catalog lookup of ICD-9, ICD-10-2016, ICD-10-2017 codes
Table Lookup Performs lookups into Table datasets
Hashing & Masking
Message Digest or Hash Generates a message digest
Mask Number Applies substitution masking on the column values
Mask Shuffle Applies shuffle masking on the column values
Row Operations
Filter Row if Matched Filters rows that match a pattern for a column
Filter Row if True Filters rows if the condition is true.
Filter Row Empty of Null Filters rows that are empty of null.
Flatten Separates the elements in a repeated field
Fail on condition Fails processing when the condition is evaluated to true.
Send to Error Filtering of records to an error collector
Send to Error And Continue Filtering of records to an error collector and continues processing
Split to Rows Splits based on a separator into multiple records
Column Operations
Change Column Case Changes column names to either lowercase or uppercase
Changing Case Change the case of column values
Cleanse Column Names Sanatizes column names, following specific rules
Columns Replace Alters column names in bulk
Copy Copies values from a source column into a destination column
Drop Column Drops a column in a record
Fill Null or Empty Columns Fills column value with a fixed value if null or empty
Keep Columns Keeps specified columns from the record
Merge Columns Merges two columns by inserting a third column
Rename Column Renames an existing column in the record
Set Column Header Sets the names of columns, in the order they are specified
Split to Columns Splits a column based on a separator into multiple columns
Swap Columns Swaps column names of two columns
Set Column Data Type Convert data type of a column
NLP
Stemming Tokenized Words Applies the Porter stemmer algorithm for English words
Transient Aggregators & Setters
Increment Variable Increments a transient variable with a record of processing.
Set Variable Sets a transient variable with a record of processing.
Functions
Data Quality Data quality check functions. Checks for date, time, etc.
Date Manipulations Functions that can manipulate date
DDL Functions that can manipulate definition of data
JSON Functions that can be useful in transforming your data
Types Functions for detecting the type of data

Performance

Initial performance tests show that with a set of directives of high complexity for transforming data, DataPrep is able to process at about ~106K records per second. The rates below are specified as records/second.

Directive Complexity Column Count Records Size Mean Rate
High (167 Directives) 426 127,946,398 82,677,845,324 106,367.27
High (167 Directives) 426 511,785,592 330,711,381,296 105,768.93

Contact

Mailing Lists

CDAP User Group and Development Discussions:

The cdap-user mailing list is primarily for users using the product to develop applications or building plugins for appplications. You can expect questions from users, release announcements, and any other discussions that we think will be helpful to the users.

IRC Channel

CDAP IRC Channel: #cdap on irc.freenode.net

Slack Team

CDAP Users on Slack: cdap-users team

License and Trademarks

Copyright © 2016-2019 Cask Data, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Cask is a trademark of Cask Data, Inc. All rights reserved.

Apache, Apache HBase, and HBase are trademarks of The Apache Software Foundation. Used with permission. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Packages

No packages published

Languages

  • Java 99.7%
  • ANTLR 0.3%