-
Notifications
You must be signed in to change notification settings - Fork 28
Storage Formats
Storage formats specify the format in which view data ist to be stored. Storage is declared using the storedAs
clause. In case no storedAs
clause is given, the TextFile
storage format is used as a default.
def storedAs(f: StorageFormat, additionalStoragePathPrefix: String, additionalStoragePathSuffix: String)
Parameters:
-
f
: the storage format to use (see below); -
additionalStoragePathPrefix
: optional prefix to put before the view's storage name (see description of `locationPathBuilder in the Section "Customizing Storage Paths"). -
additionalStoragePathSuffix
: optional suffix to put after the view's storage name (see description of `locationPathBuilder in the Section "Customizing Storage Paths").
Examples:
A simple Parquet storage format declaration:
storedAs(Parquet())
A Parquet storage format declaration with an additional storage path prefix:
storedAs(Parquet(), additionalStoragePathPrefix="tables")
Here is a description of currently supported storage formats:
With this storage format, view data is stored in the Hive TEXTFILE
format.
case class TextFile(val fieldTerminator: String = null, collectionItemTerminator: String = null, mapKeyTerminator: String = null, lineTerminator: String = null, serDe: String = null, serDeProperties: Map[String, String] = null, fullRowFormatCreateTblStmt:Boolean=false) extends StorageFormat
The TextFile
supports various optional parameters that can be used to override separation characters along the options offered by Hive TEXTFILE
:
-
fieldTerminator
: character to use to separate fields; -
collectionItemTerminator
: character to use to separate collection items in collection-typed fields; -
mapKeyTerminator
: character to use to separate keys and values in map-type fields; -
lineTerminator
: character to terminate the record; -
serDe
: custom or native SerDe; -
serDeProperties
: specify optionalSERDEPROPERTIES
; -
fullRowFormatCreateTblStmt
: for Hive versions prior to 0.13, instead of usingSTORED AS TEXTFILE
, expandsCREATE TABLE
withSTORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
For example, if you define a View in the following way:
storedAs(
TextFile(
fieldTerminator = """\\001""",
collectionItemTerminator = """\002""",
mapKeyTerminator = """\003""",
lineTerminator = """\n"""))
.. Schedoscope's HiveQL will automatically add to the generated CREATE EXTERNAL TABLE
statement the following clauses:
[...]
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\\001'
LINES TERMINATED BY '\n'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS TEXTFILE
[...]
As one can see, all parameters are optional. One could specify simply:
storedAs(TextFile())
.. and Schedoscope's HiveQL would simply add:
[...]
STORED AS TEXTFILE
[...]
For compatibility with Hive versions prior than 0.13, one could specify the following:
storedAs(TextFile(fullRowFormatCreateTblStmt = true))
.. and Schedoscope's HiveQL will change STORED AS TEXTFILE
to:
[...]
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
[...]
Additionally, Schedoscope provides a way to use custom Hive SerDes. One can, for example specify the JSON storage format and the SerDe will be automatically added. However, let us assume that, for the sake of explanation, one would like to use a different SerDe. Here is an example of how one could do it:
storedAs(TextFile(serDe = "com.amazon.elasticmapreduce.JsonSerde"))
Schedoscope's HiveQL will generate the following:
[...]
ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde'
STORED AS TEXTFILE
[...]
Additionally, one might want to add custom SerDe properties, for example when using CSV/TSV. Here is an example of how one could do so:
storedAs(TextFile(serDe = "org.apache.hadoop.hive.serde2.OpenCSVSerde",
serDeProperties = Map(
"separatorChar"->"""\t""",
"escapeChar"->"""\\"""
))
)
Schedoscope's HiveQL will generate the following:
[...]
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar' = '\\',
'separatorChar' = '\t'
)
STORED AS TEXTFILE
[...]
For more information about Hive native SerDes, please consult Hive's developer guide
View data can be stored in Avro format.
case class Avro(schemaPath: String, fullRowFormatCreateTblStmt = true)
Parameters:
-
schemaPath
: path to the Avro schema file. The path is relative to the view'savroSchemaPathPrefix
, which defaults to/hdp/dev/global/datadictionary/schema/avro
for thedev
environment. This prefix can be changed by setting a differentavroSchemaPathPrefixBuilder
for a view; -
fullRowFormatCreateTblStmt
: for Hive versions prior to 0.13, instead of usingSTORED AS AVRO
, expands theCREATE TABLE
statement withROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
For example, if you define a View in the following way:
storedAs(Avro("myUniqueExamplePath"))
.. Schedoscope's HiveQL will automatically add the following correspondening clauses:
[...]
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = 'hdfs:///hdp/<ENV>/global/datadictionary/schema/avro/myUniqueExamplePath'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
[...]
View data can be stored using Parquet:
case class Parquet(fullRowFormatCreateTblStmt = false)
Parameters:
-
fullRowFormatCreateTblStmt
: for Hive versions prior to 0.13, instead of usingSTORED AS PARQUET
, expands theCREATE TABLE
statement withROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
For example, if you define a View in the following way:
storedAs(Parquet())
.. Schedoscope's HiveQL will automatically add to the CREATE EXTERNAL TABLE
statement the following clause:
[...]
STORED AS PARQUET
[...]
For compatibility with Hive versions prior than 0.13, one could specify the following:
storedAs(Parquet(fullRowFormatCreateTblStmt = true))
.. and Schedoscope's HiveQL will change STORED AS PARQUET
to:
[...]
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
[...]
View data can also be stored using Optimized Row Columnar format:
case class OptimizedRowColumnar(fullRowFormatCreateTblStmt = false)
Parameters:
-
fullRowFormatCreateTblStmt
: for Hive versions prior to 0.13, instead of usingSTORED AS ORC
, expands theCREATE TABLE
statement withROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
For example, if you define a View in the following way:
storedAs(OptimizedRowColumnar())
.. Schedoscope's HiveQL will automatically add to the CREATE EXTERNAL TABLE
statement the following clause:
[...]
STORED AS ORC
[...]
For compatibility with Hive versions prior than 0.13, one could specify the following:
storedAs(OptimizedRowColumnar(fullRowFormatCreateTblStmt = true))
.. and Schedoscope's HiveQL will change the clause STORED AS ORC
to:
[...]
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
[...]
View data can be stored using older formats, such as the Record Columnar format:
case class RecordColumnarFile(fullRowFormatCreateTblStmt = false)
Parameters:
-
fullRowFormatCreateTblStmt
: for Hive versions prior to 0.13, instead of usingSTORED AS RCFILE
, expandsCREATE TABLE
withSTORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
For example, if you define a View in the following way:
storedAs(RecordColumnarFile())
.. Schedoscope's HiveQL will automatically add the following clause:
[...]
STORED AS RCFILE
[...]
For compatibility with Hive versions prior than 0.13, one could specify the following:
storedAs(RecordColumnarFile(fullRowFormatCreateTblStmt = true))
.. and Schedoscope's HiveQL will change the clause STORED AS RCFILE
to:
[...]
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
[...]
View data can be stored in JSON format:
case class Json(serDe: String = "org.apache.hive.hcatalog.data.JsonSerDe", serDeProperties: Map[String, String] = null, fullRowFormatCreateTblStmt:Boolean=false)
Parameters:
-
serDe
: custom or native SerDe; -
serDeProperties
: specify optionalSERDEPROPERTIES
-
fullRowFormatCreateTblStmt
: for Hive versions prior to 0.13, instead of usingSTORED AS TEXTFILE
, expands theCREATE TABLE
statement toSTORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
For example, if you define a View in the following way:
storedAs(Json())
.. Schedoscope's HiveQL will automatically add the following clauses:
[...]
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
[...]
For compatibility with Hive versions prior than 0.13, one could specify:
storedAs(Json(fullRowFormatCreateTblStmt = true))
.. and Schedoscope's HiveQL will change the clause STORED AS TEXTFILE
to:
[...]
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
[...]
For more information about Hive native SerDes, please consult Hive's developer guide
View data can also be stored in CSV / TSV format:
case class Csv(serDe: String = "org.apache.hadoop.hive.serde2.OpenCSVSerde", serDeProperties: Map[String, String] = null, fullRowFormatCreateTblStmt:Boolean=false)
Parameters:
-
serDe
: custom or native SerDe; -
serDeProperties
: specify optionalSERDEPROPERTIES
; -
fullRowFormatCreateTblStmt
: for hive versions prior to 0.13, instead of usingSTORED AS TEXTFILE
, expandsCREATE TABLE
with the clauseSTORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
For example, if you define a View in the following way:
storedAs(Csv())
.. Schedoscope's HiveQL will automatically add to the CREATE EXTERNAL TABLE
statement the following clause:
[...]
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
[...]
In case one would like to specify SerDe properties (e.g. for TSV), one could do so as follows:
storedAs(Csv())
serDeProperties(
Map("separatorChar"->"""\t""",
"escapeChar"->"""\\""",
"quoteChar"->"'"
)
)
.. which would generate the following SQL:
[...]
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar' = ''',
'escapeChar' = '\\',
'separatorChar' = '\t'
)
STORED AS TEXTFILE
[...]
For compatibility with Hive versions prior than 0.13, one could specify:
storedAs(Csv(fullRowFormatCreateTblStmt = true))
.. and Schedoscope's HiveQL will change STORED AS TEXTFILE
to:
[...]
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
[...]
For more information about Hive native SerDes, please consult Hive's developer guide
Note: this feature is still in experimental mode. Please test it first before you roll it out to production
Besides HDFS, Schedoscope is extending support for storing files in Cloud providers, starting with AWS (S3).
case class S3(bucketName: String, storageFormat: StorageFormat, uriScheme: String = "s3n")
Parameters:
-
bucketName
: S3 bucket name; -
storageFormat
: the Schedoscope case class defining the Storage format used; -
uriScheme
: URI scheme associated with hadoop version used, as well AWS region, with default value "s3n"; for more information, please check AWS documentation
Note that it assumes s3n URI Scheme by default. If one would require, for example, to use AWS Frankfurt datacenter (eu-central-1), one could specify "s3a", such as the following example:
storedAs(S3("schedoscope-bucket-test", OptimizedRowColumnar(), "s3a"))
.. Schedoscope's HiveQL will automatically add to the SQL "CREATE EXTERNAL TABLE" statement the following clause:
[...]
STORED AS ORC
LOCATION 's3a://schedoscope-bucket-test/dev/test/views/<your-view-name-here>'
[...]
If one wants to use a storage format that is not supported by Schedoscope out of the box, one can use the InOutputFormat
storage format:
case class InOutputFormat(input: String, output: String, serDe: Option[String] = None)
Using the storage format, the following example illustrates an alternative way of specifying ORC storage:
storedAs(
InOutputFormat(
"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
"org.apache.hadoop.hive.ql.io.orc.OrcSerde"
)
)
Schedoscope's HiveQL will generate the following:
[...]
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
[...]
A Schedoscope view represents a Hive table partition stored in HDFS. Views offer the following properties with regard to their HDFS storage location:
-
fullPath
: the absolute path to the folder of the Hive table partition represented by the view. For example, thefullPath
of the viewschedoscope.example.osm.processed.NodesWithGeohash(p("2013"), p("12"))
would be/hdp/dev/schedoscope/example/osm/processed/nodes_with_geohash/year=2013/month=12
in the default environmentdev
. For a viewv
,v.fullpath
is the concatenation ofv.tablePath
andv.partitionPath
. -
tablePath
: the absolute path to the storage location of Hive table to which the partition represented by the view belongs. For example, thetablePath
of the viewschedoscope.example.osm.processed.NodesWithGeohash(p("2013"), p("12"))
would be/hdp/dev/schedoscope/example/osm/processed/nodes_with_geohash
in the default environmentdev
. For a viewv
,v.tablePath
is the concatenation ofv.dbPath
and the view's storage namev.n
. -
n
: a view's storage name. The name is derived from the view's class name by lowercasing all characters and adding underscores at camel-case boundaries. For instance, the storage namen
ofschedoscope.example.osm.processed.NodesWithGeohash(p("2013"), p("12"))
would benodes_with_geohash
. -
dbPath
: the folder where all Hive tables belonging to a module are stored. The name is derived from a view's package name, prefixing it with/hdp
and the Schedoscope environment. For example, thedbPath
of the viewschedoscope.example.osm.processed.NodesWithGeohash(p("2013"), p("12"))
would be/hdp/dev/schedoscope/example/osm/processed
in the default environmentdev
.
The reason why we have introduced the different constituents of fullPath
above is because these particles are created by anonymous builder functions which can be overridden on a per-view basis. Thus, it is possible to implement different path naming schemes if so desired.
These builders are as follows and can be replaced by assigning custom functions to them:
-
var moduleNameBuilder: () => String
: by default returns the name of a views package, in lower-case-underscore format. -
var dbPathBuilder: String => String
: for a given environment, the default implementation produces thedbPath
usingmoduleNameBuilder
. It does this by building a path from the lower-case-underscore format ofmoduleNameBuilder
, replacing_
with/
and prependinghdp/dev/
for the default dev environment. -
tablePathBuilder: String => String
: for a given environment, the default builder computestablePath
usingdbPathBuilder
andn
. The latter will be surrounded byadditionalStoragePathPrefix
andadditionalStoragePathSuffix
, if set. -
partitionPathBuilder: () => String
: this builder creates a view'spartitionPath
. By default, this is the standard Hive/partitionColumn=partitionValue
pattern. -
fullPathBuilder: String => String
: for a given environment, the default builder producesfullPath
by concatenatingtablePathBuilder
andpartitionPathBuilder
.
Moreover, the schema path prefix for the Avro storage format can be changed by assigning a custom function to
-
avroSchemaPathPrefixBuilder: String => String
: for a given environmentenv
, the default implementation produces the path/hdp/${env}/global/datadictionary/schema/avro
.