- Improved performance of
catalog_ext.has_table
function by trying to execute a dummy SQL rather than listing the entire database, noticable mostly with databases with many tables. - Some minor changes to help in a spark-on-kubernetes environment:
- In addition to setting
PYSPARK_SUBMIT_ARGS
, also explicitly set config params so they are picked up by an already-running JVM - Register a handler to stop spark session on python termination to deal with SPARK-27927
- In addition to setting
- Removed
has_package
andhas_jar
functions, which are incomplete checks (resulting in false negatives) and are merely syntactic sugar. - Added options (class variables)
name
andapp_id_template
to autogenerate a unique value for spark optionspark.app.id
, which can help to preserve spark history data for all sessions across restarts. This functionality can be disabled by settingapp_id_template
toNone
or''
. - Drop support of Python 2.7
- Run integration tests using Python 3.7
- Drop tests for Elastic 6.x
- Use Kafka 2.8.0 for integration tests
- Support 0.9.x
pymysql
insparkly.testing.MysqlFixture
- Fix support for using multiple sparkly sessions during tests
- SparklySession does not persist modifications to os.environ
- Support ElasticSearch 7 by making type optional.
- Extend
SparklyCatalog
to work with database properties:
spark.catalog_ext.set_database_property
spark.catalog_ext.get_database_property
spark.catalog_ext.get_database_properties
- Allow newer versions of
six
package (avoid depednecy hell)
- Migrate to spark 2.4.0
- Fix testing.DataType to use new convention to get field type
- Add argmax function to sparkly.functions
- Fix port issue with reading and writing
by_url
.urlparse
returnnetloc
with port, which breaks read and write from MySQL and Cassandra.
- Add
port
argument toCassandraFixture
andMysqlFixture
- Add
Content-Type
header toElasticFixture
to support ElasticSearch6.x
- Update
elasticsearch-hadoop
connector to6.5.4
- Update image tag for elasticsearch to
6.5.4
- Fix write_ext.kafka: run foreachPartition instead of mapPartitions because returned value can cause spark.driver.maxResultSize excess
- Respect PYSPARK_SUBMIT_ARGS if it is already set by appending SparklySession related options at the end instead of overwriting.
- Fix additional_options to always override SparklySession.options when a session is initialized
- Fix ujson dependency on environments where redis-py is already installed
- Access or initialize SparklySession through get_or_create classmethod
- Ammend
sparkly.functions.switch_case
to accept a user defined function for deciding whether the switch column matches a specific case
- Overwrite existing tables in the metastore
- Add functions module and provide switch_case column generation and multijoin
- Add implicit test target import and extended assertEqual variation
- Support writing to redis:// and rediss:// URLs
- Add LRU cache that persists DataFrames under the hood
- Add ability to check whether a complex type defines specific fields
spark.sql.shuffle.partitions
inSparklyTest
should be set to string, becauseint
value breaks integration testing in Spark 2.0.2.
- Add instant iterative development mode.
sparkly-testing --help
for more details. - Use in-memory db for Hive Metastore in
SparklyTest
(faster tests). spark.sql.shuffle.partitions = 4
forSparklyTest
(faster tests).spark.sql.warehouse.dir = <random tmp dir>
forSparklyTest
(no side effects)
- Fix: remove backtick quoting from catalog utils to ease work with different databases.
- Add ability to specify custom maven repositories.
- Make it possible to override default value of spark.sql.catalogImplementation
- Add KafkaWatcher to facilitate testing of writing to Kafka
- Fix a few minor pyflakes warnings and typos
- Fix: #40 write_ext.kafka ignores errors.
- Migrate to Spark 2, Spark 1.6.x isn't supported by sparkly 2.x.
- Rename
SparklyContext
toSparklySession
and derive it fromSparkSession
. - Use built-in csv reader.
- Replace
hms
withcatalog_ext
. parse_schema
is now consistent withDataType.simpleString
method.
- Fix: kafka import error.
- Kafka reader and writer.
- Kafka fixtures.
- Initial open-source release.
- Features:
- Declarative definition of application dependencies (spark packages, jars, UDFs)
- Readers and writers for ElasticSearch, Cassandra, MySQL
- DSL for interaction with Apache Hive Metastore