- A new trait
ConnectorInterface
that simplify the use of custom connectors - New traits in
io.github.setl.internal
:- CanVacuum
- CanUpdate
- CanPartition
- CanWait
- New IO methods in SparkRepository:
- drop
- delete
- create
- vacuum
- awaitTermination
- stopStreaming
- Parameters of the method
DeltaConnector.update
- Parameters of the method
DeltaConnector.partition
- Parameter readCache in Setl.setSparkRepository was renamed to cacheData to avoid ambiguity
- Deprecated
FileConnector.delete()
to avoid ambiguity (useFileConnector.drop()
instead) - Upgraded spark-cassandra-connector to 3.0.0 for the mvn profile
spark_3.0
- New logo
- Update Delta version to v1.0 (PR #234)
- DeltaConnector reader options (PR #170)
- Deprecated methods and constructors
- Change group id to io.github.setl-framework (PR #192)
- Spark 3.0 support
- Downgraded default hadoop version to 3.2.0
- Save mode in DynamoDB Connector
- Updated spark-cassandra-connector from 2.4.2 to 2.5.0 (PR #117)
- Updated spark-excel-connector from 0.12.4 to 0.13.1 (PR #117)
- Updated spark-dynamodb-connector from 1.0.1 to 1.0.4 (PR #117)
- Updated scalatest (scope test) from 3.1.0 to 3.1.2 (PR #117)
- Updated postgresql (scope test) from 42.2.9 to 42.2.12 (PR #117)
- Added pipeline dependency check before starting the spark job (PR #114)
- Added default Spark job group and description (PR #116)
- Added
StructuredStreamingConnector
(PR #119) - Added
DeltaConnector
(PR #118) - Added
ZipArchiver
that can zip files/directories (PR #124)
- Fixed path separator in FileConnectorSuite that cause test failure
- Fixed
Setl.hasExternalInput
that always returns false (PR #121)
- Fixed cross building issue (#111)
- Changed benchmark unit of time to seconds (#88)
- Improved test coverage
- The master URL of SparkSession can now be overwritten in local environment (#74)
FileConnector
now lists path correctly for nested directories (#97)
- Added Mermaid diagram generation to Pipeline (#51)
- Added
showDiagram()
method to Pipeline that prints the Mermaid code and generates the live editor URL 🎩🐰✨ (#52) - Added Codecov report and Scala API doc
- Added
delete
method inJDBCConnector
(#82) - Added
drop
method inDBConnector
(#83) - Added support for both of the following two Spark configuration styles in SETL builder (#86)
setl.config { spark { spark.app.name = "my_app" spark.sql.shuffle.partitions = "1000" } } setl.config_2 { spark.app.name = "my_app" spark.sql.shuffle.partitions = "1000" }
- BREAKING CHANGE: Renamed DCContext to Setl
- Changed the default application environment config path into setl.environment
- Changed the default context config path into setl.config
- Optimized DeliverableDispatcher
- Optimized PipelineInspector (#33)
- Fixed issue of DynamoDBConnector that doesn't take user configuration
- Fixed issue of CompoundKey annotation. Now SparkRepository handles correctly columns having multiple compound keys. (#36)
- Added support for private variable delivery (#24)
- Added empty SparkRepository as placeholder (#30)
- Added annotation Benchmark that could be used on methods of an AbstractFactory (#35)
- BREAKING CHANGE: replace the Spark compatible version by the Scala compatible version in the artifact ID. The old artifact id dc-spark-sdk_2.4 was changed to dc-spark-sdk_2.11 (or dc-spark-sdk_2.12)
- Upgraded dependencies
- Added Scala 2.12 support
- Removed SparkSession from Connector and SparkRepository constructor (old constructors are kept but now deprecated)
- Added Column type support in FindBy method of SparkRepository and Condition
- Added method setConnector and setRepository in Setl that accept object of type Connector/SparkRepository
- Added read cache into spark repository to avoid consecutive disk IO.
- Added option autoLoad in the Delivery annotation so that DeliverableDispatcher can still handle the dependency injection in the case where the delivery is missing but a corresponding repository is present.
- Added option condition in the Delivery annotation to pre-filter loaded data when autoLoad is set to true.
- Added option id in the Delivery annotation. DeliveryDispatcher will match deliveries by the id in addition to the payload type. By default the id is an empty string ("").
- Added setConnector method in DCContext. Each connector should be delivered with an ID. By default the ID will be itsconfig path.
- Added support of wildcard path for SparkRepository and Connector
- Added JDBCConnector
- Added SnappyCompressor.
- Added method persist(persistence: Boolean) into Stage and Factory to activate/deactivate output persistence. By default the output persistence is set to true.
- Added implicit method
filter(cond: Set[Condition])
for Dataset and DataFrame. - Added
setUserDefinedSuffixKey
andgetUserDefinedSuffixKey
to SparkRepository.
- Added @Compress annotation. SparkRepository will compress all columns having this annotation by using a Compressor (the default compressor is XZCompressor)
case class CompressionDemo(@Compress col1: Seq[Int],
@Compress(compressor = classOf[GZIPCompressor]) col2: Seq[String])
- Added interface Compressor and implemented XZCompressor and GZIPCompressor
- Added SparkRepositoryAdapter[A, B]. It will allow a SparkRepository[A] to write/read a data store of type B by using an implicit DatasetConverter[A, B]
- Added trait Converter[A, B] that handles the conversion between an object of type A and an object of type B
- Added abstract class DatasetConverter[A, B] that extends a Converter[Dataset[A], Dataset[B]]
- Added auto-correction for
SparkRepository.findby(conditions)
method when we filter by case class field name instead of column name - Added DCContext that simplifies the creation of SparkSession, SparkRepository, Connector and Pipeline
- Added a builder for ConfigLoader to simplify the instantiation of a ConfigLoader object
- Added
readStandardJSON
andwriteStandardJSON
method into JSONConnector to read/write standard JSON format file
- Added sequential mode in class
Stage
. Use can turn in on by settingparallel
to true. - Added external data flow description in pipeline description
- Added method
beforeAll
intoConfigLoader
- Added new method
addStage
andaddFactory
that take a class object as input. The instantiation will be handled by the stage. - Removed implicit argument encoder from all methods of Repository trait
- Added new get method to Pipeline:
get[A](cls: Class[_ <: Factory[_]): A
.
- Added
Delivery
annotation to handle inputs of a Factoryclass Foo { @Delivery(producer = classOf[Factory1], optional = true) var input1: String = _ @Delivery(producer = classOf[Factory2]) var input2: String = _ }
- Added an optional argument
suffix
inFileConnector
andSparkRepository
- Added method
partitionBy
inFileConnector
andSparkRepository
- Added possibility to filter by name pattern when a FileConnector is trying to read a directory. To do this, add
filenamePattern
into the configuration file - Added possibility to create a
Conf
object from Map.Conf(Map("a" -> "A"))
- Improved Hadoop and S3 compatibility of connectors
- Added
DispatchManager
class. It will dispatch its deliverable object to setters (denoted by @Delivery) of a factory - Added
Deliverable
class, which contains a payload to be delivered - Added
PipelineInspector
to describe a pipeline - Added
FileConnector
andDBConnector
- Fixed issue of file path containing whitespace character(s) in the URI creation (52eee322aacd85e0b03a96435b07c4565e894934)
- Removed
EnrichedConnector
- Removed V1 interfaces
- Added a second argument to CompoundKey to handle primary and sort keys
- Added
Conf
intoSparkRepositoryBuilder
and changed all the set methods ofSparkRepositoryBuilder
to use the conf object - Changed package name
io.github.setl.annotations
toio.github.setl.annotation
- Added annotation
ColumnName
, which could be used to replace the current column name with an alias in the data storage. - Added annotation
CompoundKey
. It could be used to define a compound key for databases that only allow one partition key - Added sheet name into arguments of ExcelConnector
- Added DynamoDB V2 repository
- Added auxiliary constructors of case class
Condition
- Added SchemaConverter
- Added DynamoDB Repository
- Removed scope provided from connectors and TypeSafe config
- Added DynamoDB Connector
- Removed unnecessary Type variable in
Connector
- Added
ConnectorBuilder
to directly build a connector from a typesafe'sConfig
object - Added auxiliary constructor in
SparkRepositoryBuilder
- Added enumeration
AppEnv
- Changed spark version to 2.4.3
- Added
SparkRepositoryBuilder
that allows creation of aSparkRepository
for a given class without creating a dedicatedRepository
class - Added Excel support for
SparkRepository
by creatingExcelConnector
- Added
Logging
trait
- Fixed
Factory
class covariance issue (0764d10d616c3171d9bfd58acfffafbd8b9dda15) - Added documentation
- Added changelog
- Changed
.gitlab-ci.yml
to speed up CI
- Added unit tests
- Added
.gitlab-ci.yml