Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build(deps-dev): bump dbldatagen from 0.3.5 to 0.4.0 #637

Merged
merged 3 commits into from
Jun 13, 2024

Conversation

dependabot[bot]
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Jun 10, 2024

Bumps dbldatagen from 0.3.5 to 0.4.0.

Release notes

Sourced from dbldatagen's releases.

release 0.4.0

This release adds the following new features:

  • various bug fixes
  • support for Constraints
  • support for standard datasets

The new standard dataset feature allows creation of synthetic data sets in just a couple of lines of code for benchmarking / optimization and other purposes

release/v.0.3.6.1

Hot fixes post v0.3.6

  • Updates to documentation
  • updates to enable dbldatagen work better with Databricks Connect
  • bumped version

Release v0.3.6

This release includes fixes for use of dbldatagen on the Databricks shared clusters

Changelog

Sourced from dbldatagen's changelog.

Version 0.4.0

Changed

  • Updated minimum pyspark version to be 3.2.1, compatible with Databricks runtime 10.4 LTS or later
  • Modified data generator to allow specification of constraints to the data generation process
  • Updated documentation for generating text data.
  • Modified data distribiutions to use abstract base classes
  • migrated data distribution tests to use pytest
  • Additional standard datasets

Added

  • Added classes for constraints on the data generation via new package dbldatagen.constraints
  • Added support for standard data sets via the new package dbldatagen.datasets

Version 0.3.6 Post 1

Changed

  • Updated docs for complex data types / JSON to correct code examples
  • Updated license file in public docs

Fixed

  • Fixed scenario where DataAnalyzer is used on dataframe containing a column named summary

Version 0.3.6

Changed

  • Updated readme to include details on which versions of Databricks runtime support Unity Catalog shared access mode.
  • Updated code to use default parallelism of 200 when using a shared Spark session
  • Updated code to use Spark's SQL function element_at instead of array indexing due to incompatibility

Notes

  • Ths version marks the changing minimum version of Databricks runtime to 10.4 LTS and later releases.
  • While there are no known incompatibilities with Databricks 9.1 LTS, we will not test against this release
Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Jun 10, 2024
Bumps [dbldatagen](https://github.com/databrickslabs/data-generator) from 0.3.5 to 0.4.0.
- [Release notes](https://github.com/databrickslabs/data-generator/releases)
- [Changelog](https://github.com/databrickslabs/dbldatagen/blob/master/CHANGELOG.md)
- [Commits](databrickslabs/dbldatagen@release/v0.3.5...release/v0.4.0)

---
updated-dependencies:
- dependency-name: dbldatagen
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot force-pushed the dependabot/pip/dbldatagen-0.4.0 branch from 82f3b2e to 5bf615a Compare June 10, 2024 08:41
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constraints is an interesting feature that can be useful in scenarios we currently have:

  • When the mock data requires specific conditions at the time of generation.
    • The library already covered examples like this one where we define a list of possible values, but these additions adds much more flexibility.
  • When the mock data requires specific conditions at the time of usage in a particular unit test.
    • For example here, where we need study locus with an empty ldSet.

From their documentation, this is an example of how it works:

import dbldatagen as dg

data_rows = 10000000

dataspec = dg.DataGenerator(spark, rows=10000000, partitions=8)

dataspec = (
    dataspec.withColumn("name", "string", template=r"\\w \\w|\\w a. \\w")
    .withColumn(
        "product_sku", "string", minValue=1000000, maxValue=1000000 + 1000, prefix="dr", random=True
    )
    .withColumn("email", "string", template=r"\\w.\\w@\\w.com")
    .withColumn("qty_ordered", "int", minValue=1, maxValue=10, distribution="normal", random=True)
    .withColumn("unit_price", "float", minValue=1.0, maxValue=30.0, step=0.01, distribution="normal",
                baseColumn="product_sku", baseColumnType="hash")
    .withColumn("order_ts", "timestamp", begin="2020-01-01 01:00:00",
                end="2020-12-31 23:59:00",
                interval="1 minute", random=True )
    .withColumn("shipping_ts", "timestamp", begin="2020-01-05 01:00:00",
                end="2020-12-31 23:59:00",
                interval="1 minute", random=True, percentNulls=0.1)
    .withSqlConstraint(""shipping_ts is null or shipping_ts > order_ts"")
)
df1 = dataspec.build()

I wouldn't spend time in changing what we currently have, but it's just worth knowing it exists.

@ireneisdoomed ireneisdoomed merged commit 976ee30 into dev Jun 13, 2024
4 checks passed
@ireneisdoomed ireneisdoomed deleted the dependabot/pip/dbldatagen-0.4.0 branch June 13, 2024 09:29
project-defiant pushed a commit that referenced this pull request Jun 14, 2024
Bumps [dbldatagen](https://github.com/databrickslabs/data-generator) from 0.3.5 to 0.4.0.
- [Release notes](https://github.com/databrickslabs/data-generator/releases)
- [Changelog](https://github.com/databrickslabs/dbldatagen/blob/master/CHANGELOG.md)
- [Commits](databrickslabs/dbldatagen@release/v0.3.5...release/v0.4.0)

---
updated-dependencies:
- dependency-name: dbldatagen
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
project-defiant pushed a commit that referenced this pull request Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build dependencies Pull requests that update a dependency file python Pull requests that update Python code size-XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants