Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data-init component #2271

Merged
merged 2 commits into from
Jul 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- QueryTemplate component
- Support for packaging application from the database.
- Added DataInit component

#### Bug Fixes

Expand Down
29 changes: 27 additions & 2 deletions docs/content/api/components/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,16 @@
Dataset(self,
identifier: str,
db: dataclasses.InitVar[typing.Optional[ForwardRef('Datalayer')]] = None,
uuid: str = <factory>,
uuid: None = <factory>,
*,
upstream: "t.Optional[t.List['Component']]" = None,
artifacts: 'dc.InitVar[t.Optional[t.Dict]]' = None,
select: 't.Optional[Query]' = None,
sample_size: 't.Optional[int]' = None,
random_seed: 't.Optional[int]' = None,
creation_date: 't.Optional[str]' = None,
raw_data: 't.Optional[t.Sequence[t.Any]]' = None) -> None
raw_data: 't.Optional[t.Sequence[t.Any]]' = None,
pin: 'bool' = False) -> None
```
| Parameter | Description |
|-----------|-------------|
Expand All @@ -28,6 +30,29 @@ Dataset(self,
| random_seed | The random seed to use for sampling. |
| creation_date | The date the dataset was created. |
| raw_data | The raw data for the dataset. |
| pin | Whether to pin the dataset. If True, the dataset will load the datas from the database every time. If False, the dataset will cache the datas after we apply to db. |

A dataset is an immutable collection of documents.

## `DataInit`

```python
DataInit(self,
identifier: str,
db: dataclasses.InitVar[typing.Optional[ForwardRef('Datalayer')]] = None,
uuid: None = <factory>,
*,
upstream: "t.Optional[t.List['Component']]" = None,
artifacts: 'dc.InitVar[t.Optional[t.Dict]]' = None,
data: 't.List[t.Dict]',
table: 'str') -> None
```
| Parameter | Description |
|-----------|-------------|
| identifier | Identifier of the leaf. |
| db | Datalayer instance. |
| uuid | UUID of the leaf. |
| artifacts | A dictionary of artifacts paths and `DataType` objects |

DataInit(identifier: str, db: dataclasses.InitVar[typing.Optional[ForwardRef('Datalayer')]] = None, uuid: None = <factory>, *, upstream: "t.Optional[t.List['Component']]" = None, artifacts: 'dc.InitVar[t.Optional[t.Dict]]' = None, data: 't.List[t.Dict]', table: 'str')

17 changes: 17 additions & 0 deletions docs/content/apply_api/data_init.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# `DataInit`

- Used to automatically insert initialization data during application build.

***Usage pattern***

```python
from superduperdb.components.dataset import DataInit
data = [{"x": i, "y": [1, 2, 3]} for i in range(10)]
data_init = DataInit(data=data, table="documents", identifier="test_data_init")

db.apply(data_init)
```

***Explanation***

- When db.apply(data_init) is executed, DataInit inserts data into the specified table.
3 changes: 2 additions & 1 deletion superduperdb/components/component.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,14 +106,15 @@ class Component(Leaf):
that can be saved into a database.

:param artifacts: A dictionary of artifacts paths and `DataType` objects
:param upstream: A list of upstream components
"""

type_id: t.ClassVar[str] = 'component'
leaf_type: t.ClassVar[str] = 'component'
_artifacts: t.ClassVar[t.Sequence[t.Tuple[str, 'DataType']]] = ()
set_post_init: t.ClassVar[t.Sequence] = ('version',)
changed: t.ClassVar[set] = set([])

upstream: t.Optional[t.List["Component"]] = None
artifacts: dc.InitVar[t.Optional[t.Dict]] = None

@property
Expand Down
22 changes: 22 additions & 0 deletions superduperdb/components/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,3 +94,25 @@ def __str__(self):
return f'Dataset(identifier={self.identifier}, select={self.select})'

__repr__ = __str__


class DataInit(Component):
"""A data initialization component.

:param data: The data to initialize.
:param table: The table to insert the data.
"""

data: t.List[t.Dict]
table: str

def post_create(self, db: Datalayer) -> None:
"""Called after the first time this component is created.

Generally used if ``self.version`` is important in this logic.

:param db: the db that creates the component.
"""
super().post_create(db)
self.init()
db[self.table].insert(self.data).execute()
17 changes: 16 additions & 1 deletion test/unittest/component/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

import pytest

from superduperdb.components.dataset import Dataset
from superduperdb.components.dataset import DataInit, Dataset


@pytest.mark.parametrize("db", DBConfig.EMPTY_CASES, indirect=True)
Expand Down Expand Up @@ -32,3 +32,18 @@ def test_dataset_pin(db, pin):
len(dataset.data) == 10
else:
len(dataset.data) == 20


@pytest.mark.parametrize("db", DBConfig.EMPTY_CASES, indirect=True)
def test_init_data(db):
db.cfg.auto_schema = True
data = [{"x": i, "y": [1, 2, 3]} for i in range(10)]
data_init = DataInit(data=data, table="documents", identifier="test_data_init")

db.apply(data_init)

data = list(db["documents"].select().execute())
assert len(data) == 10
for i, d in enumerate(data):
assert d["x"] == i
assert d["y"] == [1, 2, 3]
Loading