- Remove condition import for
SQLAlcRow
as minimum Airflow version has been pinned to >= 2.7 #2146
- Replace
openlineage-airflow
with Apache Airflow OSS providerapache-airflow-providers-openlineage
#2103 - Bump up minimum version of
apache-airflow
to 2.7 #2103 - Bump up minimum version of
Python
to 3.8 #2103
- Allow users to disable schema check and creation on
load_file
#1922 - Allow users to disable schema check and creation on
transform
#1925 - Add support for Excel files #1978
- Support loading metadata columns from stage into table for Snowflake #2023
- Add
openlineage_dataset_uri
in databricks #1919 - Fix QueryModifier issue on Snowflake #1962
- Fix AstroCustomXcomBackend circular import issue #1943
- Add an example DAG for using dynamic task with dataframe #1912
- Improve
example_load_file
DAG tasks names #1958 - Limit
databricks-sql-connector<2.9.0
#2013
- Fix Snowflake QueryModifier issue #1962
- Add support for Pandas 2, Airflow 2.6.3 and Python 3.11 #1989
- Update the WASB connection #1994
- Fix AstroCustomXcomBackend circular import issue. #1943
- Add MySQL support #1801
- Add support to load from Azure blob storage into Databricks #1561
- Add argument
skip_on_failure
toCleanupOperator
#1837 by @scottleechua - Add
query_modifier
toraw_sql
,transform
andtransform_file
, which allow users to define SQL statements to be run before the main query statement #1898. Example of how to use this feature can be used to add Snowflake query tags to a SQL statement:from astro.query_modifier import QueryModifier @aql.run_raw_sql( results_format="pandas_dataframe", conn_id="sqlite_default", query_modifier=QueryModifier(pre_queries=["ALTER team_1", "ALTER team_2"]), ) def dummy_method(): return "SELECT 1+1"
- Upgrade astro-runtime to 7.4.2 #1878
- Raise exception in case larger dataframes than expected are passed to
aql.dataframe
#1839 - Revert breaking change introduced in 1.5.0, re-allowing
aql.transform
to receive `sql filepath #1879
- Update open lineage documentation #1881
- Support Apache Airflow 2.6 #1899, with internal serialization changes
- Add basic
tiltifle
for local dev #1819
- Fix AstroCustomXcomBackend circular import issue. #1943
- Support using SQL operators (
run_raw_sql
,transform
,dataframe
) to convert a Pandas dataframe into a table when using a DuckDB in-memory database. #1848. - Fix code coverage issues #1815
- Upgrade astro-runtime to 7.4.1 #1858
- Restore pandas load option classes -
PandasCsvLoadOptions
,PandasJsonLoadOptions
,PandasNdjsonLoadOptions
andPandasParquetLoadOptions
#1795
- Add Openlineage facets for Microsoft SQL server. #1752
- Use
use_native_support
param in load_file operator for table creation. #1756 - Resolved
pandas-gbq
dependency issue. #1768 - Fix Minio support for Snowflake. #1767
- Add handler param in database.run_sql() method 1773
- Add support for Microsoft SQL server. #1538
- Add support for DuckDB. #1695
- Add
result_format
andfail_on_empty
params torun_raw_sql
operator #1584 - Add support
validation_mode
as part of theCOPY INTO
command for snowflake. #1689 - Add support for native transfers for Azure Blob Storage to Snowflake in
LoadFileOperator
. #1675
- Use cache to reduce redundant database calls #1488
- Remove default
copy_options
as part ofSnowflakeLoadOptions
. Allcopy_options
are now supported as part ofSnowflakeLoadOptions
as per documentation. #1689 - Remove
load_options
fromFile
object. #1721 - Render SQL code with parameters in BaseSQLDecoratedOperator. #897
- Fix handling of multiple dataframes in the
run_raw_sql
operator. #1700
- Add documentation around Microsoft SQL support with example DAG. #1538
- Add documentation around DuckDB support with example DAG. #1695
- Add documentation for
validation_mode
as part of theCOPY INTO
command for snowflake. #1689 - Add documentation and example DAGs for snowflake
SnowflakeLoadOptions
for various available options aroundcopy_options
andfile_options
. #1689 - Fix the documentation to run the quickstart example described in the Python SDK README. #1716
- Add cleanup DAG to clean snowflake tables created as part of CI when the runners fail as part of GitHub actions. #1663
- Run example DAGs on astro-cloud and collect the results. #1499
- Consolidated
PandasCsvLoadOptions
,PandasJsonLoadOptions
,PandasNdjsonLoadOptions
andPandasParquetLoadOptions
to singlePandasLoadOptions
. #1722
- Implement
check_table
Operator to validate data quality at table level #1239 - Add
check_column
Operator to validate data quality for columns in a table/dataframe #1239
- Support "s3" conn type for S3Location #1647
- Add the documentation and example DAG for Azure blob storage #1598
- Fix dead link in documentation #1596
- Update README with newly supported location and database #1596
- Update configuration reference for XCom #1646
- Add step to generate constraints in Python SDK release process #1474
- Add document to showcase the use of
check_table
andcheck_column
operators #1631
- Install
google-cloud-sdk-gke-gcloud-auth-plugin
in benchmark CI job #1557 - Pin
sphinx-autoapi==2.0.0
version for docs build #1609
- Support SFTP as file location docs #1481
- Support FTP as file location docs #1482
- Add support for Azure Blob Storage (only non-native implementation) #1275, #1542
- Add databricks delta table support docs #1352, #1397, #1452, #1476, #1480, #1555
- Add sourceCode facet to
aql.dataframe()
andaql.transform()
as part of OpenLineage integration #1537 - Enhance
LoadFileOperator
so that users can send pandas attributes throughPandasLoadOptions
docs #1466 - Enhance
LoadFileOperator
so that users can send Snowflake specific load attributes throughSnowflakeLoadOptions
docs #1516 - Expose
get_file_list_func
to users so that it returns iterable File list from given destination file storage #1380
- Deprecate
export_table_to_file
in favor ofexport_to_file
(ExportTableToFileOperator
andexport_table_to_file
operator would be removed in astro-python-sdk 1.5.0) #1503
LoadFileOperator
operator checks forconn_type
andconn_id
provided toFile
#1471- Generate constraints on releases and pushes (not PRs) #1472
- Change
export_file
toexport_table_to_file
in the documentation #1477 - Enhance documentation to describe the new Xcom requirements from Astro SDK 1.3.3 and airflow 2.5 #1483
- Add documentation around
LoadOptions
with example DAGs #1567
- Refactor snowflake merge function for easier maintenance #1493
- Disable Custom serialization for Back-compat #1453
- Use different approach to get location for Bigquery tables #1449
- Fix the
run_raw_sql()
operator as handler returnNone
causing the serialization logic to fail. #1431
- Update the deprecation warning for
export_file()
operator. #1411
- Dataframe operator would now allow a user to either
append
to a table orreplace
a table withif_exists
parameter. #1379
- Fix the
aql.cleanup()
operator as failing as the attributeoutput
was implemented in 2.4.0 #1359 - Fix the backward compatibility with
apache-airflow-providers-snowflake==4.0.2
. #1351 - LoadFile operator returns a dataframe if not using XCom backend.#1348,#1337
- Fix the functionality to create region specific temporary schemas when they don't exist in same region. #1369
- Cross-link to API reference page from Operators page.#1383
- Improve the integration tests to count the number of rows impacted for database operations. #1273
- Run python-sdk tests with airflow 2.5.0 and fix the CI failures. #1232, #1351,#1317, #1337
- Deprecate
export_file
before renaming toexport_table_to_file
. #1411
-
Remove the need to use a custom Xcom backend for storing dataframes when Xcom pickling is disabled. #1334, #1331,#1319
-
Add support to Google Drive to be used as
FileLocation
. Example to load file from Google Drive to Snowflake #1044aql.load_file( input_file=File( path="gdrive://sample-google-drive/sample.csv", conn_id="gdrive_conn" ), output_table=Table( conn_id=SNOWFLAKE_CONN_ID, metadata=Metadata( database=os.environ["SNOWFLAKE_DATABASE"], schema=os.environ["SNOWFLAKE_SCHEMA"], ), ), )
- Use
DefaultExtractor
from OpenLineage. Users need not set environment variableOPENLINEAGE_EXTRACTORS
to use OpenLineage. #1223, #1292 - Generate constraints file for multiple Python and Airflow version that display the set of "installable" constraints for a particular Python (3.7, 3.8, 3.9) and Airflow version (2.2.5, 2.3.4, 2.4.2) #1226
- Improve the logs in case native transfers fallbacks to Pandas as well as fallback indication in
LoadFileOperator
. #1263
- Temporary tables should be cleaned up, even with mapped tasks via
aql.cleanup()
#963 - Update the name and namespace as per Open Lineage new conventions introduced here. #1281
- Delete the Snowflake stage when
LoadFileOperator
fails. #1262
- Update the documentation for Google Drive support. #1044
- Update the documentation to remove the environment variable
OPENLINEAGE_EXTRACTORS
to use OpenLineage. #1292
- Fix the GCS path in
aql.export_file
in the example DAGs. #1339
- When
if_exists
is set toreplace
in Dataframe operator, replace the table rather than append. This change fixes a regression on the Dataframe operator which caused it to append content to an output table instead of replacing. #1260 - Pass the table metadata
database
value to the underlying airflowPostgresHook
instead ofschema
as schema is renamed to database in airflow as per this PR. #1276
- Include description on pickling and usage of custom Xcom backend in README.md #1203
- Investigate and fix tests that are filling up Snowflake database with tmp tables as part of our CI execution. #738
- Make
openlineage
an optional dependency #1252 - Update snowflake-sqlalchemy version #1228
- Raise error if dataframe is empty #1238
- Raise error db mismatch of operation #1233
- Pass
task_id
to be used for parent class onLoadFileOperator
init #1259
- Add support for Minio #750
- Open Lineage support - Add Extractor for
ExportFileOperator
,DataframeOperator
#903, #1183
- Add check for missing conn_id on transform operator. #1152
- Raise error when
copy into
query fails in snowflake. #890 - Transform op - database/schema is not picked from table's metadata. #1034
- Change the namespace for Open Lineage #1179
- Add
LOAD_FILE_ENABLE_NATIVE_FALLBACK
config to globally disable native fallback #1089 - Add
OPENLINEAGE_EMIT_TEMP_TABLE_EVENT
config to emit events for tmp table in Open Lineage. #1121 - Fix issue with fetching table row count for snowflake #1145
- Generate unique Open Lineage namespace for Sqlite based operations #1141
- Include section in docs to cover file pattern for native path of GCS to Bigquery . #800
- Add guide for Open Lineage integration with Astro Python SDK #1116
- Pin SQLAlchemy version to >=1.3.18,<1.4.42 #1185
- Remove dependency on
AIRFLOW__CORE__ENABLE_XCOM_PICKLING
. Users can set new environment variables, namelyAIRFLOW__ASTRO_SDK__XCOM_STORAGE_CONN_ID
andAIRFLOW__ASTRO_SDK__XCOM_STORAGE_URL
and use a custom XCOM backend namely,AstroCustomXcomBackend
which enables the XCOM data to be saved to an S3 or GCS location. #795, #997 - Added OpenLineage support for
LoadFileOperator
,AppendOperator
,TransformOperator
andMergeOperator
#898, #899, #902, #901 and #900 - Add
TransformFileOperator
that- parses a SQL file with templating
- applies all needed parameters
- runs the SQL to return a table object to keep the
aql.transform_file
function, the function can returnTransformFileOperator().output
in a similar fashion to the merge operator. #892
- Add the implementation for row count for
BaseTable
. #1073
- Improved handling of snowflake identifiers for smooth experience with
dataframe
andrun_raw_sql
andload_file
operators. #917, #1098 - Fix
transform_file
to not depend ontransform
decorator #1004 - Set the CI to run and publish benchmark reports once a week #443
- Fix cyclic dependency and improve import time. Reduces the import time for
astro/databases/__init__.py
from 23.254 seconds to 0.062 seconds #1013
- Create GETTING_STARTED.md #1036
- Document the Open Lineage facets published by Astro Python SDK. #1086
- Documentation changes to specify permissions needed for running BigQuery jobs. #896
- Document the details on custom XCOM. #1100
- Document the benchmarking process. #1017
- Include a detailed description on the default Dataset concept in Astro Python SDK. #1092
- NFS volume mount in Kubernetes to test benchmarking from local to databases. #883
- Add filetype when resolving path in case of loading into dataframe #881
- Fix postgres performance regression (example from one_gb file - 5.56min to 1.84min) #876
-
Add native autodetect schema feature #780
-
Allow users to disable auto addition of inlets/outlets via airflow.cfg #858
-
Support for Datasets introduced in Airflow 2.4 #786, #808, #862, #871
-
inlets
andoutlets
will be automatically set for all the operators. -
Users can now schedule DAGs on
File
andTable
objects. Example:input_file = File( path="https://raw.githubusercontent.com/astronomer/astro-sdk/main/tests/data/imdb_v2.csv" ) imdb_movies_table = Table(name="imdb_movies", conn_id="sqlite_default") top_animations_table = Table(name="top_animation", conn_id="sqlite_default") START_DATE = datetime(2022, 9, 1) @aql.transform() def get_top_five_animations(input_table: Table): return """ SELECT title, rating FROM {{input_table}} WHERE genre1='Animation' ORDER BY rating desc LIMIT 5; """ with DAG( dag_id="example_dataset_producer", schedule=None, start_date=START_DATE, catchup=False, ) as load_dag: imdb_movies = aql.load_file( input_file=input_file, task_id="load_csv", output_table=imdb_movies_table, ) with DAG( dag_id="example_dataset_consumer", schedule=[imdb_movies_table], start_date=START_DATE, catchup=False, ) as transform_dag: top_five_animations = get_top_five_animations( input_table=imdb_movies_table, output_table=top_animations_table, )
-
-
Dynamic Task Templates: Tasks that can be used with Dynamic Task Mapping (Airflow 2.3+)
-
Create upstream_tasks parameter for dependencies independent of data transfers #585
- Avoid loading whole file into memory with load_operator for schema detection #805
- Directly pass the file to native library when native support is enabled #802
- Create file type for patterns for schema auto-detection #872
- Add compat module for typing execute
context
in operators #770 - Fix sql injection issues #807
- Add response_size to run_raw_sql and warn about db thrashing #815
- Update quick start example #819
- Add links to docs from README #832
- Fix Astro CLI doc link #842
- Add configuration details from settings.py #861
- Add section explaining table metadata #774
- Fix docstring for run_raw_sql #817
- Add missing docs for Table class #788
- Add the readme.md example dag to example dags folder #681
- Add reason for enabling XCOM pickling #747
- Skip folders while processing paths in load_file operator when file pattern is passed. #733
- Limit Google Protobuf for compatibility with bigquery client. #742
- Added a check to create table only when
if_exists
isreplace
inaql.load_file
for snowflake. #729 - Fix the file type for NDJSON file in Data transfer job in AWS S3 to Google BigQuery. #724
- Create a new version of imdb.csv with lowercase column names and update the examples to use it, so this change is backwards-compatible. #721, #727
- Skip folders while processing paths in load_file operator when file patterns is passed. #733
-
Updated the Benchmark docs for GCS to Snowflake and S3 to Snowflake of
aql.load_file
#712#707 -
Restructured the documentation in the
project.toml
, quickstart, readthedocs and README.md #698, #704, #706 -
Make astro-sdk-python compatible with major version of Google Providers. #703
- Consolidate the documentation requirements for sphinx. #699
- Add CI/CD triggers on release branches with dependency on tests. #672
-
Improved the performance of
aql.load_file
by supporting database-specific (native) load methods. This is now the default behaviour. Previously, the Astro SDK Python would always use Pandas to load files to SQL databases which passed the data to worker node which slowed the performance. #557, #481Introduced new arguments to
aql.load_file
:use_native_support
for data transfer if available on the destination (defaults touse_native_support=True
)native_support_kwargs
is a keyword argument to be used by method involved in native support flow.enable_native_fallback
can be used to fall back to default transfer(defaults toenable_native_fallback=True
).
Now, there are three modes:
Native
: Default, uses Bigquery Load Job in the case of BigQuery and Snowflake COPY INTO using external stage in the case of Snowflake.Pandas
: This is how datasets were previously loaded. To enable this mode, use the argumentuse_native_support=False
inaql.load_file
.Hybrid
: This attempts to use the native strategy to load a file to the database and if native strategy(i) fails , fallback to Pandas (ii) with relevant log warnings. #557
-
Allow users to specify the table schema (column types) in which a file is being loaded by using
table.columns
. If this table attribute is not set, the Astro SDK still tries to infer the schema by using Pandas (which is previous behaviour).#532 -
Add Example DAG for Dynamic Map Task with Astro-SDK. #377,airflow-2.3.0
- The
aql.dataframe
argumentidentifiers_as_lower
(which wasboolean
, with default set toFalse
) was replaced by the argumentcolumns_names_capitalization
(string
within possible values["upper", "lower", "original"]
, default islower
).#564 - The
aql.load_file
before would change the capitalization of all column titles to be uppercase, by default, now it makes them lowercase, by default. The old behaviour can be achieved by using the argumentcolumns_names_capitalization="upper"
. #564 aql.load_file
attempts to load files to BigQuery and Snowflake by using native methods, which may have pre-requirements to work. To disable this mode, use the argumentuse_native_support=False
inaql.load_file
. #557, #481aql.dataframe
will raise an exception if the default Airflow XCom backend is being used. To solve this, either use an external XCom backend, such as S3 or GCS or set the configurationAIRFLOW__ASTRO_SDK__DATAFRAME_ALLOW_UNSAFE_STORAGE=True
. #444- Change the declaration for the default Astro SDK temporary schema from using
AIRFLOW__ASTRO__SQL_SCHEMA
toAIRFLOW__ASTRO_SDK__SQL_SCHEMA
#503 - Renamed
aql.truncate
toaql.drop_table
#554
- Fix missing airflow's task terminal states to
CleanupOperator
#525 - Allow chaining
aql.drop_table
(previouslytruncate
) tasks using the Task Flow API syntax. #554, #515
- Improved the performance of
aql.load_file
for files for below: - Get configurations via Airflow Configuration manager. #503
- Change catching
ValueError
andAttributeError
toDatabaseCustomError
#595 - Unpin pandas upperbound dependency #620
- Remove markupsafe from dependencies #623
- Added
extend_existing
to Sqla Table object #626 - Move config to store DF in XCom to settings file #537
- Make the operator names consistent #634
- Use
exc_info
for exception logging #643 - Update query for getting bigquery table schema #661
- Use lazy evaluated Type Annotations from PEP 563 #650
- Provide Google Cloud Credentials env var for bigquery #679
- Handle breaking changes for Snowflake provide version 3.2.0 and 3.1.0 #686
Feature:
Internals:
Enhancement:
- Fail LoadFileOperator operator when input_file does not exist #467
- Create scripts to launch benchmark testing to Google cloud #432
- Bump Google Provider for google extra #294
Feature:
Breaking Change:
aql.merge
interface changed. Argumentmerge_table
changed totarget_table
,target_columns
andmerge_column
combined tocolumn
argument,merge_keys
is changed totarget_conflict_columns
,conflict_strategy
is changed toif_conflicts
. More details can be found at 422, #466
Enhancement:
- Document (new) load_file benchmark datasets #449
- Made improvement to benchmark scripts and configurations #458, #434, #461, #460, #437, #462
- Performance evaluation for loading datasets with Astro Python SDK 0.9.2 into BigQuery #437
Bug fix:
- Change export_file to return File object #454.
Bug fix:
- Table unable to have Airflow templated names #413
Enhancements:
- Introduction of the user-facing
Table
,Metadata
andFile
classes
Breaking changes:
- The operator
save_file
becameexport_file
- The tasks
load_file
,export_file
(previouslysave_file
) andrun_raw_sql
should be used with useTable
,Metadata
andFile
instances - The decorators
dataframe
,run_raw_sql
andtransform
should be used withTable
andMetadata
instances - The operators
aggregate_check
,boolean_check
,render
andstats_check
were temporarily removed - The class
TempTable
was removed. It is possible to declare temporary tables by usingTable(temp=True)
. All the temporary tables names are prefixed with_tmp_
. If the user decides to name aTable
, it is no longer temporary, unless the user enforces it to be. - The only mandatory property of a
Table
instance isconn_id
. If no metadata is given, the library will try to extract schema and other information from the connection object. If it is missing, it will default to theAIRFLOW__ASTRO__SQL_SCHEMA
environment variable.
Internals:
- Major refactor introducing
Database
,File
,FileType
andFileLocation
concepts.
Enhancements:
- Add support for Airflow 2.3 #367.
Breaking change:
- We have renamed the artifacts we released to
astro-sdk-python
fromastro-projects
.0.8.4
is the last version for which we have published bothastro-sdk-python
andastro-projects
.
Bug fix:
- Do not attempt to create a schema if it already exists #329.
Bug fix:
- Support dataframes from different databases in dataframe operator #325
Enhancements:
- Add integration testcase for
SqlDecoratedOperator
to test execution of Raw SQL #316
Bug fix:
- Snowflake transform without
input_table
#319
Feature:
*load_file
support for nested NDJSON files #257
Breaking change:
aql.dataframe
switches the capitalization to lowercase by default. This behaviour can be changed by usingidentifiers_as_lower
#154
Documentation:
- Fix commands in README.md #242
- Add scripts to auto-generate Sphinx documentation
Enhancements:
- Improve type hints coverage
- Improve Amazon S3 example DAG, so it does not rely on pre-populated data #293
- Add example DAG to load/export from BigQuery #265
- Fix usages of mutable default args #267
- Enable DeepSource validation #299
- Improve code quality and coverage
Bug fixes:
- Support
gcpbigquery
connections #294 - Support
params
argument inaql.render
to override SQL Jinja template values #254 - Fix
aql.dataframe
when table arg is absent #259
Others:
- Refactor integration tests, so they can run across all supported databases #229, #234, #235, #236, #206, #217
Feature:
load_file
to a Pandas dataframe, without SQL database dependencies #77
Documentation:
- Simplify README #101
- Add Release Guidelines #160
- Add Code of Conduct #101
- Add Contribution Guidelines #101
Enhancements:
- Add SQLite example #149
- Allow customization of
task_id
when usingdataframe
#126 - Use standard AWS environment variables, as opposed to
AIRFLOW__ASTRO__CONN_AWS_DEFAULT
#175
Bug fixes:
- Fix
merge
XComArg
support #183 - Fixes to
load_file
: - Fixes to
render
: - Fix
transform
, so it works with SQLite #159
Others:
Features:
- Support SQLite #86
- Support users who can't create schemas #121
- Ability to install optional dependencies (amazon, google, snowflake) #82
Enhancements:
- Change
render
so it creates a DAG as opposed to a TaskGroup #143 - Allow users to specify a custom version of
snowflake_sqlalchemy
#127
Bug fixes:
Others: