passing list of files (e.g., parquets) as input from previous task #634

nkruskamp · 2024-07-22T14:20:09Z

nkruskamp
Jul 22, 2024

Hi All, I'm looking for some help on structuring my tasks correctly when I have a set or unknown number of input files.

I have in the past worked with dask using a set of parquet files on disk. Or using duckdb to treat a folder of files as a database. Sometimes I will specific the number of partitions to create, and sometimes I have dask do the partitioning. I think this is separate from the dask + pytask option that does distribution computing of tasks.

So my questions is: What is the best way in the pytask workflow to pass these "distributed" data to a task?

I've included a toy example (I'm not sure it would actually run, but hopefully gets the point across) of a basic workflow that takes a collection of input CSVs, uses dask to read them, partition them, and save them to disc. Then a function that would do some sort of analysis and output results. So in between the first and second task, is this the best approach or is there something else?

from pathlib import Path
from typing import Annotated

import dask.dataframe as dd
import matplotlib.pyplot as plt
from pytask import Product, task


@task
def task_get_data(
    input_dir: Path = Path("raw_data_dir"),
    output_dir: Annotated[Path, Product] = Path("iterim_data_dir"),
):
    df = dd.read_csv(f"{input_dir}/myfiles.*.csv")
    df = df.repartition(partition_size="100MB")
    df.to_parquet(output_dir)


@task
def task_process_data(
    input_dir: Path = Path("iterim_data_dir"),
    output_path: Path = Path("figures_dir/hist_results.png"),
):
    df = dd.read_parquet(input_dir)
    fig, ax = plt.subplots()
    df.plot(kind="hist", ax=ax)
    plt.savefig(output_path)

tobiasraabe · 2024-07-22T14:42:51Z

tobiasraabe
Jul 22, 2024
Maintainer

Hi @nkruskamp! Can you take a look at this guide that explains provisional nodes. I believe the DirectoryNode is what you are looking for. If you have some feedback for the guide, let me know.

0 replies

nkruskamp · 2024-08-14T18:57:34Z

nkruskamp
Aug 14, 2024
Author

Hi @tobiasraabe this is great, and exactly what I needed to get the tasks running.

Another situation I have to address is when a task has an unknown number of inputs that are not saved in the same folder, but rather have to be specified separately. Using a parameter function to create the dictionary of arguments, each task run could have 1 - n number of input files with the exact paths listed. My initial test is to pass the input files as a list, and equal length list of sheet names (or other needed variables), and that seems to work, but I wanted to see if there was a more pytask way to accomplish this?

Thanks again for your help!


{
    "input_paths": [
        Path("C:/data/inputs/sub_input_dir/the_first_file.xlsx"),
        Path("C:/data/inputs/raw_data_from_client/data_re1.xlsx"),
        Path("C:/data/inputs/downloaded/inputs.xlsx"),
        Path("C:/data/inputs/downloaded/inputs_2.xlsx"),
        Path("C:/data/inputs/main_data.xlsx"),
    ],
    "sheet_names": ["sheet_a", "123", "sheet_ab", "input_data", "fake_sheet_name"],
    "output_path": Path("C:/data/outputs/results.csv"),
}
def task_process_data(
    input_paths: list[Path],
    sheet_names: list[str],
    output_path: Annotated[Path, Product],
) -> None:
    result_list = []
    for in_path, sheet in zip(input_paths, sheet_names):
        result = pd.read_excel(in_path, sheet_name=sheet)
        result_list.append(result)
    out_result = pd.concat(result_list)
    out_result.to_csv(output_path)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

passing list of files (e.g., parquets) as input from previous task #634

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

passing list of files (e.g., parquets) as input from previous task #634

nkruskamp Jul 22, 2024

Replies: 2 comments

tobiasraabe Jul 22, 2024 Maintainer

nkruskamp Aug 14, 2024 Author

nkruskamp
Jul 22, 2024

tobiasraabe
Jul 22, 2024
Maintainer

nkruskamp
Aug 14, 2024
Author