Dataset.Tabular is NOT loading the specified file on storage #21419

afogarty85 · 2021-10-26T18:49:51Z

Package Name: azureml.core.
Package Version: 1.34.0
Operating System: W10
Python Version: 3.6.9

Describe the bug
When loading a Tabular file, it is not reading the file that is there.

To Reproduce
Steps to reproduce the behavior:

Following this guide on how to use these files:
https://github.com/Azure/MachineLearningNotebooks/blob/122df6e84622136690801685b183af5a04d77dec/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-showcasing-dataset-and-pipelineparameter.ipynb

# build data set configurations
stack_rank = Dataset.Tabular.from_delimited_files([(ws.datastores['fs'], '/RAW/Daily/stack_rank_daily.csv')])
stack_rank_param = PipelineParameter(name="stack_rank_param", default_value=stack_rank)
stack_rank_ds_consumption = DatasetConsumptionConfig("stack_rank_dataset", stack_rank_param)

# register it to see its location
stack_rank = stack_rank.register(workspace = ws,
                                 name = 'stack_rank',
                                 description = 'stack_rank data',
                                 create_new_version = True) 

# here is its registration
"registration": {
    "id": "721b9763-b2e5-4524-a620-de3df1ed4403",
    "name": "stack_rank",
   etc

# examine the registration
dataset = Dataset.get_by_id(ws, '721b9763-b2e5-4524-a620-de3df1ed4403')
dataset.to_pandas_dataframe()
# its shape is (1406, 142)
# it SHOULD be (320, 142)

Expected behavior
I expected the file on storage to load. If I delete the file, AML rightly says that the file has disappeared. If I upload the right file, shape (320, 142), AML will continue to load the one shaped (1406, 142).

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

rakshith91 · 2021-10-27T17:14:25Z

Thank you for reporting the issue. Someone from the team will take a look asap

ghost · 2021-11-04T00:21:50Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.

Issue Details

Package Name: azureml.core.
Package Version: 1.34.0
Operating System: W10
Python Version: 3.6.9

Describe the bug
When loading a Tabular file, it is not reading the file that is there.

To Reproduce
Steps to reproduce the behavior:

Following this guide on how to use these files:
https://github.com/Azure/MachineLearningNotebooks/blob/122df6e84622136690801685b183af5a04d77dec/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-showcasing-dataset-and-pipelineparameter.ipynb

# build data set configurations
stack_rank = Dataset.Tabular.from_delimited_files([(ws.datastores['fs'], '/RAW/Daily/stack_rank_daily.csv')])
stack_rank_param = PipelineParameter(name="stack_rank_param", default_value=stack_rank)
stack_rank_ds_consumption = DatasetConsumptionConfig("stack_rank_dataset", stack_rank_param)

# register it to see its location
stack_rank = stack_rank.register(workspace = ws,
                                 name = 'stack_rank',
                                 description = 'stack_rank data',
                                 create_new_version = True) 

# here is its registration
"registration": {
    "id": "721b9763-b2e5-4524-a620-de3df1ed4403",
    "name": "stack_rank",
   etc

# examine the registration
dataset = Dataset.get_by_id(ws, '721b9763-b2e5-4524-a620-de3df1ed4403')
dataset.to_pandas_dataframe()
# its shape is (1406, 142)
# it SHOULD be (320, 142)

Expected behavior
I expected the file on storage to load. If I delete the file, AML rightly says that the file has disappeared. If I upload the right file, shape (320, 142), AML will continue to load the one shaped (1406, 142).

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Author:	afogarty85
Assignees:	SaurabhSharma-MSFT
Labels:	`question`, `Machine Learning`, `Service Attention`, `customer-reported`, `needs-team-attention`, `ML-CoreUI`
Milestone:	-

ynpandey · 2021-11-04T20:19:42Z

@afogarty85 Does the csv file that you are using contains multiline values? From our documentation:
By default (support_multi_line=False), all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This should be set to True when the delimited files are known to contain quoted line breaks.

afogarty85 · 2021-11-04T20:41:39Z

Thanks for the help!

By doing something like:

stack_rank = Dataset.Tabular.from_delimited_files([(ws.datastores['fs'], '/RAW/Daily/stack_rank_daily.csv')], support_multi_line=True)

I get another error: "Cannot load any data from the specified path. Make sure the path is accessible and contains data.\nThe Dataflow produced no records.

If I set:

stack_rank = Dataset.Tabular.from_delimited_files([(ws.datastores['fs'], '/RAW/Daily/stack_rank_daily.csv')], support_multi_line=False)

The data populates, but again it is inflated.

ynpandey · 2021-11-04T20:51:42Z

@afogarty85 Is there any way for you to share the data file with us if it does not contain any sensitive information? This may help us in reproducing the problem at our end.

afogarty85 · 2021-11-04T21:11:08Z

Unfortunately no, could you perhaps generate dummy data in a pandas dataframe that would highlight this: (?)

all line breaks, including those in quoted field values, will be interpreted as a record break

From there, investigations could be done.

ynpandey · 2021-11-04T22:59:43Z

@afogarty85 Take this multiline.csv file as an example.

from azureml.core import Workspace, Dataset, Datastore
ws = Workspace.from_config()
dstore = Datastore.get_default(ws)
path = [(dstore, 'data/multiline.csv')]
dset1 = Dataset.Tabular.from_delimited_files(path) # By default support_multi_line=False
df1 = dset1.to_pandas_dataframe()
df1.shape

Output:

(6, 3)

Now if we execute the following code:

dset2 = Dataset.Tabular.from_delimited_files(path, support_multi_line=True)
df2 = dset2.to_pandas_dataframe()
df2.shape

Output:

(2, 3)

As you can see that the shape of dataframe without multiline support is (6, 3), with multiline support is (2, 3). Setting support_multi_line=True parses the file correctly and gives a dataframe of shape (2, 3).

afogarty85 · 2021-11-08T20:29:36Z

Thanks for the input -- this definitely appears to be the problem.

Are you aware of anything I can do to speed up training and loading of files?

I am in a situation where I need to specify support_multi_line=True, otherwise the shapes are messed up. The consequence of this (support_multi_line=True), is that it takes AML approximately 5 minutes to load a dataframe shaped: (31733, 58)

support_multi_line=False returns my dataframe in seconds, just with 70k observations instead of what it should be.

ynpandey · 2021-11-08T22:14:57Z

@afogarty85 I am happy that you were able to read the data in correct format.

Regarding the time difference that you are seeing, processing tabular files with multi-line data is slower because data has to be read line-by-line and multiple CPU cores cannot be used to ingest the data in parallel. This is the reason behind slower processing when we set support_multi_line=True.

afogarty85 · 2021-11-08T22:20:16Z

Thanks @ynpandey !

This makes sense for sure, but this almost certainly has to be an issue. I cannot imagine a scenario where Xeon processors, using a single core, take 5 minutes to load 30k rows at 50 columns. It should take seconds -- its a ~30 mb file.

luigiw · 2022-10-21T00:06:53Z

Closing legacy issue.

@ynpandey can you help to share a link to the new v2 mltable package?

Machinelearningservices microsoft.machine learning services 2022 12 01 preview (Azure#21761) * Adds base for updating Microsoft.MachineLearningServices from version preview/2022-10-01-preview to version 2022-12-01-preview * Updates readme * Updates API version in new specs and examples * Add Dec API Registries Swagger (Azure#21419) * add december registries swagger + examples * add status code 202 in examples * fix 202 examples * fixes * fixes * fix * add 202 back in for put/patch Co-authored-by: Komal Yadav <komalyadav@microsoft.com> * remove location (Azure#21430) Co-authored-by: Komal Yadav <komalyadav@microsoft.com> * remove readonly flag on schedules property for CI (Azure#21653) Co-authored-by: Naman Agarwal <naagarw@microsoft.com> * add missing workspace properties (Azure#21725) * December preview updating mfe.json specs (Azure#21510) * December preview updating mfe.json specs * MFE Dec 2022 Preview API - Adding logbase * MFE 2022-12-01-preview swagger spec model validation fix * MFE 2022-12-01-preview swagger spec model validation fix, add missing location * MFE 2022-12-01-preview swagger spec model validation - typo fix * MFE 2022-12-01-preview swagger spec model validation - fix api version in automljob example * MFE 2022-12-01-preview swagger spec model validation - fix for multiselectenabled error * MFE 2022-12-01-preview swagger spec model validation - fix for multiselectenabled error * Fix for 1006 - RemovedDefinition (RecurrenceTrigger,CronTrigger) (Azure#21822) * fix ReadonlyPropertyChanged of MLC (Azure#21814) Co-authored-by: Bingchen Li <bingchenli@microsoft.com> * fixed custom-words conflict (Azure#21829) * fix custom-words conflict merge (Azure#21830) * example fix (INVALID_REQUEST_PARAMETER) (Azure#21832) Co-authored-by: Ivaliy Ivanov <ivaliyivanov@Ivaliys-MacBook-Air.local> * example fix, use correct api preview version - (INVALID_REQUEST_PARAMETER) (Azure#21833) Co-authored-by: Ivaliy Ivanov <ivaliyivanov@Ivaliys-MacBook-Air.local> * Revert breaking change for MLC swagger 2022-12-01-preview (Azure#21885) Co-authored-by: Bingchen Li <bingchenli@microsoft.com> * Revert Connection Category back to enum. (Azure#21939) * revert provisioning state change (Azure#21940) * remove body (Azure#21978) Co-authored-by: Komal Yadav <komalyadav@microsoft.com> * Addressed comments, added x-ms-long-running-operation to a patch call (Azure#22005) * Addressed comments, added x-ms-long-running-operation to a patch call * fix examples for patch - remove body * fixed formatting * Ivalbert fix patch2 (Azure#22006) * Addressed comments, added x-ms-long-running-operation to a patch call * fix examples for patch - remove body * fixed formatting * fixed formatting * Updated custom words (Azure#22262) * Fixed prettier errors (Azure#22237) * fixed examples for LRO_RESPONSE_HEADER check (Azure#22293) * fixed examples for LRO_RESPONSE_HEADER check (Azure#22294) * Example fix - OBJECT_MISSING_REQUIRED_PROPERTY - Missing required property: triggerType (Azure#22317) --------- Co-authored-by: Komal Yadav <23komal.yadav23@gmail.com> Co-authored-by: Komal Yadav <komalyadav@microsoft.com> Co-authored-by: Naman Agarwal <namanag16@gmail.com> Co-authored-by: Naman Agarwal <naagarw@microsoft.com> Co-authored-by: ZhidaLiu <zhili@microsoft.com> Co-authored-by: libc16 <88697960+libc16@users.noreply.github.com> Co-authored-by: Bingchen Li <bingchenli@microsoft.com> Co-authored-by: Ivaliy Ivanov <ivaliyivanov@Ivaliys-MacBook-Air.local>

rakshith91 added the CXP Attention label Oct 27, 2021

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Oct 27, 2021

rakshith91 added the ML-CoreUI AreaPath label Oct 27, 2021

ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Oct 27, 2021

rakshith91 added the Machine Learning label Oct 27, 2021

SaurabhSharma-MSFT self-assigned this Oct 28, 2021

SaurabhSharma-MSFT added Service Attention Workflow: This issue is responsible by Azure service team. and removed CXP Attention labels Nov 4, 2021

SaurabhSharma-MSFT removed their assignment Nov 4, 2021

lmazuel assigned bandsina Jan 25, 2022

luigiw closed this as completed Oct 21, 2022

github-actions bot locked and limited conversation to collaborators Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.Tabular is NOT loading the specified file on storage #21419

Dataset.Tabular is NOT loading the specified file on storage #21419

afogarty85 commented Oct 26, 2021

rakshith91 commented Oct 27, 2021

ghost commented Nov 4, 2021

ynpandey commented Nov 4, 2021

afogarty85 commented Nov 4, 2021

ynpandey commented Nov 4, 2021 •

edited

Loading

afogarty85 commented Nov 4, 2021

ynpandey commented Nov 4, 2021

afogarty85 commented Nov 8, 2021 •

edited

Loading

ynpandey commented Nov 8, 2021

afogarty85 commented Nov 8, 2021 •

edited

Loading

luigiw commented Oct 21, 2022

Dataset.Tabular is NOT loading the specified file on storage #21419

Dataset.Tabular is NOT loading the specified file on storage #21419

Comments

afogarty85 commented Oct 26, 2021

rakshith91 commented Oct 27, 2021

ghost commented Nov 4, 2021

ynpandey commented Nov 4, 2021

afogarty85 commented Nov 4, 2021

ynpandey commented Nov 4, 2021 • edited Loading

afogarty85 commented Nov 4, 2021

ynpandey commented Nov 4, 2021

afogarty85 commented Nov 8, 2021 • edited Loading

ynpandey commented Nov 8, 2021

afogarty85 commented Nov 8, 2021 • edited Loading

luigiw commented Oct 21, 2022

ynpandey commented Nov 4, 2021 •

edited

Loading

afogarty85 commented Nov 8, 2021 •

edited

Loading

afogarty85 commented Nov 8, 2021 •

edited

Loading