-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset.Tabular is NOT loading the specified file on storage #21419
Comments
Thank you for reporting the issue. Someone from the team will take a look asap |
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github. Issue Details
Describe the bug To Reproduce Following this guide on how to use these files:
Expected behavior Screenshots Additional context
|
@afogarty85 Does the csv file that you are using contains multiline values? From our documentation: |
Thanks for the help! By doing something like:
I get another error: If I set:
The data populates, but again it is inflated. |
@afogarty85 Is there any way for you to share the data file with us if it does not contain any sensitive information? This may help us in reproducing the problem at our end. |
Unfortunately no, could you perhaps generate dummy data in a pandas dataframe that would highlight this: (?)
From there, investigations could be done. |
@afogarty85 Take this multiline.csv file as an example. from azureml.core import Workspace, Dataset, Datastore
ws = Workspace.from_config()
dstore = Datastore.get_default(ws)
path = [(dstore, 'data/multiline.csv')]
dset1 = Dataset.Tabular.from_delimited_files(path) # By default support_multi_line=False
df1 = dset1.to_pandas_dataframe()
df1.shape Output:
Now if we execute the following code: dset2 = Dataset.Tabular.from_delimited_files(path, support_multi_line=True)
df2 = dset2.to_pandas_dataframe()
df2.shape Output:
As you can see that the shape of dataframe without multiline support is (6, 3), with multiline support is (2, 3). Setting |
Thanks for the input -- this definitely appears to be the problem. Are you aware of anything I can do to speed up training and loading of files? I am in a situation where I need to specify
|
@afogarty85 I am happy that you were able to read the data in correct format. Regarding the time difference that you are seeing, processing tabular files with multi-line data is slower because data has to be read line-by-line and multiple CPU cores cannot be used to ingest the data in parallel. This is the reason behind slower processing when we set |
Thanks @ynpandey ! This makes sense for sure, but this almost certainly has to be an issue. I cannot imagine a scenario where Xeon processors, using a single core, take 5 minutes to load 30k rows at 50 columns. It should take seconds -- its a ~30 mb file. |
Closing legacy issue. @ynpandey can you help to share a link to the new v2 mltable package? |
Machinelearningservices microsoft.machine learning services 2022 12 01 preview (Azure#21761) * Adds base for updating Microsoft.MachineLearningServices from version preview/2022-10-01-preview to version 2022-12-01-preview * Updates readme * Updates API version in new specs and examples * Add Dec API Registries Swagger (Azure#21419) * add december registries swagger + examples * add status code 202 in examples * fix 202 examples * fixes * fixes * fix * add 202 back in for put/patch Co-authored-by: Komal Yadav <komalyadav@microsoft.com> * remove location (Azure#21430) Co-authored-by: Komal Yadav <komalyadav@microsoft.com> * remove readonly flag on schedules property for CI (Azure#21653) Co-authored-by: Naman Agarwal <naagarw@microsoft.com> * add missing workspace properties (Azure#21725) * December preview updating mfe.json specs (Azure#21510) * December preview updating mfe.json specs * MFE Dec 2022 Preview API - Adding logbase * MFE 2022-12-01-preview swagger spec model validation fix * MFE 2022-12-01-preview swagger spec model validation fix, add missing location * MFE 2022-12-01-preview swagger spec model validation - typo fix * MFE 2022-12-01-preview swagger spec model validation - fix api version in automljob example * MFE 2022-12-01-preview swagger spec model validation - fix for multiselectenabled error * MFE 2022-12-01-preview swagger spec model validation - fix for multiselectenabled error * Fix for 1006 - RemovedDefinition (RecurrenceTrigger,CronTrigger) (Azure#21822) * fix ReadonlyPropertyChanged of MLC (Azure#21814) Co-authored-by: Bingchen Li <bingchenli@microsoft.com> * fixed custom-words conflict (Azure#21829) * fix custom-words conflict merge (Azure#21830) * example fix (INVALID_REQUEST_PARAMETER) (Azure#21832) Co-authored-by: Ivaliy Ivanov <ivaliyivanov@Ivaliys-MacBook-Air.local> * example fix, use correct api preview version - (INVALID_REQUEST_PARAMETER) (Azure#21833) Co-authored-by: Ivaliy Ivanov <ivaliyivanov@Ivaliys-MacBook-Air.local> * Revert breaking change for MLC swagger 2022-12-01-preview (Azure#21885) Co-authored-by: Bingchen Li <bingchenli@microsoft.com> * Revert Connection Category back to enum. (Azure#21939) * revert provisioning state change (Azure#21940) * remove body (Azure#21978) Co-authored-by: Komal Yadav <komalyadav@microsoft.com> * Addressed comments, added x-ms-long-running-operation to a patch call (Azure#22005) * Addressed comments, added x-ms-long-running-operation to a patch call * fix examples for patch - remove body * fixed formatting * Ivalbert fix patch2 (Azure#22006) * Addressed comments, added x-ms-long-running-operation to a patch call * fix examples for patch - remove body * fixed formatting * fixed formatting * Updated custom words (Azure#22262) * Fixed prettier errors (Azure#22237) * fixed examples for LRO_RESPONSE_HEADER check (Azure#22293) * fixed examples for LRO_RESPONSE_HEADER check (Azure#22294) * Example fix - OBJECT_MISSING_REQUIRED_PROPERTY - Missing required property: triggerType (Azure#22317) --------- Co-authored-by: Komal Yadav <23komal.yadav23@gmail.com> Co-authored-by: Komal Yadav <komalyadav@microsoft.com> Co-authored-by: Naman Agarwal <namanag16@gmail.com> Co-authored-by: Naman Agarwal <naagarw@microsoft.com> Co-authored-by: ZhidaLiu <zhili@microsoft.com> Co-authored-by: libc16 <88697960+libc16@users.noreply.github.com> Co-authored-by: Bingchen Li <bingchenli@microsoft.com> Co-authored-by: Ivaliy Ivanov <ivaliyivanov@Ivaliys-MacBook-Air.local>
Describe the bug
When loading a Tabular file, it is not reading the file that is there.
To Reproduce
Steps to reproduce the behavior:
Following this guide on how to use these files:
https://github.com/Azure/MachineLearningNotebooks/blob/122df6e84622136690801685b183af5a04d77dec/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-showcasing-dataset-and-pipelineparameter.ipynb
Expected behavior
I expected the file on storage to load. If I delete the file, AML rightly says that the file has disappeared. If I upload the right file, shape (320, 142), AML will continue to load the one shaped (1406, 142).
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: