-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support read_fwf functionality in cudf #15924
Comments
Hi @a-hirota , |
Hello @a-hirota, thank you for your request. I believe this reader is something that we can support by combining cudf APIs today. Would you please let me know if this works for you?
|
Hello @GregoryKimball , thank you for your prompt response! I appreciate the swift assistance. I've conducted experiments and confirmed that it's functioning as expected. However, due to the necessity of string slicing for each column, the processing time is somewhat inferior to that of the CPU, particularly when dealing with a dataset of around 1 million records across a maximum of 2000 columns (which represents roughly 1/50th of our usual daily processing volume). While the GPU processing time, including read time, surpasses that of the CPU, it doesn't result in a significant speedup: < String Slicing Time > I believe that providing the colspecs at the time of reading, similar to read_fwf, would eliminate the need for redefining the positions of the series object after reading. This optimization could lead to a significant speedup compared to the CPU. (Although not initially included in my Example usage, specifying dtypes might also be beneficial.) Additionally, legacy systems tend to have lightweight computational tasks, mainly rule-based logic, resulting in a majority (80-90%) of processing time being allocated to I/O operations. |
Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`. Addresses some concerns from issue #15924 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #16574
…#16574) Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`. Addresses some concerns from issue rapidsai#15924 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#16574
…#16574) Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`. Addresses some concerns from issue rapidsai#15924 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#16574
Using the improvement from #16574 improves the slice time significantly from:
to
|
Missing Pandas Feature Request
Support for
pandas.read_fwf
.Profiler Output
N/A
Additional context
Background:
In the legacy enterprise space, COBOL is in continuous use, and the reality is that a complete overhaul of legacy systems is difficult to achieve at this time. If the processing of legacy systems can be made to run on GPUs, this could bring significant change to this area. Since COBOL deals with fixed-width flat files, support for fixed-width files could be a first step in addressing this need.
Code Example:
For instance, consider the following example data:
Expected output:
Supplement:
The text was updated successfully, but these errors were encountered: