Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate "Rereading" pass in fread? #1843

Closed
st-pasha opened this issue May 4, 2019 · 1 comment · Fixed by #2558
Closed

Eliminate "Rereading" pass in fread? #1843

st-pasha opened this issue May 4, 2019 · 1 comment · Fixed by #2558
Assignees
Labels
fread Issues related to parsing any input files via fread function

Comments

@st-pasha
Copy link
Contributor

st-pasha commented May 4, 2019

Current design of fread's algorithm follows the following rough steps:

  1. In the preliminary stage, fread selects several chunks within the file and parses them detecting the types of all fields. Parsed data is then discarded.
  2. Given the detected types, fread enters the main "Reading" stage, scanning through the entire file and saving the data into an output Frame.
  3. If during stage (2) any column's type was found to be incorrect, that column is marked as "type-bumped". If at the end of stage (2) there are any type-bumped columns, the entire file needs to be re-scanned: fread goes into the "Rereading" stage.

This algorithm was designed based on the assumption that type-detection in stage 1 would be reliable, that stage (3) would occur only rarely, and consequently its impact would be minimal. In practice, however, we see that the "rereading" pass occurs way too often, essentially doubling the dataset parse time. Moreover, the larger the input file (especially when it's > 100GB), the more likely that it would have to be re-read, which means we are very likely to double the already quite long read time.

In addition to the obvious time cost of the re-reading pass, it affects us in more subtle ways too. In particular, it prevents us from designing a single-pass algorithm, which means fread cannot consume data from a stream. Instead, we have to dump the content of the stream into a temporary file, fread that file, and then delete it. These are all overheads that could have been avoided if we knew that the data needs to be read only once.

So, can the "re-reading" pass be eliminated? It can, but there are certain costs associated with such redesign.

Remember that the need to "re-read" a column appears whenever we encounter a value that cannot be understood within the current column's type. For example, if the column is integer 2, 7, 11, 0, ..., and then out of a sudden we see a value 3.4 in it. Then we say "oh well, looks like we'll need to reread this entire column" (and in practice it means rereading the entire file). But we could have approached this differently. We could have said "let's convert all the values we've read so far into doubles, and continue reading this column as type double". In practice it's slightly more complicated because of multiple threads having to maintain a consistent view of what the column's type is; but this is doable.

Cases like this are "easy", in the sense that they involve conversions that are unambiguous. There are, however, certain cases when the conversions are ambiguous, in particular when converting values into strings. For example, consider column containing values 0, +0, -0, 000. These would be detected as integers 0,0,0,0, as they should. However, if we later find that the column has to be converted to string, then the problem arises: the integer values that we have will be converted into four strings "0": the information that those zeros originally all looked differently as strings is lost.

It is because of these uncertainties, the "rereading" pass was deemed absolute necessity. Without it, the important invariant could be violated: reading data with auto-detected types could produce different results than reading data with explicitly specified types (but same as were auto-detected).

So the question really is -- is this invariant truly important? Are we willing to sacrifice speed (by a factor of 2 or even more), and sometimes disk usage (by a factor of 2), in order to satisfy it? Or should this choice be deferred to the user?

@st-pasha st-pasha added question fread Issues related to parsing any input files via fread function labels May 4, 2019
@pseudotensor
Copy link

pseudotensor commented May 4, 2019

Seems easiest solution, for allowing streaming, is:

  1. To have a pre-defined hierarchy of types, from simple to complex. One gradually increases the complexity of any given batch if encountering any new inconsistent types. Then one relies upon the user to be able to handle this situation, since afterall it is their data that caused it. You shouldn't require total consistency, just per batch updated.

  2. Like boolean-column selector should not select rows where value is NA #1, but one should also be allowed to have an option to only have types set per batch, without any use of prior type knowledge. Some may find this a useful mode, which of course works best if the batch is big enough.

  3. Another option is for the user to pre-define the types. Then there is no question about what the user intends.

All these options sound easy and could be chosen by user.

@st-pasha st-pasha mentioned this issue Jan 4, 2020
27 tasks
st-pasha added a commit that referenced this issue Aug 6, 2020
Since the very early days, the design of fread was such that if any "type-bumps" happen in the middle of a file, then the corresponding column(s) will be marked as "require re-reading", and then at the end we would re-parse the entire file using the new column type. This approach has obvious drawbacks:
- the amount of time necessary to read the file almost doubles (and type bumps happen more often than you'd think);
- the logic behind type-bumping is very error-prone;
- it is impossible to read a stream-like input, where the data cannot be arbitrarily rewound;

The new approach for handling type-bumping is the following:
- When a type-bump occurs while reading a chunk, we enter the ordered section and temporarily suspend execution of other threads;
- While all other threads are paused, we:
  - "archive" the column which was type-bumped;
  - update the global types array with the new column parse types;
- After that, the parallel execution resumes from the start of the type-bumped chunk;
- (the process above may occur multiple times with different columns or different chunks);
- In the end, when the final frame is constructed, the columns that were comprised of multiple chunks are rbound together with automatic type promotion.

Closes #1843
Closes #1446
@st-pasha st-pasha self-assigned this Sep 18, 2020
@st-pasha st-pasha removed the question label Sep 18, 2020
@st-pasha st-pasha added this to the Release 0.11.0 milestone Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fread Issues related to parsing any input files via fread function
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants