Different group_by result due to doing joins with by_chund_id
#358
Replies: 3 comments
-
That's not good. I will see if I can reproduce it. Can you tell me the types of col1 col2 etc? How many unique values in each? |
Beta Was this translation helpful? Give feedback.
-
The problem was in previous steps I had done some joins with the option merge_by_chunk_id = TRUE and the result of those joins were different from the ones when using data frames. I still cannot distinguish between these these two ways of joining the data. why do we set merge_by_chunk_id to TRUE while the generated data isn't what we expect. anyway I set merge_by_chunk_id to FALSE and the issue seemed to be resolved but at the end I got an error for the stack size limit! |
Beta Was this translation helpful? Give feedback.
-
I see. A proper article on how join work is long overdue. |
Beta Was this translation helpful? Give feedback.
-
I have this lines of code which produces different results with and without using diskframe.
a.df -> the diskframe with 2735110 rows
the group_by line:
after the execution the result has 2735110 rows
but the same line with data frame (or at least when I collect(a.df)) returns different number of rows: 273511 rows
I cannot and should not collect the a.df here because it will be so big in future.
any suggestion or advice on this?
Thanks in advance
Beta Was this translation helpful? Give feedback.
All reactions