Different group_by result due to doing joins with `by_chund_id` #358

SMousavi90 · 2021-09-16T07:33:46Z

SMousavi90
Sep 16, 2021

I have this lines of code which produces different results with and without using diskframe.

a.df -> the diskframe with 2735110 rows

the group_by line:

result <- a.df %>%
    group_by(col1,col2,col3,col4) %>%
    summarize(tot4 = sum(col4), tot5 = sum(col5)) %>% 
    chunk_ungroup()

after the execution the result has 2735110 rows

but the same line with data frame (or at least when I collect(a.df)) returns different number of rows: 273511 rows

result <- collect(a.df) %>%
    group_by(col1,col2,col3,col4) %>%
    summarize(tot4 = sum(col4), tot5 = sum(col5)) %>% 
   ungroup

I cannot and should not collect the a.df here because it will be so big in future.
any suggestion or advice on this?

Thanks in advance

xiaodaigh · 2021-09-16T10:28:42Z

xiaodaigh
Sep 16, 2021
Maintainer

That's not good. I will see if I can reproduce it. Can you tell me the types of col1 col2 etc? How many unique values in each?

0 replies

SMousavi90 · 2021-09-16T16:00:23Z

SMousavi90
Sep 16, 2021
Author

The problem was in previous steps I had done some joins with the option merge_by_chunk_id = TRUE and the result of those joins were different from the ones when using data frames. I still cannot distinguish between these these two ways of joining the data. why do we set merge_by_chunk_id to TRUE while the generated data isn't what we expect. anyway I set merge_by_chunk_id to FALSE and the issue seemed to be resolved but at the end I got an error for the stack size limit!

0 replies

xiaodaigh · 2021-09-18T06:06:23Z

xiaodaigh
Sep 18, 2021
Maintainer

I see. A proper article on how join work is long overdue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different group_by result due to doing joins with `by_chund_id` #358

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Different group_by result due to doing joins with by_chund_id #358

SMousavi90 Sep 16, 2021

Replies: 3 comments

xiaodaigh Sep 16, 2021 Maintainer

SMousavi90 Sep 16, 2021 Author

xiaodaigh Sep 18, 2021 Maintainer

Different group_by result due to doing joins with `by_chund_id` #358

SMousavi90
Sep 16, 2021

xiaodaigh
Sep 16, 2021
Maintainer

SMousavi90
Sep 16, 2021
Author

xiaodaigh
Sep 18, 2021
Maintainer