Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement slice(), *_join(), and other dplyr methods for tbl_svy. #65

Open
krivit opened this issue Jan 15, 2020 · 8 comments · Fixed by #120
Open

Implement slice(), *_join(), and other dplyr methods for tbl_svy. #65

krivit opened this issue Jan 15, 2020 · 8 comments · Fixed by #120

Comments

@krivit
Copy link
Contributor

krivit commented Jan 15, 2020

There is a filter() method for tbl_svy, but there isn't a slice() method, or any of the *_join() methods, as far as I can tell. Would it be possible to implement them? Thanks in advance!

@krivit krivit changed the title Implement slice() method for tbl_svy. Implement slice(), *_join(), and other dplyr methods for tbl_svy. Jan 15, 2020
@gergness
Copy link
Owner

slice isn't in because it kind of messes with the database-backed surveys, though I can't remember exactly how, and it seems like it should be possible.

The *_join() family of functions (and bind_rows() for that matter) makes me nervous because they can have implications for the weights and design that aren't knowable just from the data itself. I recommend you perform these data manipulations before setting the survey design, while the data is still in traditional data.frames.

@krivit
Copy link
Contributor Author

krivit commented Jan 15, 2020

I agree that joins can mess with the design, but the package already handles indexing that duplicates rows (e.g., x[rep(1:2,each=2),]) intelligently by treating them as a cluster sample; I would think that joins would work along similar lines.

@tzoltak
Copy link
Contributor

tzoltak commented Apr 9, 2020

Another approach to joins would be to check whether join adds or duplicates any rows and simply not to allow join in such cases.

@bschneidr
Copy link
Contributor

bschneidr commented May 12, 2021

I think it would make sense to add filtering joins (anti_join() and semi_join()), since those won't accidentally alter the survey design. That's why I've added the pull request #120 to implement them if you think that's a good idea.

I'm ambivalent about whether it's worth adding left_join() since I'm also nervous about altering users' survey designs in ways they might not expect or whose ramifications they won't appreciate. But I think it'd be totally fine if we went with @tzoltak's suggestion for left_join(), inner_join(), and full_join(), and throw informative errors when rows are added or duplicated.

@gergness
Copy link
Owner

Yeah, agree on filtering joins, I just don't see them as super useful without the other joins.

For mutating joins, I meant to investigate this comment from krivit, but never did (and likely won't have time for a while):

the package already handles indexing that duplicates rows (e.g., x[rep(1:2,each=2),]) intelligently by treating them as a cluster sample; I would think that joins would work along similar lines.

If this is usually the right thing to do (my ignorance of the math behind surveys is really coming out here), I can imagine a warning instead of an error when a join creates duplicates. I think we also would need a warning for mutating joins when both x & y are surveys to let the user know that only the weights from x are kept.

Anyone have real world examples where they wanted to do this (preferably with sharable data so they're full reprexes, but I'm also just trying to wrap my head around it, so it's okay if not)?

@tzoltak
Copy link
Contributor

tzoltak commented May 12, 2021

Well, it may make sense to create clusters on duplicated rows, but whether it actually makes heavily depends on ones workflow - there's no way package can check this. Personally I'm rather devoted to the idea of being explicit about survey design and do not modifying it on the flight (in the operation that don't look like it modifies the design) but that's matter of personal preferences and if srvyr already handles such a thing in the operation of selecting rows it makes sense it will also behave analogously while performing joins. Nevertheless I think there should be warning or at least note in such a situation - my personal experience is duplication of rows in joins often comes from mistakenly assuming that (combinations of) values of key variable(s) are unique while they somehow have unwillingly duplicated on a previous stage of performing complex data transformations.

@krivit
Copy link
Contributor Author

krivit commented May 12, 2021

@gergness, The specific case I am dealing with is something called egocentric network data. For example, I might ask each survey respondent about their own demographic information (age, sex, race/ethnicity, etc.) and put them in Table x and demographics of each of their close friends and put them in Table y. (Let's assume that no one is nominated twice.)

Since I selected my respondents using some kind of a sampling design, I might create a srvyr object out of x and set up that design. Now, suppose that I want to analyse the association between a person's demographics and those of their close friends. When I inner join x to y, I would expect the result to be a table with the same number of rows as y and with its design being a cluster sample within x's design, since that's what they become.

You can see those examples in the egor package. Right now, we use a kludge described here.

@tzoltak, my preference would be to emulate the behaviour of survey as much as possible:

library(survey)
data(mtcars)
(carsvy <- svydesign(~1, data=mtcars))
#> Warning in svydesign.default(~1, data = mtcars): No weights or probabilities
#> supplied, assuming equal probability
#> Independent Sampling design (with replacement)
#> svydesign(~1, data = mtcars)
carsvy[rep(1:2, each=2)]
#> 1 - level Cluster Sampling design (with replacement)
#> With (2) clusters.
#> svydesign(~1, data = mtcars)

Created on 2021-05-12 by the reprex package (v2.0.0)

@gergness gergness reopened this May 23, 2021
@gergness
Copy link
Owner

Oops, didn't mean to close. Filtering joins are available now though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants