Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: collection's path iteration functions #154

Closed
wants to merge 1 commit into from
Closed

feature: collection's path iteration functions #154

wants to merge 1 commit into from

Conversation

RP-pl
Copy link

@RP-pl RP-pl commented Apr 12, 2024

No description provided.

@Suor
Copy link
Owner

Suor commented Apr 16, 2024

What's the purpose or use case for this?

@RP-pl
Copy link
Author

RP-pl commented Apr 19, 2024

When working with really complicated data formats, it may be easier to filter all possible paths instead of selecting them.
For example, if we have data formated in a following way:

{
   "key1":[1,2,3,4]
   "key2":[11,22]
   "key3":5
   "key4":[6]
}

If we would want to select only second elements of arrays, it would be way more practical to filter the paths.

@Suor
Copy link
Owner

Suor commented Apr 21, 2024

Wouldn't tree_nodes() or tree_leaves() do the job for you?

@RP-pl
Copy link
Author

RP-pl commented Apr 24, 2024

When using tree_leaves, how would you know that in the example above we are getting the second element from the last sequence? While using tree_nodes, it would make it possible, but it would make the code really ugly.

The difference between tree_leaves and get_end_paths is that while tree_leaves returns only the value of the path, get_end_paths returns the path to that value. In my opinion, these functions are fundamentally different in their concepts.

@Suor
Copy link
Owner

Suor commented Apr 25, 2024

from funcy import tree_leaves, is_mapping, lcat


data = {
    "key1": [1,2,3,4],
    "key2": [11,22],
    "key3": 5,
    "key4": [6],
}

def every_2nd(data):
    leaves = tree_leaves(data, follow=is_mapping, children=lambda x: x.values())
    return lcat(l[1::2] for l in leaves if isinstance(l, list))

Like this. Or you need to be more precise about what you are trying to achieve.

I really don't understand how paths will help you though. Do you plan to use get_in() with each path to get a value? Because that sounds inefficient.

@RP-pl
Copy link
Author

RP-pl commented Apr 30, 2024

In the code you provided, we are getting every even element in the array, which is not exactly what I am trying to achieve with this example (it's really simple for demonstration purposes). What I am aiming for with this example is to retrieve every second element of arrays embedded in a dictionary. You are right about using get_in to retrieve the value, and you are also correct about it being inefficient. However, this may still be the cleanest way to handle much more complicated data structures. Notice that if our arrays were more deeply embedded in dictionaries, the code to retrieve the second element of the array using tree_leaves would look really messy. The inefficiency would also not be a problem when the data we are looking for is sparse (e.g., out of 10,000 records, only 10 have the second element in the array).

@Suor
Copy link
Owner

Suor commented Apr 30, 2024

Still don't understand what you are trying to achieve and why tree_leaves() will be ugly, it doesn't depend in any way on the nesting of dicts.

Maybe you need some tree_transform()? Not enough info to say.

@RP-pl
Copy link
Author

RP-pl commented May 2, 2024

Maybe more complex example would help.
For data given as:

data = {
    "key11": {
        "key1": [
            {
                "key111": [1, 2, 3, 4]
            },
            {
                "key222": [11, 22, 33, 44]
            },
            {
                "key333": [111, 222, 3333, 4444]
            },
            {
                "key444": [1111, 2222, 3333, 4444]

            }],
        "key2": [
            {
                "key111": [12, 23, 34, 45]
            },
            {
                "key222": [112, 223, 334, 445]
            },
            {
                "key333": [1112, 2223, 33334, 44445]
            },
            {
                "key444": [11112, 22223, 33334, 44445]

            }],
    },
    "key22": {
        "key1": 5,
        "key2": [6, [1, 2, 3, 4]],
    }
}

Suppose we want to get every second element of the most embedded list (let's stick to that version as for your last code snippet).
Then the code to get that would look like (if you have any cleaner idea please let me know):

from funcy import ltree_leaves, lcat, is_mapping, get_end_paths,get_in

def every_2nd(data):
    prev = lcat([data for data in ltree_leaves(data, follow=is_mapping, children=lambda x: x.values()) if isinstance(data, list)])
    all_leaves = [ltree_leaves(leaf, follow=is_mapping, children=lambda x: x.values()) for leaf in prev if isinstance(leaf, dict)]
    return lcat([it[1::2] for it in lcat(l for l in all_leaves if isinstance(l, list))])

def every_2nd_paths(data):
    return [get_in(data,path) for path in filter(lambda x: len(x) == 5 and (x[4]+1) %2 == 0,get_end_paths(data))]

print(every_2nd(data)) #[2, 4, 22, 44, 222, 4444, 2222, 4444, 23, 45, 223, 445, 2223, 44445, 22223, 44445]
print(every_2nd_paths(data)) #[2, 4, 22, 44, 222, 4444, 2222, 4444, 23, 45, 223, 445, 2223, 44445, 22223, 44445]

I believe every_2nd_paths looks cleaner

@Suor
Copy link
Owner

Suor commented May 2, 2024

I would probably go with something like this:

follow = lambda c: isinstance(c, dict) or isinstance(c, list) and isinstance(c[0], dict)
lists = tree_leaves(data, follow, children=lambda x: x.values() if isinstance(x, dict) else x)
nums = lfilter(isa(int), cat(l[1::2] for l in items if isinstance(l, list)))

At least it won't contain utter magic like len(x) == 5 and (x[4]+1) %2 == 0 :)

But I would seriously look into structuring my data more propperly, it should not be require to do such elaborate effort to extract any semantically meaningfull part of the data.

Talking about traversing dict/list structures there might be easier way than tree_leaves() and some custom follow/children but iterating by paths is not it. It is not only inefficient but won't work without your magic filter, which knows way to much about seemingly chaotic data.

@RP-pl RP-pl closed this May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants