Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add basename for loaders #4363

Merged
merged 8 commits into from
Mar 30, 2023
Merged

Conversation

jisuoqing
Copy link
Member

The handler.name are hard-coded with a string in dataset loaders, and assigning values to ds.basename will pop out an AttributeError: can't set attribute 'basename' (which is allowed by previous yt versions, though).

Here an additional parameter basename for dataset loaders is proposed, so users can load multiple datasets with different base names and analyze/visualize them in a workflow (e.g., saving plots with different filenames without worrying about overwriting each other.)

@jisuoqing jisuoqing added the proposal Proposals for enhancements, changes, etc label Mar 3, 2023
@jisuoqing jisuoqing added enhancement Making something better and removed proposal Proposals for enhancements, changes, etc labels Mar 3, 2023
Copy link
Member

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use actual default values instead of None. I've made explicit suggestions in one case but it should be generalised

yt/loaders.py Outdated Show resolved Hide resolved
yt/loaders.py Outdated Show resolved Hide resolved
jisuoqing and others added 3 commits March 4, 2023 00:58
Co-authored-by: Clément Robert <cr52@protonmail.com>
Co-authored-by: Clément Robert <cr52@protonmail.com>
@jisuoqing
Copy link
Member Author

Thanks! Learnt a nicer way.

Copy link
Contributor

@chrishavlin chrishavlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! Left a comment on fixing the type check failures.

yt/loaders.py Outdated Show resolved Hide resolved
@neutrinoceros
Copy link
Member

Had a couple minutes to think about this patch, and while I don't want to oppose it, I want to raise a couple important issues with the design.

assigning values to ds.basename will pop out an AttributeError: can't set attribute 'basename' (which is allowed by previous yt versions, though).

Setting ds.basename is disallowed in yt 4.1 because it breaks the consistency that ds.basename, ds.directory, ds.fullpath and ds.backup_filename are supposed to have (they all relate to ds.filename). This is why they are now implemented as properties.
The fundamental (private) attribute from which all these properties are derived is ds._input_filename, so shouldn't this is the attribute that loaders should be setting in order to control the output image prefixes ? (this question isn't rhetorical)

Furthermore, wouldn't image_prefix make more sense than basename as a parameter ? I think it conveys the intention much better. If not, it means I didn't get what it was meant for.

The image filename generating code is already difficult to maintain, and I'll fully admit that I didn't grasp the full picture in 20min of exploration. I think it's important that we try our best not to make the code even harder to reason about.
In short, we need to have a very clear understanding of how setting handler.name results in setting an image prefix, and simplify the code if we can rather than making it more complex.

@jisuoqing
Copy link
Member Author

Thanks for your comments! Sorry for my late response since I'm traveling these days. I agree to make the logic simpler rather than more complex, while it might be better to make the name customizable than hard-coded? I raise this question when I have to read a series of dataset (say, with filename snapshot_000, snapshot_001, etc.), load the content with some loader (say, load_uniform_grid) and perform analysis with parallelism, but it sometime causes trouble when all loaded data comes with the same base name (say, UniformGrid), including the image prefix issue mentioned before. Therefore, it might be more natural if the base name from the loaders can be customized into snapshot_000, snapshot_001 etc. Not sure whether there's a way to achieve this for now.
I think this PR might be helpful, but would not insist. I total understand that ds.basename should not be changed.

@neutrinoceros
Copy link
Member

Not sure whether there's a way to achieve this for now.

It is certainly achievable if you construct the filename outside yt and pass it to the save method instead of relying on the default (which I agree are not ideal).

@jisuoqing
Copy link
Member Author

It is certainly achievable if you construct the filename outside yt and pass it to the save method instead of relying on the default (which I agree are not ideal).

Let me double-confirm: I used to wrap up a data loader into a custom function ds = load_dataset(fn) in which the function load_uniform_grid is called. Is it possible to let the returned ds have the basename of fn (so any yt functionality can deal with this ds without any issue)?

@neutrinoceros
Copy link
Member

not yet. I agree that something in the spirit of your patch is needed to support that. I would feel much more confortable with discussing this if anyone could explain how this line

handler.name = "UniformGridData" # type: ignore [attr-defined]

ends up affecting image file prefixes.

@chrishavlin
Copy link
Contributor

chrishavlin commented Mar 22, 2023

I would feel much more confortable with discussing this if anyone could explain how this line <....>
ends up affecting image file prefixes.

For all the stream dataset types, stream_handler gets passed to the StreamDataset.__init__() where it gets stored as an attribute (StreamDataset.stream_handler).

The filename property of a StreamDataset is:

    @property
    def filename(self):
        return self.stream_handler.name

and I don't see anywhere else the stream_handler.name gets used.

So in terms of how that relates to the image file prefixes, it should be similar to any other dataset type. When a name is not supplied to PlotContainer.save(), it sets:

        if name is None:
            name = str(self.ds)

which returns the dataset basename. i.e., in the base Dataset we have:

    def __str__(self):
        return self.basename

and

    @property
    def basename(self):
        return os.path.basename(self.filename)

So in effect, stream_handler.name is the dataset "filename" and was likely just called name because it's an in-memory dataset and IMO, making the stream_handler.name configurable like this PR does is reasonable. It brings this name attribute more in line with how the filename is used as an identifier for the dataset. I definitely would make use of this functionality!

Slight confusing aside... but it's worth noting that during the StreamDataset initialization, it passes down a different "name" down to the base Dataset.__init__ that is a unique hash to store as the _input_filename of the dataset. But since the StreamDataset overrides the filename property to point to the stream_handler.name, I don't think that the hashed _input_filename ever actually gets used.

@neutrinoceros
Copy link
Member

Thanks @chrishavlin for the detailed walkthrough. I missed that StreamHandler overrides the filename property, it's much clearer now.
So it seems to me there is plenty room to simplify this internal mess of implementation details, but I'm confident now that this PR can be implemented without worrying about it. The one thing that should be properly discussed for the present patch, since it exposes a new keyword argument, is what we should name it (because we won't be able to change it afterward).

As I already pointed, I'm personally -1 on basename and I would rather go for something like image_prefix. What do you guys think ?

@chrishavlin
Copy link
Contributor

Ya, I agree -1 for basename, for all the reasons you point out. I'm also not so sure about image_prefix since it's a little more general than that (e.g., print(ds) is not obviously related to an image prefix). How about just dataset_name with an explanation in the docstring that this is used in place of a filename for in-memory stream datasets?

@matthewturk
Copy link
Member

matthewturk commented Mar 22, 2023 via email

@neutrinoceros
Copy link
Member

How about just dataset_name with an explanation in the docstring that this is used in place of a filename for in-memory stream datasets?

Sounds good to me

@chrishavlin
Copy link
Contributor

Is it just images we prefix with this? Do we use it for sidecar files, ever?

So from what I can tell, in general, the ds.filename attribute ends up being used in different ways (sometimes differing by frontend) and in some cases, yes, it is ultimately used to write to or read from sidecar files:

ds.parameter_file is a direct alias to ds.filename and that gets used all over the frontends and is often passed down to index objects where it's stored as an index_filename. And that's used, e.g., by the ParticleIndex._initialize_index to check for existing .ewah files:

        # Load Morton index from file if provided
        fname = getattr(ds, "index_filename", None) or f"{ds.parameter_filename}.ewah"

or to store sidecar kdtree files for SPHDataset.

storage_filename will be derived from ds.basename if it's not set and needed for Index._initialize_data_storage().

I'm not sure how much of that is relevant to the Stream frontend in particular, but I think that yes, the stream_handler.name would be used for any sidecar files (if they're used at all).

All that to say, yes, the stream_handler.name will be used for more than just image prefixing so that's a good reason to use a more general variable name like dataset_name as the new keyword argument here.

@chrishavlin
Copy link
Contributor

So I think we're in agreement here, ya?

use dataset_name as the keyword argument instead of basename and adjust the docstring entry to something like:

    dataset_name: string, optional
        Optional string used to assign a name to the dataset. Stream datasets will use
        this value in place of a filename (in image prefixing, etc.). 

Sound good, @neutrinoceros and @matthewturk ?

@jisuoqing do you mind making that change? (Maybe wait a bit to give @neutrinoceros and/or @matthewturk a chance to reply to this comment)

@jisuoqing
Copy link
Member Author

Thanks! Happy to make the change after hearing back from others.

@neutrinoceros
Copy link
Member

Sounds good to me

@matthewturk
Copy link
Member

Let's do it!

@matthewturk matthewturk enabled auto-merge March 29, 2023 15:27
@chrishavlin chrishavlin disabled auto-merge March 29, 2023 21:54
@jisuoqing jisuoqing added the docs label Mar 30, 2023
Copy link
Member

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you ! LGTM

Copy link
Contributor

@chrishavlin chrishavlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @jisuoqing !

@chrishavlin chrishavlin merged commit 55b9c50 into yt-project:main Mar 30, 2023
@jisuoqing jisuoqing deleted the loader_basename branch March 31, 2023 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs enhancement Making something better
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants