Skip to content

Conversation

@schlunma
Copy link
Contributor

@schlunma schlunma commented Jul 16, 2025

Description

This PR expands Dataset.from_files so it works properly with derived variables. In addition, a new attribute Dataset.input_datasets is available which returns the datasets necessary for derivation (or simply the dataset itself is no derivation is required). This can also be used within the derive preprocessor function.

This PR is the second step to make Dataset.load work with derived variables.

Example

dataset_template = Dataset(
    short_name="lwcre",
    mip="Amon",
    project="CMIP6",
    exp="historical",
    dataset="*",
    institute="*",
    ensemble="r1i1p1f1",
    grid="gn",
    derive=True,
    force_derivation=True,
)

datasets = list(dataset_template.from_files())
print(f"Found {len(datasets)} datasets")  # Found 36 datasets

dataset = datasets[0]
dataset.files  # []

for d in dataset.input_datasets:
    print(d["short_name"])
    print(d.files)

# rlut
# [ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/rlut/gn/v20200623/rlut_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de', 'esgf3.dkrz.de']]
# rlutcs
# [ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/rlutcs/gn/v20200623/rlutcs_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de']]

Related to #2769.

Link to documentation:


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another superb PR, Manu 🍻 - I really got nothing to raise here, but I'd defer a quick looksee to @bouweandela too, just one minor comment related to one test from me 🍻

@schlunma
Copy link
Contributor Author

@LisaBock This is the PR I talked about earlier. This is necessary to be able to properly use the derive preprocessor with public API functions. It would be very nice to have this in v2.13.0.

@jlenh
Copy link
Contributor

jlenh commented Aug 18, 2025

@LisaBock This is the PR I talked about earlier. This is necessary to be able to properly use the derive preprocessor with public API functions. It would be very nice to have this in v2.13.0.

Would this be ready to be merged @LisaBock ? Or another review from @bouweandela ?

@bouweandela
Copy link
Member

I would be happy to do a review, but will need some time as this is 1300 lines of new/changed code

@jlenh jlenh modified the milestones: v2.13.0, v2.14.0 Aug 21, 2025
Copy link
Member

@bouweandela bouweandela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @schlunma, I heard you're back at work, so I made a start with reviewing this.

I'm a bit concerned that we'll make the esmvalcore.dataset.Dataset class more complicated than desirable. Where exactly is the boundary between defining input data and defining how to process it?

If we include more preprocessing in the Dataset class, it could turn into the esmvalcore.preprocessor.PreprocessorFile that we never made public because it is just too poorly designed and complicated #1847.

Maybe it's fine to include one more preprocessor function in the Dataset.load method, but maybe we could also solve this in another way too. Have you considered creating a function like esmvalcore.preprocessor._derive.get_required that would be user-friendly?

return input_datasets

@property
def input_datasets(self) -> list[Dataset]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to derived_from or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to required_datasets in ff0cdd5. Would that be okay for you? I think derived_from might be misleading for non-derived variables.

return not copy.files


def _get_input_datasets(dataset: Dataset) -> list[Dataset]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function still needed now that the dataset provides these as an attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This function removes non-existent optional required datasets prior to loading them. This can/will be moved to the Dataset.load() (see #2769).

return input_datasets


def _representative_datasets(dataset: Dataset) -> list[Dataset]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems no longer needed either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this can be removed once Dataset.load() works properly for derived variables.

Comment on lines 311 to 343
all_datasets: list[list[tuple[dict, Dataset]]] = []
for input_dataset in self._get_input_datasets():
all_datasets.append([])
for expanded_ds in self._get_available_datasets(
input_dataset,
):
updated_facets = {}
for key, value in self.facets.items():
if _isglob(value):
if key in expanded_ds.facets and not _isglob(
expanded_ds[key],
):
updated_facets[key] = expanded_ds.facets[key]
new_ds = self.copy()
new_ds.facets.update(updated_facets)
new_ds.supplementaries = self.supplementaries

all_datasets[-1].append((updated_facets, new_ds))

# Only consider those datasets that contain all input variables
# necessary for derivation
for updated_facets, new_ds in all_datasets[0]:
other_facets = [[d[0] for d in ds] for ds in all_datasets[1:]]
if all(updated_facets in facets for facets in other_facets):
yield new_ds
else:
logger.debug(
"Not all necessary input variables to derive '%s' are "
"available for %s with facets %s",
self["short_name"],
new_ds.summary(shorten=True),
updated_facets,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is difficult to understand. I believe that what it intends to do, is yield a new dataset if the globs can be expanded in a similar way for all input datasets that are required to derive the dataset, did I get that right?

If yes, it could probably be simplified by bailing out as soon as you find an unexpanded glob pattern that was expanded for another dataset. Or did you intend to have all glob patterns expanded? I have some concerns about how reliable it is too. What happens if some facets are different from one input dataset to another, e.g. institute or version?

Copy link
Contributor Author

@schlunma schlunma Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to simplify this in d8f5d08 and 0962489. I think it should be robust. If there's any mismatch in the facets at all (apart from the variable names and other obvious ones), those are not considered.

def _get_all_available_datasets(self) -> Iterator[Dataset]: # noqa: C901
"""Yield datasets based on the available files.
This function requires that self.facets['mip'] is not a glob pattern.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't know, just copy-pasted this from the existing code. Since the parent function calling this method makes sure that mip does not contain wildcards, it might be the case.

@schlunma
Copy link
Contributor Author

schlunma commented Jan 7, 2026

Thanks for reviewing @bouweandela!

I'm a bit concerned that we'll make the esmvalcore.dataset.Dataset class more complicated than desirable. Where exactly is the boundary between defining input data and defining how to process it?

If we include more preprocessing in the Dataset class, it could turn into the esmvalcore.preprocessor.PreprocessorFile that we never made public because it is just too poorly designed and complicated #1847.

Very good question. I think by providing the method Dataset.load we are already mixing defining and processing input data. To me, the derivation of a variable is not really a preprocessor function like the countless others we have, simply because it requires multiple CMOR input variables. In that sense, it really is closer to the definition of a variable rather than preprocessing. But this opinion is fully subjective of course.

Maybe it's fine to include one more preprocessor function in the Dataset.load method, but maybe we could also solve this in another way too. Have you considered creating a function like esmvalcore.preprocessor._derive.get_required that would be user-friendly?

Yes, I did. However, I found that it was not really practical to have that. The main advantage of this PR is the ability to load derived variables with wildcards (i.e., data where not all required variables are available are skipped). Including this logic into the preprocessor module in some kind of get_required function seemed conceptually wrong (finding data is not preprocessing) and overly complicated. Thus, I opted to include this into the Dataset class. Again, this is a very subjective decision.

I think this easy access to derived variables loaded from arbitrary dataset would be a great feature of ESMValTool. AFAIK, no other packages provides that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

variable derivation Related to variable derivation functions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants