Skip to content

Conversation

@mo-dkrz
Copy link
Contributor

@mo-dkrz mo-dkrz commented Jan 6, 2026

In this PR we are going to introduce the xarray-prism as prism engine for opening all available climate data including remote and posix data.

How to install:

pip install xarray-prism

examples for remote data:

import xarray

nscc_data="https://thredds.ucar.edu/thredds/ncss/grid/grib/NCEP/GFS/Global_0p5deg_ana/GFS_Global_0p5deg_ana_20260104_1200.grib2?var=Temperature_altitude_above_msl&north=90.000&west=-180.000&east=180.000&south=-90.000&horizStride=1&time_start=2026-01-04T12:00:00Z&time_end=2026-01-04T12:00:00Z&&&accept=netcdf3"
xarray.open_dataset(nscc_data, engine="prism")

grib_remote="https://thredds.ucar.edu/thredds/fileServer/grib/NCEP/GFS/Global_0p25deg_ana/GFS_Global_0p25deg_ana_20260105_0600.grib2"
xarray.open_dataset(grib_remote, engine="prism")

nc3_remote="https://icdc.cen.uni-hamburg.de/thredds/fileServer/ftpthredds/ar5_sea_level_rise/gia_mean.nc"
xarray.open_dataset(nc3_remote, engine="prism")

opendap_data="https://icdc.cen.uni-hamburg.de/thredds/dodsC/ftpthredds/ar5_sea_level_rise/gia_mean.nc"
xarray.open_dataset(opendap_data, engine="prism")

tif_remote="https://github.com/mommermi/geotiff_sample/raw/refs/heads/master/sample.tif"
xarray.open_dataset(tif_remote, engine="prism")

Copy link

@antarcticrainforest antarcticrainforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good.

A couple for general comments.

  • There are some formats that aren't covered yet. It's some weird json reference files but I'd have to investigate myself.
  • I think the name freva-xarray or engine freva is to loaded. How about changing the name of this.

how about xarray-prism / prism, or xarray-switchboard, switchboard. Or xarray-gateway/gateway? Just some thoughts

README.md Outdated
Comment on lines 7 to 10
> If you deal with a data that `freva` engine is not able to open that, please
> report the data [here](https://github.com/freva-org/freva-xarray/issues/new)
> to let us improve this engine to be able to be versitile and work with all
> sort of climate data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> If you deal with a data that `freva` engine is not able to open that, please
> report the data [here](https://github.com/freva-org/freva-xarray/issues/new)
> to let us improve this engine to be able to be versitile and work with all
> sort of climate data.
> If you encounter with a data formats that `freva` engine is not able to open, please
> files an issue report [here](https://github.com/freva-org/freva-xarray/issues/new).
> This helps us to improve the engine enabling users work with different kinds of climate data.

extra_lines = 0
if show_progress:
fmt = "GRIB" if engine == "cfgrib" else "NetCDF3"
print(f"[warning] Remote {fmt} requires full file download")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning.warn?

) -> xr.Dataset:
"""Xarray Generic function: Open dataset with
automatic format detection."""
if not isinstance(filename_or_obj, (str, Path)):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.PathLike? This covers also fsspec stuff.

lines_printed = 0

if is_remote:
sys.stdout.write("[info] Detecting format...")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging?

self._render()

def _render(self) -> None:
mb = self._current / 1024 / 1024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mb = self._current / 1024 / 1024
mb = self._current / 1024**2

@antarcticrainforest
Copy link

@mannreis could you take a look at this and advertise if needed. This seems to be really nice!!!

Copy link

@mannreis mannreis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good work and Mo! I left my 2cents. My only concern is regarding caching the remote data which may not be, ever, an issue.

pct = min(self._current / self._total, 1.0)
filled = int(self.width * pct)
bar = "█" * filled + "░" * (self.width - filled)
total_mb = self._total / 1024 / 1024
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on Martin's suggestion:

Suggested change
total_mb = self._total / 1024 / 1024
total_mb = self._total / 1024**2

return None


def _detect_from_magic_bytes(header: bytes, lower_path: str) -> str:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since typing is around is something like this an improvement? Or returning "unknown" defeats the effort?

Suggested change
def _detect_from_magic_bytes(header: bytes, lower_path: str) -> str:
Engine = Literal["cfgrib", "scipy", "h5netcdf", "rasterio", "unknown"]
def _detect_from_magic_bytes(header: bytes, lower_path: str) -> Engine:

I guess it would also have to be propagated up the call stack and for that reason inconvenient


# GRIB: cache locally
if engine == "cfgrib":
local_path = _cache_remote_file(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching is good but I'm not sure whether is better to enable it explicitly or allow to disable.
Regardless I would not do it on /tmp for these applications which I believe is the default in many systems.

Copy link
Contributor Author

@mo-dkrz mo-dkrz Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Regarding caching, I already though quite a bit about this, and for this purpose, we designed the FREVA_XARRAY_CACHE environment variable in place to user be able to change the cache location. Or user can also adjust it on storage_options argument of open_dataset function of xarray. Also when the user opening the remote files, progress logging appears to tell the user, it's storing data on /tmp and user sees the progress. It means users can cancel the procedure when it's downloading the data by watching the logs.

Also main purpose of designing the adjustable FREVA_XARRAY_CACHE was for our own data-loader. We wanted to adjust the cache on deployment to /scratch to don't run into storage limit.

Apart from those, we really don't have many other options for opening the grib and netcdf3 data. Because of the structure of those formats, we need to download the full data and then open it. So in short, /tmp was the safest way we could come up with. If there is any more secure suggestion, please let us know.

@mannreis
Copy link

mannreis commented Jan 7, 2026

  • I think the name freva-xarray or engine freva is to loaded. How about changing the name of this.

how about xarray-prism / prism, or xarray-switchboard, switchboard. Or xarray-gateway/gateway? Just some thoughts

I first didn't mind the name but then after seeing the structure of xarray backends, I see Martin's point. Unless I misunderstood something this could even be a contribution upstream to xarray, thus freva is not involved.

@mo-dkrz mo-dkrz closed this Jan 7, 2026
@mo-dkrz mo-dkrz changed the title Introduce freva-xarray Introduce xarray-prism Jan 8, 2026
@mo-dkrz mo-dkrz reopened this Jan 8, 2026
@mo-dkrz
Copy link
Contributor Author

mo-dkrz commented Jan 8, 2026

@antarcticrainforest thanks for the reviw.

regarding this point:

There are some formats that aren't covered yet. It's some weird json reference files but I'd have to investigate myself.

could you please give a link or example of this data to dig more for making it compatible with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants