miranda.io package

IO Utilities module.

Submodules

miranda.io._input module

miranda.io._input.discover_data(input_files: str | Path | list[str | Path] | GeneratorType, suffix: str = 'nc', recurse: bool = True) list[Path] | GeneratorType[source]

Discover data.

Parameters:
  • input_files (str, pathlib.Path, list of str or Path, or GeneratorType) – Path or string to a file, a folder, or a generator of paths.

  • suffix (str) – File-ending suffix to search for. Default: “nc”.

  • recurse (bool) – Whether to recurse through folders or not. Default: True.

Returns:

list of pathlib.Path or GeneratorType of pathlib.Path

Warning

Recursion through “.zarr” files is explicitly disabled. Recursive globs and generators will not be expanded/sorted.

miranda.io._output module

IO Output Operations module.

miranda.io._output.concat_rechunk_zarr(freq: str, input_folder: str | PathLike, output_folder: str | PathLike, overwrite: bool = False, **dask_kwargs) None[source]

Concatenate and rechunk zarr files.

Parameters:
  • freq (str)

  • input_folder (str or os.PathLike)

  • output_folder (str or os.PathLike)

  • overwrite (bool)

  • **dask_kwargs

Returns:

None

miranda.io._output.merge_rechunk_zarrs(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, freq: str | None = None, suffix: str = 'zarr', overwrite: bool = False) None[source]

Merge and rechunk zarr files.

Parameters:
  • input_folder (str or os.PathLike)

  • output_folder (str or os.PathLike)

  • project (str, optional)

  • target_chunks (dict[str, int], optional)

  • variables (Sequence of str, optional)

  • freq (str, optional)

  • suffix ({“nc”, “zarr”})

  • overwrite (bool)

Returns:

None

miranda.io._output.write_dataset(ds: DataArray | Dataset, output_path: str | PathLike, output_format: str, output_name: str | None = None, chunks: dict | None = None, overwrite: bool = False, compute: bool = True) dict[str, Path][source]

Write xarray object to NetCDf or Zarr with appropriate chunking regime.

Parameters:
  • ds (xr.DataArray or xr.Dataset) – Dataset or DatArray.

  • output_path (str or os.PathLike) – Output folder path.

  • output_format ({“netcdf”, “zarr”}) – Output data container type.

  • output_name (str, optional) – Output file name.

  • chunks (dict, optional) – Chunking layout to be written to new files. If None, chunking will be left to the relevant backend engine.

  • overwrite (bool) – Whether to remove existing files or fail if files already exist.

  • compute (bool) – If True, files will be converted with each call to file conversion. If False, will return a dask.Delayed object that can be computed later. Default: True.

Returns:

dict[str, Path]

miranda.io._output.write_dataset_dict(dataset_dict: dict[str, Dataset | None], output_folder: str | PathLike, temp_folder: str | PathLike, *, output_format: str = 'zarr', overwrite: bool = False, chunks: dict[str, int], **dask_kwargs)[source]

Write dataset from Miranda-formatted dataset.

Parameters:
  • dataset_dict (dict[str, xr.Dataset or None])

  • output_folder (str or os.PathLike)

  • temp_folder (str or os.PathLike)

  • output_format ({“netcdf”, “zarr”})

  • overwrite (bool)

  • chunks (dict[str, int])

  • **dask_kwargs

Returns:

None

miranda.io._rechunk module

miranda.io._rechunk.fetch_chunk_config(priority: str, freq: str, dims: Sequence[str] | dict[str, int] | Frozen | tuple[Hashable], default_config: dict = {'files': {'1hr': {'default': {'lat': 250, 'lon': 250, 'time': 168}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 168}}, 'day': {'default': {'lat': 125, 'lon': 125, 'time': '1 year'}, 'rotated': {'rlat': 125, 'rlon': 125, 'time': '1 year'}}, 'month': {'default': {'lat': 500, 'lon': 500, 'time': 120}, 'rotated': {'rlat': 500, 'rlon': 500, 'time': 120}}}, 'stations': {'1hr': {'default': {'station': 50, 'time': '5 years'}}, 'day': {'default': {'station': 200, 'time': '10 years'}}}, 'time': {'1hr': {'default': {'lat': 50, 'lon': 50, 'time': 1440}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': 1440}}, 'day': {'default': {'lat': 50, 'lon': 50, 'time': '4 years'}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': '4 years'}}, 'month': {'default': {'lat': 250, 'lon': 250, 'time': 240}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 240}}}}) dict[str, int][source]

Fetch data chunking configuration.

Parameters:
  • priority ({“time”, “files”}) – Specifies whether the chunking regime should prioritize file granularity (“files”) or time series (“time”).

  • freq ({“1hr”, “day”, “month”}) – The time frequency of the input data.

  • dims (sequence of str) – The dimension names that will be used for chunking.

  • default_config (dict) – The dictionary to use for determining the chunking configuration.

Returns:

dict[str, int]

miranda.io._rechunk.prepare_chunks_for_ds(ds: Dataset, chunks: dict[str, str | int]) dict[str, int][source]

Prepare the chunks to be used to write Dataset.

This includes translating the time chunks, making sure chunks are not too small, and removing -1.

Parameters:
  • ds (xr.Dataset) – Dataset that we want to write with the chunks.

  • chunks (dict) – Desired chunks in human-readable format (with “4 years” and -1).

Returns:

dict – Chunks in a format that is ready to be used to write to disk.

miranda.io._rechunk.rechunk_files(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, time_step: str | None = None, chunking_priority: str = 'auto', target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, suffix: str = 'nc', output_format: str = 'netcdf', overwrite: bool = False) None[source]

Rechunks dataset for better loading/reading performance.

Parameters:
  • input_folder (str or os.PathLike) – Folder to be examined. Performs globbing.

  • output_folder (str or os.PathLike) – Target folder.

  • project (str, optional) – Supported projects. Used for determining chunk dictionary. Superseded if target_chunks is set.

  • time_step ({“1hr”, “day”}, optional) – Time step of the input data. Parsed from dataset attrs if not set. Superseded if target_chunks is set.

  • chunking_priority ({“time”, “files”, “auto”}) – The chunking regime to use. Default: “auto”.

  • target_chunks (dict, optional) – Must include “time”, optionally “lat” and “lon”, depending on dataset structure.

  • variables (Sequence[str], optional) – If no variables set, will attempt to process all variables supported based on project name.

  • suffix ({“nc”, “zarr”}) – Suffix used to identify data files. Default: “nc”.

  • output_format ({“netcdf”, “zarr”}) – Default: “zarr”.

  • overwrite (bool) – Will overwrite files. For zarr, existing folders will be removed before writing.

Returns:

None

Warning

Globbing assumes that target datasets to be rechunked have been saved in NetCDF format. File naming requires the following order of facets: {variable}_{time_step}_{institute}_{project}_reanalysis_*.nc. Chunking dimensions are assumed to be CF-Compliant (lat, lon, rlat, rlon, time).

miranda.io._rechunk.translate_time_chunk(chunks: dict, calendar: str, timesize: int) dict[source]

Translate chunk specification for time into a number.

Notes

-1 translates to timesize ‘Nyear’ translates to N times the number of days in a year of calendar calendar.

miranda.io.utils module

IO Utilities module.

miranda.io.utils.delayed_write(ds: xr.Dataset, outfile: str | os.PathLike, output_format: str, overwrite: bool, target_chunks: dict | None = None, **kwargs: Any) dask.delayed.Delayed[source]

Stage a Dataset writing job using dask.delayed objects.

Parameters:
  • ds (xr.Dataset) – The Dataset to be written.

  • outfile (str or os.PathLike) – The output file.

  • output_format ({“netcdf”, “zarr”}) – The output format.

  • overwrite (bool) – Whether to overwrite existing files. Default: False.

  • target_chunks (dict) – The target chunks for the output file.

  • **kwargs (Any) – Additional keyword arguments.

Returns:

dask.delayed.delayed – The delayed write job.

miranda.io.utils.get_chunks_on_disk(file: str | PathLike[str] | Path) dict[str, int][source]

Determine the chunks on disk for a given NetCDF or Zarr file.

Parameters:

file (str or os.PathLike or Path) – File to be examined. Supports NetCDF and Zarr.

Returns:

dict – The chunks on disk.

miranda.io.utils.get_global_attrs(file_or_dataset: str | PathLike[str] | Dataset) dict[str, str | int][source]

Collect global attributes from NetCDF, Zarr, or Dataset object.

Parameters:

file_or_dataset (str or os.PathLike or xr.Dataset) – The file or dataset to be examined.

Returns:

dict – The global attributes.

miranda.io.utils.get_time_attrs(file_or_dataset: str | PathLike[str] | Dataset) tuple[str, int][source]

Determine attributes related to time dimensions.

Parameters:

file_or_dataset (str or os.PathLike or xr.Dataset) – The file or dataset to be examined.

Returns:

tuple – The calendar and time.

miranda.io.utils.name_output_file(ds_or_dict: Dataset | dict[str, str], output_format: str, data_vars: str | None = None) str[source]

Name an output file based on facets within a Dataset or a dictionary.

Parameters:
  • ds_or_dict (xr.Dataset or dict) – A miranda-converted Dataset or a dictionary containing the appropriate facets.

  • output_format ({“netcdf”, “zarr”}) – Output filetype to be used for generating filename suffix.

  • data_vars (str, optional) – If using a Dataset, the name of the data variable to be used for naming the file.

Returns:

str – The formatted filename.

Notes

If using a dictionary, the following keys must be set: * “variable”, “frequency”, “institution”, “time_start”, “time_end”.

miranda.io.utils.sort_variables(files: list[str | PathLike[str] | Path], variables: Sequence[str] | None) dict[str, list[Path]][source]

Sort all variables within supplied files for treatment.

Parameters:
  • files (list of str or os.PathLike or Path) – The files to be sorted.

  • variables (sequence of str, optional) – The variables to be sorted. If not provided, all variables will be grouped.

Returns:

dict[str, list[Path]] – Files sorted by variables.