miranda.io package#

IO Utilities module.

miranda.io.concat_rechunk_zarr(freq: str, input_folder: str | PathLike, output_folder: str | PathLike, overwrite: bool = False, **dask_kwargs) None[source]#

Concatenate and rechunk zarr files.

Parameters:
  • freq (str)

  • input_folder (str or os.PathLike)

  • output_folder (str or os.PathLike)

  • overwrite (bool)

  • **dask_kwargs

Returns:

None

miranda.io.discover_data(input_files: str | PathLike | list[str | PathLike] | generator, suffix: str = 'nc', recurse: bool = True) list[Path] | generator[source]#

Discover data.

Parameters:
  • input_files (str, pathlib.Path, list of str or Path, or GeneratorType) – Path or string to a file, a folder, or a generator of paths.

  • suffix (str) – File-ending suffix to search for. Default: “nc”.

  • recurse (bool) – Whether to recurse through folders or not. Default: True.

Returns:

list of pathlib.Path or GeneratorType of pathlib.Path

Warning

Recursion through “.zarr” files is explicitly disabled. Recursive globs and generators will not be expanded/sorted.

miranda.io.fetch_chunk_config(priority: str, freq: str, dims: Sequence[str] | dict[str, int] | Frozen | tuple[Hashable], default_config: dict = {'files': {'1hr': {'default': {'lat': 250, 'lon': 250, 'time': 168}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 168}}, 'day': {'default': {'lat': 125, 'lon': 125, 'time': '1 year'}, 'rotated': {'rlat': 125, 'rlon': 125, 'time': '1 year'}}, 'month': {'default': {'lat': 500, 'lon': 500, 'time': 120}, 'rotated': {'rlat': 500, 'rlon': 500, 'time': 120}}}, 'time': {'1hr': {'default': {'lat': 50, 'lon': 50, 'time': 1440}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': 1440}}, 'day': {'default': {'lat': 50, 'lon': 50, 'time': '4 years'}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': '4 years'}}, 'month': {'default': {'lat': 250, 'lon': 250, 'time': 240}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 240}}}}) dict[str, int][source]#
Parameters:
  • priority ({“time”, “files”}) – Specifies whether the chunking regime should prioritize file granularity (“files”) or time series (“time”).

  • freq ({“1hr”, “day”, “month”}) – The time frequency of the input data.

  • dims (sequence of str) – The dimension names that will be used for chunking.

  • default_config (dict) – The dictionary to use for determining the chunking configuration.

Returns:

dict[str, int]

miranda.io.find_filepaths(source: str | Path | generator | list[Path | str], recursive: bool = True, file_suffixes: str | list[str] | None = None, **_) list[Path][source]#

Find all available filepaths at a given source.

Parameters:
  • source (str, Path, GeneratorType, or list[str or Path])

  • recursive (bool)

  • file_suffixes (str or list of str, optional)

Returns:

list of pathlib.Path

miranda.io.merge_rechunk_zarrs(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, freq: str | None = None, suffix: str = 'zarr', overwrite: bool = False) None[source]#

Merge and rechunk zarr files.

Parameters:
  • input_folder (str or os.PathLike)

  • output_folder (str or os.PathLike)

  • project (str, optional)

  • target_chunks (dict[str, int], optional)

  • variables (Sequence of str, optional)

  • freq (str, optional)

  • suffix ({“nc”, “zarr”})

  • overwrite (bool)

Returns:

None

miranda.io.prepare_chunks_for_ds(ds: Dataset, chunks: dict[str, str | int]) dict[str, int][source]#

Prepare the chunks to be used to write Dataset.

This includes translating the time chunks, making sure chunks are not too small, and removing -1.

Parameters:
  • ds (xr.Dataset) – Dataset that we want to write with the chunks.

  • chunks (dict) – Desired chunks in human-readable format (with “4 years” and -1).

Returns:

dict – Chunks in a format that is ready to be used to write to disk.

miranda.io.rechunk_files(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, time_step: str | None = None, chunking_priority: str = 'auto', target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, suffix: str = 'nc', output_format: str = 'netcdf', overwrite: bool = False) None[source]#

Rechunks dataset for better loading/reading performance.

Warning

Globbing assumes that target datasets to be rechunked have been saved in NetCDF format. File naming requires the following order of facets: {variable}_{time_step}_{institute}_{project}_reanalysis_*.nc. Chunking dimensions are assumed to be CF-Compliant (lat, lon, rlat, rlon, time).

Parameters:
  • input_folder (str or os.PathLike) – Folder to be examined. Performs globbing.

  • output_folder (str or os.PathLike) – Target folder.

  • project (str, optional) – Supported projects. Used for determining chunk dictionary. Superseded if target_chunks is set.

  • time_step ({“1hr”, “day”}, optional) – Time step of the input data. Parsed from dataset attrs if not set. Superseded if target_chunks is set.

  • chunking_priority ({“time”, “files”, “auto”}) – The chunking regime to use. Default: “auto”.

  • target_chunks (dict, optional) – Must include “time”, optionally “lat” and “lon”, depending on dataset structure.

  • variables (Sequence[str], optional) – If no variables set, will attempt to process all variables supported based on project name.

  • suffix ({“nc”, “zarr”}) – Suffix used to identify data files. Default: “nc”.

  • output_format ({“netcdf”, “zarr”}) – Default: “zarr”.

  • overwrite (bool) – Will overwrite files. For zarr, existing folders will be removed before writing.

Returns:

None

miranda.io.translate_time_chunk(chunks: dict, calendar: str, timesize: int) dict[source]#

Translate chunk specification for time into a number.

Notes

-1 translates to timesize ‘Nyear’ translates to N times the number of days in a year of calendar calendar.

miranda.io.write_dataset(ds: DataArray | Dataset, output_path: str | PathLike, output_format: str, chunks: dict | None = None, overwrite: bool = False, compute: bool = True) dict[str, Path][source]#

Write xarray object to NetCDf or Zarr with appropriate chunking regime.

Parameters:
  • ds (xr.DataArray or xr.Dataset) – Dataset or DatArray.

  • output_path (str or os.PathLike) – Output folder path.

  • output_format ({“netcdf”, “zarr”}) – Output data container type.

  • chunks (dict, optional) – Chunking layout to be written to new files. If None, chunking will be left to the relevant backend engine.

  • overwrite (bool) – Whether to remove existing files or fail if files already exist.

  • compute (bool) – If True, files will be converted with each call to file conversion. If False, will return a dask.Delayed object that can be computed later. Default: True.

Returns:

dict[str, Path]

miranda.io.write_dataset_dict(dataset_dict: dict[str, Dataset | None], output_folder: str | PathLike, temp_folder: str | PathLike, *, output_format: str = 'zarr', overwrite: bool = False, chunks: dict[str, int], **dask_kwargs)[source]#

Write dataset from Miranda-formatted dataset.

Parameters:
  • dataset_dict (dict[str, xr.Dataset or None])

  • output_folder (str or os.PathLike)

  • temp_folder (str or os.PathLike)

  • output_format ({“netcdf”, “zarr”})

  • overwrite (bool)

  • chunks (dict[str, int])

  • **dask_kwargs

Returns:

None

Submodules#

miranda.io._input module#

miranda.io._input.discover_data(input_files: str | PathLike | list[str | PathLike] | generator, suffix: str = 'nc', recurse: bool = True) list[Path] | generator[source]#

Discover data.

Parameters:
  • input_files (str, pathlib.Path, list of str or Path, or GeneratorType) – Path or string to a file, a folder, or a generator of paths.

  • suffix (str) – File-ending suffix to search for. Default: “nc”.

  • recurse (bool) – Whether to recurse through folders or not. Default: True.

Returns:

list of pathlib.Path or GeneratorType of pathlib.Path

Warning

Recursion through “.zarr” files is explicitly disabled. Recursive globs and generators will not be expanded/sorted.

miranda.io._input.find_filepaths(source: str | Path | generator | list[Path | str], recursive: bool = True, file_suffixes: str | list[str] | None = None, **_) list[Path][source]#

Find all available filepaths at a given source.

Parameters:
  • source (str, Path, GeneratorType, or list[str or Path])

  • recursive (bool)

  • file_suffixes (str or list of str, optional)

Returns:

list of pathlib.Path

miranda.io._output module#

IO Output Operations module.

miranda.io._output.concat_rechunk_zarr(freq: str, input_folder: str | PathLike, output_folder: str | PathLike, overwrite: bool = False, **dask_kwargs) None[source]#

Concatenate and rechunk zarr files.

Parameters:
  • freq (str)

  • input_folder (str or os.PathLike)

  • output_folder (str or os.PathLike)

  • overwrite (bool)

  • **dask_kwargs

Returns:

None

miranda.io._output.merge_rechunk_zarrs(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, freq: str | None = None, suffix: str = 'zarr', overwrite: bool = False) None[source]#

Merge and rechunk zarr files.

Parameters:
  • input_folder (str or os.PathLike)

  • output_folder (str or os.PathLike)

  • project (str, optional)

  • target_chunks (dict[str, int], optional)

  • variables (Sequence of str, optional)

  • freq (str, optional)

  • suffix ({“nc”, “zarr”})

  • overwrite (bool)

Returns:

None

miranda.io._output.write_dataset(ds: DataArray | Dataset, output_path: str | PathLike, output_format: str, chunks: dict | None = None, overwrite: bool = False, compute: bool = True) dict[str, Path][source]#

Write xarray object to NetCDf or Zarr with appropriate chunking regime.

Parameters:
  • ds (xr.DataArray or xr.Dataset) – Dataset or DatArray.

  • output_path (str or os.PathLike) – Output folder path.

  • output_format ({“netcdf”, “zarr”}) – Output data container type.

  • chunks (dict, optional) – Chunking layout to be written to new files. If None, chunking will be left to the relevant backend engine.

  • overwrite (bool) – Whether to remove existing files or fail if files already exist.

  • compute (bool) – If True, files will be converted with each call to file conversion. If False, will return a dask.Delayed object that can be computed later. Default: True.

Returns:

dict[str, Path]

miranda.io._output.write_dataset_dict(dataset_dict: dict[str, Dataset | None], output_folder: str | PathLike, temp_folder: str | PathLike, *, output_format: str = 'zarr', overwrite: bool = False, chunks: dict[str, int], **dask_kwargs)[source]#

Write dataset from Miranda-formatted dataset.

Parameters:
  • dataset_dict (dict[str, xr.Dataset or None])

  • output_folder (str or os.PathLike)

  • temp_folder (str or os.PathLike)

  • output_format ({“netcdf”, “zarr”})

  • overwrite (bool)

  • chunks (dict[str, int])

  • **dask_kwargs

Returns:

None

miranda.io._rechunk module#

miranda.io._rechunk.fetch_chunk_config(priority: str, freq: str, dims: Sequence[str] | dict[str, int] | Frozen | tuple[Hashable], default_config: dict = {'files': {'1hr': {'default': {'lat': 250, 'lon': 250, 'time': 168}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 168}}, 'day': {'default': {'lat': 125, 'lon': 125, 'time': '1 year'}, 'rotated': {'rlat': 125, 'rlon': 125, 'time': '1 year'}}, 'month': {'default': {'lat': 500, 'lon': 500, 'time': 120}, 'rotated': {'rlat': 500, 'rlon': 500, 'time': 120}}}, 'time': {'1hr': {'default': {'lat': 50, 'lon': 50, 'time': 1440}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': 1440}}, 'day': {'default': {'lat': 50, 'lon': 50, 'time': '4 years'}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': '4 years'}}, 'month': {'default': {'lat': 250, 'lon': 250, 'time': 240}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 240}}}}) dict[str, int][source]#
Parameters:
  • priority ({“time”, “files”}) – Specifies whether the chunking regime should prioritize file granularity (“files”) or time series (“time”).

  • freq ({“1hr”, “day”, “month”}) – The time frequency of the input data.

  • dims (sequence of str) – The dimension names that will be used for chunking.

  • default_config (dict) – The dictionary to use for determining the chunking configuration.

Returns:

dict[str, int]

miranda.io._rechunk.prepare_chunks_for_ds(ds: Dataset, chunks: dict[str, str | int]) dict[str, int][source]#

Prepare the chunks to be used to write Dataset.

This includes translating the time chunks, making sure chunks are not too small, and removing -1.

Parameters:
  • ds (xr.Dataset) – Dataset that we want to write with the chunks.

  • chunks (dict) – Desired chunks in human-readable format (with “4 years” and -1).

Returns:

dict – Chunks in a format that is ready to be used to write to disk.

miranda.io._rechunk.rechunk_files(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, time_step: str | None = None, chunking_priority: str = 'auto', target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, suffix: str = 'nc', output_format: str = 'netcdf', overwrite: bool = False) None[source]#

Rechunks dataset for better loading/reading performance.

Warning

Globbing assumes that target datasets to be rechunked have been saved in NetCDF format. File naming requires the following order of facets: {variable}_{time_step}_{institute}_{project}_reanalysis_*.nc. Chunking dimensions are assumed to be CF-Compliant (lat, lon, rlat, rlon, time).

Parameters:
  • input_folder (str or os.PathLike) – Folder to be examined. Performs globbing.

  • output_folder (str or os.PathLike) – Target folder.

  • project (str, optional) – Supported projects. Used for determining chunk dictionary. Superseded if target_chunks is set.

  • time_step ({“1hr”, “day”}, optional) – Time step of the input data. Parsed from dataset attrs if not set. Superseded if target_chunks is set.

  • chunking_priority ({“time”, “files”, “auto”}) – The chunking regime to use. Default: “auto”.

  • target_chunks (dict, optional) – Must include “time”, optionally “lat” and “lon”, depending on dataset structure.

  • variables (Sequence[str], optional) – If no variables set, will attempt to process all variables supported based on project name.

  • suffix ({“nc”, “zarr”}) – Suffix used to identify data files. Default: “nc”.

  • output_format ({“netcdf”, “zarr”}) – Default: “zarr”.

  • overwrite (bool) – Will overwrite files. For zarr, existing folders will be removed before writing.

Returns:

None

miranda.io._rechunk.translate_time_chunk(chunks: dict, calendar: str, timesize: int) dict[source]#

Translate chunk specification for time into a number.

Notes

-1 translates to timesize ‘Nyear’ translates to N times the number of days in a year of calendar calendar.

miranda.io.utils module#

IO Utilities module.

miranda.io.utils.creation_date(path_to_file: str | PathLike) float | date[source]#

Return the date that a file was created, falling back to when it was last modified if unable to determine.

See https://stackoverflow.com/a/39501288/1709587 for explanation.

Parameters:

path_to_file (str or os.PathLike)

Returns:

float or date

miranda.io.utils.delayed_write(ds: Dataset, outfile: str | PathLike, output_format: str, overwrite: bool, target_chunks: dict | None = None) delayed[source]#

Stage a Dataset writing job using dask.delayed objects.

Parameters:
  • ds (xr.Dataset)

  • outfile (str or os.PathLike)

  • target_chunks (dict)

  • output_format ({“netcdf”, “zarr”})

  • overwrite (bool)

Returns:

dask.delayed.delayed

miranda.io.utils.get_chunks_on_disk(file: PathLike | str) dict[source]#

Determine the chunks on disk for a given NetCDF or Zarr file.

Parameters:

file (str or os.PathLike) – File to be examined. Supports NetCDF and Zarr.

Returns:

dict

miranda.io.utils.get_global_attrs(file_or_dataset: str | PathLike | Dataset) dict[str, str | int][source]#

Collect global attributes from NetCDF, Zarr, or Dataset object.

miranda.io.utils.get_time_attrs(file_or_dataset: str | os.PathLike | xr.Dataset)[source]#

Determine attributes related to time dimensions.

miranda.io.utils.name_output_file(ds_or_dict: Dataset | dict[str, str], output_format: str) str[source]#

Name an output file based on facets within a Dataset or a dictionary.

Parameters:
  • ds_or_dict (xr.Dataset or dict) – A miranda-converted Dataset or a dictionary containing the appropriate facets.

  • output_format ({“netcdf”, “zarr”}) – Output filetype to be used for generating filename suffix.

Returns:

str

Notes

If using a dictionary, the following keys must be set: * “variable”, “frequency”, “institution”, “time_start”, “time_end”.

miranda.io.utils.sort_variables(files: list[Path], variables: Sequence[str]) dict[str, list[Path]][source]#

Sort all variables within supplied files for treatment.

Parameters:
  • files (list of Path)

  • variables (sequence of str)

Returns:

dict[str, list[Path]]