miranda.io package¶

IO Utilities module.

Submodules¶

miranda.io._input module¶

Discover data.

Parameters:

input_files (str, pathlib.Path, list of str or Path, or GeneratorType) – Path or string to a file, a folder, or a generator of paths.
suffix (str) – File-ending suffix to search for. Default: “nc”.
recurse (bool) – Whether to recurse through folders or not. Default: True.

Returns:

list of pathlib.Path or GeneratorType of pathlib.Path

Warning

Recursion through “.zarr” files is explicitly disabled. Recursive globs and generators will not be expanded/sorted.

miranda.io._output module¶

IO Output Operations module.

miranda.io._output.concat_rechunk_zarr(freq: str, input_folder: str | PathLike, output_folder: str | PathLike, overwrite: bool = False, **dask_kwargs) → None[source]¶

Concatenate and rechunk zarr files.

Parameters:

freq (str)
input_folder (str or os.PathLike)
output_folder (str or os.PathLike)
overwrite (bool)
**dask_kwargs

Returns:

None

miranda.io._output.merge_rechunk_zarrs(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, freq: str | None = None, suffix: str = 'zarr', overwrite: bool = False) → None[source]¶

Merge and rechunk zarr files.

Parameters:

input_folder (str or os.PathLike)
output_folder (str or os.PathLike)
project (str, optional)
target_chunks (dict[str, int], optional)
variables (Sequence of str, optional)
freq (str, optional)
suffix ({“nc”, “zarr”})
overwrite (bool)

Returns:

None

miranda.io._output.write_dataset(ds: DataArray | Dataset, output_path: str | PathLike, output_format: str, output_name: str | None = None, chunks: dict | None = None, overwrite: bool = False, compute: bool = True) → dict[str, Path][source]¶

Write xarray object to NetCDf or Zarr with appropriate chunking regime.

Parameters:

ds (xr.DataArray or xr.Dataset) – Dataset or DatArray.
output_path (str or os.PathLike) – Output folder path.
output_format ({“netcdf”, “zarr”}) – Output data container type.
output_name (str, optional) – Output file name.
chunks (dict, optional) – Chunking layout to be written to new files. If None, chunking will be left to the relevant backend engine.
overwrite (bool) – Whether to remove existing files or fail if files already exist.
compute (bool) – If True, files will be converted with each call to file conversion. If False, will return a dask.Delayed object that can be computed later. Default: True.

Returns:

dict[str, Path]

miranda.io._output.write_dataset_dict(dataset_dict: dict[str, Dataset | None], output_folder: str | PathLike, temp_folder: str | PathLike, *, output_format: str = 'zarr', overwrite: bool = False, chunks: dict[str, int], **dask_kwargs)[source]¶

Write dataset from Miranda-formatted dataset.

Parameters:

dataset_dict (dict[str, xr.Dataset or None])
output_folder (str or os.PathLike)
temp_folder (str or os.PathLike)
output_format ({“netcdf”, “zarr”})
overwrite (bool)
chunks (dict[str, int])
**dask_kwargs

Returns:

None

miranda.io._rechunk module¶

miranda.io._rechunk.fetch_chunk_config(priority: str, freq: str, dims: Sequence[str] | dict[str, int] | Frozen | tuple[Hashable], default_config: dict = {'files': {'1hr': {'default': {'lat': 250, 'lon': 250, 'time': 168}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 168}}, 'day': {'default': {'lat': 125, 'lon': 125, 'time': '1 year'}, 'rotated': {'rlat': 125, 'rlon': 125, 'time': '1 year'}}, 'month': {'default': {'lat': 500, 'lon': 500, 'time': 120}, 'rotated': {'rlat': 500, 'rlon': 500, 'time': 120}}}, 'stations': {'1hr': {'default': {'station': 50, 'time': '5 years'}}, 'day': {'default': {'station': 200, 'time': '10 years'}}}, 'time': {'1hr': {'default': {'lat': 50, 'lon': 50, 'time': 1440}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': 1440}}, 'day': {'default': {'lat': 50, 'lon': 50, 'time': '4 years'}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': '4 years'}}, 'month': {'default': {'lat': 250, 'lon': 250, 'time': 240}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 240}}}}) → dict[str, int][source]¶

Fetch data chunking configuration.

Parameters:

priority ({“time”, “files”}) – Specifies whether the chunking regime should prioritize file granularity (“files”) or time series (“time”).
freq ({“1hr”, “day”, “month”}) – The time frequency of the input data.
dims (sequence of str) – The dimension names that will be used for chunking.
default_config (dict) – The dictionary to use for determining the chunking configuration.

Returns:

dict[str, int]

miranda.io._rechunk.prepare_chunks_for_ds(ds: Dataset, chunks: dict[str, str | int]) → dict[str, int][source]¶

Prepare the chunks to be used to write Dataset.

This includes translating the time chunks, making sure chunks are not too small, and removing -1.

Parameters:

ds (xr.Dataset) – Dataset that we want to write with the chunks.
chunks (dict) – Desired chunks in human-readable format (with “4 years” and -1).

Returns:

dict – Chunks in a format that is ready to be used to write to disk.

miranda.io._rechunk.rechunk_files(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, time_step: str | None = None, chunking_priority: str = 'auto', target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, suffix: str = 'nc', output_format: str = 'netcdf', overwrite: bool = False) → None[source]¶

Rechunks dataset for better loading/reading performance.

Parameters:

input_folder (str or os.PathLike) – Folder to be examined. Performs globbing.
output_folder (str or os.PathLike) – Target folder.
project (str, optional) – Supported projects. Used for determining chunk dictionary. Superseded if target_chunks is set.
time_step ({“1hr”, “day”}, optional) – Time step of the input data. Parsed from dataset attrs if not set. Superseded if target_chunks is set.
chunking_priority ({“time”, “files”, “auto”}) – The chunking regime to use. Default: “auto”.
target_chunks (dict, optional) – Must include “time”, optionally “lat” and “lon”, depending on dataset structure.
variables (Sequence[str], optional) – If no variables set, will attempt to process all variables supported based on project name.
suffix ({“nc”, “zarr”}) – Suffix used to identify data files. Default: “nc”.
output_format ({“netcdf”, “zarr”}) – Default: “zarr”.
overwrite (bool) – Will overwrite files. For zarr, existing folders will be removed before writing.

Returns:

None

Warning

Globbing assumes that target datasets to be rechunked have been saved in NetCDF format. File naming requires the following order of facets: {variable}_{time_step}_{institute}_{project}_reanalysis_*.nc. Chunking dimensions are assumed to be CF-Compliant (lat, lon, rlat, rlon, time).

miranda.io._rechunk.translate_time_chunk(chunks: dict, calendar: str, timesize: int) → dict[source]¶

Translate chunk specification for time into a number.

Notes

-1 translates to timesize ‘Nyear’ translates to N times the number of days in a year of calendar calendar.

miranda.io.utils module¶

IO Utilities module.

miranda.io.utils.delayed_write(ds: xr.Dataset, outfile: str | os.PathLike, output_format: str, overwrite: bool, target_chunks: dict | None = None, **kwargs: Any) → dask.delayed.Delayed[source]¶

Stage a Dataset writing job using dask.delayed objects.

Parameters:

ds (xr.Dataset) – The Dataset to be written.
outfile (str or os.PathLike) – The output file.
output_format ({“netcdf”, “zarr”}) – The output format.
overwrite (bool) – Whether to overwrite existing files. Default: False.
target_chunks (dict) – The target chunks for the output file.
**kwargs (Any) – Additional keyword arguments.

Returns:

dask.delayed.delayed – The delayed write job.

miranda.io.utils.get_chunks_on_disk(file: str | PathLike[str] | Path) → dict[str, int][source]¶

Determine the chunks on disk for a given NetCDF or Zarr file.

Parameters:: file (str or os.PathLike or Path) – File to be examined. Supports NetCDF and Zarr.
Returns:: dict – The chunks on disk.

miranda.io.utils.get_global_attrs(file_or_dataset: str | PathLike[str] | Dataset) → dict[str, str | int][source]¶

Collect global attributes from NetCDF, Zarr, or Dataset object.

Parameters:: file_or_dataset (str or os.PathLike or xr.Dataset) – The file or dataset to be examined.
Returns:: dict – The global attributes.

miranda.io.utils.get_time_attrs(file_or_dataset: str | PathLike[str] | Dataset) → tuple[str, int][source]¶

Determine attributes related to time dimensions.

Parameters:: file_or_dataset (str or os.PathLike or xr.Dataset) – The file or dataset to be examined.
Returns:: tuple – The calendar and time.

miranda.io.utils.name_output_file(ds_or_dict: Dataset | dict[str, str], output_format: str, data_vars: str | None = None) → str[source]¶

Name an output file based on facets within a Dataset or a dictionary.

Parameters:

ds_or_dict (xr.Dataset or dict) – A miranda-converted Dataset or a dictionary containing the appropriate facets.
output_format ({“netcdf”, “zarr”}) – Output filetype to be used for generating filename suffix.
data_vars (str, optional) – If using a Dataset, the name of the data variable to be used for naming the file.

Returns:

str – The formatted filename.

Notes

If using a dictionary, the following keys must be set: * “variable”, “frequency”, “institution”, “time_start”, “time_end”.

miranda.io.utils.sort_variables(files: list[str | PathLike[str] | Path], variables: Sequence[str] | None) → dict[str, list[Path]][source]¶

Sort all variables within supplied files for treatment.

Parameters:

files (list of str or os.PathLike or Path) – The files to be sorted.
variables (sequence of str, optional) – The variables to be sorted. If not provided, all variables will be grouped.

Returns:

dict[str, list[Path]] – Files sorted by variables.