miranda.io package¶
IO Utilities module.
Submodules¶
miranda.io._input module¶
- miranda.io._input.discover_data(input_files: str | Path | list[str | Path] | GeneratorType, suffix: str = 'nc', recurse: bool = True) list[Path] | GeneratorType[source]¶
Discover data.
- Parameters:
input_files (str, pathlib.Path, list of str or Path, or GeneratorType) – Path or string to a file, a folder, or a generator of paths.
suffix (str) – File-ending suffix to search for. Default: “nc”.
recurse (bool) – Whether to recurse through folders or not. Default: True.
- Returns:
list of pathlib.Path or GeneratorType of pathlib.Path
Warning
Recursion through “.zarr” files is explicitly disabled. Recursive globs and generators will not be expanded/sorted.
miranda.io._output module¶
IO Output Operations module.
- miranda.io._output.concat_rechunk_zarr(freq: str, input_folder: str | PathLike, output_folder: str | PathLike, overwrite: bool = False, **dask_kwargs) None[source]¶
Concatenate and rechunk zarr files.
- Parameters:
freq (str)
input_folder (str or os.PathLike)
output_folder (str or os.PathLike)
overwrite (bool)
**dask_kwargs
- Returns:
None
- miranda.io._output.merge_rechunk_zarrs(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, freq: str | None = None, suffix: str = 'zarr', overwrite: bool = False) None[source]¶
Merge and rechunk zarr files.
- Parameters:
input_folder (str or os.PathLike)
output_folder (str or os.PathLike)
project (str, optional)
target_chunks (dict[str, int], optional)
variables (Sequence of str, optional)
freq (str, optional)
suffix ({“nc”, “zarr”})
overwrite (bool)
- Returns:
None
- miranda.io._output.write_dataset(ds: DataArray | Dataset, output_path: str | PathLike, output_format: str, output_name: str | None = None, chunks: dict | None = None, overwrite: bool = False, compute: bool = True) dict[str, Path][source]¶
Write xarray object to NetCDf or Zarr with appropriate chunking regime.
- Parameters:
ds (xr.DataArray or xr.Dataset) – Dataset or DatArray.
output_path (str or os.PathLike) – Output folder path.
output_format ({“netcdf”, “zarr”}) – Output data container type.
output_name (str, optional) – Output file name.
chunks (dict, optional) – Chunking layout to be written to new files. If None, chunking will be left to the relevant backend engine.
overwrite (bool) – Whether to remove existing files or fail if files already exist.
compute (bool) – If True, files will be converted with each call to file conversion. If False, will return a dask.Delayed object that can be computed later. Default: True.
- Returns:
dict[str, Path]
- miranda.io._output.write_dataset_dict(dataset_dict: dict[str, Dataset | None], output_folder: str | PathLike, temp_folder: str | PathLike, *, output_format: str = 'zarr', overwrite: bool = False, chunks: dict[str, int], **dask_kwargs)[source]¶
Write dataset from Miranda-formatted dataset.
- Parameters:
dataset_dict (dict[str, xr.Dataset or None])
output_folder (str or os.PathLike)
temp_folder (str or os.PathLike)
output_format ({“netcdf”, “zarr”})
overwrite (bool)
chunks (dict[str, int])
**dask_kwargs
- Returns:
None
miranda.io._rechunk module¶
- miranda.io._rechunk.fetch_chunk_config(priority: str, freq: str, dims: Sequence[str] | dict[str, int] | Frozen | tuple[Hashable], default_config: dict = {'files': {'1hr': {'default': {'lat': 250, 'lon': 250, 'time': 168}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 168}}, 'day': {'default': {'lat': 125, 'lon': 125, 'time': '1 year'}, 'rotated': {'rlat': 125, 'rlon': 125, 'time': '1 year'}}, 'month': {'default': {'lat': 500, 'lon': 500, 'time': 120}, 'rotated': {'rlat': 500, 'rlon': 500, 'time': 120}}}, 'stations': {'1hr': {'default': {'station': 50, 'time': '5 years'}}, 'day': {'default': {'station': 200, 'time': '10 years'}}}, 'time': {'1hr': {'default': {'lat': 50, 'lon': 50, 'time': 1440}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': 1440}}, 'day': {'default': {'lat': 50, 'lon': 50, 'time': '4 years'}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': '4 years'}}, 'month': {'default': {'lat': 250, 'lon': 250, 'time': 240}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 240}}}}) dict[str, int][source]¶
Fetch data chunking configuration.
- Parameters:
priority ({“time”, “files”}) – Specifies whether the chunking regime should prioritize file granularity (“files”) or time series (“time”).
freq ({“1hr”, “day”, “month”}) – The time frequency of the input data.
dims (sequence of str) – The dimension names that will be used for chunking.
default_config (dict) – The dictionary to use for determining the chunking configuration.
- Returns:
dict[str, int]
- miranda.io._rechunk.prepare_chunks_for_ds(ds: Dataset, chunks: dict[str, str | int]) dict[str, int][source]¶
Prepare the chunks to be used to write Dataset.
This includes translating the time chunks, making sure chunks are not too small, and removing -1.
- Parameters:
ds (xr.Dataset) – Dataset that we want to write with the chunks.
chunks (dict) – Desired chunks in human-readable format (with “4 years” and -1).
- Returns:
dict – Chunks in a format that is ready to be used to write to disk.
- miranda.io._rechunk.rechunk_files(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, time_step: str | None = None, chunking_priority: str = 'auto', target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, suffix: str = 'nc', output_format: str = 'netcdf', overwrite: bool = False) None[source]¶
Rechunks dataset for better loading/reading performance.
- Parameters:
input_folder (str or os.PathLike) – Folder to be examined. Performs globbing.
output_folder (str or os.PathLike) – Target folder.
project (str, optional) – Supported projects. Used for determining chunk dictionary. Superseded if target_chunks is set.
time_step ({“1hr”, “day”}, optional) – Time step of the input data. Parsed from dataset attrs if not set. Superseded if target_chunks is set.
chunking_priority ({“time”, “files”, “auto”}) – The chunking regime to use. Default: “auto”.
target_chunks (dict, optional) – Must include “time”, optionally “lat” and “lon”, depending on dataset structure.
variables (Sequence[str], optional) – If no variables set, will attempt to process all variables supported based on project name.
suffix ({“nc”, “zarr”}) – Suffix used to identify data files. Default: “nc”.
output_format ({“netcdf”, “zarr”}) – Default: “zarr”.
overwrite (bool) – Will overwrite files. For zarr, existing folders will be removed before writing.
- Returns:
None
Warning
Globbing assumes that target datasets to be rechunked have been saved in NetCDF format. File naming requires the following order of facets: {variable}_{time_step}_{institute}_{project}_reanalysis_*.nc. Chunking dimensions are assumed to be CF-Compliant (lat, lon, rlat, rlon, time).
miranda.io.utils module¶
IO Utilities module.
- miranda.io.utils.delayed_write(ds: xr.Dataset, outfile: str | os.PathLike, output_format: str, overwrite: bool, target_chunks: dict | None = None, **kwargs: Any) dask.delayed.Delayed[source]¶
Stage a Dataset writing job using dask.delayed objects.
- Parameters:
ds (xr.Dataset) – The Dataset to be written.
outfile (str or os.PathLike) – The output file.
output_format ({“netcdf”, “zarr”}) – The output format.
overwrite (bool) – Whether to overwrite existing files. Default: False.
target_chunks (dict) – The target chunks for the output file.
**kwargs (Any) – Additional keyword arguments.
- Returns:
dask.delayed.delayed – The delayed write job.
- miranda.io.utils.get_chunks_on_disk(file: str | PathLike[str] | Path) dict[str, int][source]¶
Determine the chunks on disk for a given NetCDF or Zarr file.
- Parameters:
file (str or os.PathLike or Path) – File to be examined. Supports NetCDF and Zarr.
- Returns:
dict – The chunks on disk.
- miranda.io.utils.get_global_attrs(file_or_dataset: str | PathLike[str] | Dataset) dict[str, str | int][source]¶
Collect global attributes from NetCDF, Zarr, or Dataset object.
- Parameters:
file_or_dataset (str or os.PathLike or xr.Dataset) – The file or dataset to be examined.
- Returns:
dict – The global attributes.
- miranda.io.utils.get_time_attrs(file_or_dataset: str | PathLike[str] | Dataset) tuple[str, int][source]¶
Determine attributes related to time dimensions.
- Parameters:
file_or_dataset (str or os.PathLike or xr.Dataset) – The file or dataset to be examined.
- Returns:
tuple – The calendar and time.
- miranda.io.utils.name_output_file(ds_or_dict: Dataset | dict[str, str], output_format: str, data_vars: str | None = None) str[source]¶
Name an output file based on facets within a Dataset or a dictionary.
- Parameters:
ds_or_dict (xr.Dataset or dict) – A miranda-converted Dataset or a dictionary containing the appropriate facets.
output_format ({“netcdf”, “zarr”}) – Output filetype to be used for generating filename suffix.
data_vars (str, optional) – If using a Dataset, the name of the data variable to be used for naming the file.
- Returns:
str – The formatted filename.
Notes
If using a dictionary, the following keys must be set: * “variable”, “frequency”, “institution”, “time_start”, “time_end”.
- miranda.io.utils.sort_variables(files: list[str | PathLike[str] | Path], variables: Sequence[str] | None) dict[str, list[Path]][source]¶
Sort all variables within supplied files for treatment.
- Parameters:
files (list of str or os.PathLike or Path) – The files to be sorted.
variables (sequence of str, optional) – The variables to be sorted. If not provided, all variables will be grouped.
- Returns:
dict[str, list[Path]] – Files sorted by variables.