miranda.io package#
IO Utilities module.
- miranda.io.concat_rechunk_zarr(freq: str, input_folder: str | PathLike, output_folder: str | PathLike, overwrite: bool = False, **dask_kwargs) None [source]#
Concatenate and rechunk zarr files.
- Parameters:
freq (str)
input_folder (str or os.PathLike)
output_folder (str or os.PathLike)
overwrite (bool)
**dask_kwargs
- Returns:
None
- miranda.io.discover_data(input_files: str | PathLike | list[str | PathLike] | generator, suffix: str = 'nc', recurse: bool = True) list[Path] | generator [source]#
Discover data.
- Parameters:
input_files (str, pathlib.Path, list of str or Path, or GeneratorType) – Path or string to a file, a folder, or a generator of paths.
suffix (str) – File-ending suffix to search for. Default: “nc”.
recurse (bool) – Whether to recurse through folders or not. Default: True.
- Returns:
list of pathlib.Path or GeneratorType of pathlib.Path
Warning
Recursion through “.zarr” files is explicitly disabled. Recursive globs and generators will not be expanded/sorted.
- miranda.io.fetch_chunk_config(priority: str, freq: str, dims: Sequence[str] | dict[str, int] | Frozen | tuple[Hashable], default_config: dict = {'files': {'1hr': {'default': {'lat': 250, 'lon': 250, 'time': 168}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 168}}, 'day': {'default': {'lat': 125, 'lon': 125, 'time': '1 year'}, 'rotated': {'rlat': 125, 'rlon': 125, 'time': '1 year'}}, 'month': {'default': {'lat': 500, 'lon': 500, 'time': 120}, 'rotated': {'rlat': 500, 'rlon': 500, 'time': 120}}}, 'time': {'1hr': {'default': {'lat': 50, 'lon': 50, 'time': 1440}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': 1440}}, 'day': {'default': {'lat': 50, 'lon': 50, 'time': '4 years'}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': '4 years'}}, 'month': {'default': {'lat': 250, 'lon': 250, 'time': 240}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 240}}}}) dict[str, int] [source]#
- Parameters:
priority ({“time”, “files”}) – Specifies whether the chunking regime should prioritize file granularity (“files”) or time series (“time”).
freq ({“1hr”, “day”, “month”}) – The time frequency of the input data.
dims (sequence of str) – The dimension names that will be used for chunking.
default_config (dict) – The dictionary to use for determining the chunking configuration.
- Returns:
dict[str, int]
- miranda.io.find_filepaths(source: str | Path | generator | list[Path | str], recursive: bool = True, file_suffixes: str | list[str] | None = None, **_) list[Path] [source]#
Find all available filepaths at a given source.
- Parameters:
source (str, Path, GeneratorType, or list[str or Path])
recursive (bool)
file_suffixes (str or list of str, optional)
- Returns:
list of pathlib.Path
- miranda.io.merge_rechunk_zarrs(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, freq: str | None = None, suffix: str = 'zarr', overwrite: bool = False) None [source]#
Merge and rechunk zarr files.
- Parameters:
input_folder (str or os.PathLike)
output_folder (str or os.PathLike)
project (str, optional)
target_chunks (dict[str, int], optional)
variables (Sequence of str, optional)
freq (str, optional)
suffix ({“nc”, “zarr”})
overwrite (bool)
- Returns:
None
- miranda.io.prepare_chunks_for_ds(ds: Dataset, chunks: dict[str, str | int]) dict[str, int] [source]#
Prepare the chunks to be used to write Dataset.
This includes translating the time chunks, making sure chunks are not too small, and removing -1.
- Parameters:
ds (xr.Dataset) – Dataset that we want to write with the chunks.
chunks (dict) – Desired chunks in human-readable format (with “4 years” and -1).
- Returns:
dict – Chunks in a format that is ready to be used to write to disk.
- miranda.io.rechunk_files(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, time_step: str | None = None, chunking_priority: str = 'auto', target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, suffix: str = 'nc', output_format: str = 'netcdf', overwrite: bool = False) None [source]#
Rechunks dataset for better loading/reading performance.
Warning
Globbing assumes that target datasets to be rechunked have been saved in NetCDF format. File naming requires the following order of facets: {variable}_{time_step}_{institute}_{project}_reanalysis_*.nc. Chunking dimensions are assumed to be CF-Compliant (lat, lon, rlat, rlon, time).
- Parameters:
input_folder (str or os.PathLike) – Folder to be examined. Performs globbing.
output_folder (str or os.PathLike) – Target folder.
project (str, optional) – Supported projects. Used for determining chunk dictionary. Superseded if target_chunks is set.
time_step ({“1hr”, “day”}, optional) – Time step of the input data. Parsed from dataset attrs if not set. Superseded if target_chunks is set.
chunking_priority ({“time”, “files”, “auto”}) – The chunking regime to use. Default: “auto”.
target_chunks (dict, optional) – Must include “time”, optionally “lat” and “lon”, depending on dataset structure.
variables (Sequence[str], optional) – If no variables set, will attempt to process all variables supported based on project name.
suffix ({“nc”, “zarr”}) – Suffix used to identify data files. Default: “nc”.
output_format ({“netcdf”, “zarr”}) – Default: “zarr”.
overwrite (bool) – Will overwrite files. For zarr, existing folders will be removed before writing.
- Returns:
None
- miranda.io.translate_time_chunk(chunks: dict, calendar: str, timesize: int) dict [source]#
Translate chunk specification for time into a number.
Notes
-1 translates to timesize ‘Nyear’ translates to N times the number of days in a year of calendar calendar.
- miranda.io.write_dataset(ds: DataArray | Dataset, output_path: str | PathLike, output_format: str, chunks: dict | None = None, overwrite: bool = False, compute: bool = True) dict[str, Path] [source]#
Write xarray object to NetCDf or Zarr with appropriate chunking regime.
- Parameters:
ds (xr.DataArray or xr.Dataset) – Dataset or DatArray.
output_path (str or os.PathLike) – Output folder path.
output_format ({“netcdf”, “zarr”}) – Output data container type.
chunks (dict, optional) – Chunking layout to be written to new files. If None, chunking will be left to the relevant backend engine.
overwrite (bool) – Whether to remove existing files or fail if files already exist.
compute (bool) – If True, files will be converted with each call to file conversion. If False, will return a dask.Delayed object that can be computed later. Default: True.
- Returns:
dict[str, Path]
- miranda.io.write_dataset_dict(dataset_dict: dict[str, Dataset | None], output_folder: str | PathLike, temp_folder: str | PathLike, *, output_format: str = 'zarr', overwrite: bool = False, chunks: dict[str, int], **dask_kwargs)[source]#
Write dataset from Miranda-formatted dataset.
- Parameters:
dataset_dict (dict[str, xr.Dataset or None])
output_folder (str or os.PathLike)
temp_folder (str or os.PathLike)
output_format ({“netcdf”, “zarr”})
overwrite (bool)
chunks (dict[str, int])
**dask_kwargs
- Returns:
None
Submodules#
miranda.io._input module#
- miranda.io._input.discover_data(input_files: str | PathLike | list[str | PathLike] | generator, suffix: str = 'nc', recurse: bool = True) list[Path] | generator [source]#
Discover data.
- Parameters:
input_files (str, pathlib.Path, list of str or Path, or GeneratorType) – Path or string to a file, a folder, or a generator of paths.
suffix (str) – File-ending suffix to search for. Default: “nc”.
recurse (bool) – Whether to recurse through folders or not. Default: True.
- Returns:
list of pathlib.Path or GeneratorType of pathlib.Path
Warning
Recursion through “.zarr” files is explicitly disabled. Recursive globs and generators will not be expanded/sorted.
- miranda.io._input.find_filepaths(source: str | Path | generator | list[Path | str], recursive: bool = True, file_suffixes: str | list[str] | None = None, **_) list[Path] [source]#
Find all available filepaths at a given source.
- Parameters:
source (str, Path, GeneratorType, or list[str or Path])
recursive (bool)
file_suffixes (str or list of str, optional)
- Returns:
list of pathlib.Path
miranda.io._output module#
IO Output Operations module.
- miranda.io._output.concat_rechunk_zarr(freq: str, input_folder: str | PathLike, output_folder: str | PathLike, overwrite: bool = False, **dask_kwargs) None [source]#
Concatenate and rechunk zarr files.
- Parameters:
freq (str)
input_folder (str or os.PathLike)
output_folder (str or os.PathLike)
overwrite (bool)
**dask_kwargs
- Returns:
None
- miranda.io._output.merge_rechunk_zarrs(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, freq: str | None = None, suffix: str = 'zarr', overwrite: bool = False) None [source]#
Merge and rechunk zarr files.
- Parameters:
input_folder (str or os.PathLike)
output_folder (str or os.PathLike)
project (str, optional)
target_chunks (dict[str, int], optional)
variables (Sequence of str, optional)
freq (str, optional)
suffix ({“nc”, “zarr”})
overwrite (bool)
- Returns:
None
- miranda.io._output.write_dataset(ds: DataArray | Dataset, output_path: str | PathLike, output_format: str, chunks: dict | None = None, overwrite: bool = False, compute: bool = True) dict[str, Path] [source]#
Write xarray object to NetCDf or Zarr with appropriate chunking regime.
- Parameters:
ds (xr.DataArray or xr.Dataset) – Dataset or DatArray.
output_path (str or os.PathLike) – Output folder path.
output_format ({“netcdf”, “zarr”}) – Output data container type.
chunks (dict, optional) – Chunking layout to be written to new files. If None, chunking will be left to the relevant backend engine.
overwrite (bool) – Whether to remove existing files or fail if files already exist.
compute (bool) – If True, files will be converted with each call to file conversion. If False, will return a dask.Delayed object that can be computed later. Default: True.
- Returns:
dict[str, Path]
- miranda.io._output.write_dataset_dict(dataset_dict: dict[str, Dataset | None], output_folder: str | PathLike, temp_folder: str | PathLike, *, output_format: str = 'zarr', overwrite: bool = False, chunks: dict[str, int], **dask_kwargs)[source]#
Write dataset from Miranda-formatted dataset.
- Parameters:
dataset_dict (dict[str, xr.Dataset or None])
output_folder (str or os.PathLike)
temp_folder (str or os.PathLike)
output_format ({“netcdf”, “zarr”})
overwrite (bool)
chunks (dict[str, int])
**dask_kwargs
- Returns:
None
miranda.io._rechunk module#
- miranda.io._rechunk.fetch_chunk_config(priority: str, freq: str, dims: Sequence[str] | dict[str, int] | Frozen | tuple[Hashable], default_config: dict = {'files': {'1hr': {'default': {'lat': 250, 'lon': 250, 'time': 168}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 168}}, 'day': {'default': {'lat': 125, 'lon': 125, 'time': '1 year'}, 'rotated': {'rlat': 125, 'rlon': 125, 'time': '1 year'}}, 'month': {'default': {'lat': 500, 'lon': 500, 'time': 120}, 'rotated': {'rlat': 500, 'rlon': 500, 'time': 120}}}, 'time': {'1hr': {'default': {'lat': 50, 'lon': 50, 'time': 1440}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': 1440}}, 'day': {'default': {'lat': 50, 'lon': 50, 'time': '4 years'}, 'rotated': {'rlat': 50, 'rlon': 50, 'time': '4 years'}}, 'month': {'default': {'lat': 250, 'lon': 250, 'time': 240}, 'rotated': {'rlat': 250, 'rlon': 250, 'time': 240}}}}) dict[str, int] [source]#
- Parameters:
priority ({“time”, “files”}) – Specifies whether the chunking regime should prioritize file granularity (“files”) or time series (“time”).
freq ({“1hr”, “day”, “month”}) – The time frequency of the input data.
dims (sequence of str) – The dimension names that will be used for chunking.
default_config (dict) – The dictionary to use for determining the chunking configuration.
- Returns:
dict[str, int]
- miranda.io._rechunk.prepare_chunks_for_ds(ds: Dataset, chunks: dict[str, str | int]) dict[str, int] [source]#
Prepare the chunks to be used to write Dataset.
This includes translating the time chunks, making sure chunks are not too small, and removing -1.
- Parameters:
ds (xr.Dataset) – Dataset that we want to write with the chunks.
chunks (dict) – Desired chunks in human-readable format (with “4 years” and -1).
- Returns:
dict – Chunks in a format that is ready to be used to write to disk.
- miranda.io._rechunk.rechunk_files(input_folder: str | PathLike, output_folder: str | PathLike, project: str | None = None, time_step: str | None = None, chunking_priority: str = 'auto', target_chunks: dict[str, int] | None = None, variables: Sequence[str] | None = None, suffix: str = 'nc', output_format: str = 'netcdf', overwrite: bool = False) None [source]#
Rechunks dataset for better loading/reading performance.
Warning
Globbing assumes that target datasets to be rechunked have been saved in NetCDF format. File naming requires the following order of facets: {variable}_{time_step}_{institute}_{project}_reanalysis_*.nc. Chunking dimensions are assumed to be CF-Compliant (lat, lon, rlat, rlon, time).
- Parameters:
input_folder (str or os.PathLike) – Folder to be examined. Performs globbing.
output_folder (str or os.PathLike) – Target folder.
project (str, optional) – Supported projects. Used for determining chunk dictionary. Superseded if target_chunks is set.
time_step ({“1hr”, “day”}, optional) – Time step of the input data. Parsed from dataset attrs if not set. Superseded if target_chunks is set.
chunking_priority ({“time”, “files”, “auto”}) – The chunking regime to use. Default: “auto”.
target_chunks (dict, optional) – Must include “time”, optionally “lat” and “lon”, depending on dataset structure.
variables (Sequence[str], optional) – If no variables set, will attempt to process all variables supported based on project name.
suffix ({“nc”, “zarr”}) – Suffix used to identify data files. Default: “nc”.
output_format ({“netcdf”, “zarr”}) – Default: “zarr”.
overwrite (bool) – Will overwrite files. For zarr, existing folders will be removed before writing.
- Returns:
None
miranda.io.utils module#
IO Utilities module.
- miranda.io.utils.creation_date(path_to_file: str | PathLike) float | date [source]#
Return the date that a file was created, falling back to when it was last modified if unable to determine.
See https://stackoverflow.com/a/39501288/1709587 for explanation.
- Parameters:
path_to_file (str or os.PathLike)
- Returns:
float or date
- miranda.io.utils.delayed_write(ds: Dataset, outfile: str | PathLike, output_format: str, overwrite: bool, target_chunks: dict | None = None) delayed [source]#
Stage a Dataset writing job using dask.delayed objects.
- Parameters:
ds (xr.Dataset)
outfile (str or os.PathLike)
target_chunks (dict)
output_format ({“netcdf”, “zarr”})
overwrite (bool)
- Returns:
dask.delayed.delayed
- miranda.io.utils.get_chunks_on_disk(file: PathLike | str) dict [source]#
Determine the chunks on disk for a given NetCDF or Zarr file.
- Parameters:
file (str or os.PathLike) – File to be examined. Supports NetCDF and Zarr.
- Returns:
dict
- miranda.io.utils.get_global_attrs(file_or_dataset: str | PathLike | Dataset) dict[str, str | int] [source]#
Collect global attributes from NetCDF, Zarr, or Dataset object.
- miranda.io.utils.get_time_attrs(file_or_dataset: str | os.PathLike | xr.Dataset)[source]#
Determine attributes related to time dimensions.
- miranda.io.utils.name_output_file(ds_or_dict: Dataset | dict[str, str], output_format: str) str [source]#
Name an output file based on facets within a Dataset or a dictionary.
- Parameters:
ds_or_dict (xr.Dataset or dict) – A miranda-converted Dataset or a dictionary containing the appropriate facets.
output_format ({“netcdf”, “zarr”}) – Output filetype to be used for generating filename suffix.
- Returns:
str
Notes
If using a dictionary, the following keys must be set: * “variable”, “frequency”, “institution”, “time_start”, “time_end”.