miranda.convert package

Data Conversion module.

Submodules

miranda.convert._aggregation module

Aggregation module.

miranda.convert._aggregation.aggregate(ds: Dataset, freq: str = 'day') dict[str, Dataset][source]

Aggregate a dataset to a specified frequency.

Parameters:
  • ds (xarray.Dataset)

  • freq (str)

Returns:

dict[str, xarray.Dataset]

miranda.convert._aggregation.aggregations_possible(ds: Dataset, freq: str = 'day') dict[str, set[str]][source]

Determine which aggregations are possible based on variables within a dataset.

Parameters:
  • ds (xarray.Dataset) – The dataset.

  • freq (str) – TODO: I’m not entirely certain this is even necessary, but is used to determine whether averages are possible.

Returns:

dict – Mapping of variable names to a set of possible operations (e.g., max, mean, min).

Notes

The function checks first for continuous time periods in the dataset and then determines which variables are present and which operations can be performed on them.

If the dataset has variables that can be aggregated, such as temperature, humidity, and wind speed, then the following operations are possible: - For temperature: max, mean, min - For humidity: max, mean, min - For wind speed: max, mean

For fluxes (e.g., precipitation, evaporation), only the mean operation is available.

If the dataset has variables that are not present but can be derived (e.g., tas from tasmax and tasmin), then the following operations are possible: - For derived temperature variables: max, mean, min

miranda.convert._data_corrections module

miranda.convert._data_corrections.dataset_conversion(input_files: str | PathLike | Sequence[str | PathLike] | Iterator[PathLike] | Dataset, project: str, domain: str | None = None, mask: Dataset | DataArray | None = None, mask_cutoff: float | bool = False, regrid: bool = False, add_version_hashes: bool = True, preprocess: Callable | str | None = 'auto', **xr_kwargs) Dataset[source]

Convert an existing Xarray-compatible dataset to another format with variable corrections applied.

Parameters:
  • input_files (str or os.PathLike or Sequence[str or os.PathLike] or Iterator[os.PathLike] or xr.Dataset) – Files or objects to be converted. If sent a list or GeneratorType, will open with xarray.open_mfdataset() and concatenate files.

  • project ({“cordex”, “cmip5”, “cmip6”, “ets-grnch”, “isimip-ft”, “pcic-candcs-u6”, “converted”}) – Project name for decoding/handling purposes.

  • domain ({“global”, “nam”, “can”, “qc”, “mtl”}, optional) – Domain to perform subsetting for. Default: None.

  • mask (Optional[Union[xr.Dataset, xr.DataArray]]) – DataArray or single data_variable dataset containing mask.

  • mask_cutoff (float or bool) – If land_sea_mask supplied, the threshold above which to mask with land_sea_mask. Default: False.

  • regrid (bool) – Performing regridding with xesmf. Default: False.

  • add_version_hashes (bool) – If True, version name and sha256sum of source file(s) will be added as a field among the global attributes.

  • preprocess (callable or str, optional) – Preprocessing functions to perform over each Dataset. Default: “auto” - Run preprocessing fixes based on supplied fields from metadata definition. Callable - Runs function over Dataset (single) or supplied to preprocess (multifile dataset).

  • **xr_kwargs – Arguments passed directly to xarray.

Returns:

xr.Dataset

miranda.convert._data_corrections.dataset_corrections(ds: Dataset, project: str) Dataset[source]

Convert variables to CF-compliant format

miranda.convert._data_corrections.dims_conversion(d: Dataset, p: str, m: dict) Dataset[source]

Rename dimensions to CF to their equivalents.

Parameters:
  • d (xarray.Dataset) – Dataset with dimensions to be updated.

  • p (str) – Dataset project name.

  • m (dict) – Metadata definition dictionary for project and variable(s).

Returns:

xarray.Dataset

miranda.convert._data_corrections.load_json_data_mappings(project: str) dict[str, Any][source]

Load JSON mappings for supported dataset conversions.

Parameters:

project (str)

Returns:

dict[str, Any]

miranda.convert._data_corrections.metadata_conversion(d: Dataset, p: str, m: dict) Dataset[source]

Update xarray dataset and data_vars with project-specific metadata fields.

Parameters:
  • d (xarray.Dataset) – Dataset with metadata to be updated.

  • p (str) – Dataset project name.

  • m (dict) – Metadata definition dictionary for project and variable(s).

Returns:

xarray.Dataset

miranda.convert._data_corrections.threshold_mask(ds: Dataset | DataArray, *, mask: Dataset | DataArray, mask_cutoff: float | bool = False) Dataset | DataArray[source]

Land-Sea mask operations.

Parameters:
  • ds (xr.Dataset or str or os.PathLike)

  • mask (xr.Dataset or xr.DataArray)

  • mask_cutoff (float or bool)

Returns:

xr.Dataset or xr.DataArray

miranda.convert._data_corrections.variable_conversion(d: Dataset, p: str, m: dict) Dataset[source]

Add variable metadata and remove nonstandard entries.

Parameters:
  • d (xarray.Dataset) – Dataset with variable(s) to be updated.

  • p (str) – Dataset project name.

  • m (dict) – Metadata definition dictionary for project and variable(s).

Returns:

xarray.Dataset

miranda.convert._data_definitions module

miranda.convert._data_definitions.gather_agcfsr(path: str | PathLike) dict[str, list[Path]][source]

Gather agCFSR source data.

Parameters:

path (str or os.PathLike)

Returns:

dict[str, list[pathlib.Path]]

miranda.convert._data_definitions.gather_agmerra(path: str | PathLike) dict[str, list[Path]][source]

Gather agMERRA source data.

Parameters:

path (str or os.PathLike)

Returns:

dict[str, list[pathlib.Path]]

miranda.convert._data_definitions.gather_eccc_rdrs(name: str, path: str | PathLike, suffix: str, key: str) dict[str, dict[str, list[Path]]][source]

Gather RDRS processed source data.

Parameters:
  • name (str) – The variable to gather.

  • path (str or os.PathLike) – The location of the source data.

  • suffix (str) – The filename suffix.

  • key ({“raw”, “cf”}) – Indicating which variable name dictionary to search for.

Returns:

dict[str, list[pathlib.Path]]

miranda.convert._data_definitions.gather_ecmwf(project: str, path: str | PathLike, back_extension: bool = False, monthly_means: bool = False) dict[str, list[Path]][source]

Gather ECMWF source data.

Parameters:
  • project ({“era5-single-levels”, “era5-pressure-levels”, “era5-land”})

  • path (str or os.PathLike)

  • back_extension (bool)

  • monthly_means (bool)

Returns:

dict[str, list[pathlib.Path]]

miranda.convert._data_definitions.gather_emdna(path: str | PathLike) dict[str, list[Path]][source]

Gather raw EMDNA files for preprocessing.

Put all files with the same member together.

Parameters:

path (str or os.PathLike)

Returns:

dict[str, list[pathlib.Path]]

miranda.convert._data_definitions.gather_grnch(path: str | PathLike) dict[str, list[Path]][source]

Gather raw ETS-GRNCH files for preprocessing.

Parameters:

path (str or os.PathLike)

Returns:

dict(str, dict(str, list[Path])) or None

miranda.convert._data_definitions.gather_nex(path: str | PathLike) dict[str, list[Path]][source]

Gather raw NEX files for preprocessing.

Put all files that should be contained in one dataset in one entry of the dictionary.

Parameters:

path (str or os.PathLike)

Returns:

dict[str, list[pathlib.Path]]

miranda.convert._data_definitions.gather_nrcan_gridded_obs(path: str | PathLike) dict[str, list[Path]][source]

Gather NRCan Gridded Observations source data.

Parameters:

path (str or os.PathLike)

Returns:

dict(str, list[pathlib.Path])

miranda.convert._data_definitions.gather_raw_rdrs_by_years(path: str | PathLike, project: str) dict[str, dict[str, list[Path]]][source]

Gather raw RDRS files for preprocessing.

Parameters:
  • path (str or os.PathLike)

  • project (str)

Returns:

dict[str, dict[str, list[pathlib.Path]]

miranda.convert._data_definitions.gather_sc_earth(path: str | PathLike) dict[str, list[Path]][source]

Gather SC-Earth source data

Parameters:

path (str or os.PathLike)

Returns:

dict[str, list[pathlib.Path]]

miranda.convert._data_definitions.gather_wfdei_gem_capa(path: str | PathLike) dict[str, list[Path]][source]

Gather WFDEI-GEM-CaPa source data.

Parameters:

path (str or os.PathLike)

Returns:

dict[str, list[pathlib.Path]]

miranda.convert._reconstruction module

miranda.convert._reconstruction.reanalysis_processing(data: dict[str, list[str | PathLike]], output_folder: str | PathLike, variables: Sequence[str], aggregate: str | bool = False, domains: str | list[str] = '_DEFAULT', start: str | None = None, end: str | None = None, target_chunks: dict | None = None, output_format: str = 'netcdf', overwrite: bool = False, engine: str = 'h5netcdf', n_workers: int = 4, **dask_kwargs) None[source]

Reanalysis processing.

Parameters:
  • data (dict[str, list[str]])

  • output_folder (str or os.PathLike)

  • variables (Sequence[str])

  • aggregate ({“day”, None})

  • domains ({“QC”, “CAN”, “AMNO”, “NAM”, “GLOBAL”})

  • start (str, optional)

  • end (str, optional)

  • target_chunks (dict, optional)

  • output_format ({“netcdf”, “zarr”})

  • overwrite (bool)

  • engine ({“netcdf4”, “h5netcdf”})

  • n_workers (int)

Returns:

None

miranda.convert.deh module

DEH Hydrograph Conversion module.

miranda.convert.deh.open_txt(path: str | Path, cf_table: dict | None = {'flag': {'comment': 'See DEH technical information for details.', 'long_name': 'data flag'}, 'q': {'long_name': 'River discharge', 'units': 'm3 s-1'}}) Dataset[source]

Extract daily HQ meteorological data and convert to xr.DataArray with CF-Convention attributes.

Parameters:
  • path (str or Path) – The path to the file.

  • cf_table (dict, optional) – The CF table dictionary.

Returns:

xr.Dataset – The CF-compliant dataset.

miranda.convert.eccc_canswe module

Environment and Climate Change Canada Data Conversion module.

miranda.convert.eccc_canswe.convert_canswe(file: str | Path, output: str | Path)[source]

Convert the CanSWE netCDF files to production-ready CF-compliant netCDFs.

Parameters:
  • file (str or Path) – The path to the CanSWE netCDF file.

  • output (str or Path) – The output directory.

miranda.convert.eccc_rdrs module

Environment and Climate Change Canada RDRS conversion tools.

miranda.convert.eccc_rdrs.convert_rdrs(project: str, input_folder: str | PathLike[str], output_folder: str | PathLike[str], output_format: str = 'zarr', working_folder: str | PathLike[str] | None = None, overwrite: bool = False, year_start: int | None = None, year_end: int | None = None, cfvariable_list: list | None = None, **dask_kwargs: dict[str, Any]) None[source]

Convert RDRS dataset.

Parameters:
  • project (str) – The project name.

  • input_folder (str or os.PathLike) – The input folder.

  • output_folder (str or os.PathLike) – The output folder.

  • output_format ({“netcdf”, “zarr”}) – The output format.

  • working_folder (str or os.PathLike, optional) – The working folder.

  • overwrite (bool) – Whether to overwrite existing files. Default: False.

  • year_start (int, optional) – The start year. If not provided, the minimum year in the dataset will be used.

  • year_end (int, optional) – The end year. If not provided, the maximum year in the dataset will be used.

  • cfvariable_list (list, optional) – The CF variable list.

  • **dask_kwargs (dict) – Additional keyword arguments passed to the Dask scheduler.

miranda.convert.eccc_rdrs.rdrs_to_daily(project: str, input_folder: str | PathLike, output_folder: str | PathLike, working_folder: str | PathLike | None = None, overwrite: bool = False, output_format: str = 'zarr', year_start: int | None = None, year_end: int | None = None, process_variables: list[str] | None = None, **dask_kwargs: dict[str, Any]) None[source]

Write out RDRS files to daily-timestep files.

Parameters:
  • project (str) – The project name.

  • input_folder (str or os.PathLike) – The input folder.

  • output_folder (str or os.PathLike) – The output folder.

  • working_folder (str or os.PathLike) – The working folder.

  • overwrite (bool) – Whether to overwrite existing files. Default: False.

  • output_format ({“netcdf”, “zarr”}) – The output format.

  • year_start (int, optional) – The start year. If not provided, the minimum year in the dataset will be used.

  • year_end (int, optional) – The end year. If not provided, the maximum year in the dataset will be used.

  • process_variables (list of str, optional) – The variables to process. If not provided, all variables will be processed.

  • **dask_kwargs (dict) – Additional keyword arguments passed to the Dask scheduler.

miranda.convert.hq module

Hydro Quebec Weather Station Data Conversion module.

miranda.convert.hq.open_csv(path: str | PathLike[str], cf_table: dict[str, Any] | None = {'hurs': {'cell_methods': 'time: point', 'comment': 'The relative humidity with respect to liquid water for T> 0 C, and with respect to ice for T<0 C.', 'frequency': '1h', 'long_name': 'Near-Surface Relative Humidity', 'out_name': 'hurs', 'standard_name': 'relative_humidity', 'type': 'real', 'units': '%'}, 'prlp': {'cell_methods': 'time: mean', 'comment': 'At surface; includes precipitation of all forms of water in the liquid phase.', 'frequency': 'day', 'long_name': 'Rainfall Flux', 'out_name': 'prlp', 'standard_name': 'rainfall_flux', 'type': 'real', 'units': 'kg m-2 s-1'}, 'prsn': {'cell_methods': 'time: mean', 'comment': 'At surface; includes precipitation of all forms of water in the solid phase.', 'frequency': 'day', 'long_name': 'Snowfall Flux', 'out_name': 'prsn', 'standard_name': 'snowfall_flux', 'type': 'real', 'units': 'kg m-2 s-1'}, 'sfcWind': {'cell_methods': 'time: point', 'comment': 'Near-surface (usually, 10 meters) wind speed.', 'frequency': '1h', 'long_name': 'Near-Surface Wind Speed', 'out_name': 'sfcWind', 'standard_name': 'wind_speed', 'type': 'real', 'units': 'm s-1'}, 'sfcWindAz': {'cell_methods': 'time: point', 'comment': 'Near-surface (usually, 10 meters) direction from which wind originates.', 'frequency': '1h', 'long_name': 'Near-Surface Wind Direction', 'out_name': 'sfcWindAz', 'standard_name': 'wind_direction', 'type': 'real', 'units': 'degree'}, 'snd': {'cell_methods': 'time: point', 'comment': 'The thickness of snow.', 'frequency': '1h', 'long_name': 'Snow Depth', 'out_name': 'snd', 'standard_name': 'surface_snow_thickness', 'type': 'real', 'units': 'm'}, 'tasmax_1h': {'cell_methods': 'time: maximum', 'comment': 'Maximum near-surface (usually, 2 meter) air temperature.', 'frequency': '1h', 'long_name': 'Hourly Maximum Near-Surface Air Temperature', 'out_name': 'tasmax', 'standard_name': 'air_temperature', 'type': 'real', 'units': 'K'}, 'tasmax_day': {'cell_methods': 'time: maximum', 'comment': 'Maximum near-surface (usually, 2 meter) air temperature.', 'frequency': 'day', 'long_name': 'Daily Maximum Near-Surface Air Temperature', 'out_name': 'tasmax', 'standard_name': 'air_temperature', 'type': 'real', 'units': 'K'}, 'tasmin_1h': {'cell_methods': 'time: minimum', 'comment': 'Minimum near-surface (usually, 2 meter) air temperature.', 'frequency': '1h', 'long_name': 'Hourly Minimum Near-Surface Air Temperature', 'out_name': 'tasmin', 'standard_name': 'air_temperature', 'type': 'real', 'units': 'K'}, 'tasmin_day': {'cell_methods': 'time: minimum', 'comment': 'Minimum near-surface (usually, 2 meter) air temperature.', 'frequency': 'day', 'long_name': 'Daily Minimum Near-Surface Air Temperature', 'out_name': 'tasmin', 'standard_name': 'air_temperature', 'type': 'real', 'units': 'K'}}) DataArray[source]

Extract daily HQ meteo data and convert to xr.DataArray with CF-Convention attributes.

Parameters:
  • path (os.PathLike or str) – The path to the file.

  • cf_table (dict, optional) – The CF table dictionary.

Returns:

xr.DataArray – The CF-compliant xarray DataArray.

miranda.convert.melcc module

MELCC (Québec) Weather Stations data conversion module.

miranda.convert.melcc.concat(files: Sequence[str | PathLike[str]], output_folder: str | PathLike[str], overwrite: bool = True) Path[source]

Concatenate converted weather station files.

Parameters:
  • files (Sequence of str or os.PathLike) – The files to concatenate.

  • output_folder (str or os.PathLike) – The output folder.

  • overwrite (bool) – Whether to overwrite existing files. Default: True.

Returns:

Path – The output path.

miranda.convert.melcc.convert_mdb(database: str | Path, stations: Dataset, definitions: Dataset, output: str | Path, overwrite: bool = True) dict[tuple[str, str], Path][source]

Convert microsoft databases of MELCCFP observation data to xarray objects.

Parameters:
  • database (str or Path) – The database file.

  • stations (xr.Dataset) – The station list.

  • definitions (xr.Dataset) – The variable definitions.

  • output (str or Path) – The output folder.

  • overwrite (bool) – Whether to overwrite existing files. Default: True.

Returns:

dict[tuple[str, str], Path] – The converted files.

miranda.convert.melcc.convert_melcc_obs(metafile: str | Path, folder: str | Path, output: str | Path | None = None, overwrite: bool = True) dict[tuple[str, str], Path][source]

Convert MELCCFP observation data to xarray data objects, returning paths.

Parameters:
  • metafile (str or Path) – The metadata file.

  • folder (str or Path) – The folder containing the MDB files.

  • output (str or Path, optional) – The output folder. Default: None.

  • overwrite (bool) – Whether to overwrite existing files. Default: True.

Returns:

dict[(str, str), Path] – The converted files.

miranda.convert.melcc.convert_snow_table(file: str | PathLike[str] | Path, output: str | PathLike[str] | Path) None[source]

Convert snow data given through an Excel file.

This private data is not included in the MDB files.

Parameters:
  • file (str or os.PathLike or Path) – The Excel file with sheets: “Stations”, “Périodes standards”, and “Données”.

  • output (str or os.PathLike or Path) – Folder where to put the netCDF files (one for each of snd, sd and snw).

miranda.convert.melcc.list_tables(db_file: str | PathLike[str]) list[str][source]

List the tables of an MDB file.

Parameters:

db_file (str or os.PathLike) – The database file.

Returns:

list of str – The list of tables.

miranda.convert.melcc.parse_var_code(vcode: str) dict[str, Any][source]

Parse variable code to generate metadata.

Parameters:

vcode (str) – The variable code.

Returns:

dict[str, Any] – The metadata dictionary.

miranda.convert.melcc.read_definitions(db_file: str) DataFrame[source]

Read variable definition file using mdbtools.

Parameters:

db_file (str) – The database file.

Returns:

pandas.DataFrame – The variable definitions.

miranda.convert.melcc.read_stations(db_file: str | PathLike) DataFrame[source]

Read station file using mdbtools.

Parameters:

db_file (str or os.PathLike) – The database file.

Returns:

pandas.DataFrame – A Pandas DataFrame with the station information.

miranda.convert.melcc.read_table(db_file: str | PathLike[str], table: str | PathLike) Dataset[source]

Read a MySQL table into an xarray object.

Parameters:
  • db_file (str or os.PathLike) – The database file.

  • table (str or os.PathLike) – The table to read.

Returns:

xarray.Dataset – An xarray Dataset with the table data.

miranda.convert.nrcanmet module

NRCANmet (ANUSPLIN) interpolated station data conversion module.

miranda.convert.nrcanmet.convert_nrcanmet(infile: str | Path, engine: str = 'h5netcdf') Dataset[source]

Convert the NRCanMET netCDF files to production-ready CF-compliant netCDFs.

Parameters:
  • infile (str or Path) – The path to the NRCanMET netCDF files.

  • engine (str) – The engine to use for file read operations. Default: “h5netcdf”.

Returns:

xr.Dataset – NRCanmet xarray dataset.

miranda.convert.stationdata module

Module to convert station data to Zarr format.

miranda.convert.stationdata.convert_statdata_bychunks(project: str, working_folder: str | PathLike[str] | None = None, cfvariable_list: list | None = None, start_year: int | None = None, end_year: int | None = None, lon_bnds: list[float] | None = None, lat_bnds: list[float] | None = None, n_workers: int = 4, n_stations: int = 100, update_from_raw: bool = False, zarr_format: int = 2) None[source]

Convert GHCN or CanHomT station data to Zarr format.

Requires GIS libraries (geopandas).

Parameters:
  • project (str) – Project name.

  • working_folder (str or os.PathLike[str], optional) – The working folder. The default (None) is to use the current working directory.

  • cfvariable_list (list, optional) – List of CF variable names. Optional.

  • start_year (int, optional) – Start year. Optional.

  • end_year (int, optional) – End year. Optional.

  • lon_bnds (list of float, optional) – Longitude boundaries.

  • lat_bnds (list of float, optional) – Latitude boundaries.

  • n_workers (int) – Number of workers to use. Default is 4.

  • n_stations (int) – Number of stations to process. Default is 100.

  • update_from_raw (bool) – Whether to update from raw data.

  • zarr_format (int) – Zarr format version (2 or 3). Default is 2.

miranda.convert.utils module

Conversion Utilities submodule.

miranda.convert.utils.date_parser(date: str, *, end_of_period: bool = False, output_type: str = 'str', strftime_format: str = '%Y-%m-%d') str | Timestamp | NaTType[source]

Parse datetime objects from a string representation of a date or both a start and end date.

Parameters:
  • date (str) – Date to be converted.

  • end_of_period (bool) – If True, the date will be the end of month or year depending on what’s most appropriate.

  • output_type ({“datetime”, “str”}) – Desired returned object type.

  • strftime_format (str) – If output_type==’str’, this sets the strftime format.

Returns:

pd.Timestamp or str or pd.NaT – Parsed date.

Notes

Adapted from code written by Gabriel Rondeau-Genesse (@RondeauG).

miranda.convert.utils.find_version_hash(file: str | PathLike[str]) dict[str, Any][source]

Check for an existing version hash file and, if one cannot be found, generate one from file.

Parameters:

file (str or os.PathLike) – The file to check.

Returns:

dict – The version and hash.

miranda.convert.utils.get_station_meta(project: str, lon_bnds: list[float] | None = None, lat_bnds: list[float] | None = None) DataFrame[source]

Get GHCN or CanHomT station metadata.

Parameters:
  • project (str) – Project name.

  • lon_bnds (list of float, optional) – Longitude boundaries.

  • lat_bnds (list of float, optional) – Latitude boundaries.

Returns:

pd.DataFrame – Station metadata.