Skip to content

heiplanet_data.preprocess module⚓︎

heiplanet_data.preprocess ⚓︎

Classes:

  • GridConfig

    Configuration for grid specification for resampling.

  • ResolutionConfig

    Configuration for resolution resampling.

Functions:

Attributes:

CRS module-attribute ⚓︎

CRS = 4326

T module-attribute ⚓︎

T = TypeVar('T', bound=Union[float64, DataArray])

warn_positive_resolution module-attribute ⚓︎

warn_positive_resolution = 'New resolution must be a positive number.'

GridConfig dataclass ⚓︎

GridConfig(expected_longitude_max_xarray=float64(179.75), new_min_lat=None, new_max_lat=None, new_min_lon=None, new_max_lon=None, new_lat_size=None, new_lon_size=None, gridtype='lonlat')

Configuration for grid specification for resampling.

Attributes:

  • expected_longitude_max_xarray (float64) –

    Expected maximum longitude. Default is np.float64(179.75). This is used to adjust the grid after resampling with xarray, e.g. to align with population data.

  • new_min_lat (float | None) –

    Minimum latitude of the new grid. Default is None. This is used for resampling with xESMF and CDO.

  • new_max_lat (float | None) –

    Maximum latitude of the new grid. Default is None. This is used for resampling with xESMF.

  • new_min_lon (float | None) –

    Minimum longitude of the new grid. Default is None. This is used for resampling with xESMF and CDO.

  • new_max_lon (float | None) –

    Maximum longitude of the new grid. Default is None. This is used for resampling with xESMF.

  • new_lat_size (int | None) –

    Size of latitude of the new grid. Default is None. This is used for resampling with CDO.

  • new_lon_size (int | None) –

    Size of longitude of the new grid. Default is None. This is used for resampling with CDO.

  • gridtype (Literal['gaussian', 'lonlat', 'curvilinear', 'unstructured']) –

    Type of the grid. Default is "lonlat". This is used for resampling with CDO.

expected_longitude_max_xarray class-attribute instance-attribute ⚓︎

expected_longitude_max_xarray = float64(179.75)

gridtype class-attribute instance-attribute ⚓︎

gridtype = 'lonlat'

new_lat_size class-attribute instance-attribute ⚓︎

new_lat_size = None

new_lon_size class-attribute instance-attribute ⚓︎

new_lon_size = None

new_max_lat class-attribute instance-attribute ⚓︎

new_max_lat = None

new_max_lon class-attribute instance-attribute ⚓︎

new_max_lon = None

new_min_lat class-attribute instance-attribute ⚓︎

new_min_lat = None

new_min_lon class-attribute instance-attribute ⚓︎

new_min_lon = None

ResolutionConfig dataclass ⚓︎

ResolutionConfig(new_resolution=0.5, lat_name='latitude', lon_name='longitude', downsample_lib='xesmf', agg_funcs=None, upsample_method_map=None)

Configuration for resolution resampling.

Attributes:

  • new_resolution (float) –

    New resolution in degrees. Default is 0.5.

  • lat_name (str) –

    Name of the latitude coordinate. Default is "latitude".

  • lon_name (str) –

    Name of the longitude coordinate. Default is "longitude".

  • downsample_lib (Literal['xarray', 'xesmf', 'cdo']) –

    Library to use for downsampling. Options are "xarray", "xesmf", or "cdo". Default is "xesmf".

  • agg_funcs (Dict[str, str] | None) –

    Aggregation functions for each variable. If None, default aggregation of corresponding library is used. Default is None.

  • upsample_method_map (Dict[str, str] | None) –

    Mapping of variable names to interpolation methods. If None, linear interpolation is used. Default is None.

agg_funcs class-attribute instance-attribute ⚓︎

agg_funcs = None

downsample_lib class-attribute instance-attribute ⚓︎

downsample_lib = 'xesmf'

lat_name class-attribute instance-attribute ⚓︎

lat_name = 'latitude'

lon_name class-attribute instance-attribute ⚓︎

lon_name = 'longitude'

new_resolution class-attribute instance-attribute ⚓︎

new_resolution = 0.5

upsample_method_map class-attribute instance-attribute ⚓︎

upsample_method_map = None

adjust_longitude_360_to_180 ⚓︎

adjust_longitude_360_to_180(dataset, limited_area=False, lon_name='longitude')

Adjust longitude from 0-360 to -180-180.

Parameters:

  • dataset (Dataset) –

    Dataset with longitude in 0-360 range.

  • limited_area (bool, default: False ) –

    Flag indicating if the dataset is a limited area. Default is False.

  • lon_name (str, default: 'longitude' ) –

    Name of the longitude variable in the dataset. Default is "longitude".

Returns:

  • Dataset

    xr.Dataset: Dataset with longitude adjusted to -180-180 range.

aggregate_data_by_nuts ⚓︎

aggregate_data_by_nuts(netcdf_files, nuts_file, normalize_time=True, output_dir=None)

Aggregate data from a NetCDF file by NUTS regions, data variable names, and time. The aggregated data is saved to a NetCDF file with coordinates "NUTS_ID", "time", and data variables include aggregated data variables.

Parameters:

  • netcdf_files (dict[str, tuple[Path, dict | None]]) –

    Dictionary of NetCDF files. Keys are dataset names and values are tuples of (file path, agg_dict). The agg_dict can contain aggregation options for each data variable. For example, {"t2m": "mean", "tp": "sum"}. If agg_dict is None, default aggregation (i.e. mean) is used. NetCDF files must contain "latitude", "longitude", and "time" coordinates.

  • nuts_file (Path) –

    Path to the NUTS regions shape file. The shape file has columns such as "NUTS_ID" and "geometry".

  • normalize_time (bool, default: True ) –

    If True, normalize time to the beginning of the day. e.g. 2025-10-01T12:00:00 becomes 2025-10-01T00:00:00. Default is True.

  • output_dir (Path | None, default: None ) –

    Directory to save the aggregated NetCDF file. If None, the output file is saved in the same directory as the NUTS file. Default is None.

Returns:

  • Path ( Path ) –

    Path to the aggregated NetCDF file.

align_lon_lat_with_popu_data ⚓︎

align_lon_lat_with_popu_data(dataset, expected_longitude_max=float64(179.75), lat_name='latitude', lon_name='longitude')

Align longitude and latitude coordinates with population data of the same resolution. This function is specifically designed to ensure that the longitude and latitude coordinates in the dataset match the expected values used in population data, which are: - Longitude: -179.75 to 179.75, 720 points - Latitude: 89.75 to -89.75, 360 points

Parameters:

  • dataset (Dataset) –

    Dataset with longitude and latitude coordinates.

  • expected_longitude_max (float64, default: float64(179.75) ) –

    Expected maximum longitude after adjustment. Default is np.float64(179.75).

  • lat_name (str, default: 'latitude' ) –

    Name of the latitude coordinate. Default is "latitude".

  • lon_name (str, default: 'longitude' ) –

    Name of the longitude coordinate. Default is "longitude".

Returns:

  • Dataset

    xr.Dataset: Dataset with adjusted longitude and latitude coordinates.

check_agg_funcs ⚓︎

check_agg_funcs(agg_funcs, valid_agg_funcs)

Check if aggregation functions are valid.

Parameters:

  • agg_funcs (Dict[str, str]) –

    Aggregation functions for each variable.

  • valid_agg_funcs (set) –

    Set of valid aggregation function names.

Raises:

  • ValueError

    If any aggregation function is not valid or agg_funcs is not a dictionary.

check_downsample_condition ⚓︎

check_downsample_condition(dataset, new_resolution, lat_name='latitude', lon_name='longitude', agg_funcs=None)

Check if downsampling conditions are met.

Parameters:

  • dataset (Dataset) –

    Dataset to check downsampling conditions.

  • new_resolution (float) –

    Desired new resolution in degrees.

  • lat_name (str, default: 'latitude' ) –

    Name of the latitude coordinate. Default is "latitude".

  • lon_name (str, default: 'longitude' ) –

    Name of the longitude coordinate. Default is "longitude".

  • agg_funcs (Dict[str, str] | None, default: None ) –

    Aggregation functions for each variable.

Raises:

  • ValueError

    If coordinate names are incorrect, new resolution is non-positive, new resolution is not greater than old resolution, or agg_funcs is not None and not a dictionary.

Returns:

  • float ( float ) –

    Old resolution in degrees.

convert_360_to_180 ⚓︎

convert_360_to_180(longitude)

Convert longitude from 0-360 to -180-180.

Parameters:

  • longitude (T) –

    Longitude in 0-360 range.

Returns:

  • T ( T ) –

    Longitude in -180-180 range.

convert_m_to_mm ⚓︎

convert_m_to_mm(precipitation)

Convert precipitation from meters to millimeters.

Parameters:

  • precipitation (T) –

    Precipitation in meters.

Returns:

  • T ( T ) –

    Precipitation in millimeters.

convert_m_to_mm_with_attributes ⚓︎

convert_m_to_mm_with_attributes(dataset, inplace=False, var_name='tp')

Convert precipitation from meters to millimeters and keep attributes.

Parameters:

  • dataset (Dataset) –

    Dataset containing precipitation in meters.

  • inplace (bool, default: False ) –

    If True, modify the original dataset. If False, return a new dataset. Default is False.

  • var_name (str, default: 'tp' ) –

    Name of the precipitation variable in the dataset. Default is "tp".

Returns:

  • Dataset

    xr.Dataset: Dataset with precipitation converted to millimeters.

convert_to_celsius ⚓︎

convert_to_celsius(temperature_kelvin)

Convert temperature from Kelvin to Celsius.

Parameters:

  • temperature_kelvin (T) –

    Temperature in Kelvin, accessed through t2m variable in the dataset.

Returns:

  • T ( T ) –

    Temperature in Celsius.

convert_to_celsius_with_attributes ⚓︎

convert_to_celsius_with_attributes(dataset, inplace=False, var_name='t2m')

Convert temperature from Kelvin to Celsius and keep attributes.

Parameters:

  • dataset (Dataset) –

    Dataset containing temperature in Kelvin.

  • inplace (bool, default: False ) –

    If True, modify the original dataset. If False, return a new dataset. Default is False.

  • var_name (str, default: 't2m' ) –

    Name of the temperature variable in the dataset. Default is "t2m".

Returns:

  • Dataset

    xr.Dataset: Dataset with temperature converted to Celsius.

downsample_resolution_with_cdo ⚓︎

downsample_resolution_with_cdo(dataset, new_resolution=0.5, new_min_lat=None, new_lat_size=None, new_min_lon=None, new_lon_size=None, lat_name='latitude', lon_name='longitude', agg_funcs=None, gridtype='lonlat')

Downsample the resolution of a dataset using CDO.

Parameters:

  • dataset (Dataset) –

    Dataset to change resolution.

  • new_resolution (float, default: 0.5 ) –

    New resolution in degrees. Default is 0.5.

  • new_min_lat (float, default: None ) –

    Minimum latitude of the new grid. Default is None.

  • new_lat_size (int, default: None ) –

    Size of latitude of the new grid. Default is None.

  • new_min_lon (float, default: None ) –

    Minimum longitude of the new grid. Default is None.

  • new_lon_size (int, default: None ) –

    Size of longitude of the new grid. Default is None.

  • lat_name (str, default: 'latitude' ) –

    Name of the latitude coordinate. Default is "latitude".

  • lon_name (str, default: 'longitude' ) –

    Name of the longitude coordinate. Default is "longitude".

  • agg_funcs (Dict[str, str] | None, default: None ) –

    Aggregation functions for each variable. If None, default aggregation is used, i.e. bil (bilinear). Default is None. Possible keys are: * nn (nearest neighbor), * bil (bilinear), * bic (bicubic), * con (conservative), * con2 (conservative 2nd order).

  • gridtype (Literal['gaussian', 'lonlat', 'curvilinear', 'unstructured'], default: 'lonlat' ) –

    Type of the grid. Default is "lonlat".

Returns:

  • Dataset

    xr.Dataset: Dataset with changed resolution.

downsample_resolution_with_xarray ⚓︎

downsample_resolution_with_xarray(dataset, new_resolution=0.5, lat_name='latitude', lon_name='longitude', agg_funcs=None)

Downsample the resolution of a dataset.

Parameters:

  • dataset (Dataset) –

    Dataset to change resolution.

  • new_resolution (float, default: 0.5 ) –

    New resolution in degrees. Default is 0.5.

  • lat_name (str, default: 'latitude' ) –

    Name of the latitude coordinate. Default is "latitude".

  • lon_name (str, default: 'longitude' ) –

    Name of the longitude coordinate. Default is "longitude".

  • agg_funcs (Dict[str, str] | None, default: None ) –

    Aggregation functions for each variable. If None, default aggregation (i.e. mean) is used. Default is None. Possible keys are: * mean * sum * max * min

Returns:

  • Dataset

    xr.Dataset: Dataset with changed resolution.

downsample_resolution_with_xesmf ⚓︎

downsample_resolution_with_xesmf(dataset, new_resolution=0.5, new_min_lat=None, new_max_lat=None, new_min_lon=None, new_max_lon=None, lat_name='latitude', lon_name='longitude', agg_funcs=None)

Downsample the resolution of a dataset using xESMF. Ref: https://xesmf.readthedocs.io/en/stable/notebooks/Rectilinear_grid.html

Parameters:

  • dataset (Dataset) –

    Dataset to change resolution.

  • new_resolution (float, default: 0.5 ) –

    New resolution in degrees. Default is 0.5.

  • new_min_lat (float, default: None ) –

    Minimum latitude of the new grid. Default is None.

  • new_max_lat (float, default: None ) –

    Maximum latitude of the new grid. Default is None.

  • new_min_lon (float, default: None ) –

    Minimum longitude of the new grid. Default is None.

  • new_max_lon (float, default: None ) –

    Maximum longitude of the new grid. Default is None.

  • lat_name (str, default: 'latitude' ) –

    Name of the latitude coordinate. Default is "latitude".

  • lon_name (str, default: 'longitude' ) –

    Name of the longitude coordinate. Default is "longitude".

  • agg_funcs (Dict[str, str] | None, default: None ) –

    Aggregation functions for each variable. If None, default aggregation is used, i.e. bilinear for all variables. Possible keys are: * bilinear * conservative, need grid corner information * conservative_normed, need grid corner information * patch * nearest_s2d * nearest_d2s

Returns:

  • Dataset

    xr.Dataset: Dataset with changed resolution.

preprocess_data_file ⚓︎

preprocess_data_file(netcdf_file, source='era5', settings='default', new_settings=None, unique_tag=None)

Preprocess the dataset based on provided settings. If the settings path is "default", use the default settings of the source. The settings and preprocessed files are saved in the directory, which is specified by the settings file and unique number.

Parameters:

  • netcdf_file (Path) –

    Path to the NetCDF file to preprocess.

  • source (Literal['era5', 'isimip'], default: 'era5' ) –

    Source of the data. Defaults to "era5".

  • settings (Path | str, default: 'default' ) –

    Path to the settings file or "default" for default settings.

  • new_settings (Dict[str, Any] | None, default: None ) –

    Additional settings to overwrite defaults. Defaults to None.

  • unique_tag (str | None, default: None ) –

    Unique tag to append to the output file name and settings file. Defaults to None.

Returns: Tuple[xr.Dataset, str]: Preprocessed dataset and the name of the preprocessed file.

rename_coords ⚓︎

rename_coords(dataset, coords_mapping)

Rename coordinates in the dataset based on a mapping.

Parameters:

  • dataset (Dataset) –

    Dataset with coordinates to rename.

  • coords_mapping (dict) –

    Mapping of old coordinate names to new names.

Returns:

  • Dataset

    xr.Dataset: A new dataset with renamed coordinates.

resample_resolution ⚓︎

resample_resolution(dataset, resolution_config=ResolutionConfig(), grid_config=GridConfig())

Resample the grid of a dataset to a new resolution.

Parameters:

Returns:

  • Dataset

    xr.Dataset: Resampled dataset with changed resolution.

shift_time ⚓︎

shift_time(dataset, offset=-1, time_unit='D', var_name='time')

Shift the time coordinate of a dataset by a specified timedelta. The dataset is overwritten with the shifted time values.

Parameters:

  • dataset (Dataset) –

    Dataset to shift.

  • offset (int, default: -1 ) –

    Amount to shift the time coordinate. Default is -1.

  • time_unit (Literal['W', 'D', 'h', 'm', 's', 'ms', 'ns'], default: 'D' ) –

    Time unit for the shift. Default is "D".

  • var_name (str, default: 'time' ) –

    Name of the time variable in the dataset. Default is "time".

truncate_data_by_time ⚓︎

truncate_data_by_time(dataset, start_date, end_date=None, var_name='time')

Truncate data from a specific start date to an end date. Both dates are inclusive.

Parameters:

  • dataset (Dataset) –

    Dataset to truncate.

  • start_date (Union[str, datetime64]) –

    Start date for truncation. Format as "YYYY-MM-DD" or as a numpy datetime64 object.

  • end_date (Union[str, datetime64, None], default: None ) –

    End date for truncation. Format as "YYYY-MM-DD" or as a numpy datetime64 object. If None, truncate until the last date in the dataset. Default is None.

  • var_name (str, default: 'time' ) –

    Name of the time variable in the dataset. Default is "time".

Returns:

  • Dataset

    xr.Dataset: Dataset truncated from the specified start date.

upsample_resolution ⚓︎

upsample_resolution(dataset, new_resolution=0.1, lat_name='latitude', lon_name='longitude', method_map=None)

Upsample the resolution of a dataset using xarray.interp.

Parameters:

  • dataset (Dataset) –

    Dataset to change resolution.

  • new_resolution (float, default: 0.1 ) –

    New resolution in degrees. Default is 0.1.

  • lat_name (str, default: 'latitude' ) –

    Name of the latitude coordinate. Default is "latitude".

  • lon_name (str, default: 'longitude' ) –

    Name of the longitude coordinate. Default is "longitude".

  • method_map (Dict[str, str] | None, default: None ) –

    Mapping of variable names to interpolation methods. If None, linear interpolation is used. Default is None.

Returns:

  • Dataset

    xr.Dataset: Dataset with changed resolution.