Tutorial C: Aggregate the data¶

heiplanet-data Python package - data aggregation of the processed data into NUTS or other regions

Authors: Scientific Software Center
Date: October 2025
Version: 1.0

Overview¶

This tutorial demonstrates how to postprocess data through heiplanet-data. You will learn how to:

Specify the regional averaging for heiplanet-data: Work with NUTS regions to create aggregated maps
Data operations: Aggregate the data on regional grids
Data Visualization: Create plots to verify the processed data

We will postprocess the data to aggregate in NUTS regions. This can be NUTS0 down to NUTS3, depending on which values you would like to extract from the aggregated data.

We will start by importing the necessary libraries.

In [ ]:

Copied!





# if running on google colab
# flake8-noqa-cell

if "google.colab" in str(get_ipython()):
    # install packages
    %pip install git+https://github.com/ssciwr/heiplanet-data.git -qqq
# if running on google colab
# flake8-noqa-cell

if "google.colab" in str(get_ipython()):
    # install packages
    %pip install git+https://github.com/ssciwr/heiplanet-data.git -qqq

In [ ]:

Copied!





from pathlib import Path
import time
import pooch
from heiplanet_data import preprocess
import xarray as xr
from matplotlib import pyplot as plt
import geopandas as gpd
import matplotlib as mpl
from pathlib import Path
import time
import pooch
from heiplanet_data import preprocess
import xarray as xr
from matplotlib import pyplot as plt
import geopandas as gpd
import matplotlib as mpl

In [ ]:

Copied!





# change to your own data folder, if needed
data_root = Path("data/")
data_folder = data_root / "in"
processed_folder = Path.cwd() / "processed"
# replace the below lines with your processed data filenames
era5_pfname = "era5_data_2016-2017_allm_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim_tutorial_B.nc"
isimip_pfname = "population_histsoc_30arcmin_annual_1901_2021_unicoords_2016-2017_ts20251014-092525_hssc08.nc"
jmodel_fname = "output_JModel_global_025deg_trim_ts20251014-092703_hssc08.nc"
# change to your own data folder, if needed
data_root = Path("data/")
data_folder = data_root / "in"
processed_folder = Path.cwd() / "processed"
# replace the below lines with your processed data filenames
era5_pfname = "era5_data_2016-2017_allm_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim_tutorial_B.nc"
isimip_pfname = "population_histsoc_30arcmin_annual_1901_2021_unicoords_2016-2017_ts20251014-092525_hssc08.nc"
jmodel_fname = "output_JModel_global_025deg_trim_ts20251014-092703_hssc08.nc"

In [ ]:

Copied!





# download the NUTS shapefile
filename = "NUTS_RG_20M_2024_4326.shp.zip"
url = "https://heibox.uni-heidelberg.de/f/e95125307161470a8906/?dl=1"
filehash = "246a5e1a3901380f194879cff5a65bbac86c9c906a7871027a3a918dbe6c7e46"

try:
    file = pooch.retrieve(
        url=url,
        known_hash=filehash,
        fname=filename,
        path=data_folder,
    )
except Exception as e:
    print(f"Error fetching data: {e}")
    raise RuntimeError(f"Failed to fetch data from {url}") from e
print(f"Data fetched and saved to {file}")
# download the NUTS shapefile
filename = "NUTS_RG_20M_2024_4326.shp.zip"
url = "https://heibox.uni-heidelberg.de/f/e95125307161470a8906/?dl=1"
filehash = "246a5e1a3901380f194879cff5a65bbac86c9c906a7871027a3a918dbe6c7e46"

try:
    file = pooch.retrieve(
        url=url,
        known_hash=filehash,
        fname=filename,
        path=data_folder,
    )
except Exception as e:
    print(f"Error fetching data: {e}")
    raise RuntimeError(f"Failed to fetch data from {url}") from e
print(f"Data fetched and saved to {file}")

1. NUTS specification¶

Eurostat's NUTS definitions are set here and corresponding shapefiles can be downloaded here.

For downloading, please choose:

The latest year from NUTS year,
File format: SHP,
Geometry type: Polygons (RG),
Scale: 20M
CRS: EPSG: 4326

Inside the zip folder, there are five different shapefiles, which are all required to display and extract the NUTS regions data.

shape data folder
|____.shp file: geometry data (e.g. polygons)
|____.shx file: index for geometry data
|____.dbf file: attribute data for each NUTS region (e.g NUTS name, NUTS ID)
|____.prj file: information on CRS
|____.cpg file: character encoding data

These NUTS definition files are for Europe only. If a country does not have NUTS level $x \in [1,3]$, the corresponding data for these levels is excluded from the shapefiles. You do not need to extract the zip folder.

NUTS_ID explanation¶

Structure of NUTS_ID: <country><level>
country: 2 letters, representing name of a country, e.g. DE
level: 0 to 3 letters or numbers, signifying the level of the NUTS region

First, we need to tell heiplanet-data where to find the geometrical shapes over which we will aggregate. In principle, any shape folder with similar content structure as the NUTS shapefolder should work.

In [ ]:

Copied!

# NUTS shapefile
nuts_file = data_folder / "NUTS_RG_20M_2024_4326.shp.zip"
# NUTS shapefile
nuts_file = data_folder / "NUTS_RG_20M_2024_4326.shp.zip"

Next, we specify which files will be aggregated. The aggregation is carried out simultaneously which is more efficient.

In [ ]:

Copied!





# specify which files will be aggregated by NUTS regions
# key is name of the dataset, value is a tuple of (file path, aggregation mapping dict.)
non_nuts_data = {
    "era5": (processed_folder / era5_pfname, None),
    "popu": (processed_folder / isimip_pfname, None),
    "jmodel": (processed_folder / jmodel_fname, None),
}
# specify which files will be aggregated by NUTS regions
# key is name of the dataset, value is a tuple of (file path, aggregation mapping dict.)
non_nuts_data = {
    "era5": (processed_folder / era5_pfname, None),
    "popu": (processed_folder / isimip_pfname, None),
    "jmodel": (processed_folder / jmodel_fname, None),
}

2. Aggregate the data over NUTS regions¶

To aggregate the data, we now need to pass the NUTS file path and the non_nuts_data dictionary to the function for the processing. Here, the keys represent dataset names (used to form the resulting file name), and the values are tuples containing the file path and the aggregation mapping.

By default, the aggregation mapping is set to None, which means the mean function will be applied to all data variables during aggregation.

An example of aggregation mapping dictionary is:

{
    "t2m": "mean", 
    "tp": "sum"
}

The resulting file name would be:

<NUTS_shapefile_name>agg<nc_dataset_names>_<min_yyyy-mm>-<max_yyyy-mm>.nc

In [ ]:

Copied!





# aggregate data by NUTS regions
t0 = time.time()
aggregated_file = preprocess.aggregate_data_by_nuts(
    non_nuts_data, nuts_file, normalize_time=True, output_dir=processed_folder
)
t1 = time.time()
print(f"Aggregation completed in {t1 - t0:.2f} seconds")
# aggregate data by NUTS regions
t0 = time.time()
aggregated_file = preprocess.aggregate_data_by_nuts(
    non_nuts_data, nuts_file, normalize_time=True, output_dir=processed_folder
)
t1 = time.time()
print(f"Aggregation completed in {t1 - t0:.2f} seconds")

3. Plot the aggregated data¶

In [ ]:

Copied!

# read the netcdf data into xarray
nuts_grid = xr.open_dataset(aggregated_file)
# read the netcdf data into xarray
nuts_grid = xr.open_dataset(aggregated_file)

In [ ]:

Copied!





# convert the xarray DataArray to pandas DataFrame
# to be able to merge with the GeoDataFrame
nuts_grid = nuts_grid.to_dataframe().reset_index()
nuts_grid.head(5)
# convert the xarray DataArray to pandas DataFrame
# to be able to merge with the GeoDataFrame
nuts_grid = nuts_grid.to_dataframe().reset_index()
nuts_grid.head(5)

In [ ]:

Copied!





# read the NUTS shapefile
nuts_shapefile = data_folder / "NUTS_RG_20M_2024_4326.shp.zip"
NUTS_shapes = gpd.read_file(nuts_shapefile)
# merge the shapes data with the grid values
data_nuts = NUTS_shapes.merge(nuts_grid, on="NUTS_ID")
# read the NUTS shapefile
nuts_shapefile = data_folder / "NUTS_RG_20M_2024_4326.shp.zip"
NUTS_shapes = gpd.read_file(nuts_shapefile)
# merge the shapes data with the grid values
data_nuts = NUTS_shapes.merge(nuts_grid, on="NUTS_ID")

In [ ]:

Copied!

# Create plots for selected times
selected_times = ["2016-01-01", "2016-07-01"]
# Create plots for selected times
selected_times = ["2016-01-01", "2016-07-01"]

In [ ]:

Copied!





# Create figure with subplots
fig, axes = plt.subplots(3, 2, figsize=(14, 12))
fig.suptitle("NUTS aggregated data plots", fontsize=16, y=0.94)

# Plot raw and processed data for selected times
for i, time_val in enumerate(selected_times):
    # Filter data for current timestamp
    current_data = data_nuts[data_nuts["time"] == time_val]
    # Use GeoDataFrame.plot to draw polygons colored by the 't2m' column
    current_data.plot(
        column="t2m",
        ax=axes[0, i],
        cmap="coolwarm",
        legend=False,
        missing_kwds={"color": "lightgrey"},
    )
    axes[0, i].set_title(f"t2m - {time_val[:7]}", pad=15)
    axes[0, i].set_xlim(-10, 35)
    axes[0, i].set_ylim(34, 72)

    current_data.plot(
        column="tp",
        ax=axes[1, i],
        cmap="coolwarm",
        legend=False,
        missing_kwds={"color": "lightgrey"},
    )
    axes[1, i].set_title(f"precipitation - {time_val[:7]}", pad=15)
    axes[1, i].set_xlim(-10, 35)
    axes[1, i].set_ylim(34, 72)

    current_data.plot(
        column="R0",
        ax=axes[2, i],
        cmap="coolwarm",
        legend=False,
        missing_kwds={"color": "lightgrey"},
    )
    axes[2, i].set_title(f"model output - {time_val[:7]}", pad=15)
    axes[2, i].set_xlim(-10, 35)
    axes[2, i].set_ylim(34, 72)

# Adjust layout with more spacing
plt.tight_layout(
    rect=[0, 0.05, 0.85, 0.94]
)  # Adjusted to leave more space for colorbars
plt.subplots_adjust(hspace=0.25, wspace=0.15)

# Create colorbars using ScalarMappable so they work with GeoDataFrame plots


mappable1 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable1.set_array(data_nuts["t2m"].values)
cbar1 = fig.colorbar(
    mappable1, ax=axes[0, :], orientation="vertical", pad=0.02, shrink=0.8
)
cbar1.set_label("Temperature (°C)", fontsize=10)

mappable2 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable2.set_array(data_nuts["tp"].values)
cbar2 = fig.colorbar(
    mappable2, ax=axes[1, :], orientation="vertical", pad=0.02, shrink=0.8
)
cbar2.set_label("precipitation", fontsize=10)

mappable3 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable3.set_array(data_nuts["R0"].values)
cbar3 = fig.colorbar(
    mappable3, ax=axes[2, :], orientation="vertical", pad=0.02, shrink=0.8
)
cbar3.set_label("Model output (R0)", fontsize=10)

plt.show()
# Create figure with subplots
fig, axes = plt.subplots(3, 2, figsize=(14, 12))
fig.suptitle("NUTS aggregated data plots", fontsize=16, y=0.94)

# Plot raw and processed data for selected times
for i, time_val in enumerate(selected_times):
    # Filter data for current timestamp
    current_data = data_nuts[data_nuts["time"] == time_val]
    # Use GeoDataFrame.plot to draw polygons colored by the 't2m' column
    current_data.plot(
        column="t2m",
        ax=axes[0, i],
        cmap="coolwarm",
        legend=False,
        missing_kwds={"color": "lightgrey"},
    )
    axes[0, i].set_title(f"t2m - {time_val[:7]}", pad=15)
    axes[0, i].set_xlim(-10, 35)
    axes[0, i].set_ylim(34, 72)

    current_data.plot(
        column="tp",
        ax=axes[1, i],
        cmap="coolwarm",
        legend=False,
        missing_kwds={"color": "lightgrey"},
    )
    axes[1, i].set_title(f"precipitation - {time_val[:7]}", pad=15)
    axes[1, i].set_xlim(-10, 35)
    axes[1, i].set_ylim(34, 72)

    current_data.plot(
        column="R0",
        ax=axes[2, i],
        cmap="coolwarm",
        legend=False,
        missing_kwds={"color": "lightgrey"},
    )
    axes[2, i].set_title(f"model output - {time_val[:7]}", pad=15)
    axes[2, i].set_xlim(-10, 35)
    axes[2, i].set_ylim(34, 72)

# Adjust layout with more spacing
plt.tight_layout(
    rect=[0, 0.05, 0.85, 0.94]
)  # Adjusted to leave more space for colorbars
plt.subplots_adjust(hspace=0.25, wspace=0.15)

# Create colorbars using ScalarMappable so they work with GeoDataFrame plots


mappable1 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable1.set_array(data_nuts["t2m"].values)
cbar1 = fig.colorbar(
    mappable1, ax=axes[0, :], orientation="vertical", pad=0.02, shrink=0.8
)
cbar1.set_label("Temperature (°C)", fontsize=10)

mappable2 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable2.set_array(data_nuts["tp"].values)
cbar2 = fig.colorbar(
    mappable2, ax=axes[1, :], orientation="vertical", pad=0.02, shrink=0.8
)
cbar2.set_label("precipitation", fontsize=10)

mappable3 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable3.set_array(data_nuts["R0"].values)
cbar3 = fig.colorbar(
    mappable3, ax=axes[2, :], orientation="vertical", pad=0.02, shrink=0.8
)
cbar3.set_label("Model output (R0)", fontsize=10)

plt.show()

In [ ]:

Copied!





# also plot the population data
# note that population is not available for every time step
selected_times = ["2016-01-01", "2017-01-01"]

# Create figure with subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Plot raw and processed data for selected times
for i, time_val in enumerate(selected_times):
    # Filter data for current timestamp
    current_data_pop = data_nuts[data_nuts["time"] == time_val]
    current_data_pop.plot(
        column="total-population",
        ax=axes[i],
        cmap="coolwarm",
        legend=False,
        missing_kwds={"color": "lightgrey"},
    )
    axes[i].set_title(f"total population - {time_val[:7]}", pad=15)
    axes[i].set_xlim(-10, 35)
    axes[i].set_ylim(34, 72)

# Create colorbars using ScalarMappable so they work with GeoDataFrame plots

mappable1 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable1.set_array(data_nuts["total-population"].values)
cbar1 = fig.colorbar(
    mappable1, ax=axes[:], orientation="vertical", pad=0.02, shrink=0.8
)
cbar1.set_label("Total population", fontsize=10)

plt.show()
# also plot the population data
# note that population is not available for every time step
selected_times = ["2016-01-01", "2017-01-01"]

# Create figure with subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Plot raw and processed data for selected times
for i, time_val in enumerate(selected_times):
    # Filter data for current timestamp
    current_data_pop = data_nuts[data_nuts["time"] == time_val]
    current_data_pop.plot(
        column="total-population",
        ax=axes[i],
        cmap="coolwarm",
        legend=False,
        missing_kwds={"color": "lightgrey"},
    )
    axes[i].set_title(f"total population - {time_val[:7]}", pad=15)
    axes[i].set_xlim(-10, 35)
    axes[i].set_ylim(34, 72)

# Create colorbars using ScalarMappable so they work with GeoDataFrame plots

mappable1 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable1.set_array(data_nuts["total-population"].values)
cbar1 = fig.colorbar(
    mappable1, ax=axes[:], orientation="vertical", pad=0.02, shrink=0.8
)
cbar1.set_label("Total population", fontsize=10)

plt.show()

In [ ]: