Tutorial C: Aggregate the data¶
heiplanet-data Python package - data aggregation of the processed data into NUTS or other regions
Authors: Scientific Software Center
Date: October 2025
Version: 1.0
Overview¶
This tutorial demonstrates how to postprocess data through heiplanet-data. You will learn how to:
- Specify the regional averaging for
heiplanet-data: Work with NUTS regions to create aggregated maps - Data operations: Aggregate the data on regional grids
- Data Visualization: Create plots to verify the processed data
We will postprocess the data to aggregate in NUTS regions. This can be NUTS0 down to NUTS3, depending on which values you would like to extract from the aggregated data.
We will start by importing the necessary libraries.
# if running on google colab
# flake8-noqa-cell
if "google.colab" in str(get_ipython()):
# install packages
%pip install git+https://github.com/ssciwr/heiplanet-data.git -qqq
from pathlib import Path
import time
import pooch
from heiplanet_data import preprocess
import xarray as xr
from matplotlib import pyplot as plt
import geopandas as gpd
import matplotlib as mpl
# change to your own data folder, if needed
data_root = Path("data/")
data_folder = data_root / "in"
processed_folder = Path.cwd() / "processed"
# replace the below lines with your processed data filenames
era5_pfname = "era5_data_2016-2017_allm_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim_tutorial_B.nc"
isimip_pfname = "population_histsoc_30arcmin_annual_1901_2021_unicoords_2016-2017_ts20251014-092525_hssc08.nc"
jmodel_fname = "output_JModel_global_025deg_trim_ts20251014-092703_hssc08.nc"
# download the NUTS shapefile
filename = "NUTS_RG_20M_2024_4326.shp.zip"
url = "https://heibox.uni-heidelberg.de/f/e95125307161470a8906/?dl=1"
filehash = "246a5e1a3901380f194879cff5a65bbac86c9c906a7871027a3a918dbe6c7e46"
try:
file = pooch.retrieve(
url=url,
known_hash=filehash,
fname=filename,
path=data_folder,
)
except Exception as e:
print(f"Error fetching data: {e}")
raise RuntimeError(f"Failed to fetch data from {url}") from e
print(f"Data fetched and saved to {file}")
1. NUTS specification¶
Eurostat's NUTS definitions are set here and corresponding shapefiles can be downloaded here.
For downloading, please choose:
- The latest year from NUTS year,
- File format:
SHP, - Geometry type:
Polygons (RG), - Scale:
20M - CRS:
EPSG: 4326
Inside the zip folder, there are five different shapefiles, which are all required to display and extract the NUTS regions data.
shape data folder
|____.shp file: geometry data (e.g. polygons)
|____.shx file: index for geometry data
|____.dbf file: attribute data for each NUTS region (e.g NUTS name, NUTS ID)
|____.prj file: information on CRS
|____.cpg file: character encoding data
These NUTS definition files are for Europe only. If a country does not have NUTS level $x \in [1,3]$, the corresponding data for these levels is excluded from the shapefiles. You do not need to extract the zip folder.
NUTS_ID explanation¶
- Structure of
NUTS_ID:<country><level> country: 2 letters, representing name of a country, e.g. DElevel: 0 to 3 letters or numbers, signifying the level of the NUTS region
First, we need to tell heiplanet-data where to find the geometrical shapes over which we will aggregate. In principle, any shape folder with similar content structure as the NUTS shapefolder should work.
# NUTS shapefile
nuts_file = data_folder / "NUTS_RG_20M_2024_4326.shp.zip"
Next, we specify which files will be aggregated. The aggregation is carried out simultaneously which is more efficient.
# specify which files will be aggregated by NUTS regions
# key is name of the dataset, value is a tuple of (file path, aggregation mapping dict.)
non_nuts_data = {
"era5": (processed_folder / era5_pfname, None),
"popu": (processed_folder / isimip_pfname, None),
"jmodel": (processed_folder / jmodel_fname, None),
}
2. Aggregate the data over NUTS regions¶
To aggregate the data, we now need to pass the NUTS file path and the non_nuts_data dictionary to the function for the processing. Here, the keys represent dataset names (used to form the resulting file name), and the values are tuples containing the file path and the aggregation mapping.
By default, the aggregation mapping is set to None, which means the mean function will be applied to all data variables during aggregation.
An example of aggregation mapping dictionary is:
{
"t2m": "mean",
"tp": "sum"
}
The resulting file name would be:
<NUTS_shapefile_name>agg<nc_dataset_names>_<min_yyyy-mm>-<max_yyyy-mm>.nc
# aggregate data by NUTS regions
t0 = time.time()
aggregated_file = preprocess.aggregate_data_by_nuts(
non_nuts_data, nuts_file, normalize_time=True, output_dir=processed_folder
)
t1 = time.time()
print(f"Aggregation completed in {t1 - t0:.2f} seconds")
3. Plot the aggregated data¶
# read the netcdf data into xarray
nuts_grid = xr.open_dataset(aggregated_file)
# convert the xarray DataArray to pandas DataFrame
# to be able to merge with the GeoDataFrame
nuts_grid = nuts_grid.to_dataframe().reset_index()
nuts_grid.head(5)
# read the NUTS shapefile
nuts_shapefile = data_folder / "NUTS_RG_20M_2024_4326.shp.zip"
NUTS_shapes = gpd.read_file(nuts_shapefile)
# merge the shapes data with the grid values
data_nuts = NUTS_shapes.merge(nuts_grid, on="NUTS_ID")
# Create plots for selected times
selected_times = ["2016-01-01", "2016-07-01"]
# Create figure with subplots
fig, axes = plt.subplots(3, 2, figsize=(14, 12))
fig.suptitle("NUTS aggregated data plots", fontsize=16, y=0.94)
# Plot raw and processed data for selected times
for i, time_val in enumerate(selected_times):
# Filter data for current timestamp
current_data = data_nuts[data_nuts["time"] == time_val]
# Use GeoDataFrame.plot to draw polygons colored by the 't2m' column
current_data.plot(
column="t2m",
ax=axes[0, i],
cmap="coolwarm",
legend=False,
missing_kwds={"color": "lightgrey"},
)
axes[0, i].set_title(f"t2m - {time_val[:7]}", pad=15)
axes[0, i].set_xlim(-10, 35)
axes[0, i].set_ylim(34, 72)
current_data.plot(
column="tp",
ax=axes[1, i],
cmap="coolwarm",
legend=False,
missing_kwds={"color": "lightgrey"},
)
axes[1, i].set_title(f"precipitation - {time_val[:7]}", pad=15)
axes[1, i].set_xlim(-10, 35)
axes[1, i].set_ylim(34, 72)
current_data.plot(
column="R0",
ax=axes[2, i],
cmap="coolwarm",
legend=False,
missing_kwds={"color": "lightgrey"},
)
axes[2, i].set_title(f"model output - {time_val[:7]}", pad=15)
axes[2, i].set_xlim(-10, 35)
axes[2, i].set_ylim(34, 72)
# Adjust layout with more spacing
plt.tight_layout(
rect=[0, 0.05, 0.85, 0.94]
) # Adjusted to leave more space for colorbars
plt.subplots_adjust(hspace=0.25, wspace=0.15)
# Create colorbars using ScalarMappable so they work with GeoDataFrame plots
mappable1 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable1.set_array(data_nuts["t2m"].values)
cbar1 = fig.colorbar(
mappable1, ax=axes[0, :], orientation="vertical", pad=0.02, shrink=0.8
)
cbar1.set_label("Temperature (°C)", fontsize=10)
mappable2 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable2.set_array(data_nuts["tp"].values)
cbar2 = fig.colorbar(
mappable2, ax=axes[1, :], orientation="vertical", pad=0.02, shrink=0.8
)
cbar2.set_label("precipitation", fontsize=10)
mappable3 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable3.set_array(data_nuts["R0"].values)
cbar3 = fig.colorbar(
mappable3, ax=axes[2, :], orientation="vertical", pad=0.02, shrink=0.8
)
cbar3.set_label("Model output (R0)", fontsize=10)
plt.show()
# also plot the population data
# note that population is not available for every time step
selected_times = ["2016-01-01", "2017-01-01"]
# Create figure with subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Plot raw and processed data for selected times
for i, time_val in enumerate(selected_times):
# Filter data for current timestamp
current_data_pop = data_nuts[data_nuts["time"] == time_val]
current_data_pop.plot(
column="total-population",
ax=axes[i],
cmap="coolwarm",
legend=False,
missing_kwds={"color": "lightgrey"},
)
axes[i].set_title(f"total population - {time_val[:7]}", pad=15)
axes[i].set_xlim(-10, 35)
axes[i].set_ylim(34, 72)
# Create colorbars using ScalarMappable so they work with GeoDataFrame plots
mappable1 = mpl.cm.ScalarMappable(cmap="coolwarm")
mappable1.set_array(data_nuts["total-population"].values)
cbar1 = fig.colorbar(
mappable1, ax=axes[:], orientation="vertical", pad=0.02, shrink=0.8
)
cbar1.set_label("Total population", fontsize=10)
plt.show()