Tutorial A: Download Data from Copernicus and ISIMIP¶
heiplanet-data Python package - data download and visualization of the raw data
Authors: Scientific Software Center
Date: October 2025
Version: 1.0
Overview¶
This tutorial demonstrates how to download data files through the Copernicus and ISIMIP APIs and inspect the data using xarray. You will learn how to:
- Data Sources and APIs: Using Copernicus Climate Data Store (CDS) and ISIMIP APIs for data download
- Xarray: Working with multi-dimensional scientific datasets
- Data Visualization: Creating plots to verify data integrity
1. Downloading the data from the resources¶
Let's start by importing the necessary libraries.
# if running on google colab
# flake8-noqa-cell
if "google.colab" in str(get_ipython()):
# install packages
%pip install git+https://github.com/ssciwr/heiplanet-data.git -qqq
# Import required libraries
from heiplanet_data import inout # Our custom module for data I/O operations
from pathlib import Path # For cross-platform file path handling
from matplotlib import pyplot as plt # For creating plots and visualizations
import xarray as xr # For working with labeled multi-dimensional arrays
from isimip_client.client import ISIMIPClient # For downloading ISIMIP data
We now need to set up a folder structure where to save the downloaded data.
# Set up data directories
# We use pathlib.Path for cross-platform compatibility
data_root = Path("data/") # Navigate to the data directory from docs/source/notebooks/
data_folder = data_root / "in" # Raw input data goes in the 'in' subfolder
print(f"Data root directory: {data_root.absolute()}")
print(f"Input data directory: {data_folder.absolute()}")
print(f"Directory exists: {data_folder.exists()}")
Download ERA5-Land data¶
ERA5-Land is a reanalysis dataset providing a comprehensive record of land variables from 1950 to present. It's produced by the European Centre for Medium-Range Weather Forecasts (ECMWF).
What is reanalysis data?¶
Reanalysis combines:
- Observations: From weather stations, satellites, radiosondes, etc.
- Numerical weather models: Physics-based atmospheric models
- Data assimilation: Mathematical techniques to optimally combine observations with model forecasts
This creates a spatially and temporally consistent dataset that's invaluable for climate research.
Key ERA5-Land variables we'll download¶
- 2m temperature (t2m): Air temperature at 2 meters above ground
- Total precipitation (tp): Accumulated precipitation
The CDS's ERA5-Land monthly dataset is being used for now. Please set up the CDS API as outlined below and take note of the naming convention used for the downloaded files.
Set up CDS API¶
To use CDS API for downloading data, you need to first create an account on CDS to obtain your personal access token.
Create a .cdsapirc file containing your personal access token by following this instruction.
General API requests¶
- Select the target dataset, e.g. ERA5-Land monthly averaged data from 1950 to present
- Go to tab
Downloadof the dataset and select the data variables, time range, geographical area, etc. that you want to download - At the end of the page, click on
Show API request codeand take notes of the following informationdataset: name of the datasetrequest: a dictionary summarizes your download request
- Replace the values of
datasetandrequestin the below cell correspondingly
# replace dataset and request with your own values
dataset = "reanalysis-era5-land-monthly-means"
request = {
"product_type": ["monthly_averaged_reanalysis"],
"variable": ["2m_temperature", "total_precipitation"],
"year": ["2016", "2017"],
"month": [
"01",
"02",
"03",
"04",
"05",
"06",
"07",
"08",
"09",
"10",
"11",
"12",
],
"time": ["00:00"],
"data_format": "netcdf",
"download_format": "unarchived",
}
Understanding the request parameters¶
dataset: The specific ERA5-Land dataset identifier from CDSproduct_type: Type of product (monthly averaged reanalysis)variable: The climate variables we want (temperature and precipitation)yearandmonth: Time range for our data (2016-2017, all months)time: Specific time of day (00:00 for monthly averages)data_format: Output format (NetCDF is standard for scientific data)download_format: Whether to compress files (unarchived = no compression)
# Generate a descriptive filename for our downloaded data
data_format = request.get("data_format")
# The inout.get_filename() function creates standardized filenames
# that include key metadata about the dataset
era5_fname = inout.get_filename(
ds_name=dataset, # Dataset identifier
data_format=data_format, # File format (netcdf)
years=request["year"], # Years included (2016-2017)
months=request["month"], # Months included (all 12)
has_area=bool("area" in request), # Whether spatial subsetting was used
base_name="era5_data", # Base prefix for the filename
variables=request["variable"], # Variables included (t2m, tp)
)
# Create the full file path
era5_fpath = data_folder / era5_fname
print(f"Generated filename: {era5_fname}")
print(f"Full file path: {era5_fpath}")
# Download the ERA5 data (this may take several minutes)
# We first check if the file already exists to avoid unnecessary downloads
if not era5_fpath.exists():
print("Downloading ERA5 data from Copernicus Climate Data Store...")
print("This may take several minutes depending on file size and server load...")
# The inout.download_data() function handles the CDS API authentication and download
inout.download_data(era5_fpath, dataset, request)
print(f"Download complete! File saved to: {era5_fpath}")
else:
print(f"ERA5 data already exists at {era5_fpath}")
print("Skipping download to save time.")
Special download for total precipitation data from ERA5-Land Hourly dataset¶
Some models may require total precipitation data downloaded from the dataset ERA5-Land hourly data from 1950 to present.
Due to the nature of this dataset, the value at 00:00 is the total precipitation of the previous day (see here).
To get the correct precipitation values from 01.01.2016 to 31.12.2017, we need to download the data from 02.01.2016 to 01.01.2018. The current CDS request API does not support specifying a continuous date range that does not share the same days for each month and the same months for each year.
We implemented a special function for this case.
def download_total_precipitation_from_hourly_era5_land(
start_date: str,
end_date: str,
area: List[float] | None = None,
out_dir: Path = Path("."),
base_name: str = "era5_data",
data_format: str = "netcdf",
ds_name: str = "reanalysis-era5-land",
coord_name: str = "valid_time",
var_name: str = "total_precipitation",
clean_tmp_files: bool = False,
) -> str:
Input for this function includes:
start_dateandend_datein the format of "YYYY-MM-DD"areaindicates the area to download;Nonemeans the entire globe.out_dir: output directory to store the downloaded filebase_name: base string used to name the output file. File name is described in Naming convention - Special casedata_format: can benetcdforgribds_name,coord_name, andvar_namerepresent the dataset name, coordinate name, and data variable name in the dataset. Please only change these values when CDS changes the corresponding names.clean_tmp_filesparameter can be set toFalseto retain the downloaded temporary files, which store data for smaller sub-ranges derived from the overall date range. For example, the range2016-01-01to2017-12-31would be split into sub-ranges2016-01-02to2016-12-31,2017-01-01to2017-12-31, and2018-01-01to2018-01-01, because the timestamps are shifted one day forward.
The function handles time shifting, downloads the data, adjusts the time coordinate back to the target range, and returns the output file path.
# download total precipitation data from ERA5-Land Hourly dataset
# from 2016-01-01 to 2017-12-31
start_time = "2016-01-01"
end_time = "2017-12-31"
tp_era5_hourly_file = inout.download_total_precipitation_from_hourly_era5_land(
start_date=start_time,
end_date=end_time,
area=None,
out_dir=data_folder,
base_name="era5_data",
data_format="netcdf",
ds_name="reanalysis-era5-land",
coord_name="valid_time",
var_name="total_precipitation",
clean_tmp_files=False, # keep temporary files for checking
)
tp_era5_hourly_file
tp_era5_hourly_ds = xr.open_dataset(tp_era5_hourly_file)
tp_era5_hourly_ds["valid_time"]
Naming convention¶
The filenames of the downloaded netCDF files follow this structure:
{base_name}_{year_str}_{month_str}_{day_str}_{time_str}_{var_str}_{ds_type}_{area_str}_raw.{ext}
base_nameis"era5_data",- For list of numbers, i.e. years/months/days/times, the rule below is applied
- If the values are continuous, the string representation is a concatenate of
minandmaxvalues, separated by- - Otherwise, the string is a join of all values, separated by
_ - However, if there are more than 5 values, we only keep the first 5 ones and replace the rest by
"_etc" - If the values are empty (e.g. no days or times in the download request), their string representation and the corresponding separator (i.e.
"_") are omitted from the file name.
- If the values are continuous, the string representation is a concatenate of
year_stris the string representation of list of years using the rule above.- Similarly for
month_str. However, if the download requests all 12 months,month_strwould be"allm" day_strandtime_strfollows the same pattern, assuming that a month has at most 31 days ("alld") and a day has at most 24 hours ("allt").- Special case: if data is downloaded at time
00:00per day only,time_strwould be"midnight"(e.g. precipitation data for P-model)
- Special case: if data is downloaded at time
- For
var_str, each variable has an abbreviation derived by the first letter of each word in the variable name (e.g.tpfortotal precipitation).- All abbreviations are then concatenated by
_ - If this concatenated string is longer than 30 characters, we only keep the first 2 characters and replace the the rest by
"_etc"
- All abbreviations are then concatenated by
- As for
ds_type:- If the file was downloaded from a monthly dataset,
"monthly"is set tods_type. This means the data is recorded only on the first day of each month. - For other datasets, when data is downloaded only at midnight (
time_str="midnight"), the ds_type is"daily", meaning one data record for one day of each month. ds_typewould be an empty string in other cases, i.e. multiple data records for each day of a month.
- If the file was downloaded from a monthly dataset,
- For
area_str, if the downloaded data is only for an area of the grid (instead of the whole map),"area"would represent forarea_str. - If the part before
"_raw"is longer than 100 characters, only the first 100 characters are kept and the rest is replaced by"_etc" "_raw"is added at the end to indicate that the file is raw data- Extension
extof the file can be.ncor.grib - If any of these fields (from
year_strtoarea_str) are missing from the download request, the corresponding string and the preceding_are removed from the file name.
Special case¶
As for total precipitation data downloaded from dataset ERA5-Land hourly data from 1950 to present, the file name is structured as:
{base_name}_{start_date}-{end_date}_{time_str}_{var_str}_{ds_type}_{area_str}_raw.{ext}
In this case, time_str is "midnight" and ds_type is "daily".
Download ISIMIP data (population data)¶
ISIMIP (Inter-Sectoral Impact Model Intercomparison Project) provides a framework for comparing and improving impact models across different sectors.
About population data¶
- Source: Historical population data from 1901-2021
- Resolution: 30 arc-minutes (approximately 50km at the equator)
- Format: NetCDF with population counts per grid cell
- Units: Number of people per grid cell
- Use case: Essential for calculating population exposure to climate hazards
Download ISIMIP data manually¶
To download population data manually, please perform the following steps:
- go to ISIMIP website
- search
populationfrom the search bar - choose simulation round
ISIMIP3a - click
Input Data->Direct human forcing->Population data->histsoc - choose
population_histsoc_30arcmin_annual - download file
population_histsoc_30arcmin_annual_1901_2021.nc
Download ISIMIP data using ISIMIP's API¶
# initialize ISIMIP client
client = ISIMIPClient()
# search for population data
response = client.datasets(
path="ISIMIP3a/InputData/socioeconomic/pop/histsoc/population"
) # this path is similar to the one in ISIMIP's website
for dataset in response["results"]:
print("Dataset found: {}".format(dataset["path"]))
# download population data file, 1901_2021
for dataset in response["results"]:
for file in dataset["files"]:
if "1901_2021" in file["name"]:
isimip_fpath = data_folder / file["name"]
if isimip_fpath.exists():
print(f"Population data file already exists: {file['name']}")
else:
print(f"Downloading population data file: {file['name']}")
client.download(file["file_url"], path=data_folder)
print("Download complete!")
break # exit after first match
2. Inspect the data using xarray¶
Xarray is a powerful Python library for working with labeled multi-dimensional arrays. It's particularly well-suited for scientific datasets like climate data because it:
- Labels dimensions: Instead of working with raw indices, you can use meaningful names like 'time', 'latitude', 'longitude'
- Handles metadata: Stores attributes, coordinate information, and units alongside your data
- Integrates with pandas: Provides similar functionality for N-dimensional data as pandas does for 2D data
- Works with NetCDF: Native support for the NetCDF format commonly used in climate science
- Enables easy operations: Broadcasting, grouping, resampling, and mathematical operations across dimensions
Key Xarray concepts¶
- Dataset: A dictionary-like container of data variables with shared coordinates
- DataArray: A labeled N-dimensional array (similar to a pandas Series but for N dimensions)
- Coordinates: Arrays that provide labels for each dimension
- Attributes: Metadata stored as key-value pairs
# load netCDF files
ds_era5 = xr.open_dataset(era5_fpath)
ds_isimip = xr.open_dataset(isimip_fpath)
Explore the xarray dataset structure¶
Let's examine our datasets to understand their structure. Xarray provides excellent methods for inspecting data.
# Examine the ERA5 dataset structure
print("=== ERA5 DATASET OVERVIEW ===")
print(ds_era5)
print("\n" + "=" * 50)
# Explore specific data variables in the ERA5 dataset
print("=== ERA5 DATA VARIABLES ===")
for var_name, var in ds_era5.data_vars.items():
print(f"\n📊 Variable: {var_name}")
print(f" Shape: {var.shape}")
print(f" Dimensions: {var.dims}")
print(f" Data type: {var.dtype}")
if hasattr(var, "long_name"):
print(f" Description: {var.long_name}")
if hasattr(var, "units"):
print(f" Units: {var.units}")
# Examine coordinates in the ERA5 dataset
print("=== ERA5 COORDINATES ===")
for coord_name, coord in ds_era5.coords.items():
print(f"\n🌐 Coordinate: {coord_name}")
print(f" Shape: {coord.shape}")
print(f" Range: {coord.min().values} to {coord.max().values}")
if coord_name == "valid_time":
print(f" First date: {coord.values[0]}")
print(f" Last date: {coord.values[-1]}")
print(f" Total time steps: {len(coord)}")
elif coord_name in ["latitude", "longitude"]:
print(f" Resolution: ~{abs(coord[1].values - coord[0].values):.3f} degrees")
Understanding xarray data selection¶
Xarray provides powerful methods for selecting and indexing data. Let's explore some common operations:
# Demonstrate xarray data selection methods
print("=== XARRAY DATA SELECTION EXAMPLES ===")
# 1. Select data by coordinate values (not indices!)
print("\n1. Select temperature data for January 2016:")
jan_2016_temp = ds_era5.t2m.sel(valid_time="2016-01")
print(f" Shape: {jan_2016_temp.shape}")
print(f" Selected time: {jan_2016_temp.valid_time.values}")
# 2. Select a specific geographic location
print("\n2. Select data for a specific location (52.5°N, 13.4°E):")
location_data = ds_era5.sel(latitude=52.5, longitude=13.4, method="nearest")
print(f" Shape: {location_data.t2m.shape}")
print(
f" Actual coordinates: lat={location_data.latitude.values:.2f}, lon={location_data.longitude.values:.2f}"
)
# 3. Select a geographic region
print("\n3. Select data for an area:")
area_data = ds_era5.sel(latitude=slice(70, 35), longitude=slice(0, 40))
print(f" Shape: {area_data.t2m.shape}")
print(
f" Lat range: {area_data.latitude.min().values:.1f} to {area_data.latitude.max().values:.1f}"
)
print(
f" Lon range: {area_data.longitude.min().values:.1f} to {area_data.longitude.max().values:.1f}"
)
# Demonstrate common xarray operations
print("=== COMMON XARRAY OPERATIONS ===")
# 1. Statistical operations across dimensions
print("\n1. Calculate statistics across time dimension:")
temp_mean = ds_era5.t2m.mean(dim="valid_time")
temp_std = ds_era5.t2m.std(dim="valid_time")
print(f" Mean temperature shape: {temp_mean.shape} (time dimension removed)")
print(f" Global mean temperature: {temp_mean.mean().values:.2f} K")
print(f" Standard deviation shape: {temp_std.shape} (time dimension removed)")
print(f" Global standard deviation: {temp_std.mean().values:.2f} K")
# 2. GroupBy operations (seasonal means)
print("\n2. Calculate seasonal means using groupby:")
seasonal_temp = ds_era5.t2m.groupby("valid_time.season").mean()
print(f" Seasonal data shape: {seasonal_temp.shape}")
print(f" Seasons available: {seasonal_temp.season.values}")
# 3. Mathematical operations
print("\n3. Convert temperature from Kelvin to Celsius:")
temp_celsius = ds_era5.t2m - 273.15 # Simple arithmetic on the entire array
print(
f" Original range: {ds_era5.t2m.min().values:.1f} to {ds_era5.t2m.max().values:.1f} K"
)
print(
f" Converted range: {temp_celsius.min().values:.1f} to {temp_celsius.max().values:.1f} °C"
)
Plot the datasets¶
# plot the cartesian grid data of t2m and tp for 2016-2017, all months
ds_era5.t2m.plot.pcolormesh(
col="valid_time", col_wrap=4, cmap="coolwarm", robust=True, figsize=(15, 10)
)
plt.savefig("era5_2016_2017_plots.png", dpi=300)
plt.show()
# Plot precipitation data
ds_era5.tp.plot.pcolormesh(
col="valid_time",
col_wrap=4,
cmap="Blues", # Use a sequential colormap for precipitation
robust=True,
figsize=(15, 10),
add_colorbar=True,
)
plt.suptitle(
"ERA5 Total Precipitation (mm) - 2016-2017 Monthly Data", fontsize=16, y=1.02
)
plt.savefig("era5_2016_2017_plots_tp.png", dpi=300, bbox_inches="tight")
plt.show()
print("✓ Precipitation plot saved as 'era5_2016_2017_plots_tp.png'")
# Plot population data for specific years
print("Creating population plots...")
# Select population data for 2016 and 2017
pop_2016_2017 = ds_isimip.sel(time=slice("2016", "2017"))
# Create population plots
pop_2016_2017["total-population"].plot.pcolormesh(
col="time",
col_wrap=2,
cmap="viridis", # Use viridis colormap for population
robust=True,
figsize=(12, 5),
add_colorbar=True,
)
plt.suptitle("Total population", fontsize=16, y=1.02)
plt.savefig("population_2016_2017_plots.png", dpi=300, bbox_inches="tight")
plt.show()
print("✓ Population plot saved as 'population_2016_2017_plots.png'")