Preparing data files¶

Preparing data files according to the data flowchart

In [ ]:

Copied!





from onehealth_db import inout
from onehealth_db import preprocess, utils
from pathlib import Path
import time
from onehealth_db import inout
from onehealth_db import preprocess, utils
from pathlib import Path
import time

In [ ]:

Copied!

# change to your own data folder, if needed
data_folder = Path("../../../data/in/")
# change to your own data folder, if needed
data_folder = Path("../../../data/in/")

Download ERA5-Land data¶

To download ERA5-Land data using CDS's API:

Select the target dataset, e.g. ERA5-Land monthly averaged data from 1950 to present
Go to tab Download of the dataset and select the data variables, time range, geographical area, etc. that you want to download
At the end of the page, click on Show API request code and take notes of the following information
- dataset: name of the dataset
- request: a dictionary summarizes your download request
Replace the values of dataset and request in the below cell correspondingly

In [ ]:

Copied!





# replace dataset and request with your own values
dataset = "reanalysis-era5-land-monthly-means"
request = {
    "product_type": ["monthly_averaged_reanalysis"],
    "variable": ["2m_temperature", "total_precipitation"],
    "year": ["2020", "2021", "2022", "2023", "2024", "2025"],
    "month": [
        "01",
        "02",
        "03",
        "04",
        "05",
        "06",
        "07",
        "08",
        "09",
        "10",
        "11",
        "12",
    ],
    "time": ["00:00"],
    "data_format": "netcdf",
    "download_format": "unarchived",
}
# replace dataset and request with your own values
dataset = "reanalysis-era5-land-monthly-means"
request = {
    "product_type": ["monthly_averaged_reanalysis"],
    "variable": ["2m_temperature", "total_precipitation"],
    "year": ["2020", "2021", "2022", "2023", "2024", "2025"],
    "month": [
        "01",
        "02",
        "03",
        "04",
        "05",
        "06",
        "07",
        "08",
        "09",
        "10",
        "11",
        "12",
    ],
    "time": ["00:00"],
    "data_format": "netcdf",
    "download_format": "unarchived",
}

In [ ]:

Copied!





data_format = request.get("data_format")

# file name of downladed data
file_name = inout.get_filename(
    ds_name=dataset,
    data_format=data_format,
    years=request["year"],
    months=request["month"],
    has_area=bool("area" in request),
    base_name="era5_data",
    variable=request["variable"],
)
output_file = data_folder / file_name
data_format = request.get("data_format")

# file name of downladed data
file_name = inout.get_filename(
    ds_name=dataset,
    data_format=data_format,
    years=request["year"],
    months=request["month"],
    has_area=bool("area" in request),
    base_name="era5_data",
    variable=request["variable"],
)
output_file = data_folder / file_name

In [ ]:

Copied!





# download data
if not output_file.exists():
    print("Downloading data...")
    inout.download_data(output_file, dataset, request)
else:
    print("Data already exists at {}".format(output_file))
# download data
if not output_file.exists():
    print("Downloading data...")
    inout.download_data(output_file, dataset, request)
else:
    print("Data already exists at {}".format(output_file))

Load settings¶

First we need to load the default settings which setup preprocessing steps.

In [ ]:

Copied!





settings = utils.get_settings(setting_path="default", 
                              new_settings={}, 
                              updated_setting_dir=None, 
                              save_updated_settings=False)
settings = utils.get_settings(setting_path="default", 
                              new_settings={}, 
                              updated_setting_dir=None, 
                              save_updated_settings=False)

TBU: more details about the default settings will be provided...

Preprocess data¶

Preprocess ERA5-Land data¶

In [ ]:

Copied!





# disable truncation of dates
settings["truncate_date"] = False

print("Preprocessing ERA5-Land data...")
t0 = time.time()
preprocessed_dataset = preprocess.preprocess_data_file(
    netcdf_file=output_file,
    settings=settings,
)
t_preprocess = time.time()
print("Preprocessing completed in {:.2f} seconds.".format(t_preprocess - t0))
# disable truncation of dates
settings["truncate_date"] = False

print("Preprocessing ERA5-Land data...")
t0 = time.time()
preprocessed_dataset = preprocess.preprocess_data_file(
    netcdf_file=output_file,
    settings=settings,
)
t_preprocess = time.time()
print("Preprocessing completed in {:.2f} seconds.".format(t_preprocess - t0))

The preprocessed dataset is also saved in a .nc file under the same folder, namely era5_data_2020_2025_all_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim

Details on regulation for the file name can be found in Data Lake & Database

Preprocess population data¶

Instructions for downloading population data (i.e. ISIMIP data) are presented in Data Lake & Database

In [ ]:

Copied!

popu_file = data_folder / "population_histsoc_30arcmin_annual_1901_2021.nc"
popu_file = data_folder / "population_histsoc_30arcmin_annual_1901_2021.nc"

In [ ]:

Copied!





settings["truncate_date"] = True
# disable uncessary preprocessing steps
settings["adjust_longitude"] = False
settings["convert_kelvin_to_celsius"] = False
settings["convert_m_to_mm_precipitation"] = False
settings["resample_grid"] = False

print("Preprocessing population data...")
t0 = time.time()
preprocessed_popu = preprocess.preprocess_data_file(
    netcdf_file=popu_file,
    settings=settings,
)
t_popu = time.time()
print("Preprocessing population data completed in {:.2f} seconds.".format(t_popu - t0))
settings["truncate_date"] = True
# disable uncessary preprocessing steps
settings["adjust_longitude"] = False
settings["convert_kelvin_to_celsius"] = False
settings["convert_m_to_mm_precipitation"] = False
settings["resample_grid"] = False

print("Preprocessing population data...")
t0 = time.time()
preprocessed_popu = preprocess.preprocess_data_file(
    netcdf_file=popu_file,
    settings=settings,
)
t_popu = time.time()
print("Preprocessing population data completed in {:.2f} seconds.".format(t_popu - t0))

The preprocessed dataset is also saved in a .nc file under the same folder.