Preparing data files¶
Preparing data files according to the data flowchart
In [ ]:
Copied!
from onehealth_db import inout
from onehealth_db import preprocess, utils
from pathlib import Path
import time
from onehealth_db import inout
from onehealth_db import preprocess, utils
from pathlib import Path
import time
In [ ]:
Copied!
# change to your own data folder, if needed
data_folder = Path("../../../data/in/")
# change to your own data folder, if needed
data_folder = Path("../../../data/in/")
Download ERA5-Land data¶
To download ERA5-Land data using CDS's API:
- Select the target dataset, e.g. ERA5-Land monthly averaged data from 1950 to present
- Go to tab
Download
of the dataset and select the data variables, time range, geographical area, etc. that you want to download - At the end of the page, click on
Show API request code
and take notes of the following informationdataset
: name of the datasetrequest
: a dictionary summarizes your download request
- Replace the values of
dataset
andrequest
in the below cell correspondingly
In [ ]:
Copied!
# replace dataset and request with your own values
dataset = "reanalysis-era5-land-monthly-means"
request = {
"product_type": ["monthly_averaged_reanalysis"],
"variable": ["2m_temperature", "total_precipitation"],
"year": ["2020", "2021", "2022", "2023", "2024", "2025"],
"month": [
"01",
"02",
"03",
"04",
"05",
"06",
"07",
"08",
"09",
"10",
"11",
"12",
],
"time": ["00:00"],
"data_format": "netcdf",
"download_format": "unarchived",
}
# replace dataset and request with your own values
dataset = "reanalysis-era5-land-monthly-means"
request = {
"product_type": ["monthly_averaged_reanalysis"],
"variable": ["2m_temperature", "total_precipitation"],
"year": ["2020", "2021", "2022", "2023", "2024", "2025"],
"month": [
"01",
"02",
"03",
"04",
"05",
"06",
"07",
"08",
"09",
"10",
"11",
"12",
],
"time": ["00:00"],
"data_format": "netcdf",
"download_format": "unarchived",
}
In [ ]:
Copied!
data_format = request.get("data_format")
# file name of downladed data
file_name = inout.get_filename(
ds_name=dataset,
data_format=data_format,
years=request["year"],
months=request["month"],
has_area=bool("area" in request),
base_name="era5_data",
variable=request["variable"],
)
output_file = data_folder / file_name
data_format = request.get("data_format")
# file name of downladed data
file_name = inout.get_filename(
ds_name=dataset,
data_format=data_format,
years=request["year"],
months=request["month"],
has_area=bool("area" in request),
base_name="era5_data",
variable=request["variable"],
)
output_file = data_folder / file_name
In [ ]:
Copied!
# download data
if not output_file.exists():
print("Downloading data...")
inout.download_data(output_file, dataset, request)
else:
print("Data already exists at {}".format(output_file))
# download data
if not output_file.exists():
print("Downloading data...")
inout.download_data(output_file, dataset, request)
else:
print("Data already exists at {}".format(output_file))
Load settings¶
First we need to load the default settings which setup preprocessing steps.
In [ ]:
Copied!
settings = utils.get_settings(setting_path="default",
new_settings={},
updated_setting_dir=None,
save_updated_settings=False)
settings = utils.get_settings(setting_path="default",
new_settings={},
updated_setting_dir=None,
save_updated_settings=False)
TBU: more details about the default settings will be provided...
Preprocess data¶
Preprocess ERA5-Land data¶
In [ ]:
Copied!
# disable truncation of dates
settings["truncate_date"] = False
print("Preprocessing ERA5-Land data...")
t0 = time.time()
preprocessed_dataset = preprocess.preprocess_data_file(
netcdf_file=output_file,
settings=settings,
)
t_preprocess = time.time()
print("Preprocessing completed in {:.2f} seconds.".format(t_preprocess - t0))
# disable truncation of dates
settings["truncate_date"] = False
print("Preprocessing ERA5-Land data...")
t0 = time.time()
preprocessed_dataset = preprocess.preprocess_data_file(
netcdf_file=output_file,
settings=settings,
)
t_preprocess = time.time()
print("Preprocessing completed in {:.2f} seconds.".format(t_preprocess - t0))
The preprocessed dataset is also saved in a .nc
file under the same folder, namely era5_data_2020_2025_all_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim
Details on regulation for the file name can be found in Data Lake & Database
Preprocess population data¶
Instructions for downloading population data (i.e. ISIMIP data) are presented in Data Lake & Database
In [ ]:
Copied!
popu_file = data_folder / "population_histsoc_30arcmin_annual_1901_2021.nc"
popu_file = data_folder / "population_histsoc_30arcmin_annual_1901_2021.nc"
In [ ]:
Copied!
settings["truncate_date"] = True
# disable uncessary preprocessing steps
settings["adjust_longitude"] = False
settings["convert_kelvin_to_celsius"] = False
settings["convert_m_to_mm_precipitation"] = False
settings["resample_grid"] = False
print("Preprocessing population data...")
t0 = time.time()
preprocessed_popu = preprocess.preprocess_data_file(
netcdf_file=popu_file,
settings=settings,
)
t_popu = time.time()
print("Preprocessing population data completed in {:.2f} seconds.".format(t_popu - t0))
settings["truncate_date"] = True
# disable uncessary preprocessing steps
settings["adjust_longitude"] = False
settings["convert_kelvin_to_celsius"] = False
settings["convert_m_to_mm_precipitation"] = False
settings["resample_grid"] = False
print("Preprocessing population data...")
t0 = time.time()
preprocessed_popu = preprocess.preprocess_data_file(
netcdf_file=popu_file,
settings=settings,
)
t_popu = time.time()
print("Preprocessing population data completed in {:.2f} seconds.".format(t_popu - t0))
The preprocessed dataset is also saved in a .nc
file under the same folder.