Skip to content

Data Lake & Database Backend⚓︎

Data Lake Architecture⚓︎

The Heiplanet platform implements a medallion architecture data lake with three distinct layers for data processing and refinement:

Bronze Layer (Raw Data)⚓︎

The bronze layer stores raw, unprocessed data exactly as downloaded from external sources. This includes: - Original data files in their native formats (NetCDF, Shapefiles, etc.) - Complete metadata and provenance information - Data integrity verification through file hashing

Files are stored in .data_heiplanet_db/bronze/ directory.

Silver Layer (Cleaned Data)⚓︎

The silver layer contains validated and cleaned data that has undergone initial processing: - Data format standardization - Quality checks and validation - Coordinate system transformations where needed - Initial data filtering and cleaning

Files are stored in .data_heiplanet_db/silver/ directory.

Gold Layer (Analysis-Ready Data)⚓︎

The gold layer provides fully processed, analysis-ready datasets optimized for ingestion into the PostgreSQL/PostGIS database: - Aggregated and summarized data products - NUTS region-level statistics - Optimized data formats for database insertion - Final quality assurance and metadata enrichment

Files are stored in .data_heiplanet_db/gold/ directory.

This multi-tiered approach ensures data lineage, reproducibility, and the ability to reprocess data from any stage if requirements change.

Data Ingestion and Processing⚓︎

Data flows through the pipeline in the following stages:

  1. Download: Raw data files are fetched from configured sources (URLs or local files)
  2. Validation: File integrity is verified using SHA-256 checksums
  3. Bronze Storage: Original files are preserved unchanged
  4. Processing: Data is cleaned, transformed, and validated
  5. Silver Storage: Cleaned data is stored in standardized formats
  6. Aggregation: Data is aggregated to analysis-ready products
  7. Gold Storage: Final datasets ready for database insertion
  8. Database Ingestion: Data is loaded into PostgreSQL/PostGIS tables

Configuration Files⚓︎

The data pipeline is controlled by YAML configuration files that specify which datasets to download and process. Multiple pre-configured options are available based on deployment size and data requirements:

  • Small: Minimal dataset for testing (single month)
  • Medium: Seasonal dataset for production (3 months)
  • Large: Complete annual dataset for comprehensive analysis (12 months)
  • Historical: Extended multi-decade dataset for trend analysis (45 years)

For detailed information on these configurations, see the Deployment Configuration Options.

Data Model⚓︎

The database schema is designed to efficiently store and query spatiotemporal disease surveillance data. Downloaded and processed data is organized in the database as follows:

onehealth_data_model.jpg

Key Database Tables⚓︎

  • NUTS Definitions: European geographic boundaries at multiple administrative levels
  • Climate Variables: Temperature and environmental data gridded globally
  • R0 Values: Disease transmission suitability metrics
  • Time-series data at multiple temporal resolutions
  • Spatial data at both grid and regional aggregations
  • Metadata: Data provenance, timestamps, and quality indicators

Spatial Data Features⚓︎

The PostGIS extension enables advanced geospatial queries: - Spatial Indexing: Fast geographic searches and intersections - Coordinate Transformations: Support for multiple coordinate reference systems - Geometry Operations: Area calculations, buffer zones, spatial joins - Raster Support: Efficient storage and querying of gridded data

Running the PostgreSQL Database with Docker⚓︎

Quick Start⚓︎

Start the database service using Docker Compose:

docker compose up -d db

The -d flag runs the service in detached mode (background).

Accessing the Database⚓︎

Connect to the running database:

# Using psql command-line client
docker exec -it <db-container-name> psql -U <username> -d <database-name>

# Or expose the port and connect externally (requires port mapping in docker-compose.yaml)
psql -h localhost -p 5432 -U <username> -d <database-name>

Database Management⚓︎

# View database logs
docker logs <db-container-name>

# Stop the database
docker compose stop db

# Restart the database
docker compose restart db

# Remove database and volumes (deletes all data)
docker compose down -v

For complete deployment instructions including data loading and API setup, see the Deployment Guide.