AQAI Database Schema¶
This document provides an overview of the PostgreSQL table schemas used in the AQAI system. Each schema is described in detail with example tables.
Cohorts Table¶
The cohorts table contains all raw data for the specified pollutant in the location and time range required.
Schema:
locationId (integer): Unique identifier for the location.
location (text): Name of the location.
parameter (text): Type of pollutant.
value (float): Measured value of the pollutant.
date (json): Date and time of the measurement in UTC and local time.
unit (text): Unit of measurement.
coordinates (json): Coordinates of the measurement location.
country (text): Country of the measurement location.
city (text): City of the measurement location.
isMobile (boolean): Indicates if the measurement was taken from a mobile source.
isAnalysis (boolean): Indicates if the measurement is an analysis.
entity (text): Entity responsible for the measurement.
sensorType (text): Type of sensor used.
train_validation_set (text): Indicates if the data is for training or validation.
cohort (text): Cohort identifier.
cohort_type (text): Type of cohort.
timestamp_utc (timestamp): UTC timestamp of the measurement.
timestamp_local (timestamp): Local timestamp of the measurement.
y (float): Latitude of the measurement location.
x (float): Longitude of the measurement location.
Example:
locationId |
location |
parameter |
value |
date |
unit |
coordinates |
country |
city |
isMobile |
isAnalysis |
entity |
sensorType |
train_validation_set |
cohort |
cohort_type |
timestamp_utc |
timestamp_local |
y |
x |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
23 |
Amgalan |
pm25 |
181 |
{“utc”: “2019-03-13T22:30:00+00:00”, “local”: “2019-03-14T06:30:00+08:00”} |
µg/m³ |
{“latitude”: 47.913429, “longitude”: 106.99790699999998} |
MN |
NULL |
f |
NULL |
Governmental Organization |
reference grade |
0 |
0_2015-09-01T00:00:00Z_2023-11-30T00:00:00Z |
training |
2019-03-13T22:30:00.000000Z |
2019-03-14T06:30:00+0800 |
47.913429 |
106.997907 |
Features Table¶
The features table gathers all satellite variables spatially and temporally co-located with the pollutant observation.
Schema:
sensor_datetime (timestamp): Date and time of the sensor reading.
sensor_longitude (float): Longitude of the sensor reading.
sensor_latitude (float): Latitude of the sensor reading.
location_id (text): Identifier for the location.
Optical_Depth_047 (float): Optical depth at 0.47 microns.
B4 (float): Band 4 reflectance.
B3 (float): Band 3 reflectance.
B2 (float): Band 2 reflectance.
avg_rad (float): Average radiance.
temperature_2m_above_ground (float): Temperature 2 meters above ground.
relative_humidity_2m_above_ground (float): Relative humidity 2 meters above ground.
total_precipitation_surface (float): Total precipitation at surface level.
total_cloud_cover_entire_atmosphere (float): Total cloud cover for the entire atmosphere.
u_component_of_wind_10m_above_ground (float): U component of wind at 10 meters above ground.
v_component_of_wind_10m_above_ground (float): V component of wind at 10 meters above ground.
discrete_classification (integer): Discrete classification value.
Example:
sensor_datetime |
sensor_longitude |
sensor_latitude |
location_id |
Optical_Depth_047 |
B4 |
B3 |
B2 |
avg_rad |
temperature_2m_above_ground |
relative_humidity_2m_above_ground |
total_precipitation_surface |
total_cloud_cover_entire_atmosphere |
u_component_of_wind_10m_above_ground |
v_component_of_wind_10m_above_ground |
discrete_classification |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-03-11 20:30:00 |
106.997907 |
47.913429 |
Amgalan |
117.976936723832 |
NULL |
NULL |
NULL |
25.4099998474121 |
-0.12999572753904 |
33.2400001525879 |
0.0300000004470348 |
99.8 |
-0.547999973595142 |
-0.589999985694885 |
30 |
Model Metadata Table¶
The model_metadata table stores all outputs of the machine learning pipeline. It includes transactions and codes data from the processed schema joined on the hash of the diagnosis note. This serves as the complete training and validation set, which is then split into temporal splits for cross validation.
Schema:
model_id (text): Unique identifier for the model.
model_set (text): Set identifier for the model.
features (json): Features used in the model.
labels (json): Labels used in the model.
hyperparameters (text): Hyperparameters used for the model.
run_date (timestamp): Date and time when the model was run.
timestamp (timestamp): Timestamp of the record.
Example:
model_id |
model_set |
features |
labels |
hyperparameters |
run_date |
timestamp |
---|---|---|---|---|---|---|
randomforestregressor_n_estimators500_max_depth10_8 |
randomforestregressor_n_estimators500_max_depth10 |
{Optical_Depth_047, SR_B4, SR_B3, SR_B2, avg_rad, temperature_2m_above_ground, relative_humidity_2m_above_ground, precipitable_water_entire_atmosphere, total_cloud_cover_entire_atmosphere, u_component_of_wind_10m_above_ground, v_component_of_wind_10m_above_ground} |
{value} |
n_estimators500_max_depth10 |
2024-05-31 14:48:41.217907 |
2024-05-31 14:48:41.217907 |
Results Table¶
The results table is used for model comparison and selection using Mean Squared Error (MSE) and Mean Absolute Percentage Error (MAPE).
Schema:
model_id (text): Unique identifier for the model.
run_date (timestamp): Date and time when the model was run.
cohort (text): Cohort identifier.
mse (numeric): Mean Squared Error of the model.
mape (numeric): Mean Absolute Percentage Error of the model.
Example:
model_id |
run_date |
cohort |
mse |
mape |
---|---|---|---|---|
randomforestregressor_n_estimators500_max_depth10_5 |
NULL |
5 |
2069.5940127034537 |
1846990659123641.2 |