GLOBE Curation AI

GLOBE Curation AI

This project strengthens the scientific value of the NASA GLOBE water transparency dataset by applying AI driven validation and satellite based metadata enrichment. The dataset, built from more than three decades of crowdsourced measurements, remains underutilized due to inconsistencies caused by protocol deviations, human error, and limited quality control. The objective is to prepare this data for rigorous scientific and modeling applications.

The approach combines two complementary machine learning efforts. The first is a data cleaning and validation framework using Bayesian hierarchical models alongside Random Forest models to identify anomalous, censored, or inaccurate measurements. The Bayesian approach demonstrated strong performance, revealing geographic site and cluster effects as the dominant factors explaining variability across observations.

The second effort focuses on metadata feature engineering to support and enhance the validation models. Independent contextual information is integrated from Sentinel 2 and MODIS Aqua satellite imagery, together with GEBCO bathymetry data. From these sources, key features are derived, including the Diffuse Attenuation Coefficient, water body classification, and distance from the water’s edge. These features are fed into an ML based anomaly detection pipeline that assigns quality scores to individual observations.

The outcome is a continuously validated, curated dataset that enables reliable scientific analysis and provides high quality input for hydrological and environmental models.

Share this speaker
Share This Speaker
  • Organization
    Harvard University and NASA
  • Profession
    AI system improving crowdsourced water transparency data quality

Are you sure you want to remove this speaker?