SDG corpora
activeReproducible ontology-grounded corpora for data-governance models.
SDG corpora is the independently-versioned home for the Signals Data Governance datasets used across the weathership stack. Each commit is a reproducible convergence snapshot of the whole derivation chain: the 540-template ontology catalog produces a SKOS vocabulary of 548 concepts, which produces a deterministic relational footprint (DDL plus cross-family foreign keys), which is populated by generated textbook documents and the corresponding relational rows.
Consumed via git submodule by three downstream projects:
- Aegir — training and evaluation of the hierarchical sequence model.
- Atelier — independent classification against the shared SKOS vocabulary.
- signals — governance enforcement against the populated relational footprint.
Tagged releases package the SKOS vocabulary, the ontology, and the populated DDL tables — without the per-column reference codes that identify which template each column was generated from. Atelier pins a release and classifies blind from values and vocabulary only; the held-back reference is the scoring key. The result is an independent, pre-training measure of classification efficacy on the corpus, and a clean baseline against which to measure Aegir’s downstream lift.
Released under the Apache 2.0 license.