Aegir
activeHierarchical sequence modeling with dynamic chunking.
Aegir is a hierarchical sequence model for semantic column annotation and cross-table data-element discovery. It addresses an enterprise data governance problem — warehouse columns whose meanings have drifted past recognition — by reading each column as a sequence and predicting what the column represents.
The model performs three tasks:
- Column type annotation. Each column is classified into a semantic category drawn from the governance taxonomy.
- Column property annotation. For each pair of columns, the model predicts whether a relationship exists and what kind.
- Data element discovery. Columns across different tables that refer to the same real-world entity are grouped together — the structural artifact that lets governance, privacy, and integration workflows treat the warehouse as a coherent surface.
The architecture bridges pattern-based detection methods (regex, constraint, fingerprint) with learned sequence models. Aegir does not replace evidence-fusion pipelines that already exist in enterprise governance stacks; it integrates with them, providing a learned signal alongside the rule-based ones.
The source repository is private; published documentation, technical design notes, and benchmarks are available at the docs link below.