ModernBERT-NHSVM: calibrated multiclass SVM over ModernBERT embeddings

Hierarchy-aware multiclass SVM with a Crammer–Singer joint margin over ModernBERT mean-pool embeddings, post-hoc temperature-calibrated.

implemented in Atelier docs →

Motivation

The Atelier classification pipeline fuses up to six evidence channels under Dempster-Shafer theory. Channel independence is load-bearing for the fusion (see docs/src/architecture/dst-evidence-independence.md): two channels that share an encoder also share their failure modes, which violates the source-independence assumption Dempster’s rule requires. The SVM channel exists to supply margin-driven, hierarchy-aware evidence architecturally distinct from the retrieval (ColBERT MaxSim) and gradient-boosted (CatBoost) channels.

A native hierarchical formulation matters because the target taxonomies are deep (the v0.5.1 UAT line spans 287 nodes) and non-leaf nodes are first-class prediction targets — a flat one-vs-rest classifier discards the structural prior the taxonomy carries.

Formulation

The natural Kronecker expansion phi(x, y) = sqrt(alpha_y) * (x ⊗ e_y) from Choi et al. 2015 (Eq. 5) works for sparse text features — a few hundred nonzero dimensions per row — but catastrophically degrades on dense pretrained embeddings where every dimension is nonzero with similar magnitude. On the 1149-row reference, the same expansion reaches 98.93% top-1 fit-on-train with TF-IDF features versus 4.26% with ModernBERT mean-pool features — a structural mismatch, not a tuning gap (src/atelier/classify/factorized_nhsvm.py:5-9).

The factorized form learns one weight vector W_n in R^d per hierarchy node and computes the path score directly,

gamma(x, y) = sum_{n in A_y} alpha_n * (W_n^T x),

where A_y is the root-to-y ancestor set, without ever materializing the Kronecker product. Operationally the model is a single (n_nodes, d) linear layer with the path indicator M_alpha = (path_indicator) * diag(alpha) baked into a frozen buffer, so the forward pass is two matmuls:

node_scores = X @ W.T; path_scores = node_scores @ M_alpha.T.

Training optimizes the structured-SVM margin under the Crammer–Singer joint multiclass objective (Crammer & Singer 2001) with a tree-distance loss delta(y, y') = sqrt(sum alpha_n over symmetric-difference of ancestor sets), via AdamW. Implementation in src/atelier/classify/factorized_nhsvm.py.

Calibration

The structured hinge objective is margin-driven, not probability-calibrated. Under a 287-way softmax, hinge-trained logits flatten — top-1 probabilities collapse to the 0.02-0.04 range even when the margin to second-best is large. A single scalar softmax temperature T is fitted against a held-out reference slice, minimizing NLL via grid search over (0.05, ..., 2.00) followed by scalar refinement (src/atelier/optimize/svm/calibrate.py, following Guo et al. 2017). The fitted T is persisted in the adapter manifest and applied at inference.

Implementation status

The factorized ModernBERT-NHSVM head is the canonical and default SVM evidence source for the DST pipeline. The encoder is answerdotai/ModernBERT-base, mean-pooled, 768-dim, loaded once per (model_id, device) and cached process-wide (src/atelier/optimize/svm/encoder.py). Trained heads are content-addressed and promoted via the nhsvm_head_registry table; the runtime NHSVMHeadAdapter wraps the head with encoder identity and training metadata, and _ensure_registered_svm_head (src/atelier/classify/pipeline.py:418) installs it via ml_inference.install_svm so the orchestrator’s predict_svm resolves to the registered head at classify-time. At fusion time nhsvm_to_mass applies hierarchy-distance reweighting to the calibrated per-node scores before constructing the DST mass function — see Dempster–Shafer evidence fusion.

A legacy TF-IDF + LinearSVC path survives in src/atelier/classify/svm_classifier.py, reachable only via the deprecated per_vocab_legacy / auto source modes. The deprecation is explicit: _ensure_per_vocab_svm emits a runtime warning naming the registered NHSVM head as the intended source, the classifier docstring carries an explicit LEGACY marker, and the file is slated for deletion once a promoted head is guaranteed in every environment. Pre-roll-forward _nhsvm.pkl bundles from the prior one-vs-rest era fail loudly on load via a nhsvm_variant tag (SVMClassifier._NHSVM_BUNDLE_VERSION = "crammer_singer.v1").

References

Crammer, K. & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR 2:265-292.
Choi, J., Chung, J., & Hewitt, J. (2015). Normalized Hierarchical Multi-label SVM. arXiv:1508.02479. Eq. 5 (Kronecker expansion), Eq. 7 (directional alpha constraints), Eq. 9 (tree-distance reweight).
Guo, C. et al. (2017). On calibration of modern neural networks. ICML.
Warner, B. et al. (2024). ModernBERT: A Smarter, Better, Faster, Longer Encoder.