ML Governance

ML Lifecycle Overview

QuickSecure operates a full production ML lifecycle. Every model is versioned, signed, canary-validated, drift-monitored, and auditable.

1

Train

Labeled TP/FP/FN/TN data from fleet. Feature vectors persisted during Shadow mode.

2

Validate

Canary deployment to subset of endpoints. FP rate compared against active model.

3

Promote

Canary passes thresholds. Model promoted to Stable. Signed artifact distributed.

4

Monitor

PSI drift scoring, confusion matrix, latency P95. Auto-retrain on threshold breach.

Versioned Model Registry

Every ONNX model artifact is tracked in the model registry with full metadata:

Field	Description
Version	Semantic version string (e.g., `2.4.1`)
ModelName	Human-readable model identifier
Status	Training, Canary, Stable, Rolled Back
IsSigned	SHA-256 signature verification status
Precision / Recall / F1	Validation metrics at training time
FalsePositiveRate	Measured FP rate from production data
TrainingTrigger	What triggered training: Drift, FP Rate, Schedule, Manual
ParentVersion	Previous model this was trained from (lineage)
TenantId	`null` = global shared, non-null = tenant-dedicated

Model Download API

# Get active model info (tenant-aware: dedicated model preferred, global fallback)
curl -X GET https://corxor.com/api/ml/model/latest \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "X-Tenant-Id: your-tenant-id"

# Download model artifact
curl -X GET "https://corxor.com/api/ml/model/download?version=2.4.1&type=onnx" \
  -H "X-Api-Key: YOUR_API_KEY" -o model.onnx

Drift Monitoring (PSI)

Population Stability Index (PSI) measures distribution shift between the training feature distribution and production inference distribution.

PSI Range	Severity	Action
< 0.10	None	No action required
0.10 – 0.25	Low	Monitoring increased
0.25 – 0.50	Medium	Alert generated, retraining queued
> 0.50	High	Automatic retraining triggered

Dual-Output Drift Workers

Drift is computed both globally (all tenants combined) and per-tenant (isolated). This enables:

Global model health visibility across the entire fleet
Per-tenant drift detection for dedicated model training eligibility
ML_TENANT_TRAINING_ELIGIBLE audit event when a tenant has ≥500 events + elevated drift/FP rate

Drift snapshots are uploaded by endpoints every 6 hours with local confidence distribution statistics. The server computes PSI against the baseline.

Canary Deployment

New model versions are deployed to a configurable subset of endpoints before fleet-wide promotion.

Canary Lifecycle

Traffic Split: Configurable percentage of endpoints receive the canary model
Monitoring: Every 30 minutes, MLCanaryMonitorWorker compares canary FP rate vs. active model
Auto-Rollback: If canary FP rate exceeds absolute threshold (5%) or is >2x the active model rate
Promotion: After sufficient events with healthy metrics, canary is promoted to Stable

Per-Model Scope

Canary monitoring runs per-model: both global canaries and tenant-dedicated canaries are monitored independently. A tenant canary rollback does not affect the global model.

Minimum Events: At least 50 canary inference events are required before an auto-rollback decision is made. This prevents premature rollback on low-traffic tenants.

Hybrid Model Registry

The model registry supports both global shared models and tenant-dedicated models.

Resolution Order

If tenant context is present and a tenant-dedicated model exists with Status = Stable or Canary → use tenant model
Otherwise → fall back to global shared model

When Does a Tenant Get a Dedicated Model?

The MLDriftWorker automatically flags tenants as training-eligible when:

≥ 500 ML events in the last 30 days
Elevated FP rate or PSI drift score for that tenant
An ML_TENANT_TRAINING_ELIGIBLE audit event is generated

Dedicated model training is then triggered by the ML operations team or automated pipeline.

ModelVersion.TenantId	Meaning	Served To
`null`	Global shared model	All endpoints without a dedicated model
`tenant-abc`	Tenant-dedicated model	Only endpoints belonging to that tenant

Zero breaking changes: Existing endpoints without tenant context continue to receive the global model. The hybrid registry is fully backward-compatible.

Background Workers

The ML governance cloud runs four background workers:

Worker	Interval	Purpose
MLMetricsAggregationWorker	1 hour	Dual output: global + per-tenant confusion matrix, latency, decision counts
MLDriftWorker	6 hours	PSI computation from drift snapshots. Per-tenant training eligibility detection
MLCanaryMonitorWorker	30 min	All canary models (global + tenant). Auto-rollback on FP threshold breach
TenantBackfillWorker	Startup	One-time: populates TenantId on legacy ML rows via Endpoint → License chain