Validating Actuarial Input Schemas with Pydantic

Validating actuarial input schemas with Pydantic means turning a written actuarial data dictionary into an executable contract that rejects a malformed policy record at the ingestion boundary, before it can distort a reserve. This is the structural layer of the two-stage Schema Validation with Pydantic & Great Expectations gate: Pydantic proves every field is the right type and within absolute bounds, so the statistical layer that follows only ever sees clean input. This guide is a focused build recipe for that structural contract — mapping each column the projection engine consumes to a typed field, encoding the actuarial invariants a plain type system cannot express, and converting every rejection into an audit-grade trace an examiner will accept.

Scoping the Problem

Actuarial ingestion defects are quiet. A face_amount column silently coerced from string to float, a lapse assumption arriving as a percentage where the engine expects a decimal, or a claim date that lands after a policy termination date will all survive a naive pd.read_csv and only surface as an unexplained movement in the reserve roll-forward — often after the filing has left the building. The technique here is narrow and deterministic: model exactly one input record with Pydantic schema enforcement so that a record is either well-formed against the contract or rejected with a precise, machine-readable reason. Distributional drift — a batch that is individually valid but statistically shifted — is a different problem handled downstream by Dynamic Threshold Tuning for Assumption Drift; this page deliberately stops at the structural boundary.

Minimal Working Contract

The whole technique fits in one model plus one factory function. The model below encodes a life-insurance in-force record; strict typing disables silent coercion, Field constraints enforce absolute bounds drawn from the data dictionary, and validator hooks carry the cross-field actuarial rules.

from datetime import date
from decimal import Decimal
from enum import Enum
from pydantic import BaseModel, Field, field_validator, model_validator, ValidationError


class PremiumMode(str, Enum):
    annual = "annual"
    semiannual = "semiannual"
    quarterly = "quarterly"
    monthly = "monthly"


class PolicyRecord(BaseModel):
    model_config = {"strict": True, "extra": "forbid"}

    policy_id: str = Field(min_length=1, max_length=32)
    valuation_date: date
    issue_date: date
    issue_age: int = Field(ge=0, le=120)
    face_amount: Decimal = Field(gt=0, le=Decimal("100_000_000"))
    mortality_table: str
    lapse_rate: Decimal = Field(ge=Decimal("0"), le=Decimal("1"), decimal_places=6)
    premium_mode: PremiumMode

    @field_validator("mortality_table")
    @classmethod
    def known_table(cls, v: str) -> str:
        approved = {"2017_CSO_ANB", "2017_CSO_ALB", "VBT_2015"}
        if v not in approved:
            raise ValueError(f"mortality_table {v!r} not in approved set")
        return v

    @model_validator(mode="after")
    def issue_precedes_valuation(self) -> "PolicyRecord":
        if self.issue_date > self.valuation_date:
            raise ValueError("issue_date must not follow valuation_date")
        if self.issue_age > 120 - (self.valuation_date.year - self.issue_date.year):
            raise ValueError("attained age exceeds terminal age of mortality basis")
        return self

Block-by-Block Rationale

model_config = {"strict": True, "extra": "forbid"}. Strict mode is the single most important line for financial data. Without it, Pydantic v2 will happily turn the string "250000" into an int and "1.0" into a float — masking exactly the source-system export defects you are trying to catch. extra="forbid" makes an unexpected column a hard error rather than a silently dropped field, so a renamed upstream header surfaces immediately instead of nulling a value the engine needs.

face_amount: Decimal rather than float. Cash-flow projection sums millions of policy-level amounts; binary floating point accumulates rounding error that a regulator can see in a reconciliation. Modelling money as Python’s Decimal with an explicit upper bound keeps the arithmetic exact and rejects a fat-fingered nine-figure face amount that is almost certainly a units error.

lapse_rate: Decimal with decimal_places=6. Modelling the lapse assumption as a bounded decimal, not a float, keeps assumption precision reproducible across runs and rejects a rate accidentally supplied as a whole percentage (6 instead of 0.06), which the le=Decimal("1") bound catches at the boundary. The same rate feeds the Policy Lapse & Surrender Assumption Engines downstream, so a clean contract here protects every consumer of the field.

premium_mode: PremiumMode. A closed Enum turns a free-text category into a fixed domain. Any value outside the four permitted modes fails validation, which is both safer and more self-documenting than a string with a comment.

@field_validator("mortality_table"). Field validators run on a single value. Here it enforces that the table label is one of the approved bases — the deterministic sibling of the drift and monotonicity checks in Mortality & Morbidity Rate Validation. Restricting the basis at ingestion prevents a projection from silently running on last year’s table.

@model_validator(mode="after"). Model validators run after every field is individually valid, so they can safely compare fields. issue_precedes_valuation encodes two actuarial invariants a type system cannot: an issue date may not follow the valuation date, and attained age at valuation may not exceed the terminal age of the mortality basis. Cross-field logic belongs beside the schema, not in a separate spreadsheet of business rules.

Turning a Rejection into an Audit Trace

A contract is only useful if a single bad row does not abort the batch and every rejection is explained. The factory function below validates an untrusted extract row by row, accumulating typed records into a clean frame and routing failures to a quarantine ledger keyed by policy_id and the exact rule that rejected them.

import pandas as pd


def validate_extract(rows: list[dict]) -> tuple[pd.DataFrame, list[dict]]:
    accepted, quarantined = [], []
    for raw in rows:
        try:
            record = PolicyRecord.model_validate(raw)
            accepted.append(record.model_dump())
        except ValidationError as exc:
            quarantined.append({
                "policy_id": raw.get("policy_id", "UNKNOWN"),
                "errors": exc.errors(include_url=False),
            })
    return pd.DataFrame(accepted), quarantined

exc.errors(include_url=False) is the load-bearing detail: it returns a structured list naming the field, the rejected value, and the failing constraint — the machine-readable evidence a model-risk reviewer expects when asking why a record was excluded from a valuation. The accepted frame is a clean DataFrame ready for the vectorized handling in Pandas & NumPy for Actuarial Data Pipelines, and the quarantine ledger becomes an input to the Actuarial Audit Trail Architecture that reconstructs lineage from raw extract to filed number.

Edge Cases and Production Hardening

Coercion sneaking back in through Union types. A field typed int | str re-opens the coercion door even under strict mode, because Pydantic will try each member. Keep financial fields single-typed; where an upstream system genuinely sends a value two ways, normalise it in an explicit @field_validator(mode="before") that you can point an auditor at, rather than delegating the choice to union resolution.

The circuit breaker for a poisoned extract. Row-by-row quarantine is the right behaviour for a handful of bad records, but a bad source export or an upstream schema migration can fail most of the batch. Guard against silently proceeding on a partial population:

def gate_extract(rows: list[dict], reject_ratio: float = 0.05) -> pd.DataFrame:
    frame, quarantined = validate_extract(rows)
    if rows and len(quarantined) / len(rows) > reject_ratio:
        raise RuntimeError(
            f"structural reject ratio {len(quarantined) / len(rows):.1%} "
            f"exceeds {reject_ratio:.0%}; halting batch for review"
        )
    return frame

When more than a configured fraction of rows fail the contract, the extract itself is suspect and the batch should abort loudly rather than hand a half-populated cohort to a projection engine.

Per-row Python cost at scale. model_validate in a Python loop is fine for tens of thousands of rows but becomes the bottleneck at millions. Rather than weaken the contract, move validation off the critical path with the orchestration patterns in Async Batch Processing for Large Models, and profile before the validated cohort ever reaches a Stochastic Scenario Generation Framework so no Monte Carlo run consumes unvalidated input.

Prove the Contract with Unit Tests

The schema is itself model input, so it should be tested like model code — a red test is the cheapest possible place to catch a loosened bound.

import pytest


def test_future_issue_date_rejected():
    with pytest.raises(ValidationError):
        PolicyRecord.model_validate({
            "policy_id": "P-001",
            "valuation_date": "2026-06-30",
            "issue_date": "2026-12-01",   # after valuation
            "issue_age": 45,
            "face_amount": "250000",
            "mortality_table": "2017_CSO_ANB",
            "lapse_rate": "0.06",
            "premium_mode": "annual",
        })


def test_lapse_rate_supplied_as_percent_rejected():
    with pytest.raises(ValidationError):
        PolicyRecord.model_validate({
            "policy_id": "P-002",
            "valuation_date": "2026-06-30",
            "issue_date": "2010-01-01",
            "issue_age": 30,
            "face_amount": "100000",
            "mortality_table": "2017_CSO_ANB",
            "lapse_rate": "6",            # whole percent, not a decimal
            "premium_mode": "monthly",
        })

Compliance Note

Under NAIC VM-20 Section 3, a principle-based reserve must rest on data whose provenance and quality can be demonstrated to an examiner, and the Federal Reserve’s SR 11-7 holds model inputs to the same validation rigour as model code. An executable Pydantic contract satisfies both in a way a manual reconciliation cannot: the schema is version-controlled, every rejection carries a structured reason, and the quarantine ledger is the documented evidence that unfit records never reached the valuation. It is the enforcement point that lets the broader NAIC VM-20 Compliance Frameworks and the governance discipline of the Assumption Validation & Rule Engine Design subsystem rest on a defensible data foundation — the difference between “the load ran without an exception” and a contract you can hand to a regulator.

Schema Validation with Pydantic & Great Expectations — the two-layer gate this structural contract plugs into.
Pandas & NumPy for Actuarial Data Pipelines — vectorized handling for the validated frame this contract emits.
Actuarial Audit Trail Architecture — how the quarantine ledger becomes examiner-ready lineage.
OSFI Model Risk Management Guidelines — the parallel Canadian expectations for validated model inputs.

Up a level: Schema Validation with Pydantic & Great Expectations · Actuarial Model Ingestion & Testing Workflows

Scoping the Problem #

Minimal Working Contract #

Block-by-Block Rationale #

Turning a Rejection into an Audit Trace #

Edge Cases and Production Hardening #

Prove the Contract with Unit Tests #

Compliance Note #

Related Guides #