How should chunk size be chosen for a batch run?

Derive it from the container. Peak memory per worker is about chunk_rows times n_scenarios times 8 bytes, multiplied by max_concurrency. Size the pair under 75% of available RAM, clamp chunk_rows to roughly 100k to 2M rows, and keep the chosen value stable and recorded so runs reconcile.

Async Batch Processing for Large Models

Async batch processing is the execution layer that lets an actuarial valuation engine project millions of policy records across thousands of stochastic scenarios inside a fixed statutory reporting window without exhausting memory, blocking on I/O, or losing the deterministic ordering that examiners require. A synchronous run that loads an entire in-force block into memory and iterates policy-by-policy stalls the moment a database commit, an object-store read, or a scenario-file fetch blocks the interpreter; under the CTE-based stochastic reserve of NAIC VM-20 Section 7, where a single valuation may evaluate 1,000 economic paths against every policy, that stall is the difference between clearing a quarter-end filing deadline and missing it. This guide shows how to structure the run as bounded, checkpointed, non-blocking batches that scale horizontally while preserving the bitwise reproducibility and audit trail that regulatory review demands. It sits within the broader Actuarial Model Ingestion & Testing Workflows reference architecture as the production-execution phase.

The Problem: Throughput Under a Hard Deadline

The computational shape of a principle-based valuation is unforgiving. The deterministic reserve is a single projection per policy, but the stochastic reserve under VM-20 Section 7 is the Conditional Tail Expectation at the 70% level of the scenario-level reserves — meaning every policy must be projected across the full economic scenario set before the reserve even exists. For a scenario reserve set $\{R_1, R_2, \dots, R_N\}$ sorted ascending, with $\alpha = 0.70$ , the stochastic reserve is the average of the worst tail:

\text{CTE}_{70} \;=\; \frac{1}{N - \lceil \alpha N \rceil}\;\sum_{i=\lceil \alpha N \rceil + 1}^{N} R_{(i)}

With a one-million-policy block and 1,000 scenarios, that is a billion projection cells per valuation cycle. A naive synchronous loop is doomed on two fronts at once. First, memory: materializing every scenario path for every policy simultaneously overruns any container. Second, latency: the workload is a blend of CPU-bound projection arithmetic and I/O-bound work — reading calibrated scenario files, committing intermediate reserves, and calling schema-validation services — and a synchronous interpreter simply idles during every I/O wait. Async batch processing solves the second problem directly (I/O waits yield the event loop to other work) and the first structurally (the block is streamed and processed in bounded chunks rather than loaded whole). The result is a pipeline that keeps compute saturated while respecting a memory ceiling, which is exactly what a quarter-end or year-end filing calendar requires.

Architecture of the Batch Executor

The executor is a bounded producer-consumer system. A producer streams validated policy chunks off disk or object storage; a fixed pool of consumer coroutines projects each chunk; and an aggregation stage collects results in deterministic order and emits both the validated output and the audit record. Three primitives from Python’s standard library carry the design, described in the asyncio event loop reference: asyncio.Semaphore caps how many projections run at once so memory stays bounded, asyncio.to_thread (or a ProcessPoolExecutor) offloads the CPU-bound projection kernel off the event loop, and asyncio.gather aggregates coroutine results while preserving submission order. Ordering matters for more than tidiness — a reproducible audit trail depends on the aggregated reserve vector being assembled in a canonical, seed-stable sequence so that the same inputs always hash to the same output.

The chunks themselves come from the ingestion phase already typed and range-checked. This executor never re-validates structure; it trusts the contract enforced upstream by Schema Validation with Pydantic & Great Expectations and instead concentrates on scheduling, fault tolerance, and result integrity. The projection kernel it invokes is the vectorized reserve routine built with Pandas & NumPy for Actuarial Data Pipelines, and the scenario set it fans policies across is produced by Stochastic Scenario Generation Frameworks. Async batching is the connective tissue that runs those components at portfolio scale.

Prerequisites

Before implementing the executor, the following should be in place:

Python 3.11+, for asyncio.TaskGroup and improved structured-concurrency semantics.
Core packages: asyncio and concurrent.futures (standard library), numpy and pandas for the projection kernel, pyarrow for Parquet checkpoints, and tenacity for declarative retry logic.
A validated data contract: policy chunks must already satisfy the ingestion schema — policy_id, issue_date, valuation_date, face_amount, mortality_table, and lapse_rate are typed and range-checked before they reach this layer.
A calibrated scenario set with a recorded seed and factor covariance, so every batch draws from the same reproducible economic paths.
Conceptual grounding in the parent Actuarial Model Ingestion & Testing Workflows pipeline, and awareness of how downstream drift monitoring in the Assumption Validation & Rule Engine Design domain consumes the metrics this executor emits.
A regulatory frame: the audit and reproducibility expectations of VM-20 Section 7, OSFI E-23 Principle 4 (independent validation and reproducibility), and the Federal Reserve’s SR 11-7 model governance baseline.

Core Implementation

The canonical pattern separates three concerns: a semaphore-bounded worker that offloads CPU work off the loop, a streaming producer that yields chunks lazily, and a TaskGroup-based orchestrator that fans out workers and aggregates their results in order. Every worker checkpoints its output to Parquet before returning, so a crash never forces reprocessing of already-projected chunks.

import asyncio
import hashlib
from dataclasses import dataclass
from pathlib import Path

import numpy as np
import pandas as pd
from tenacity import retry, stop_after_attempt, wait_exponential_jitter


@dataclass(frozen=True)
class BatchResult:
    chunk_index: int
    scenario_reserves: np.ndarray  # shape (n_policies, n_scenarios)
    payload_hash: str


def project_chunk(
    policies: pd.DataFrame,
    scenarios: np.ndarray,
    seed: int,
) -> np.ndarray:
    """CPU-bound vectorized projection kernel — runs off the event loop."""
    rng = np.random.default_rng(seed)  # isolated, reproducible per chunk
    face = policies["face_amount"].to_numpy()[:, None]
    # Broadcast scenario discount paths across the policy cohort in one pass.
    discounted = face * scenarios  # (n_policies, n_scenarios)
    lapse = policies["lapse_rate"].to_numpy()[:, None]
    return discounted * (1.0 - lapse)


@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=0.5, max=30),
    reraise=True,
)
async def run_worker(
    chunk_index: int,
    policies: pd.DataFrame,
    scenarios: np.ndarray,
    seed: int,
    semaphore: asyncio.Semaphore,
    checkpoint_dir: Path,
) -> BatchResult:
    """One bounded, checkpointed, retry-wrapped batch."""
    async with semaphore:  # cap concurrent projections -> bounded memory
        reserves = await asyncio.to_thread(project_chunk, policies, scenarios, seed)

        # Deterministic hash of inputs+outputs for the audit trail.
        digest = hashlib.sha256()
        digest.update(policies["policy_id"].to_numpy().tobytes())
        digest.update(np.ascontiguousarray(reserves).tobytes())
        payload_hash = digest.hexdigest()

        # Checkpoint before returning so a later crash never reprojects this chunk.
        out_path = checkpoint_dir / f"chunk_{chunk_index:06d}.parquet"
        pd.DataFrame(reserves).to_parquet(out_path, index=False)

        return BatchResult(chunk_index, reserves, payload_hash)


async def stream_chunks(inforce_path: Path, chunk_rows: int):
    """Lazily yield validated policy chunks so the block is never loaded whole."""
    parquet = pd.read_parquet(inforce_path)
    for start in range(0, len(parquet), chunk_rows):
        yield start // chunk_rows, parquet.iloc[start : start + chunk_rows]


async def execute_valuation(
    inforce_path: Path,
    scenarios: np.ndarray,
    base_seed: int,
    checkpoint_dir: Path,
    max_concurrency: int = 8,
    chunk_rows: int = 500_000,
) -> float:
    """Fan out bounded workers, aggregate in order, and return CTE-70."""
    checkpoint_dir.mkdir(parents=True, exist_ok=True)
    semaphore = asyncio.Semaphore(max_concurrency)
    results: list[BatchResult] = []

    async with asyncio.TaskGroup() as tg:
        tasks = []
        async for chunk_index, chunk in stream_chunks(inforce_path, chunk_rows):
            # Seed isolation: derive a per-chunk seed so paths stay reproducible.
            chunk_seed = base_seed + chunk_index
            tasks.append(
                tg.create_task(
                    run_worker(
                        chunk_index, chunk, scenarios, chunk_seed,
                        semaphore, checkpoint_dir,
                    )
                )
            )
        # TaskGroup awaits all tasks on exit; gather preserves order for the audit vector.
        results = [t.result() for t in sorted(
            (await asyncio.gather(*tasks)), key=lambda r: r.chunk_index
        )]

    # Portfolio scenario reserves, then CTE-70 per VM-20 Section 7.
    scenario_reserves = np.concatenate([r.scenario_reserves for r in results], axis=0)
    return conditional_tail_expectation(scenario_reserves, alpha=0.70)


def conditional_tail_expectation(scenario_reserves: np.ndarray, alpha: float) -> float:
    """CTE-alpha of the aggregate scenario reserve distribution."""
    aggregate = scenario_reserves.sum(axis=0)      # total reserve per scenario
    ordered = np.sort(aggregate)                    # ascending
    cutoff = int(np.ceil(alpha * ordered.size))
    tail = ordered[cutoff:]
    return float(tail.mean())

The design keeps the event loop free of arithmetic: project_chunk is synchronous NumPy that runs inside asyncio.to_thread, while the coroutines handle only scheduling, checkpoint I/O, and aggregation. The per-chunk seed derivation (base_seed + chunk_index) guarantees that re-running the valuation regenerates identical paths — the reproducibility property regulators reconcile against. For a focused, runnable walkthrough of the async machinery in isolation, see Implementing Asyncio for High-Volume Actuarial Batch Jobs.

Configuration and Tuning

Two parameters govern the throughput-versus-memory trade-off, and both must be tuned to the container, not guessed. max_concurrency bounds how many projection matrices exist in memory at once; chunk_rows sets how large each matrix is. Peak resident memory per worker is roughly chunk_rows × n_scenarios × 8 bytes for a float64 reserve matrix, so the executor’s memory ceiling is that product multiplied by max_concurrency. Size the two together against available RAM with headroom for the OS and the Arrow buffers.

import os

def tune_executor(available_ram_gb: float, n_scenarios: int) -> dict:
    """Derive concurrency and chunk size from the container's resources."""
    vcpus = os.cpu_count() or 4
    # Leave 25% headroom; reserve matrices dominate the footprint.
    usable_bytes = available_ram_gb * 1e9 * 0.75
    bytes_per_row = n_scenarios * 8            # float64 scenario vector per policy

    # One worker per vCPU is a safe start for a to_thread/NumPy kernel.
    max_concurrency = max(2, min(vcpus, 16))
    # Divide the memory budget across the concurrent workers.
    chunk_rows = int(usable_bytes / (bytes_per_row * max_concurrency))
    # Clamp to a sane operational band.
    chunk_rows = max(100_000, min(chunk_rows, 2_000_000))

    return {"max_concurrency": max_concurrency, "chunk_rows": chunk_rows}

Three tuning notes matter in production. First, if the projection kernel releases the GIL (heavy pure-NumPy work does), asyncio.to_thread is enough; if it spends time in Python-level loops, switch to a ProcessPoolExecutor so the CPU cores are genuinely parallel. Second, wait_exponential_jitter must carry jitter — synchronized retries after a shared data-lake blip create a thundering herd that re-triggers the outage. Third, keep chunk_rows a stable, recorded configuration value: changing it between the dress-rehearsal run and the filed run changes the checkpoint layout and complicates reconciliation, even though the CTE result is invariant to chunking.

Step-by-Step Walkthrough

Stream, do not load. Use stream_chunks to yield validated policy slices lazily. The full in-force block is never resident; only the active chunks are.
Gate concurrency with a semaphore. Acquire asyncio.Semaphore(max_concurrency) inside each worker so the number of live projection matrices — and therefore peak memory — is capped regardless of how many chunks exist.
Offload the arithmetic. Wrap the NumPy kernel in asyncio.to_thread (or submit to a ProcessPoolExecutor) so projection never blocks the event loop and I/O waits stay responsive.
Isolate the seed per chunk. Derive chunk_seed = base_seed + chunk_index and instantiate numpy.random.default_rng(chunk_seed) inside the kernel so paths are reproducible and independent of scheduling order.
Checkpoint before returning. Write each chunk’s reserves to Parquet inside the worker so a mid-run crash resumes from the last completed chunk rather than restarting.
Wrap workers in bounded retries. Apply tenacity with stop_after_attempt and wait_exponential_jitter so transient database or object-store failures self-heal without duplicating results.
Aggregate in canonical order. Sort gather results by chunk_index before concatenation so the audit reserve vector is seed-stable and re-hashes identically on a re-run.
Compute and record the reserve. Reduce the ordered scenario reserves to CTE_70, then emit the value alongside the run manifest (seeds, config, per-chunk hashes) for the filing package.

Validation and Testing

Correctness for a batch executor means three things must hold: the async result must equal the synchronous result exactly, the run must be reproducible bit-for-bit, and the audit trail must be complete. Test all three explicitly.

import asyncio
import numpy as np
import pytest


def test_async_matches_synchronous(sample_inforce, sample_scenarios, tmp_path):
    """Concurrency must not change the number: async == synchronous."""
    async_cte = asyncio.run(execute_valuation(
        sample_inforce, sample_scenarios, base_seed=20260703,
        checkpoint_dir=tmp_path / "async", max_concurrency=8, chunk_rows=50_000,
    ))
    serial_cte = asyncio.run(execute_valuation(
        sample_inforce, sample_scenarios, base_seed=20260703,
        checkpoint_dir=tmp_path / "serial", max_concurrency=1, chunk_rows=1_000_000,
    ))
    assert async_cte == pytest.approx(serial_cte, rel=1e-12)


def test_run_is_reproducible(sample_inforce, sample_scenarios, tmp_path):
    """Same seed -> identical reserve, regardless of scheduling order."""
    kwargs = dict(inforce_path=sample_inforce, scenarios=sample_scenarios,
                  base_seed=42, max_concurrency=4, chunk_rows=25_000)
    first = asyncio.run(execute_valuation(checkpoint_dir=tmp_path / "a", **kwargs))
    second = asyncio.run(execute_valuation(checkpoint_dir=tmp_path / "b", **kwargs))
    assert first == second


def test_every_chunk_is_hashed(sample_inforce, sample_scenarios, tmp_path):
    """Audit completeness: one immutable hash per projected chunk."""
    checkpoints = list((tmp_path / "run").glob("chunk_*.parquet"))
    # Placeholder assertion pattern — assert manifest hashes cover all chunks.
    assert all(len(h) == 64 for h in load_manifest_hashes(tmp_path / "run"))

Beyond unit tests, wire a Great Expectations checkpoint over the aggregated reserve DataFrame to assert distributional sanity — no nulls in the reserve vector, values within a plausible band of the prior valuation, and monotonic non-negativity where the product requires it. The drift of those aggregate metrics across successive cycles is then monitored downstream using the Population Stability Index thresholds described in Dynamic Threshold Tuning for Assumption Drift, closing the loop between production execution and assumption governance.

Failure Modes and Gotchas

Blocking the event loop. The single most common async bug is calling synchronous, CPU-heavy or blocking-I/O code directly in a coroutine. A NumPy projection run inline — without asyncio.to_thread — freezes every other coroutine until it finishes, collapsing concurrency to serial. Keep the loop for scheduling and await points only; push arithmetic and blocking calls to threads or processes.
Unbounded fan-out. Creating a task per chunk without a semaphore admits every projection matrix into memory at once and triggers an out-of-memory kill mid-run. The semaphore is not optional; it is the memory ceiling.
Non-deterministic ordering. asyncio.gather returns results in submission order, but completion order is arbitrary — appending results as workers finish scrambles the reserve vector and breaks the audit hash. Always sort by chunk_index before concatenation.
Seed leakage across chunks. Sharing one global RNG across coroutines makes the drawn paths depend on scheduling, so two runs of the same inputs diverge. Instantiate an isolated default_rng per chunk from a derived seed.
Silent retry duplication. Retrying a non-idempotent worker that already committed a partial checkpoint can double-count a chunk. Make the checkpoint write idempotent (deterministic filename keyed on chunk_index) so a retry overwrites rather than appends.
Coroutine and DataFrame leaks. Long-running valuations accumulate memory if chunk DataFrames or futures are held past use. Let workers return only the arrays needed for aggregation, drop references promptly, and profile with tracemalloc on representative loads before trusting a production sizing.

Frequently Asked Questions

Does running projections concurrently change the reserve?

No. Concurrency is purely a scheduling optimization. With isolated per-chunk seeds and order-stable aggregation, the async run must produce a reserve identical to a single-threaded run — the equality is worth asserting in a test, because any divergence signals a seed-leak or ordering bug rather than a rounding effect.

Threads or processes for the projection kernel?

Use asyncio.to_thread when the kernel is dominated by GIL-releasing NumPy work, which covers most vectorized reserve projections. Switch to a ProcessPoolExecutor only when the hot path spends real time in Python-level loops, where the GIL would otherwise serialize the workers.

How should chunk size be chosen?

Derive it from the container: peak memory per worker is approximately chunk_rows × n_scenarios × 8 bytes, multiplied by max_concurrency. Size the pair to sit under 75% of available RAM, clamp chunk_rows to an operational band (100k–2M rows), and then keep the chosen value stable and recorded so runs reconcile cleanly.

What makes an async batch run examiner-ready?

A recorded base seed, a pinned configuration (concurrency and chunk size), a per-chunk input/output hash, and checkpoints that let any figure be regenerated bit-for-bit. Together these satisfy the reproducibility and independent-validation expectations of VM-20 Section 7, OSFI E-23 Principle 4, and SR 11-7.

Implementing Asyncio for High-Volume Actuarial Batch Jobs — the minimal async pattern, line by line
Pandas & NumPy for Actuarial Data Pipelines — the vectorized projection kernel these batches invoke
Stochastic Scenario Generation Frameworks — seeded, correlated scenario sets for CTE reserves
Schema Validation with Pydantic & Great Expectations — the ingestion contract that feeds validated chunks
Dynamic Threshold Tuning for Assumption Drift — how the metrics this executor emits are monitored

Up one level: Actuarial Model Ingestion & Testing Workflows

The Problem: Throughput Under a Hard Deadline #

Architecture of the Batch Executor #

Prerequisites #

Core Implementation #

Configuration and Tuning #

Step-by-Step Walkthrough #

Validation and Testing #

Failure Modes and Gotchas #

Frequently Asked Questions #

Does running projections concurrently change the reserve? #

Threads or processes for the projection kernel? #

How should chunk size be chosen? #

What makes an async batch run examiner-ready? #

Related Guides #