Why use Cholesky decomposition instead of sampling each factor independently?

Economic and behavioral drivers are cross-correlated. Independent univariate sampling ignores that dependence and artificially inflates diversification benefit, biasing the Conditional Tail Expectation reserve downward. Cholesky decomposition factors the target correlation matrix into a lower-triangular L and left-multiplies a block of independent standard normals so the output carries exactly the intended covariance structure.

Why prefer numpy.random.default_rng over np.random.seed?

default_rng returns a PCG64 Generator whose entire draw sequence is a pure function of the integer seed, with no reliance on hidden global state. That explicit, isolated seeding is what lets a stochastic reserve regenerate bit-for-bit on re-derivation, which the legacy global np.random.seed cannot reliably guarantee across a multi-worker run.

How do I generate a scenario tensor too large to fit in RAM?

Write to a numpy.memmap and generate in chunks. Draw and correlate one chunk of paths at a time and assign it straight into the memory-mapped slice, so only chunk_size paths are ever resident. Peak transient memory is roughly chunk_size times n_steps times n_factors times 8 bytes for the two float64 buffers.

What if my correlation matrix is not positive definite?

Estimates built from unequal history or expert overlays are often not positive definite and will make scipy.linalg.cholesky raise LinAlgError. Project to the nearest valid matrix by clipping negative eigenvalues to a small floor and re-normalizing the diagonal to 1.0, and record that a repair occurred so the adjustment is visible in model validation.

Generating Monte Carlo Scenarios with NumPy and SciPy

Generating Monte Carlo scenarios with NumPy and SciPy means turning a calibrated covariance matrix and a recorded integer seed into a large block of correlated economic paths that a projection engine can consume — and that an examiner can regenerate bit-for-bit. This page isolates the sampling mechanics from the surrounding subsystem: how numpy.random.default_rng and scipy.linalg.cholesky combine to map independent standard normals into a correlated multivariate draw, and how to materialize a (10_000, 600, 12) scenario tensor for a 50-year monthly run without exhausting RAM. It is the focused build guide behind the broader Stochastic Scenario Generation Frameworks, which wires this sampler into a full seeded, drift-monitored engine; here the goal is to get the correlated draw exactly right and reproducible.

The Problem in One Paragraph

The difficulty is not drawing random numbers; it is drawing the right ones, reproducibly, at filing scale. Economic and behavioral drivers — interest-rate term structures, equity returns, credit spreads, and policyholder lapse — are cross-correlated, and independent univariate sampling silently inflates diversification benefit, biasing the Conditional Tail Expectation reserve downward and drawing a finding in model validation. The industry-standard fix is Cholesky decomposition: factor the target correlation matrix into a lower-triangular L, then left-multiply a block of independent standard normals so the output carries exactly the intended covariance structure. Two constraints ride alongside the mathematics. First, determinism — the same seed and parameters must regenerate the same paths, so the PCG64 generator must be seeded explicitly and never touched by wall-clock time or thread scheduling. Second, memory — a full 50-year monthly tensor across ten thousand paths is far too large to hold in RAM at once, so it must be written in chunks to a disk-backed array. The snippet below satisfies all three: correct correlation, reproducible seed, and a bounded memory footprint.

Minimal Working Example

The whole technique fits in one function. It validates the correlation matrix, factors it once, then loops over policy-path chunks — drawing standard normals, correlating them with a single einsum, applying the marginal mean and standard deviation, and writing each chunk straight into a numpy.memmap so peak memory stays flat regardless of path count.

import logging

import numpy as np
from scipy.linalg import cholesky

logger = logging.getLogger(__name__)


def generate_correlated_scenarios(
    n_paths: int,
    n_steps: int,
    n_factors: int,
    corr_matrix: np.ndarray,
    marginal_means: np.ndarray,
    marginal_stds: np.ndarray,
    seed: int,
    output_path: str = "stochastic_scenarios.dat",
    chunk_size: int = 2_000,
) -> np.memmap:
    """Generate correlated Monte Carlo scenarios via Cholesky decomposition,
    streamed to a memory-mapped array for large-scale actuarial simulation."""
    # Pre-flight validation — a malformed covariance input corrupts every path.
    if not np.allclose(corr_matrix, corr_matrix.T, atol=1e-8):
        raise ValueError("Correlation matrix must be symmetric.")
    eigenvalues = np.linalg.eigvalsh(corr_matrix)
    if not np.all(eigenvalues > 0):
        raise ValueError("Correlation matrix must be positive definite.")

    rng = np.random.default_rng(seed)          # PCG64, explicitly seeded
    L = cholesky(corr_matrix, lower=True)      # factored once, reused per chunk

    shape = (n_paths, n_steps, n_factors)
    scenarios = np.memmap(output_path, dtype=np.float64, mode="w+", shape=shape)

    for start_idx in range(0, n_paths, chunk_size):
        end_idx = min(start_idx + chunk_size, n_paths)
        current_chunk = end_idx - start_idx

        # Independent standard normals for this chunk only.
        Z = rng.standard_normal((current_chunk, n_steps, n_factors))

        # Correlate across the factor axis: Z @ L.T maps uncorrelated -> correlated.
        correlated_chunk = np.einsum("ijk,lk->ijl", Z, L)

        # Apply marginal scaling: mean + std * correlated_normal.
        scenarios[start_idx:end_idx] = (
            marginal_means[np.newaxis, np.newaxis, :]
            + correlated_chunk * marginal_stds[np.newaxis, np.newaxis, :]
        )
        logger.info("Processed paths %d-%d of %d.", start_idx, end_idx, n_paths)

    scenarios.flush()
    return scenarios

Call it with a validated correlation matrix and per-factor moments and it writes a fully correlated scenario tensor to disk, never holding more than chunk_size paths in memory at any instant.

Block-by-Block Explanation

Pre-flight validation — reject a bad covariance before it corrupts the run. Cholesky decomposition is only defined for a symmetric positive-definite matrix, so the function checks both properties before drawing a single normal. np.linalg.eigvalsh — the symmetric eigenvalue solver — is used deliberately in place of the general np.linalg.eigvals: correlation matrices are real symmetric, so eigvalsh is both faster and numerically more stable, and it returns real eigenvalues whose positivity is the exact definiteness test. A matrix that is merely positive semi-definite (a zero eigenvalue, common when two factors are perfectly collinear or the estimate is rank-deficient) will fail here rather than silently producing a degenerate factor. In practice the covariance input should already have cleared Schema Validation with Pydantic & Great Expectations at the ingestion boundary; this check is the last-line defense inside the numerical kernel itself.

Seeding and factoring — the two decisions that make the run reproducible. np.random.default_rng(seed) constructs a PCG64 generator whose entire draw sequence is a pure function of the integer seed — no global state, no dependence on import order or time. This is the modern replacement for the legacy np.random.seed global, and it is what lets a run hash identically on re-derivation. The Cholesky factor L is computed once, outside the loop, because it is invariant across chunks; refactoring it per chunk would waste cycles and risk subtle drift if the input were ever mutated mid-run.

The correlation step — one einsum across the factor axis. Each chunk of independent normals Z has shape (chunk, n_steps, n_factors). The transformation Z @ L.T — expressed as np.einsum("ijk,lk->ijl", Z, L) — contracts only the factor index k, leaving the path (i) and step (j) axes untouched, so every timestep of every path is correlated in a single vectorized pass. The mathematics is standard: if Z has identity covariance, then L·Z has covariance L·Lᵀ, which by construction equals the target correlation matrix. Keeping this as einsum rather than a Python loop is what makes the sampler fast enough for filing volumes; the broader vectorization patterns are covered in Pandas & NumPy for Actuarial Data Pipelines.

Marginal scaling and streamed write — correlation first, moments second. The correlated standard normals are shifted and scaled to each factor’s target mean and standard deviation via broadcasting over the newly inserted path and step axes. The result is assigned directly into the memmap slice, so the chunk is flushed toward disk rather than accumulated in a growing list. Because only the active chunk’s Z and correlated_chunk are ever resident, peak memory is governed by chunk_size × n_steps × n_factors × 8 bytes — a constant the operator sets, independent of the total path count. The final scenarios.flush() guarantees every byte is committed before the array is handed downstream.

Edge Cases and Production Hardening

Three failure modes account for nearly every scenario-generation bug that reaches production. Each has a concrete fix.

1. A non-positive-definite correlation matrix. Correlation estimates built from unequal history windows, pairwise-deleted missing data, or expert overlays are frequently not positive definite, and scipy.linalg.cholesky will raise LinAlgError on them. Do not silently drop the offending factor. Instead, project the matrix to the nearest valid one by clipping negative eigenvalues and re-normalizing the diagonal, then record that a repair occurred so the adjustment is visible in review.

def nearest_positive_definite(corr: np.ndarray, floor: float = 1e-8) -> np.ndarray:
    """Clip negative eigenvalues and re-normalize to a valid correlation matrix."""
    vals, vecs = np.linalg.eigh(corr)          # symmetric eigendecomposition
    vals_clipped = np.clip(vals, floor, None)  # remove non-positive eigenvalues
    repaired = vecs @ np.diag(vals_clipped) @ vecs.T
    d = np.sqrt(np.diag(repaired))             # rescale so the diagonal is 1.0
    return repaired / np.outer(d, d)

2. Seed non-determinism under parallel generation. When path generation is fanned across workers — as in Async Batch Processing for Large Models — sharing one generator makes the drawn paths depend on which worker runs first, so two runs of identical inputs diverge. The fix is numpy.random.SeedSequence, which spawns statistically independent child streams that are a pure function of the base seed and the chunk index, so a chunk’s paths are identical whether it runs first, last, or alone.

def child_generator(base_seed: int, chunk_index: int) -> np.random.Generator:
    """Independent, reproducible PCG64 stream per chunk — order-invariant."""
    child = np.random.SeedSequence(base_seed).spawn(chunk_index + 1)[chunk_index]
    return np.random.default_rng(child)

3. Silent memory blow-up from an oversized chunk. The memmap caps resident output, but the transient Z and correlated_chunk for one chunk are ordinary in-RAM arrays. Set chunk_size too high and a single iteration can still trigger OS swapping and blow the filing window. Size it against the container — peak transient memory is roughly chunk_size × n_steps × n_factors × 8 bytes for the two float64 buffers — and profile a representative load with tracemalloc before trusting it. A (2_000, 600, 12) chunk is about 110 MB per buffer; scale chunk_size down, not n_steps, when memory is tight, since the step and factor axes are fixed by the valuation.

Compliance Note

Reproducibility here is a regulatory obligation, not an engineering nicety. The stochastic reserve under NAIC VM-20 Section 7 is the CTE 70 of the greatest present value of accumulated deficiency across the scenario set, and it is defensible only if an examiner can regenerate that exact set from a recorded seed and a versioned parameter manifest. OSFI’s E-23 Principle 4 and the Federal Reserve’s SR 11-7 impose the same standard: independent, reproducible validation as a condition of model use. That means persisting, alongside the reserve, the integer base seed, a SHA-256 hash of the correlation matrix, and the marginal-parameter versions into the immutable record described in Actuarial Audit Trail Architecture. With those captured, drift between a baseline scenario set and a re-run — measured with a Kolmogorov-Smirnov statistic or Wasserstein distance — tells an examiner whether a change in reserve volatility stems from a legitimate recalibration or an unintended parameter change, which is exactly the distinction a model governance committee is required to make.

Stochastic Scenario Generation Frameworks — the full seeded, drift-monitored engine this sampler feeds
Pandas & NumPy for Actuarial Data Pipelines — the vectorization and memory-layout patterns behind the einsum correlation step
Schema Validation with Pydantic & Great Expectations — the ingestion contract that validates the covariance input before it reaches this kernel
Economic Scenario Mapping & Yield Curve Alignment — how the correlation and volatility inputs are calibrated upstream
NAIC VM-20 Compliance Frameworks — the reserve rules the CTE-70 reduction ultimately serves

Up one level: Stochastic Scenario Generation Frameworks · Actuarial Model Ingestion & Testing Workflows

The Problem in One Paragraph #

Minimal Working Example #

Block-by-Block Explanation #

Edge Cases and Production Hardening #

Compliance Note #

Related Guides #

The Problem in One Paragraph

Minimal Working Example

Block-by-Block Explanation

Edge Cases and Production Hardening

Compliance Note

Related Guides