Skip to content

ADR-015: Snapshot Serialization and Schema Versioning

Date: 2026-03-18 Status: Accepted Decision maker: Nikolay Petrov


Context

abicheck dump produces a JSON snapshot file (.abi.json) that captures the complete ABI surface of a library. These snapshots serve multiple purposes:

  • Offline comparison: abicheck compare old.abi.json new.abi.json without needing the original binaries or headers
  • Baseline storage: Check snapshots into version control as ABI baselines
  • Cross-mode comparison: A DWARF-derived snapshot can be compared against a castxml-derived snapshot (ADR-003)
  • CI caching: Generate once, compare many times

The snapshot format is a user-facing contract. Changes to the format can break stored baselines and downstream tooling.


Decision

1. AbiSnapshot as the canonical interchange model

All pipeline stages — dumper, checker, reporter — operate on the same AbiSnapshot dataclass. Serialization converts this dataclass to/from JSON.

@dataclass
class AbiSnapshot:
    library: str
    version: str
    functions: list[Function]
    variables: list[Variable]
    types: list[RecordType]
    enums: list[EnumType]
    typedefs: dict[str, str]        # default: {}
    constants: dict[str, str]       # default: {} (populated from header #defines)
    elf: ElfMetadata | None
    pe: PeMetadata | None
    macho: MachoMetadata | None
    dwarf: DwarfMetadata | None
    dwarf_advanced: AdvancedDwarfMetadata | None
    platform: str | None         # "elf" | "pe" | "macho"
    language_profile: str | None # "c" | "cpp" | "sycl"
    elf_only_mode: bool
    dependency_info: DependencyInfo | None

2. Integer schema versioning

SCHEMA_VERSION: int = 3

Version history:

Version Change PR
1 Initial format (no schema_version field)
2 schema_version field added PR #89
3 pe and macho metadata fields added (multi-format support)

Integer versioning was chosen over semver because:

  • Snapshot format changes are always backward-incompatible (new fields change the meaning of existing data)
  • There is no concept of "minor" or "patch" format changes — either the schema is compatible or it isn't
  • Monotonic integers are simpler to compare (if version < 3: migrate(...))

3. Backward compatibility rules

Reading old snapshots: Snapshots without a schema_version field are treated as v1. The deserializer handles missing fields by using dataclass defaults (empty lists, None values).

Reading future snapshots: If schema_version > SCHEMA_VERSION, emit a warning suggesting the user upgrade abicheck. The deserializer attempts to read the snapshot anyway — forward compatibility is best-effort.

Writing: Always writes the current SCHEMA_VERSION. There is no option to write in an older format.

4. Serialization mechanics

Serialization (snapshot_to_dict()): 1. dataclasses.asdict() converts the snapshot tree to a plain dict 2. _sets_to_lists() recursively converts sets to sorted lists (JSON has no set type) 3. Enum values are converted to their string representation 4. Internal cache fields (_func_by_mangled, _var_by_mangled, _type_by_name) are reset to None before serialization 5. schema_version is embedded at the top level

Deserialization (snapshot_from_dict()): 1. Inspect schema_version (default to 1 if absent) 2. Reconstruct typed objects: Function, Variable, RecordType, EnumType, etc. 3. Reconstruct enum instances (SymbolBinding, SymbolType, Visibility, etc.) from string values 4. Platform-specific metadata reconstructed via _elf_from_dict(), _pe_from_dict(), _macho_from_dict(), _dwarf_from_dict(), _dwarf_advanced_from_dict()

5. JSON determinism

To ensure reproducible snapshots (important for diffing baselines in version control):

  • Sets are converted to sorted lists
  • Dict keys are naturally ordered by json.dumps(sort_keys=True)
  • Floating-point values are avoided in the schema

6. Cross-mode snapshot equivalence

A snapshot produced from DWARF data (--dwarf-only) and a snapshot produced from castxml headers produce the same JSON schema. The checker.compare() function treats them identically. This enables:

# Generate snapshots from different sources
abicheck dump lib.so --dwarf-only > dwarf.abi.json
abicheck dump lib.so -H include/  > ast.abi.json

# Cross-compare works
abicheck compare dwarf.abi.json ast.abi.json

Fields that only one source can populate differ in their empty representation based on their type:

  • Optional fields (T | None, e.g., elf, pe, macho, dwarf_advanced): null in JSON when the source doesn't provide them
  • Collection fields (dict / list, e.g., constants, typedefs, functions): empty {} or [] when the source doesn't populate them

For example, a DWARF-only snapshot has constants: {} (empty dict — no header parsing to extract #define values), pe: null (wrong platform), and dwarf_advanced populated with DWARF-specific data. Consumers should handle both null and empty-collection cases.


Consequences

Positive

  • Offline comparison without original binaries or headers
  • Baselines can be checked into version control
  • Cross-mode comparison (DWARF vs castxml) works transparently
  • Deterministic JSON enables meaningful diffs of snapshot files
  • Simple integer versioning avoids semver complexity

Negative

  • Schema version bumps break stored baselines (users must regenerate)
  • Forward compatibility is best-effort — new fields may be silently ignored
  • dataclasses.asdict() with post-processing is slower than custom serialization (acceptable for file sizes in practice)
  • No compression — snapshots for large libraries can be several MB

References

  • abicheck/serialization.pySCHEMA_VERSION, snapshot_to_dict(), snapshot_from_dict()
  • abicheck/model.pyAbiSnapshot dataclass
  • ADR-003 — Data source architecture (DWARF vs castxml snapshot equivalence)