ADR 0002: Use SHA-256 for Request Fingerprinting¶

Status¶

Accepted

Date¶

2026-01-05

Context¶

The replay system needs to match incoming requests against recorded interactions. This requires a stable, deterministic fingerprinting mechanism that uniquely identifies requests based on their protocol, action, target, headers, and body content. The fingerprint must be consistent across different processes and Python sessions.

Decision¶

Use SHA-256 hashing with canonical JSON serialization for request fingerprinting. The canonical representation is generated by serializing a list of request fields [protocol, action, target, headers, body_hex] using json.dumps with separators=(",", ":") and sort_keys=True to ensure determinism and prevent delimiter collisions. Header ordering is preserved (no normalization).

Rationale¶

Stability: Hash-based fingerprints are deterministic and consistent across processes
Collision resistance: SHA-256 combined with structure-preserving JSON serialization prevents accidental collisions (e.g., when fields contain the delimiter)
Standard library: Available in Python's hashlib and json without external dependencies
Canonical ordering: Fixed JSON separators ensure identical requests produce identical fingerprints regardless of environment without normalizing request data
Binary-safe: Hex-encoded body in JSON handles arbitrary binary data safely
Compact: Fixed-length 64-character hex digest is memory-efficient

Implications¶

Positive Implications¶

Fingerprints are stable across application restarts and different machines
O(1) lookup using fingerprints as dictionary keys
No false positives from hash collisions in realistic scenarios
Header order affects matching (no normalization)
Works with any protocol (protocol-agnostic design)

Concerns¶

Hash computation has O(n) cost proportional to request size (mitigation: acceptable for typical request sizes)
Changing canonical format breaks compatibility with existing cassettes (mitigation: version 0 is in-memory only)
Cryptographic hashing may be overkill for this use case (mitigation: no measurable performance impact)

Alternatives¶

Direct Request Comparison (`eq`)¶

Using direct object equality comparison to find matching requests.

Pros: Simple implementation, no hashing overhead
Cons: O(n) lookup time for finding interactions in cassette, cannot use requests as dictionary keys without stable hash
Reason for rejection: Poor performance for large cassettes; O(n) lookup vs O(1) with hash-based index

MD5 Hashing¶

Using MD5 hash algorithm for fingerprinting.

Pros: Faster than SHA-256, sufficient for non-cryptographic use
Cons: Considered cryptographically broken, community perception of MD5 weakness could undermine trust
Reason for rejection: Reputational risk outweighs marginal performance gains

Non-cryptographic Hashes (xxHash, MurmurHash)¶

Using fast non-cryptographic hash algorithms like xxHash or MurmurHash.

Pros: Faster than SHA-256, designed for hash table use
Cons: Requires external dependency, not in standard library, adds maintenance burden
Reason for rejection: Performance difference is negligible for this use case; standard library preference

Tuple-based Keys¶

Using tuples of request fields directly as dictionary keys.

Pros: No hashing overhead, direct comparison
Cons: Large memory footprint for storing full request data as keys
Reason for rejection: Memory inefficient; stable hashing provides compact keys

Future Direction¶

This decision should be revisited if:

Performance profiling shows hashing is a bottleneck (consider faster non-cryptographic hashes)
Large request bodies cause excessive memory pressure (consider streaming hash calculation without full in-memory loading)
Cassette persistence is added and backward compatibility becomes critical (use versioning scheme)
Protocol-specific fingerprinting is needed (extend with pluggable fingerprint strategies)
Matching rules need to be customized (make fingerprinting + matching strategy injectable, including hash choice)