Fsdss-536 |top| May 2026
If "FSDSS-536" refers to a product or a model:
2. Background & Scope
- System Affected: Real‑Time Transaction Auditing Service (RT‑TAS) – micro‑service responsible for persisting every financial transaction event to the immutable audit store (Cassandra + S3 archive).
- Environment: Production cluster (K8s‑1.28, 6‑node Kafka 3.3, Cassandra 4.1).
- Change History Prior to Incident:
5. Root‑Cause Analysis (RCA)
- Schema‑Registry Update introduced a new Avro field that required a change in the consumer deserialization logic.
- During the rollout of RT‑TAS v3.2.5, a configuration file (
application‑rt‑tas.yml) was inadvertently overridden by the new schema‑registry helm chart, settingenable.auto.commit: falseandauto.commit.interval.ms: 0. - The service relied on auto‑commit for offset persistence; with it disabled, offsets were only committed when the processing loop completed successfully.
- A race condition between the consumer thread and the periodic checkpoint timer caused offsets to be lost on pod restarts, resulting in duplicate processing and gap periods where messages were consumed but never marked as processed.
- Lack of validation alerts for offset‑commit failures meant the problem wasn’t detected until lag metrics crossed the threshold.
Prepared by the Platform Engineering Team – FSDSS
Document ID: FSDSS‑536‑REPORT‑2026‑04‑17 FSDSS-536