Advanced Resources - Snapshot Data Integrity Check - 《Longhorn v1.4.1 Documentation》

- Global Settings
Performance Impact

Longhorn is capable of hashing snapshot disk files and periodically checking their integrity.

Longhorn system supports volume snapshotting and stores the snapshot disk files on the local disk. However, it is impossible to check the data integrity of snapshots due to the lack of the checksums of the snapshots previously. As a result, when the data is corrupted due to, for example, the bit rot in the underlying storage, there is no way to detect the corruption and repair the replicas. After applying the feature, Longhorn is capable of hashing snapshot disk files and periodically checking their integrity. When a snapshot disk file in one replica is corrupted, Longhorn will automatically start the rebuilding process to fix it.

snapshot-data-integrity

This setting allows users to enable or disable snapshot hashing and data integrity checking. Available options are:
- disabled: Disable snapshot disk file hashing and data integrity checking.
- fast-check: Enable snapshot disk file hashing and fast data integrity checking. Longhorn system only hashes snapshot disk files if their are not hashed or the modification time are changed. In this mode, filesystem-unaware corruption cannot be detected, but the impact on system performance can be minimized.
snapshot-data-integrity-immediate-check-after-snapshot-creation
snapshot-data-integrity-cronjob

A schedule defined using the unix-cron string format specifies when Longhorn checks the data integrity of snapshot disk files.

Per-Volume Settings

Longhorn also supports the per-volume setting by configuring . The value is ignored by default, so data integrity check is determined by the global setting snapshot-data-integrity. supports ignored, disabled, and fast-check. Each volume can have its data integrity check setting customized.

For detecting data corruption, checksums of snapshot disk files need to be calculated. The calculations consume storage and computation resources. Therefore, the storage performance will be negatively impacted. In order to provide a clear understanding of the impact, we benchmarked storage performance when checksumming disk files. The read IOPS, bandwidth and latency are negatively impacted.

Environment
- Host: AWS EC2 c5d.2xlarge
- CPU: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
- Memory: 16 GB
- Kubernetes: v1.24.4+rke2r1
- Disk: 200 GiB NVMe SSD as the instance store
  - 100 GiB snapshot with full random data
- Disk: 200 GiB throughput optimized HDD (st1)
  - 30 GiB snapshot with full random data

The feature helps detect the data corruption in snapshot disk files of volumes. However, the checksum calculation negatively impacts the storage performance. To lower down the impact, the recommendations are

Checksumming and checking snapshot disk files can be scheduled to off-peak hours by the global setting snapshot-data-integrity-cronjob.