Detection
If the database instance goes into read-only mode, the following foreground operations will return the error status on all subsequent calls -
- ,
DB::Put
,DB::Delete
,DB::SingleDelete
,DB::DeleteRange
,DB::Merge
DB::IngestExternalFile
DB::CompactFiles
DB::Flush
Status::Severity::kSoftError
- Errors of this severity do not prevent writes to the DB, but it does mean that the DB is in a degraded mode. Background compactions and flush may not be able to run in a timely manner.Status::Severity::kHardError
- The DB is in read-only mode, but it can be transitioned back to read-write mode once the cause of the error has been addressed.- - The DB is in read-only mode. The only way to recover is to close the DB, remedy the underlying cause of the error, and then re-open the DB.
Status::Severity::kUnrecoverableError
- This is the highest severity and indicates a corruption in the database. It may be possible to close and re-open the DB, but the contents of the database may no longer be correct.
In addition to the above, a notification callback EventListener::OnBackgroundError
will be called as soon as the background error is encountered.
Recovery
- Call
DB::Resume()
to manually resume the DB and put it in read-write mode. This function will flush memtables for all the column families, clear the error, purge any obsolete files, and restart background flush and compaction operations. - Automatic recovery from background errors. This is done by polling the system to ensure the underlying error condition is resolved, and then following the same steps as
DB::Resume()
to restore write capability. Notification callbacksEventListener::OnErrorRecoveryBegin
andEventListener::OnErrorRecoveryCompleted
are called at the start and end of the recovery process respectively, to inform the user of the status. The retry behavior can be controlled by settingmax_bgerror_resume_count
andbgerror_resume_retry_interval
inDBOptions
.
Auto Recovery Situations
At present, the automatic recovery is supported in the following scenarios -
- ENOSPC error from the filesystem
- IO errors reported by the
FileSystem
as retryable (typically transient errors such as network outages). When WAL is not in use, the database will continue to buffer writes in the memtable (i.e the database remains in read-write mode). Writes may eventually stall once memtables are accumulated. - Errors during WAL sync, recovery is done only if 2PC is not in use.