ArangoDB Server Compaction Options (MMFiles)

ArangoDB writes Documents in the WAL file. Once they have been sealed in the WAL file, the collector may copy them into a per collection journal file.

Once journal files fill up, they’re sealed to become data files.

One collection may have documents in the WAL logs, its journal file, and an arbitrary number of data files.

If a collection is loaded, each of these files are opened (thus use a file handle) and are mmap’ed. Since file handles and memory mapped files are also a sparse resource, that number should be kept low.

Once you update or remove documents from data files (or already did while it was the journal file) these documents are marked as ‘dead’ with a deletion marker.

Over time the number of dead documents may rise, and we don’t want to use the previously mentioned resources, plus the disk space should be given back to the system. Thus several journal files can be combined to one, omitting the dead documents.

Combining several of these data files into one is called compaction. The compaction process reads the alive documents from the original data files, and writes them into new data file.

Once that is done, the memory mappings to the old data files is released, and the files are erased.

Since the compaction locks the collection, and also uses I/O resources, its carefully configurable under which conditions the system should perform which amount of these compaction jobs:

ArangoDB spawns one compactor thread per database. The settings below vary in scope.

The activity control parameters alter the behavior in terms of scan / execution frequency of the compaction.

Sleep interval between two compaction runs (in seconds):

Scope: Database.

Minimum sleep time between two compaction runs (in seconds): --compaction.min-interval

When an actual compaction was executed for one collection, we wait for this time before we execute the compaction on this collection again. This is here to let eventually piled up user load be worked out.

Scope: collection.

These parameters control which data files are taken into account for a compaction run. You can specify several criteria which each off may be sufficient alone.

The scan over the data files belonging to one collection is executed from oldest data file to newest; if files qualify for a compaction they may be merged with newer files (containing younger documents).

Scope: Collection level, some are influenced by collection settings.

Minimal file size threshold original data files have to be below for a compaction: --compaction.min-small-data-file-size

This is the threshold which controls below which minimum total size a data file will always be taken into account for the compaction.

Minimum unused count of documents in a datafile:

Data files will often contain dead documents. This parameter specifies their top most accetpeable count until the data file qualifies for compaction.

How many bytes of the source data file are allowed to be unused at most: --compaction.dead-size-threshold

How many percent of the source data file should be unused at least: --compaction.dead-size-percent-threshold

Since the size of the documents may vary this threshold works on the percentage of the dead documents size. Thus, if you have many huge dead documents, this threshold kicks in earlier.

To name an example with numbers, if the data file contains 800 kbytes of alive and 400 kbytes of dead documents, the share of the dead documents is:

If this value if higher than the specified threshold, the data file will be compacted.

Once data files of a collection are qualified for a compaction run, these parameters control how many data files are merged into one, (or even one source data file may be compacted into one smaller target data file)

Scope: Collection level, some are influenced by collection settings.

Maximum number of files to merge to one file: --compaction.dest-max-files

How many data files (at most) we may merge into one resulting data file during one compaction run.

How large the resulting file may be in comparison to the collections database.maximal-journal-size setting:

In ArangoDB you can configure a default journal file size globally and override it on a per collection level. This value controls the size of collected data files relative to the configured journal file size of the collection in question.

A factor of 3 means that the maximum file size of the compacted file is 3 times the size of the maximum collection journal file size.

Next to the factor above, a totally maximum allowed file size in bytes may be specified. This will overrule all previous parameters.