Arangoimport Details
Import files are expected to be UTF-8 encoded without byte order mark (BOM). Other encodings are not supported, but may not raise warnings or errors.
In case of CSV/TSV files, BOMs will become part of the first column’s name (possibly mangled), so be sure the files have none.
Arangoimport can also be used to import data into an existing edge collection. The import data must, for each edge to import, contain at least the _from and _to attributes. These indicate which other two documents the edge should connect. It is necessary that these attributes are set for all records, and point to valid document IDs in existing collections.
Example
Note: The edge collection must already exist when the import is started. Using the --create-collection flag will not work because arangoimport will always try to create a regular document collection if the target collection does not exist.
Attributes whose names start with an underscore are treated in a special way by ArangoDB:
- the optional _key attribute contains the document’s key. If specified, the value must be formally valid (e.g. must be a string and conform to the naming conventions). Additionally, the key value must be unique within the collection the import is run for.
- _from: when importing into an edge collection, this attribute contains the id of one of the documents connected by the edge. The value of _from must be a syntactically valid document id and the referred collection must exist.
- _to: when importing into an edge collection, this attribute contains the id of the other document connected by the edge. The value of _to must be a syntactically valid document id and the referred collection must exist.
- _rev: this attribute contains the revision number of a document. However, the revision numbers are managed by ArangoDB and cannot be specified on import. Thus any value in this attribute is ignored on import.
If you import values into _key, you should make sure they are valid and unique.
When importing data into an edge collection, you should make sure that all import documents can _from and _to and that their values point to existing documents.
To avoid specifying complete document ids (consisting of collection names and document keys) for _from and _to values, there are the options --from-collection-prefix and --to-collection-prefix. If specified, these values will be automatically prepended to each value in _from (or _to resp.). This allows specifying only document keys inside _from and/or _to.
Example
By default, arangoimport will try to insert all documents from the import file into the specified collection. In case the import file contains documents that are already present in the target collection (matching is done via the _key attributes), then a default arangoimport run will not import these documents and complain about unique key constraint violations.
However, arangoimport can be used to update or replace existing documents in case they already exist in the target collection. It provides the command-line option --on-duplicate to control the behavior in case a document is already present in the database.
The default value of --on-duplicate is error. This means that when the import file contains a document that is present in the target collection already, then trying to re-insert a document with the same _key value is considered an error, and the document in the database will not be modified.
Other possible values for --on-duplicate are:
replace: each document present in the import file that is also present in the target collection already will be replace by arangoimport. replace will replace the existing document entirely, resulting in a document with only the attributes specified in the import file.
The values of system attributes _id, _key, _rev, _from and _to cannot be updated or replaced in existing documents.
ignore: each document present in the import file that is also present in the target collection already will be ignored and not modified in the target collection.
When --on-duplicate is set to either update or replace, arangoimport will return the number of documents updated/replaced in the updated return value. When set to another value, the value of updated will always be zero. When --on-duplicate is set to ignore, arangoimport will return the number of ignored documents in the ignored return value. When set to another value, ignored will always be zero.
An arangoimport import run will print out the final results on the command line. It will show the
- number of documents created (created)
- number of documents updated/replaced (updated/replaced, only non-zero if --on-duplicate was set to update or replace, see below)
- number of ignored documents (only non-zero if --on-duplicate was set to ignore).
Example
For CSV and TSV imports, the total number of input file lines read will also be printed (lines read).
arangoimport will also print out details about warnings and errors that happened on the server-side (if any).
Arangoimport has an optional automatic pacing algorithm that can limit how fast data is sent to the ArangoDB servers. This pacing algorithm exists to prevent the import operation from failing due to slow responses.
Google Compute and other VM providers limit the throughput of disk devices. Similarly, other users’ processes on the shared VMs can limit the available throughput of the disk devices.
The automatic pacing algorithm adjusts the transmit block size dynamically based upon the actual throughput of the server over the last few seconds. Automatic pacing intentionally may not use the full throughput of a disk device. An unlimited (really fast) disk device might not need pacing. Raising the number of threads via the command line to any value of greater than 2 will increase the total throughput used.
Automatic pacing frees the user from adjusting the throughput used to match available resources. It is disabled by default, and can be enabled by invoking arangoimport with the parameter.
When enabling the pacing, the initial chunk size is 8MB per second. This may be too high or too low, depending on the available disk throughput of the target system. To start off with a different chunk size, one can adjust the value of the parameter.
The pacing algorithm is turned on by default up to version 3.7.10 and turned off by default in version 3.7.11 and higher.