Migrating to Amazon DocumentDB
To migrate to Amazon DocumentDB, the two primary tools that most customers use are the AWS Database Migration Service (AWS DMS) and command line utilities like and mongorestore
. As a best practice, and for either of these options, we recommend that you first create indexes in Amazon DocumentDB before beginning your migration as it can reduce the overall time and increase the speed of the migration. To do this, you can use the .
AWS Database Migration Service (AWS DMS) is a cloud service that makes it easy to migrate relational databases and non-relational databases to Amazon DocumentDB. You can use AWS DMS to migrate your data to Amazon DocumentDB from databases hosted on-premises or on EC2. With AWS DMS, you can perform one-time migrations, or you can replicate ongoing changes to keep sources and targets in sync.
To help with the cost of migrations, you can use AWS DMS free for six months per instance when migrating to Amazon DocumentDB. For more information, see Free DMS.
For more information on using AWS DMS to migrate to Amazon DocumentDB, please see:
Command Line Utilities
Common utilities for migrating data to and from Amazon DocumentDB include mongodump
, mongorestore
, mongoexport
, and mongoimport
. Typically, mongodump
and mongorestore
are the most efficient utilities as they dump and restore data from your databases in a binary format. This is generally the most performant option and yields a smaller data size compared to logical exports. mongoexport
and mongoimport
are useful if you want to export and import data in a logical format like JSON or CSV as the data is human readable but is generally slower than the mongodump
/mongorestore
and yields a larger data size.
The section below will discuss when it is best to use AWS DMS and command line utilities based on your use case and requirements.
Discovery
For each of your MongoDB deployments, you should identify and record two sets of data: Architecture Details and Operational Characteristics. This information will help you choose the appropriate migration approach and cluster sizing.
Architecture Details
Name
Choose a unique name for tracking this deployment.
Version
Record the version of MongoDB that your deployment is running. To find the version, connect to a replica set member with the mongo shell and run the
db.version()
operation.Type
Record whether your deployment is a standalone mongo instance, a replica set, or a sharded cluster.
Members
Record the hostnames, addresses, and ports of each cluster, replica set, or standalone member.
For a clustered deployment, you can find shard members by connecting to a mongo host with the mongo shell and running the
sh.status()
operation.For a replica set, you can obtain the members by connecting to a replica set member with the mongo shell and running the
rs.status()
operation.Oplog sizes
For replica sets or sharded clusters, record the size of the oplog for each replica set member. To find a member’s oplog size, connect to the replica set member with the mongo shell and run the
ps.printReplicationInfo()
operation.Replica set member priorities
For replica sets or sharded clusters, record the priority for each replica set member. To find the replica set member priorities, connect to a replica set member with the mongo shell and run the
rs.conf()
operation. The priority is shown as the value of thepriority
key.TLS/SSL usage
Record whether Transport Layer Security (TLS)/Secure Sockets Layer (SSL) is used on each node for encryption in transit.
Operational Characteristics
Database statistics
For each collection, record the following information:
Data size
Collection count
To find the database statistics, connect to your database with the mongo shell and run the command .
Collection statistics
For each collection, record the following information:
Namespace
Data size
Index count
Whether the collection is capped
Index statistics
For each collection, record the following index information:
Namespace
ID
Size
Keys
TTL
Sparse
Background
To find the index information, connect to your database with the mongo shell and run the command
db.collection.getIndexes()
.Opcounters
This information helps you understand your current MongoDB workload patterns (read-heavy, write-heavy, or balanced). It also provides guidance on your initial Amazon DocumentDB instance selection.
The following are the key pieces of information to collect over the monitoring period (in counts/sec):
Queries
Inserts
Updates
You can obtain this information by graphing the output of the
db.serverStatus()
command over time. You can also use the mongostat tool to obtain instantaneous values for these statistics. However, with this option you run the risk of planning your migration on usage periods other than your peak load.Network statistics
This information helps you understand your current MongoDB workload patterns (read-heavy, write-heavy, or balanced). It also provides guidance on your initial Amazon DocumentDB instance selection.
The following are the key pieces of information to collect over the monitoring period (in counts/sec):
Connections
Network bytes in
Network bytes out
You can get this information by graphing the output of the
db.serverStatus()
command over time. You can also use the mongostat tool to obtain instantaneous values for these statistics. However, with this option you run the risk of planning your migration on usage periods other than your peak load.
Planning: Amazon DocumentDB Cluster Requirements
Successful migration requires that you carefully consider both your Amazon DocumentDB cluster’s configuration and how applications will access your cluster. Consider each of the following dimensions when determining your cluster requirements:
Availability
Amazon DocumentDB provides high availability through the deployment of replica instances, which can be promoted to a primary instance in a process known as failover. By deploying replica instances to different Availability Zones, you can achieve higher levels of availability.
The following table provides guidelines for Amazon DocumentDB deployment configurations to meet specific availability goals.
Overall system reliability must consider all components, not just the database. For best practices and recommendations for meeting overall system reliability needs, see the AWS Well-Architected Reliability Pillar Whitepaper.
Performance
Amazon DocumentDB instances allow you to read from and write to your cluster’s storage volume. Cluster instances come in a number of types, with varying amounts of memory and vCPU, which affect your cluster’s read and write performance. Using the information you gathered in the discovery phase, choose an instance type that can support your workload performance requirements. For a list of supported instance types, see .
When choosing an instance type for your Amazon DocumentDB cluster, consider the following aspects of your workload’s performance requirements:
vCPUs—Architectures that require higher connection counts might benefit from instances with more vCPUS.
Memory—When possible, keeping your working dataset in memory provides maximum performance. A starting guideline is to reserve a third of your instance’s memory for the Amazon DocumentDB engine, leaving two-thirds for your working dataset.
Connections—The minimum optimal connection count is eight connections per Amazon DocumentDB instance vCPU. Although the Amazon DocumentDB instance connection limit is much higher, performance benefits of additional connections decline above eight connections per vCPU.
Network—Workloads with a large number of clients or connections should consider the aggregate network performance required for inserted and retrieved data. Bulk operations can make more efficient use of network resources.
Insert Performance—Single document inserts are generally the slowest way to insert data into Amazon DocumentDB. Bulk insert operations can be dramatically faster than single inserts.
Read Performance—Reads from working memory are always faster than reads returned from the storage volume. Therefore, optimizing your instance memory size to retain your working set in memory is ideal.
In addition to serving reads from your primary instance, Amazon DocumentDB clusters are automatically configured as replica sets. You can then route read-only queries to read replicas by setting read preference in your MongoDB driver. You can scale read traffic by adding replicas, reducing the overall load on the primary instance.
It is possible to deploy Amazon DocumentDB replicas of different instance types in the same cluster. An example use case might be to stand up a replica with a larger instance type to serve temporary analytics traffic. If you deploy a mixed set of instance types, be sure to configure the failover priority for each instance. This helps ensure that a failover event always promotes a replica of sufficient size to handle your write load.
Recovery
Amazon DocumentDB continuously backs up your data as it is written. It provides point-in-time recovery (PITR) capabilities within a configurable period of 1–35 days, known as the backup retention period. The default backup retention period is one day. Amazon DocumentDB also automatically creates daily snapshots of your storage volume, which are also retained for the configured backup retention period.
If you want to retain snapshots beyond the backup retention period, you can also initiate manual snapshots at any time using the AWS Management Console and AWS Command Line Interface (AWS CLI). For more information, see Backing Up and Restoring in Amazon DocumentDB.
Consider the following as you plan your migration:
Decide if you require manual snapshots, and if so, at what interval.
There are three primary approaches for migrating your data to Amazon DocumentDB.
Note
Although you can create indexes at any time in Amazon DocumentDB, it is faster overall to create your indexes before importing large datasets. As a best practice, we recommend that for each of the approaches below, you first create your indexes in Amazon DocumentDB before performing the migration. To do this, you can use the .
The offline approach uses the mongodump
and mongorestore
tools to migrate your data from your source MongoDB deployment to your Amazon DocumentDB cluster. The offline method is the simplest migration approach, but it also incurs the most downtime for your cluster.
The basic process for offline migration is as follows:
Quiesce writes to your MongoDB source.
Dump collection data and indexes from the source MongoDB deployment.
Restore indexes to the Amazon DocumentDB cluster.
Restore collection data to the Amazon DocumentDB cluster.
Change your application endpoint to write to the Amazon DocumentDB cluster.
Online
The online approach uses AWS Database Migration Service (AWS DMS). It performs a full load of data from your source MongoDB deployment to your Amazon DocumentDB cluster. It then switches to change data capture (CDC) mode to replicate changes. The online approach minimizes downtime for your cluster, but it is the slowest of the three methods.
The basic process for online migration is as follows:
Your application uses the source DB normally.
Optionally, pre-create indexes in the Amazon DocumentDB cluster.
Create an AWS DMS task to perform a full load, and then enable CDC from the source MongoDB deployment to the Amazon DocumentDB cluster.
After the AWS DMS task has completed a full load and is replicating changes to the Amazon DocumentDB, switch the application’s endpoint to the Amazon DocumentDB cluster.
For more information about using AWS DMS to migrate, see and the related Tutorial in the AWS Database Migration Service User Guide.
The hybrid approach uses the mongodump
and mongorestore
tools to migrate your data from your source MongoDB deployment to your Amazon DocumentDB cluster. It then uses AWS DMS in CDC mode to replicate changes. The hybrid approach balances migration speed and downtime, but it is the most complex of the three approaches.
The basic process for hybrid migration is as follows:
Your application uses the source MongoDB deployment normally.
Dump collection data and indexes from the source MongoDB deployment.
Restore indexes to the Amazon DocumentDB cluster.
Restore collection data to the Amazon DocumentDB cluster.
Create an AWS DMS task to enable CDC from the source MongoDB deployment to the Amazon DocumentDB cluster.
Important
An AWS DMS task can currently migrate only a single database. If your MongoDB source has a large number of databases, you might need to automate the migration task creation, or consider using the offline method.
Regardless of the migration approach that you choose, it’s most efficient to pre-create indexes in your Amazon DocumentDB cluster before migrating your data. This is because Amazon DocumentDB indexes are inserted data in parallel, but creating an index on existing data is a single-threaded operation.
Because AWS DMS does not migrate indexes (only your data), there is no extra step required to avoid creating indexes a second time.
Migration Sources
If your MongoDB source is a standalone mongo process and you want to use the online or hybrid migration approaches, first convert your standalone mongo to a replica set so that the oplog is created to use as a CDC source.
If you are migrating from a MongoDB replica set or sharded cluster, consider creating a chained or hidden secondary for each replica set or shard to use as your migration source. Performing data dumps can force working set data out of memory and impact performance on production instances. You can reduce this risk by migrating from a node not serving production data.
Migration Source Versions
If your source MongoDB database version is different from the compatibility version of your destination Amazon DocumentDB cluster, you might need to take other preparation steps to ensure a successful migration. The two most common requirements encountered are the need to upgrade the source MongoDB installation to a supported version for migration (MongoDB version 3.0 or greater), and upgrading your application drivers to support the target Amazon DocumentDB version.
Ensure that if your migration has either of these requirements, you include those steps in your migration plan to upgrade and test any driver changes.
Migration Connectivity
You can migrate to Amazon DocumentDB from a source MongoDB deployment running in your data center or from a MongoDB deployment running on an Amazon EC2 instance. Migrating from MongoDB running on EC2 is straightforward, and only requires that you correctly configure your security groups and subnets.
Migrating from an on-premises database requires connectivity between your MongoDB deployment and your virtual private cloud (VPC). You can accomplish this through a virtual private network (VPN) connection, or by using the AWS Direct Connect service. Although you can migrate over the internet to your VPC, this connection method is the least desirable from a security standpoint.
The following diagram illustrates a migration to Amazon DocumentDB from an on-premises source via a VPN connection.
The following represents a migration to Amazon DocumentDB from an on-premises source using AWS Direct Connect.
Online and hybrid migration approaches require the use of an AWS DMS instance, which must run on Amazon EC2 in an Amazon VPC. All approaches require a migration server to run mongodump
and mongorestore
. It is generally easier to run the migration server on an Amazon EC2 instance in the VPC where your Amazon DocumentDB cluster is launched because it dramatically simplifies connectivity to your Amazon DocumentDB cluster.
The following are goals of pre-migration testing:
Verify that your chosen approach achieves your desired migration outcome.
Verify that your instance type and read preference choices meet your application performance requirements.
Verify your application’s behavior during failover.
Migration Plan Testing Considerations
Consider the following when testing your Amazon DocumentDB migration plan.
Restoring Indexes
By default, mongorestore
creates indexes for dumped collections, but it creates them after the data is restored. It is faster overall to create indexes in Amazon DocumentDB before data is restored to the cluster. This is because the indexing operations are parallelized during the data load.
If you choose to pre-create your indexes, you can skip the index creation step when restoring data with mongorestore
by supplying the -–noIndexRestore
option.
Dumping Data
The mongodump
tool is the preferred method of dumping data from your source MongoDB deployment. Depending on the resources available on your migration instance, you might be able to speed up your mongodump
by increasing the number of parallel connections dumped from the default 4 using the –-numParallelCollections
option.
Restoring Data
The mongorestore
tool is the preferred method for restoring dumped data to your Amazon DocumentDB instance. You can improve restore performance by increasing the number of workers for each collection during restore with the -–numInsertionWorkersPerCollection
option. One worker per vCPU on your Amazon DocumentDB cluster primary instance is a good place to start.
Amazon DocumentDB does not currently support the tool’s --oplogReplay
option.
By default, mongorestore
skips insert errors and continues the restore process. This can occur if you are restoring unsupported data to your Amazon DocumentDB instance. For example, it can happen if you have a document that contains keys or values with null strings. If you prefer to have the mongorestore
operation fail entirely if any restore error is encountered, use the --stopOnError
option.
Oplog Sizing
The MongoDB operations log (oplog
) is a capped collection that contains all data modifications to your database. You can view the size of the oplog and the time range it contains by running the db.printReplicationInfo()
operation on a replica set or shard member.
If you are using the online or hybrid approaches, ensure that the oplog on each replica set or shard is large enough to contain all changes made during the entire duration of the data migration process (whether via mongodump
or an AWS DMS task full load), plus a reasonable buffer. For more information, see Check the Size of the Oplog in the MongoDB documentation. Determine the minimum required oplog size by recording the elapsed time taken by the first test run of your mongodump
or mongorestore
process or AWS DMS full load task.
AWS Database Migration Service Configuration
The covers the components and steps required to migrate your MongoDB source data to your Amazon DocumentDB cluster. The following is the basic process for using AWS DMS to perform an online or hybrid migration:
To perform a migration using AWS DMS
Create a MongoDB source endpoint. For more information, see Using MongoDB as a Source for AWS DMS.
Create an Amazon DocumentDB target endpoint. For more information, see .
Create at least one AWS DMS replication instance. For more information, see Working with an AWS DMS Replication Instance.
Create at least one AWS DMS replication task. For more information, see .
For an online migration, your migration task uses the migration type Migrate existing data and replicate ongoing changes.
For a hybrid migration, your migration task uses the migration type Replicate data changes only. You can choose the CDC start time to align with your dump time from your
mongodump
operation. The MongoDB oplog is idempotent. To avoid missing changes, it’s a good idea to leave a few minutes worth of overlap between yourmongodump
finish time and your CDC start time.
Migrating from a Sharded Cluster
The process for migrating data from a sharded cluster to your Amazon DocumentDB instance is essentially that of several replica set migrations in parallel. A key consideration when testing a sharded cluster migration is that some shards might be more heavily used than others. This situation leads to varying elapsed times for data migration. Ensure that you evaluate each shard’s oplog requirements when planning and testing.
The following are some configuration issues to consider when migrating a sharded cluster:
Before running
mongodump
or starting an AWS DMS migration task, you must disable the sharded cluster balancer and wait for any in-process migrations to complete. For more information, see Disable the Balancer in the MongoDB documentation.If you are using AWS DMS to replicate data, run the
cleanupOrphaned
command on each shard before running the migration tasks. If you don’t run this command, the tasks might fail because of duplicate document IDs. Note that this command might affect performance. For more information, see cleanupOrphaned in the MongoDB documentation.If you are using the
mongodump
tool to dump data, you should run onemongodump
process per shard. The most time-efficient approach might require multiple migration servers to maximize your dump performance.If you are using AWS Database Migration Service to replicate data, you must create a source endpoint for each shard. Also run at least one migration task for each shard that you are migrating. The most time-efficient approach might require multiple replication instances to maximize your migration performance.
Performance Testing
After you successfully migrate your data to your test Amazon DocumentDB cluster, execute your test workload against the cluster. Verify through Amazon CloudWatch metrics that your performance meets or exceeds your MongoDB source deployment’s current throughput.
Verify the following key Amazon DocumentDB metrics:
Network throughput
Write throughput
Read throughput
Replica lag
For more information, see Monitoring Amazon DocumentDB.
Failover Testing
Verify that your application’s behavior during an Amazon DocumentDB failover event meets your availability requirements. To initiate a manual failover of an Amazon DocumentDB cluster on the console, on the Clusters page, choose the Failover action on the Actions menu.
You can also initiate a failover by executing the failover-db-cluster
operation from the AWS CLI. For more information, see failover-db-cluster
in the Amazon DocumentDB section of the AWS CLI reference.
See the following topics in the AWS Database Migration Service User Guide: