Migrating to Amazon DocumentDB

    To migrate to Amazon DocumentDB, the two primary tools that most customers use are the AWS Database Migration Service (AWS DMS) and command line utilities like and mongorestore. As a best practice, and for either of these options, we recommend that you first create indexes in Amazon DocumentDB before beginning your migration as it can reduce the overall time and increase the speed of the migration. To do this, you can use the .

    AWS Database Migration Service (AWS DMS) is a cloud service that makes it easy to migrate relational databases and non-relational databases to Amazon DocumentDB. You can use AWS DMS to migrate your data to Amazon DocumentDB from databases hosted on-premises or on EC2. With AWS DMS, you can perform one-time migrations, or you can replicate ongoing changes to keep sources and targets in sync.

    To help with the cost of migrations, you can use AWS DMS free for six months per instance when migrating to Amazon DocumentDB. For more information, see Free DMS.

    For more information on using AWS DMS to migrate to Amazon DocumentDB, please see:

    Command Line Utilities

    Common utilities for migrating data to and from Amazon DocumentDB include mongodump, mongorestore, mongoexport, and mongoimport. Typically, mongodump and mongorestore are the most efficient utilities as they dump and restore data from your databases in a binary format. This is generally the most performant option and yields a smaller data size compared to logical exports. mongoexport and mongoimport are useful if you want to export and import data in a logical format like JSON or CSV as the data is human readable but is generally slower than the mongodump/mongorestore and yields a larger data size.

    The section below will discuss when it is best to use AWS DMS and command line utilities based on your use case and requirements.

    Discovery

    For each of your MongoDB deployments, you should identify and record two sets of data: Architecture Details and Operational Characteristics. This information will help you choose the appropriate migration approach and cluster sizing.

    Architecture Details

    • Name

      Choose a unique name for tracking this deployment.

    • Version

      Record the version of MongoDB that your deployment is running. To find the version, connect to a replica set member with the mongo shell and run the db.version() operation.

    • Type

      Record whether your deployment is a standalone mongo instance, a replica set, or a sharded cluster.

    • Members

      Record the hostnames, addresses, and ports of each cluster, replica set, or standalone member.

      For a clustered deployment, you can find shard members by connecting to a mongo host with the mongo shell and running the sh.status() operation.

      For a replica set, you can obtain the members by connecting to a replica set member with the mongo shell and running the rs.status() operation.

    • Oplog sizes

      For replica sets or sharded clusters, record the size of the oplog for each replica set member. To find a member’s oplog size, connect to the replica set member with the mongo shell and run the ps.printReplicationInfo() operation.

    • Replica set member priorities

      For replica sets or sharded clusters, record the priority for each replica set member. To find the replica set member priorities, connect to a replica set member with the mongo shell and run the rs.conf() operation. The priority is shown as the value of the priority key.

    • TLS/SSL usage

      Record whether Transport Layer Security (TLS)/Secure Sockets Layer (SSL) is used on each node for encryption in transit.

    Operational Characteristics

    • Database statistics

      For each collection, record the following information:

      • Data size

      • Collection count

      To find the database statistics, connect to your database with the mongo shell and run the command .

    • Collection statistics

      For each collection, record the following information:

      • Namespace

      • Data size

      • Index count

      • Whether the collection is capped

    • Index statistics

      For each collection, record the following index information:

      • Namespace

      • ID

      • Size

      • Keys

      • TTL

      • Sparse

      • Background

      To find the index information, connect to your database with the mongo shell and run the command db.collection.getIndexes().

    • Opcounters

      This information helps you understand your current MongoDB workload patterns (read-heavy, write-heavy, or balanced). It also provides guidance on your initial Amazon DocumentDB instance selection.

      The following are the key pieces of information to collect over the monitoring period (in counts/sec):

      • Queries

      • Inserts

      • Updates

      You can obtain this information by graphing the output of the db.serverStatus() command over time. You can also use the mongostat tool to obtain instantaneous values for these statistics. However, with this option you run the risk of planning your migration on usage periods other than your peak load.

    • Network statistics

      This information helps you understand your current MongoDB workload patterns (read-heavy, write-heavy, or balanced). It also provides guidance on your initial Amazon DocumentDB instance selection.

      The following are the key pieces of information to collect over the monitoring period (in counts/sec):

      • Connections

      • Network bytes in

      • Network bytes out

      You can get this information by graphing the output of the db.serverStatus() command over time. You can also use the mongostat tool to obtain instantaneous values for these statistics. However, with this option you run the risk of planning your migration on usage periods other than your peak load.

    Planning: Amazon DocumentDB Cluster Requirements

    Successful migration requires that you carefully consider both your Amazon DocumentDB cluster’s configuration and how applications will access your cluster. Consider each of the following dimensions when determining your cluster requirements:

    • Availability

      Amazon DocumentDB provides high availability through the deployment of replica instances, which can be promoted to a primary instance in a process known as failover. By deploying replica instances to different Availability Zones, you can achieve higher levels of availability.

      The following table provides guidelines for Amazon DocumentDB deployment configurations to meet specific availability goals.

      Overall system reliability must consider all components, not just the database. For best practices and recommendations for meeting overall system reliability needs, see the AWS Well-Architected Reliability Pillar Whitepaper.

    • Performance

      Amazon DocumentDB instances allow you to read from and write to your cluster’s storage volume. Cluster instances come in a number of types, with varying amounts of memory and vCPU, which affect your cluster’s read and write performance. Using the information you gathered in the discovery phase, choose an instance type that can support your workload performance requirements. For a list of supported instance types, see .

      When choosing an instance type for your Amazon DocumentDB cluster, consider the following aspects of your workload’s performance requirements:

      • vCPUs—Architectures that require higher connection counts might benefit from instances with more vCPUS.

      • Memory—When possible, keeping your working dataset in memory provides maximum performance. A starting guideline is to reserve a third of your instance’s memory for the Amazon DocumentDB engine, leaving two-thirds for your working dataset.

      • Connections—The minimum optimal connection count is eight connections per Amazon DocumentDB instance vCPU. Although the Amazon DocumentDB instance connection limit is much higher, performance benefits of additional connections decline above eight connections per vCPU.

      • Network—Workloads with a large number of clients or connections should consider the aggregate network performance required for inserted and retrieved data. Bulk operations can make more efficient use of network resources.

      • Insert Performance—Single document inserts are generally the slowest way to insert data into Amazon DocumentDB. Bulk insert operations can be dramatically faster than single inserts.

      • Read Performance—Reads from working memory are always faster than reads returned from the storage volume. Therefore, optimizing your instance memory size to retain your working set in memory is ideal.

      In addition to serving reads from your primary instance, Amazon DocumentDB clusters are automatically configured as replica sets. You can then route read-only queries to read replicas by setting read preference in your MongoDB driver. You can scale read traffic by adding replicas, reducing the overall load on the primary instance.

      It is possible to deploy Amazon DocumentDB replicas of different instance types in the same cluster. An example use case might be to stand up a replica with a larger instance type to serve temporary analytics traffic. If you deploy a mixed set of instance types, be sure to configure the failover priority for each instance. This helps ensure that a failover event always promotes a replica of sufficient size to handle your write load.

    • Recovery

      Amazon DocumentDB continuously backs up your data as it is written. It provides point-in-time recovery (PITR) capabilities within a configurable period of 1–35 days, known as the backup retention period. The default backup retention period is one day. Amazon DocumentDB also automatically creates daily snapshots of your storage volume, which are also retained for the configured backup retention period.

      If you want to retain snapshots beyond the backup retention period, you can also initiate manual snapshots at any time using the AWS Management Console and AWS Command Line Interface (AWS CLI). For more information, see Backing Up and Restoring in Amazon DocumentDB.

      Consider the following as you plan your migration:

      • Decide if you require manual snapshots, and if so, at what interval.

    There are three primary approaches for migrating your data to Amazon DocumentDB.

    Note

    Although you can create indexes at any time in Amazon DocumentDB, it is faster overall to create your indexes before importing large datasets. As a best practice, we recommend that for each of the approaches below, you first create your indexes in Amazon DocumentDB before performing the migration. To do this, you can use the .

    The offline approach uses the mongodump and mongorestore tools to migrate your data from your source MongoDB deployment to your Amazon DocumentDB cluster. The offline method is the simplest migration approach, but it also incurs the most downtime for your cluster.

    The basic process for offline migration is as follows:

    1. Quiesce writes to your MongoDB source.

    2. Dump collection data and indexes from the source MongoDB deployment.

    3. Restore indexes to the Amazon DocumentDB cluster.

    4. Restore collection data to the Amazon DocumentDB cluster.

    5. Change your application endpoint to write to the Amazon DocumentDB cluster.

    Online

    The online approach uses AWS Database Migration Service (AWS DMS). It performs a full load of data from your source MongoDB deployment to your Amazon DocumentDB cluster. It then switches to change data capture (CDC) mode to replicate changes. The online approach minimizes downtime for your cluster, but it is the slowest of the three methods.

    The basic process for online migration is as follows:

    1. Your application uses the source DB normally.

    2. Optionally, pre-create indexes in the Amazon DocumentDB cluster.

    3. Create an AWS DMS task to perform a full load, and then enable CDC from the source MongoDB deployment to the Amazon DocumentDB cluster.

    4. After the AWS DMS task has completed a full load and is replicating changes to the Amazon DocumentDB, switch the application’s endpoint to the Amazon DocumentDB cluster.

    
               Diagram: Online approach to migrating to Amazon DocumentDB

    For more information about using AWS DMS to migrate, see and the related Tutorial in the AWS Database Migration Service User Guide.

    The hybrid approach uses the mongodump and mongorestore tools to migrate your data from your source MongoDB deployment to your Amazon DocumentDB cluster. It then uses AWS DMS in CDC mode to replicate changes. The hybrid approach balances migration speed and downtime, but it is the most complex of the three approaches.

    The basic process for hybrid migration is as follows:

    1. Your application uses the source MongoDB deployment normally.

    2. Dump collection data and indexes from the source MongoDB deployment.

    3. Restore indexes to the Amazon DocumentDB cluster.

    4. Restore collection data to the Amazon DocumentDB cluster.

    5. Create an AWS DMS task to enable CDC from the source MongoDB deployment to the Amazon DocumentDB cluster.

    Important

    An AWS DMS task can currently migrate only a single database. If your MongoDB source has a large number of databases, you might need to automate the migration task creation, or consider using the offline method.

    Regardless of the migration approach that you choose, it’s most efficient to pre-create indexes in your Amazon DocumentDB cluster before migrating your data. This is because Amazon DocumentDB indexes are inserted data in parallel, but creating an index on existing data is a single-threaded operation.

    Because AWS DMS does not migrate indexes (only your data), there is no extra step required to avoid creating indexes a second time.

    Migration Sources

    If your MongoDB source is a standalone mongo process and you want to use the online or hybrid migration approaches, first convert your standalone mongo to a replica set so that the oplog is created to use as a CDC source.

    If you are migrating from a MongoDB replica set or sharded cluster, consider creating a chained or hidden secondary for each replica set or shard to use as your migration source. Performing data dumps can force working set data out of memory and impact performance on production instances. You can reduce this risk by migrating from a node not serving production data.

    Migration Source Versions

    If your source MongoDB database version is different from the compatibility version of your destination Amazon DocumentDB cluster, you might need to take other preparation steps to ensure a successful migration. The two most common requirements encountered are the need to upgrade the source MongoDB installation to a supported version for migration (MongoDB version 3.0 or greater), and upgrading your application drivers to support the target Amazon DocumentDB version.

    Ensure that if your migration has either of these requirements, you include those steps in your migration plan to upgrade and test any driver changes.

    Migration Connectivity

    You can migrate to Amazon DocumentDB from a source MongoDB deployment running in your data center or from a MongoDB deployment running on an Amazon EC2 instance. Migrating from MongoDB running on EC2 is straightforward, and only requires that you correctly configure your security groups and subnets.

    
            Diagram: Migrating to Amazon DocumentDB from an Amazon EC2 source

    Migrating from an on-premises database requires connectivity between your MongoDB deployment and your virtual private cloud (VPC). You can accomplish this through a virtual private network (VPN) connection, or by using the AWS Direct Connect service. Although you can migrate over the internet to your VPC, this connection method is the least desirable from a security standpoint.

    The following diagram illustrates a migration to Amazon DocumentDB from an on-premises source via a VPN connection.

    The following represents a migration to Amazon DocumentDB from an on-premises source using AWS Direct Connect.

    
            Diagram: Migrating to Amazon DocumentDB from an on-premises source 
               (AWS Direct Connect)

    Online and hybrid migration approaches require the use of an AWS DMS instance, which must run on Amazon EC2 in an Amazon VPC. All approaches require a migration server to run mongodump and mongorestore. It is generally easier to run the migration server on an Amazon EC2 instance in the VPC where your Amazon DocumentDB cluster is launched because it dramatically simplifies connectivity to your Amazon DocumentDB cluster.

    The following are goals of pre-migration testing:

    • Verify that your chosen approach achieves your desired migration outcome.

    • Verify that your instance type and read preference choices meet your application performance requirements.

    • Verify your application’s behavior during failover.

    Migration Plan Testing Considerations

    Consider the following when testing your Amazon DocumentDB migration plan.

    Restoring Indexes

    By default, mongorestore creates indexes for dumped collections, but it creates them after the data is restored. It is faster overall to create indexes in Amazon DocumentDB before data is restored to the cluster. This is because the indexing operations are parallelized during the data load.

    If you choose to pre-create your indexes, you can skip the index creation step when restoring data with mongorestore by supplying the -–noIndexRestore option.

    Dumping Data

    The mongodump tool is the preferred method of dumping data from your source MongoDB deployment. Depending on the resources available on your migration instance, you might be able to speed up your mongodump by increasing the number of parallel connections dumped from the default 4 using the –-numParallelCollections option.

    Restoring Data

    The mongorestore tool is the preferred method for restoring dumped data to your Amazon DocumentDB instance. You can improve restore performance by increasing the number of workers for each collection during restore with the -–numInsertionWorkersPerCollection option. One worker per vCPU on your Amazon DocumentDB cluster primary instance is a good place to start.

    Amazon DocumentDB does not currently support the tool’s --oplogReplay option.

    By default, mongorestore skips insert errors and continues the restore process. This can occur if you are restoring unsupported data to your Amazon DocumentDB instance. For example, it can happen if you have a document that contains keys or values with null strings. If you prefer to have the mongorestore operation fail entirely if any restore error is encountered, use the --stopOnError option.

    Oplog Sizing

    The MongoDB operations log (oplog) is a capped collection that contains all data modifications to your database. You can view the size of the oplog and the time range it contains by running the db.printReplicationInfo() operation on a replica set or shard member.

    If you are using the online or hybrid approaches, ensure that the oplog on each replica set or shard is large enough to contain all changes made during the entire duration of the data migration process (whether via mongodump or an AWS DMS task full load), plus a reasonable buffer. For more information, see Check the Size of the Oplog in the MongoDB documentation. Determine the minimum required oplog size by recording the elapsed time taken by the first test run of your mongodump or mongorestore process or AWS DMS full load task.

    AWS Database Migration Service Configuration

    The covers the components and steps required to migrate your MongoDB source data to your Amazon DocumentDB cluster. The following is the basic process for using AWS DMS to perform an online or hybrid migration:

    To perform a migration using AWS DMS

    1. Create a MongoDB source endpoint. For more information, see Using MongoDB as a Source for AWS DMS.

    2. Create an Amazon DocumentDB target endpoint. For more information, see .

    3. Create at least one AWS DMS replication instance. For more information, see Working with an AWS DMS Replication Instance.

    4. Create at least one AWS DMS replication task. For more information, see .

      For an online migration, your migration task uses the migration type Migrate existing data and replicate ongoing changes.

      For a hybrid migration, your migration task uses the migration type Replicate data changes only. You can choose the CDC start time to align with your dump time from your mongodump operation. The MongoDB oplog is idempotent. To avoid missing changes, it’s a good idea to leave a few minutes worth of overlap between your mongodump finish time and your CDC start time.

    Migrating from a Sharded Cluster

    The process for migrating data from a sharded cluster to your Amazon DocumentDB instance is essentially that of several replica set migrations in parallel. A key consideration when testing a sharded cluster migration is that some shards might be more heavily used than others. This situation leads to varying elapsed times for data migration. Ensure that you evaluate each shard’s oplog requirements when planning and testing.

    The following are some configuration issues to consider when migrating a sharded cluster:

    • Before running mongodump or starting an AWS DMS migration task, you must disable the sharded cluster balancer and wait for any in-process migrations to complete. For more information, see Disable the Balancer in the MongoDB documentation.

    • If you are using AWS DMS to replicate data, run the cleanupOrphaned command on each shard before running the migration tasks. If you don’t run this command, the tasks might fail because of duplicate document IDs. Note that this command might affect performance. For more information, see cleanupOrphaned in the MongoDB documentation.

    • If you are using the mongodump tool to dump data, you should run one mongodump process per shard. The most time-efficient approach might require multiple migration servers to maximize your dump performance.

    • If you are using AWS Database Migration Service to replicate data, you must create a source endpoint for each shard. Also run at least one migration task for each shard that you are migrating. The most time-efficient approach might require multiple replication instances to maximize your migration performance.

    Performance Testing

    After you successfully migrate your data to your test Amazon DocumentDB cluster, execute your test workload against the cluster. Verify through Amazon CloudWatch metrics that your performance meets or exceeds your MongoDB source deployment’s current throughput.

    Verify the following key Amazon DocumentDB metrics:

    • Network throughput

    • Write throughput

    • Read throughput

    • Replica lag

    For more information, see Monitoring Amazon DocumentDB.

    Failover Testing

    Verify that your application’s behavior during an Amazon DocumentDB failover event meets your availability requirements. To initiate a manual failover of an Amazon DocumentDB cluster on the console, on the Clusters page, choose the Failover action on the Actions menu.

    You can also initiate a failover by executing the failover-db-cluster operation from the AWS CLI. For more information, see failover-db-cluster in the Amazon DocumentDB section of the AWS CLI reference.

    See the following topics in the AWS Database Migration Service User Guide: