Frequently Asked Questions
Is PyMongo fork-safe?
PyMongo is not fork-safe. Care must be taken when using instances of with . Specifically, instances of MongoClient must not be copied from a parent process to a child process. Instead, the parent process and each child process must create their own instances of MongoClient. Instances of MongoClient copied from the parent process have a high probability of deadlock in the child process due to the inherent incompatibilities between fork()
, threads, and locks described below. PyMongo will attempt to issue a warning if there is a chance of this deadlock occurring.
MongoClient spawns multiple threads to run background tasks such as monitoring connected servers. These threads share state that is protected by instances of , which are themselves not fork-safe. The driver is therefore subject to the same limitations as any other multithreaded code that uses (and mutexes in general). One of these limitations is that the locks become useless after fork()
. During the fork, all locks are copied over to the child process in the same state as they were in the parent: if they were locked, the copied locks are also locked. The child created by fork()
only has one thread, so any locks that were taken out by other threads in the parent will never be released in the child. The next time the child process attempts to acquire one of these locks, deadlock occurs.
For a long but interesting read about the problems of Python locks in multithreaded contexts with fork()
, see http://bugs.python.org/issue6721.
How does connection pooling work in PyMongo?
Every instance has a built-in connection pool per server in your MongoDB topology. These pools open sockets on demand to support the number of concurrent MongoDB operations that your multi-threaded application requires. There is no thread-affinity for sockets.
The size of each connection pool is capped at maxPoolSize
, which defaults to 100. If there are maxPoolSize
connections to a server and all are in use, the next request to that server will wait until one of the connections becomes available.
The client instance opens one additional socket per server in your MongoDB topology for monitoring the server’s state.
For example, a client connected to a 3-node replica set opens 3 monitoring sockets. It also opens as many sockets as needed to support a multi-threaded application’s concurrent operations on each server, up to maxPoolSize
. With a maxPoolSize
of 100, if the application only uses the primary (the default), then only the primary connection pool grows and the total connections is at most 103. If the application uses a ReadPreference to query the secondaries, their pools also grow and the total connections can reach 303.
It is possible to set the minimum number of concurrent connections to each server with minPoolSize
, which defaults to 0. The connection pool will be initialized with this number of sockets. If sockets are closed due to any network errors, causing the total number of sockets (both in use and idle) to drop below the minimum, more sockets are opened until the minimum is reached.
The maximum number of milliseconds that a connection can remain idle in the pool before being removed and replaced can be set with maxIdleTimeMS
, which defaults to None (no limit).
The default configuration for a works for most applications:
Create this client once for each process, and reuse it for all operations. It is a common mistake to create a new client for each request, which is very inefficient.
To support extremely high numbers of concurrent MongoDB operations within one process, increase maxPoolSize
:
client = MongoClient(host, port, maxPoolSize=200)
… or make it unbounded:
client = MongoClient(host, port, maxPoolSize=None)
Once the pool reaches its maximum size, additional threads have to wait for sockets to become available. PyMongo does not limit the number of threads that can wait for sockets to become available and it is the application’s responsibility to limit the size of its thread pool to bound queuing during a load spike. Threads are allowed to wait for any length of time unless waitQueueTimeoutMS
is defined:
client = MongoClient(host, port, waitQueueTimeoutMS=100)
A thread that waits more than 100ms (in this example) for a socket raises ConnectionFailure. Use this option if it is more important to bound the duration of operations during a load spike than it is to complete every operation.
When is called by any thread, all idle sockets are closed, and all sockets that are in use will be closed as they are returned to the pool.
PyMongo supports CPython 3.4+ and PyPy3.5+. See the Python 3 FAQ for details.
Does PyMongo support asynchronous frameworks like Gevent, asyncio, Tornado, or Twisted?
PyMongo fully supports .
To use MongoDB with asyncio or , see the Motor project.
For , see TxMongo. Its stated mission is to keep feature parity with PyMongo.
Why does PyMongo add an _id field to all of my documents?
When a document is inserted to MongoDB using , insert_many(), or , and that document does not include an _id
field, PyMongo automatically adds one for you, set to an instance of ObjectId. For example:
>>> my_doc = {'x': 1}
>>> collection.insert_one(my_doc)
<pymongo.results.InsertOneResult object at 0x7f3fc25bd640>
>>> my_doc
{'x': 1, '_id': ObjectId('560db337fba522189f171720')}
Users often discover this behavior when calling with a list of references to a single document raises BulkWriteError. Several Python idioms lead to this pitfall:
>>> doc = {}
>>> collection.insert_many(doc for _ in range(10))
Traceback (most recent call last):
...
pymongo.errors.BulkWriteError: batch op errors occurred
{'_id': ObjectId('560f171cfba52279f0b0da0c')}
>>> docs = [{}]
>>> collection.insert_many(docs * 10)
Traceback (most recent call last):
...
pymongo.errors.BulkWriteError: batch op errors occurred
>>> docs
[{'_id': ObjectId('560f1933fba52279f0b0da0e')}]
- All MongoDB documents are required to have an
_id
field. - If PyMongo were to insert a document without an
_id
MongoDB would add one itself, but it would not report the value back to PyMongo. - Copying the document to insert before adding the
_id
field would be prohibitively expensive for most high write volume applications.
If you don’t want PyMongo to add an _id
to your documents, insert only documents that already have an _id
field, added by your application.
Key order in subdocuments – why does my query work in the shell but not PyMongo?
The key-value pairs in a BSON document can have any order (except that _id
is always first). The mongo shell preserves key order when reading and writing data. Observe that “b” comes before “a” when we create the document and when it is displayed:
> // mongo shell.
> db.collection.insert( { "_id" : 1, "subdocument" : { "b" : 1, "a" : 1 } } )
WriteResult({ "nInserted" : 1 })
> db.collection.find()
{ "_id" : 1, "subdocument" : { "b" : 1, "a" : 1 } }
PyMongo represents BSON documents as Python dicts by default, and the order of keys in dicts is not defined. That is, a dict declared with the “a” key first is the same, to Python, as one with “b” first:
Therefore, Python dicts are not guaranteed to show keys in the order they are stored in BSON. Here, “a” is shown before “b”:
>>> print(collection.find_one())
{u'_id': 1.0, u'subdocument': {u'a': 1.0, u'b': 1.0}}
To preserve order when reading BSON, use the class, which is a dict that remembers its key order. First, get a handle to the collection, configured to use SON instead of dict:
>>> from bson import CodecOptions, SON
>>> opts = CodecOptions(document_class=SON)
>>> opts
CodecOptions(document_class=<class 'bson.son.SON'>,
tz_aware=False,
uuid_representation=UuidRepresentation.PYTHON_LEGACY,
unicode_decode_error_handler='strict',
tzinfo=None, type_registry=TypeRegistry(type_codecs=[],
fallback_encoder=None))
>>> collection_son = collection.with_options(codec_options=opts)
Now, documents and subdocuments in query results are represented with objects:
>>> print(collection_son.find_one())
SON([(u'_id', 1.0), (u'subdocument', SON([(u'b', 1.0), (u'a', 1.0)]))])
The subdocument’s actual storage layout is now visible: “b” is before “a”.
Because a dict’s key order is not defined, you cannot predict how it will be serialized to BSON. But MongoDB considers subdocuments equal only if their keys have the same order. So if you use a dict to query on a subdocument it may not match:
>>> collection.find_one({'subdocument': {'a': 1.0, 'b': 1.0}}) is None
True
Swapping the key order in your query makes no difference:
>>> collection.find_one({'subdocument': {'b': 1.0, 'a': 1.0}}) is None
… because, as we saw above, Python considers the two dicts the same.
There are two solutions. First, you can match the subdocument field-by-field:
>>> collection.find_one({'subdocument.a': 1.0,
... 'subdocument.b': 1.0})
{u'_id': 1.0, u'subdocument': {u'a': 1.0, u'b': 1.0}}
The query matches any subdocument with an “a” of 1.0 and a “b” of 1.0, regardless of the order you specify them in Python or the order they are stored in BSON. Additionally, this query now matches subdocuments with additional keys besides “a” and “b”, whereas the previous query required an exact match.
The second solution is to use a SON to specify the key order:
The key order you use when you create a is preserved when it is serialized to BSON and used as a query. Thus you can create a subdocument that exactly matches the subdocument in the collection.
See also
MongoDB Manual entry on subdocument matching.
Cursors in MongoDB can timeout on the server if they’ve been open for a long time without any operations being performed on them. This can lead to an exception being raised when attempting to iterate the cursor.
MongoDB doesn’t support custom timeouts for cursors, but cursor timeouts can be turned off entirely. Pass no_cursor_timeout=True
to find().
How can I store decimal.Decimal instances?
PyMongo >= 3.4 supports the Decimal128 BSON type introduced in MongoDB 3.4. See for more information.
MongoDB <= 3.2 only supports IEEE 754 floating points - the same as the Python float type. The only way PyMongo could store Decimal instances to these versions of MongoDB would be to convert them to this standard, so you’d really only be storing floats anyway - we force users to do this conversion explicitly so that they are aware that it is happening.
The database representation is 9.99
as an IEEE floating point (which is common to MongoDB and Python as well as most other modern languages). The problem is that 9.99
cannot be represented exactly with a double precision floating point - this is true in some versions of Python as well:
>>> 9.99
The result that you get when you save 9.99
with PyMongo is exactly the same as the result you’d get saving it with the JavaScript shell or any of the other languages (and as the data you’re working with when you type 9.99
into a Python program).
- This will pollute the attribute namespace for documents, so could lead to subtle bugs / confusing errors when using a key with the same name as a dictionary method.
- The only reason we even use SON objects instead of regular dictionaries is to maintain key ordering, since the server requires this for certain operations. So we’re hesitant to needlessly complicate SON (at some point it’s hypothetically possible we might want to revert back to using dictionaries alone, without breaking backwards compatibility for everyone).
- It’s easy (and Pythonic) for new users to deal with documents, since they behave just like dictionaries. If we start changing their behavior it adds a barrier to entry for new users - another class to learn.
See Datetimes and Timezones for examples on how to handle objects correctly.
PyMongo doesn’t support saving datetime.date
instances, since there is no BSON type for dates without times. Rather than having the driver enforce a convention for converting datetime.date
instances to datetime.datetime
instances for you, any conversion should be performed in your client code.
It’s common in web applications to encode documents’ ObjectIds in URLs, like:
"/posts/50b3bda58a02fb9a84d8991e"
Your web framework will pass the ObjectId portion of the URL to your request handler as a string, so it must be converted to ObjectId before it is passed to . It is a common mistake to forget to do this conversion. Here’s how to do it correctly in Flask (other web frameworks are similar):
from pymongo import MongoClient
from bson.objectid import ObjectId
from flask import Flask, render_template
client = MongoClient()
app = Flask(__name__)
@app.route("/posts/<_id>")
def show_post(_id):
# NOTE!: converting _id from string to ObjectId before passing to find_one
post = client.db.posts.find_one({'_id': ObjectId(_id)})
return render_template('post.html', post=post)
if __name__ == "__main__":
app.run()
See also
Django is a popular Python web framework. Django includes an ORM, django.db
. Currently, there’s no official MongoDB backend for Django.
is an unofficial MongoDB backend that supports Django aggregations, (atomic) updates, embedded objects, Map/Reduce and GridFS. It allows you to use most of Django’s built-in features, including the ORM, admin, authentication, site and session frameworks and caching.
However, it’s easy to use MongoDB (and PyMongo) from Django without using a Django backend. Certain features of Django that require django.db
(admin, authentication and sessions) will not work using just MongoDB, but most of what Django provides can still be used.
One project which should make working with MongoDB and Django easier is mango. Mango is a set of MongoDB backends for Django sessions and authentication (bypassing django.db
entirely).
Does PyMongo work with mod_wsgi?
Yes. See the configuration guide for .
No. PyMongo creates Python threads which PythonAnywhere does not support. For more information see .
json module to encode my documents to JSON?
is PyMongo’s built in, flexible tool for using Python’s json module with BSON documents and . The json module won’t work out of the box with all documents from PyMongo as PyMongo supports some special types (like and DBRef) that are not supported in JSON.
is a fast BSON to MongoDB Extended JSON converter built on top of libbson. python-bsonjs does not depend on PyMongo and can offer a nice performance improvement over . python-bsonjs works best with PyMongo when using RawBSONDocument.
Why do I get OverflowError decoding dates stored by another language’s driver?
PyMongo decodes BSON datetime values to instances of Python’s . Instances of datetime.datetime are limited to years between (usually 1) and datetime.MAXYEAR (usually 9999). Some MongoDB drivers (e.g. the PHP driver) can store BSON datetimes with year values far outside those supported by .
There are a few ways to work around this issue. One option is to filter out documents with values outside of the range supported by datetime.datetime:
>>> from datetime import datetime
>>> coll = client.test.dates
>>> cur = coll.find({'dt': {'$gte': datetime.min, '$lte': datetime.max}})
Another option, assuming you don’t need the datetime field, is to filter out just that field:
>>> cur = coll.find({}, projection={'dt': False})
Using PyMongo with Multiprocessing
On Unix systems the multiprocessing module spawns processes using fork()
. Care must be taken when using instances of with fork()
. Specifically, instances of MongoClient must not be copied from a parent process to a child process. Instead, the parent process and each child process must create their own instances of MongoClient. For example:
# Each process creates its own instance of MongoClient.
def func():
db = pymongo.MongoClient().mydb
# Do something with db.
proc.start()
Never do this:
Instances of MongoClient copied from the parent process have a high probability of deadlock in the child process due to inherent incompatibilities between fork(), threads, and locks. PyMongo will attempt to issue a warning if there is a chance of this deadlock occurring.
See also