Handling UUID Data

    Native uuid.UUID objects can also be used as part of MongoDB queries:

    1. document = collection.find({'uuid': uuid_obj})
    2. assert document['uuid'] == uuid_obj

    The above examples illustrate the simplest of use-cases - one where the UUID is generated by, and used in the same application. However, the situation can be significantly more complex when dealing with a MongoDB deployment that contains UUIDs created by other drivers as the Java and CSharp drivers have historically encoded UUIDs using a byte-order that is different from the one used by PyMongo. Applications that require interoperability across these drivers must specify the appropriate .

    In the following sections, we describe how drivers have historically differed in their encoding of UUIDs, and how applications can use the UuidRepresentation configuration option to maintain cross-language compatibility.

    Attention

    New applications that do not share a MongoDB deployment with any other application and that have never stored UUIDs in MongoDB should use the standard UUID representation for cross-language compatibility. See for details on how to configure the UuidRepresentation.

    Historically, MongoDB Drivers have used different byte-ordering while serializing UUID types to . Consider, for instance, a UUID with the following canonical textual representation:

    1. 00112233-4455-6677-8899-aabbccddeeff

    This UUID would historically be serialized by the Python driver as:

    1. 00112233-4455-6677-8899-aabbccddeeff

    The same UUID would historically be serialized by the C# driver as:

    1. 33221100-5544-7766-8899-aabbccddeeff

    Finally, the same UUID would historically be serialized by the Java driver as:

    1. 77665544-3322-1100-ffee-ddccbbaa9988

    Note

    For in-depth information about the the byte-order historically used by different drivers, see the Handling of Native UUID Types Specification.

    This difference in the byte-order of UUIDs encoded by different drivers can result in highly unintuitive behavior in some scenarios. We detail two such scenarios in the next sections.

    Consider the following situation:

    • Application C written in C# generates a UUID and uses it as the _id of a document that it proceeds to insert into the uuid_test collection of the example_db database. Let’s assume that the canonical textual representation of the generated UUID is:

      1. 00112233-4455-6677-8899-aabbccddeeff
    • Application P written in Python attempts to find the document written by application C in the following manner:

      In this instance, result will never be the document that was inserted by application C in the previous step. This is because of the different byte-order used by the C# driver for representing UUIDs as BSON Binary. The following query, on the other hand, will successfully find this document:

      1. result = collection.find_one({'_id': UUID('33221100-5544-7766-8899-aabbccddeeff')})

    This example demonstrates how the differing byte-order used by different drivers can hamper interoperability. To workaround this problem, users should configure their MongoClient with the appropriate (in this case, client in application P can be configured to use the CSHARP_LEGACY representation to avoid the unintuitive behavior) as described in .

    Scenario 2: Round-Tripping UUIDs

    In the following examples, we see how using a misconfigured can cause an application to inadvertently change the Binary subtype, and in some cases, the bytes of the field itself when round-tripping documents containing UUIDs.

    Consider the following situation:

    1. from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS
    2. from bson.binary import Binary, UuidRepresentation
    3. from uuid import uuid4
    4. # Using UuidRepresentation.PYTHON_LEGACY stores a Binary subtype-3 UUID
    5. python_opts = CodecOptions(uuid_representation=UuidRepresentation.PYTHON_LEGACY)
    6. input_uuid = uuid4()
    7. collection = client.testdb.get_collection('test', codec_options=python_opts)
    8. collection.insert_one({'_id': 'foo', 'uuid': input_uuid})
    9. assert collection.find_one({'uuid': Binary(input_uuid.bytes, 3)})['_id'] == 'foo'
    10. # Retrieving this document using UuidRepresentation.STANDARD returns a native UUID
    11. std_opts = CodecOptions(uuid_representation=UuidRepresentation.STANDARD)
    12. std_collection = client.testdb.get_collection('test', codec_options=std_opts)
    13. doc = std_collection.find_one({'_id': 'foo'})
    14. assert doc['uuid'] == input_uuid
    15. # Round-tripping the retrieved document silently changes the Binary subtype to 4
    16. assert collection.find_one({'uuid': Binary(input_uuid.bytes, 3)}) is None
    17. round_tripped_doc = collection.find_one({'uuid': Binary(input_uuid.bytes, 4)})
    18. assert doc == round_tripped_doc

    In this example, round-tripping the document using the incorrect UuidRepresentation (STANDARD instead of PYTHON_LEGACY) changes the subtype as a side-effect. Note that this can also happen when the situation is reversed - i.e. when the original document is written using ``STANDARD`` representation and then round-tripped using the ``PYTHON_LEGACY`` representation.

    1. from bson.binary import Binary, UuidRepresentation
    2. from uuid import uuid4
    3. # Using UuidRepresentation.STANDARD stores a Binary subtype-4 UUID
    4. std_opts = CodecOptions(uuid_representation=UuidRepresentation.STANDARD)
    5. input_uuid = uuid4()
    6. collection = client.testdb.get_collection('test', codec_options=std_opts)
    7. collection.insert_one({'_id': 'baz', 'uuid': input_uuid})
    8. assert collection.find_one({'uuid': Binary(input_uuid.bytes, 4)})['_id'] == 'baz'
    9. # Retrieving this document using UuidRepresentation.JAVA_LEGACY returns a native UUID
    10. # without modifying the UUID byte-order
    11. java_opts = CodecOptions(uuid_representation=UuidRepresentation.JAVA_LEGACY)
    12. java_collection = client.testdb.get_collection('test', codec_options=java_opts)
    13. doc = java_collection.find_one({'_id': 'baz'})
    14. assert doc['uuid'] == input_uuid
    15. # Round-tripping the retrieved document silently changes the Binary bytes and subtype
    16. java_collection.replace_one({'_id': 'baz'}, doc)
    17. assert collection.find_one({'uuid': Binary(input_uuid.bytes, 3)}) is None
    18. assert collection.find_one({'uuid': Binary(input_uuid.bytes, 4)}) is None
    19. round_tripped_doc = collection.find_one({'_id': 'baz'})
    20. assert round_tripped_doc['uuid'] == Binary(input_uuid.bytes, 3).as_uuid(UuidRepresentation.JAVA_LEGACY)

    In this case, using the incorrect UuidRepresentation (JAVA_LEGACY instead of STANDARD) changes the bytes and subtype as a side-effect. Note that this happens when any representation that manipulates byte-order (``CSHARP_LEGACY`` or ``JAVA_LEGACY``) is incorrectly used to round-trip UUIDs written with ``STANDARD``. When the situation is reversed - i.e. when the original document is written using ``CSHARP_LEGACY`` or ``JAVA_LEGACY`` and then round-tripped using ``STANDARD`` - only the :class:`~bson.binary.Binary` subtype is changed.

    Note

    Starting in PyMongo 4.0, these issue will be resolved as the STANDARD representation will decode Binary subtype 3 fields as Binary objects of subtype 3 (instead of ), and each of the LEGACY_* representations will decode Binary subtype 4 fields to Binary objects of subtype 4 (instead of ).

    Users can workaround the problems described above by configuring their applications with the appropriate UuidRepresentation. Configuring the representation modifies PyMongo’s behavior while encoding objects to BSON and decoding Binary subtype 3 and 4 fields from BSON.

    Applications can set the UUID representation in one of the following ways:

    1. At the MongoClient level using the uuidRepresentation URI option, e.g.:

      1. client = MongoClient("mongodb://a:27107/?uuidRepresentation=javaLegacy")

      Valid values are:

    2. At the MongoClient level using the uuidRepresentation kwarg option, e.g.:

      1. from bson.binary import UuidRepresentation
      2. client = MongoClient(uuidRepresentation=UuidRepresentation.PYTHON_LEGACY)
    3. At the Database or Collection level by supplying a suitable CodecOptions instance, e.g.:

      1. from bson.codec_options import CodecOptions
      2. csharp_opts = CodecOptions(uuid_representation=UuidRepresentation.CSHARP_LEGACY)
      3. java_opts = CodecOptions(uuid_representation=UuidRepresentation.JAVA_LEGACY)
      4. # Get database/collection from client with csharpLegacy UUID representation
      5. csharp_collection = client.testdb.get_collection('csharp_coll', codec_options=csharp_opts)
      6. # Get database/collection from existing database/collection with javaLegacy UUID representation
      7. java_database = csharp_database.with_options(codec_options=java_opts)
    UUID RepresentationDefault?Encode toDecode Binary subtype 4 toDecode subtype 3 to
    PYTHON_LEGACYYes, in PyMongo>=2.9,<4 subtype 3 with standard byte-orderuuid.UUID in PyMongo<4; subtype 4 in PyMongo>=4uuid.UUID
    NoBinary subtype 3 with Java legacy byte-order in PyMongo<4; Binary subtype 4 in PyMongo>=4
    CSHARP_LEGACYNo subtype 3 with C# legacy byte-orderuuid.UUID in PyMongo<4; subtype 4 in PyMongo>=4uuid.UUID
    NoBinary subtype 4uuid.UUID in PyMongo<4; subtype 3 in PyMongo>=4
    UNSPECIFIEDYes, in PyMongo>=4Raise Binary subtype 4 in PyMongo<4; Binary subtype 3 in PyMongo>=4

    We now detail the behavior and use-case for each supported UUID representation.

    Attention

    This uuid representation should be used when reading UUIDs generated by existing applications that use the Python driver but don’t explicitly set a UUID representation.

    Attention

    has been the default uuid representation since PyMongo 2.9.

    The PYTHON_LEGACY representation corresponds to the legacy representation of UUIDs used by PyMongo. This representation conforms with .

    The following example illustrates the use of this representation:

    PYTHON_LEGACY encodes native uuid.UUID objects to subtype 3 objects, preserving the same byte-order as bytes:

    1. from bson.binary import Binary
    2. document = collection.find_one({'uuid': Binary(uuid_2.bytes, subtype=3)})
    3. assert document['uuid'] == uuid_2

    JAVA_LEGACY

    Attention

    This UUID representation should be used when reading UUIDs written to MongoDB by the legacy applications (i.e. applications that don’t use the STANDARD representation) using the Java driver.

    The JAVA_LEGACY representation corresponds to the legacy representation of UUIDs used by the MongoDB Java Driver.

    The JAVA_LEGACY representation reverses the order of bytes 0-7, and bytes 8-15.

    As an example, consider the same UUID described in . Let us assume that an application used the Java driver without an explicitly specified UUID representation to insert the example UUID 00112233-4455-6677-8899-aabbccddeeff into MongoDB. If we try to read this value using PyMongo with no UUID representation specified, we end up with an entirely different UUID:

    1. UUID('77665544-3322-1100-ffee-ddccbbaa9988')

    However, if we explicitly set the representation to JAVA_LEGACY, we get the correct result:

    1. UUID('00112233-4455-6677-8899-aabbccddeeff')

    PyMongo uses the specified UUID representation to reorder the BSON bytes and load them correctly. JAVA_LEGACY encodes native objects to Binary subtype 3 objects, while performing the same byte-reordering as the legacy Java driver’s UUID to BSON encoder.

    Attention

    This UUID representation should be used when reading UUIDs written to MongoDB by the legacy applications (i.e. applications that don’t use the STANDARD representation) using the C# driver.

    The representation corresponds to the legacy representation of UUIDs used by the MongoDB Java Driver.

    Note

    The CSHARP_LEGACY representation reverses the order of bytes 0-3, bytes 4-5, and bytes 6-7.

    As an example, consider the same UUID described in Legacy Handling of UUID Data. Let us assume that an application used the C# driver without an explicitly specified UUID representation to insert the example UUID 00112233-4455-6677-8899-aabbccddeeff into MongoDB. If we try to read this value using PyMongo with no UUID representation specified, we end up with an entirely different UUID:

    1. UUID('33221100-5544-7766-8899-aabbccddeeff')

    However, if we explicitly set the representation to , we get the correct result:

    1. UUID('00112233-4455-6677-8899-aabbccddeeff')

    PyMongo uses the specified UUID representation to reorder the BSON bytes and load them correctly. CSHARP_LEGACY encodes native uuid.UUID objects to subtype 3 objects, while performing the same byte-reordering as the legacy C# driver’s UUID to BSON encoder.

    STANDARD

    Attention

    This UUID representation should be used by new applications that have never stored UUIDs in MongoDB.

    The representation enables cross-language compatibility by ensuring the same byte-ordering when encoding UUIDs from all drivers. UUIDs written by a driver with this representation configured will be handled correctly by every other provided it is also configured with the STANDARD representation.

    STANDARD encodes native uuid.UUID objects to subtype 4 objects.

    Attention

    Starting in PyMongo 4.0, UNSPECIFIED will be the default UUID representation used by PyMongo.

    The representation prevents the incorrect interpretation of UUID bytes by stopping short of automatically converting UUID fields in BSON to native UUID types. Loading a UUID when using this representation returns a Binary object instead. If required, users can coerce the decoded objects into native UUIDs using the as_uuid() method and specifying the appropriate representation format. The following example shows what this might look like for a UUID stored by the C# driver:

    1. from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS
    2. from bson.binary import Binary, UuidRepresentation
    3. from uuid import uuid4
    4. # Using UuidRepresentation.CSHARP_LEGACY
    5. csharp_opts = CodecOptions(uuid_representation=UuidRepresentation.CSHARP_LEGACY)
    6. # Store a legacy C#-formatted UUID
    7. input_uuid = uuid4()
    8. collection = client.testdb.get_collection('test', codec_options=csharp_opts)
    9. collection.insert_one({'_id': 'foo', 'uuid': input_uuid})
    10. # Using UuidRepresentation.UNSPECIFIED
    11. unspec_opts = CodecOptions(uuid_representation=UuidRepresentation.UNSPECIFIED)
    12. unspec_collection = client.testdb.get_collection('test', codec_options=unspec_opts)
    13. # UUID fields are decoded as Binary when UuidRepresentation.UNSPECIFIED is configured
    14. document = unspec_collection.find_one({'_id': 'foo'})
    15. decoded_field = document['uuid']
    16. assert isinstance(decoded_field, Binary)
    17. # Binary.as_uuid() can be used to coerce the decoded value to a native UUID
    18. decoded_uuid = decoded_field.as_uuid(UuidRepresentation.CSHARP_LEGACY)
    19. assert decoded_uuid == input_uuid

    Native objects cannot directly be encoded to Binary when the UUID representation is UNSPECIFIED and attempting to do so will result in an exception:

    1. explicit_binary = Binary.from_uuid(uuid4(), UuidRepresentation.PYTHON_LEGACY)