Apache Beam Google DataFlow Pipeline Engine

    The Cloud Dataflow Runner and service are suitable for large scale continuous jobs and provide:

    • A fully managed service

    • Autoscaling of the number of workers throughout the lifetime of the job

    • Dynamic work re-balancing

    Check the Google DataFlow docs and for more information.

    INFO: this configuration checklist was reprinted (copied) from the Apache Beam documentation.

    To use the Google Cloud Dataflow runtime configuration, you must complete the setup in the Before you begin section of the for your chosen language.

    • Enable billing for your project.

    • Enable the required Google Cloud APIs: Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, and Cloud Resource Manager. You may need to enable additional APIs (such as BigQuery, Cloud Pub/Sub, or Cloud Datastore) if you use them in your pipeline code.

    • Authenticate with Google Cloud Platform.

    • Create a Cloud Storage bucket.

    Environment Settings

    This environment variable need to be set locally.

    To allow encrypted (TLS) network connections to, for example, Kafka and Neo4j Aura certain older security algorithms are disabled on Dataflow. This is done by setting security property to value: .