Broker

    Broker provides services through an RPC service port. It is a stateless JVM process that is responsible for encapsulating some POSIX-like file operations for read and write operations on remote storage, such as open, pred, pwrite, and so on. In addition, the Broker does not record any other information, so the connection information, file information, permission information, and so on stored remotely need to be passed to the Broker process in the RPC call through parameters in order for the Broker to read and write files correctly .

    Broker only acts as a data channel and does not participate in any calculations, so it takes up less memory. Usually one or more Broker processes are deployed in a Doris system. And the same type of Broker will form a group and set a ** Broker name **.

    Broker’s position in the Doris system architecture is as follows:

    This document mainly introduces the parameters that Broker needs when accessing different remote storages, such as connection information, authorization information, and so on.

    Different types of brokers support different storage systems.

    1. Community HDFS

      • Support simple authentication access
      • Support kerberos authentication access
      • Support HDFS HA mode access
    2. Baidu HDFS / AFS (not supported by open source version)

      • Support UGI simple authentication access
    3. Baidu Object Storage BOS (not supported by open source version)

      • Support AK / SK authentication access
    1. Broker Load

      The Broker Load function reads the file data on the remote storage through the Broker process and imports it into Doris. Examples are as follows:

      1. (
      2. DATA INFILE("bos://my_bucket/input/file")
      3. INTO TABLE `my_table`
      4. )
      5. WITH BROKER "broker_name"
      6. (
      7. "bos_endpoint" = "http://bj.bcebos.com",
      8. "bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
      9. "bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyy"
      10. )

      WITH BROKER and following Property Map are used to provide Broker’s related information.

    2. Export

      1. EXPORT TABLE testTbl
      2. TO "hdfs://hdfs_host:port/a/b/c"
      3. (
      4. "username" = "xxx",
      5. "password" = "yyy"
      6. );

      WITH BROKER and following Property Map are used to provide Broker’s related information.

    3. Create Repository

      When users need to use the backup and restore function, they need to first create a “repository” with the CREATE REPOSITORY command,and the broker metadata and related information are recorded in the warehouse metadata. Subsequent backup and restore operations will use Broker to back up data to this warehouse, or read data from this warehouse to restore to Doris. Examples are as follows:

      1. CREATE REPOSITORY `bos_repo`
      2. WITH BROKER `broker_name`
      3. ON LOCATION "bos://doris_backup"
      4. PROPERTIES
      5. (
      6. "bos_endpoint" = "http://gz.bcebos.com",
      7. "bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyy"
      8. );

      WITH BROKER and following Property Map are used to provide Broker’s related information.

    Broker information includes two parts: ** Broker name ** and ** Certification information **. The general syntax is as follows:

    Usually the user needs to specify an existing Broker Name through the WITH BROKER" broker_name " clause in the operation command. Broker Name is a name that the user specifies when adding a Broker process through the ALTER SYSTEM ADD BROKER command. A name usually corresponds to one or more broker processes. Doris selects available broker processes based on the name. You can use the SHOW BROKER command to view the Brokers that currently exist in the cluster.

    Note: Broker Name is just a user-defined name and does not represent the type of Broker.

    Certification Information

    Different broker types and different access methods need to provide different authentication information. Authentication information is usually provided as a Key-Value in the Property Map after WITH BROKER" broker_name ".

    Community HDFS

    1. Simple Authentication

      Simple authentication means that Hadoop configures hadoop.security.authentication tosimple.

      Use system users to access HDFS. Or add in the environment variable started by Broker: HADOOP_USER_NAME.

      1. (
      2. "username" = "user",
      3. "password" = ""
      4. );

      Just leave the password blank.

    2. The authentication method needs to provide the following information::

      • hadoop.security.authentication: Specify the authentication method as kerberos.
      • kerberos_principal: Specify the principal of kerberos.
      • kerberos_keytab: Specify the path to the keytab file for kerberos. The file must be an absolute path to a file on the server where the broker process is located. And can be accessed by the Broker process.
      • kerberos_keytab_content: Specify the content of the keytab file in kerberos after base64 encoding. You can choose one of these with kerberos_keytab configuration.

      Examples are as follows:

      1. (
      2. "hadoop.security.authentication" = "kerberos",
      3. "kerberos_principal" = "doris@YOUR.COM",
      4. "kerberos_keytab" = "/home/doris/my.keytab"
      5. )
      1. (
      2. "hadoop.security.authentication" = "kerberos",
      3. )

      If Kerberos authentication is used, the krb5.conf (opens new window) file is required when deploying the Broker process. The krb5.conf file contains Kerberos configuration information, Normally, you should install your krb5.conf file in the directory /etc. You can override the default location by setting the environment variable KRB5_CONFIG. An example of the contents of the krb5.conf file is as follows:

    3. HDFS HA Mode

      This configuration is used to access HDFS clusters deployed in HA mode.

      • dfs.nameservices: Specify the name of the hdfs service, custom, such as “dfs.nameservices” = “my_ha”.
      • dfs.ha.namenodes.xxx: Custom namenode names. Multiple names are separated by commas, where xxx is the custom name in dfs.nameservices, such as” dfs.ha.namenodes.my_ha “=” my_nn “.
      • dfs.namenode.rpc-address.xxx.nn: Specify the rpc address information of namenode, Where nn represents the name of the namenode configured in dfs.ha.namenodes.xxx, such as: “dfs.namenode.rpc-address.my_ha.my_nn” = “host:port”.
      • dfs.client.failover.proxy.provider: Specify the provider for the client to connect to the namenode. The default is: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

      Examples are as follows:

      1. (
      2. "dfs.nameservices" = "my_ha",
      3. "dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
      4. "dfs.namenode.rpc-address.my_ha.my_namenode1" = "nn1_host:rpc_port",
      5. "dfs.namenode.rpc-address.my_ha.my_namenode2" = "nn2_host:rpc_port",
      6. "dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
      7. )

      The HA mode can be combined with the previous two authentication methods for cluster access. If you access HA HDFS with simple authentication:

      1. (
      2. "username"="user",
      3. "password"="passwd",
      4. "dfs.nameservices" = "my_ha",
      5. "dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
      6. "dfs.namenode.rpc-address.my_ha.my_namenode1" = "nn1_host:rpc_port",
      7. "dfs.namenode.rpc-address.my_ha.my_namenode2" = "nn2_host:rpc_port",
      8. "dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
      9. )

      The configuration for accessing the HDFS cluster can be written to the hdfs-site.xml file. When users use the Broker process to read data from the HDFS cluster, they only need to fill in the cluster file path and authentication information.

    Baidu Object Storage BOS

    (Open source version is not supported)

    1. Access via AK / SK

      • AK/SK: Access Key and Secret Key. You can check the user’s AK / SK in Baidu Cloud Security Certification Center.
      • Region Endpoint: Endpoint of the BOS region.
      • For the regions supported by BOS and corresponding Endpoints, please see [Get access domain name](https://cloud.baidu.com/doc/BOS/s/Ck1rk80hn#%E8%8E%B7%E5%8F%96%E8%AE %BF%E9%97%AE%E5%9F%9F%E5%90%8D)

      Examples are as follows:

      1. (
      2. "bos_endpoint" = "http://bj.bcebos.com",
      3. "bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",

    Baidu HDFS/AFS

    (Open source version is not supported)

    User and passwd are UGI configurations for Hadoop.