RM design

    2 Product Design

    RM maintains the available resource information reported by the engine manager, processes the resource application submitted by the engine, and records the actual resource usage information after the successful application.

    1. Engine Manager, EM for short: a microservice that processes requests to start an engine. As a resource provider, EM is responsible for registering and unregistering resources with RM. At the same time, EM, as the manager of the engine, is responsible for applying for resources from RM instead of the engine. For each EM instance, there is a corresponding resource record in the RM, which contains information such as the total resources and protection resources it provides, and dynamically updates the used resources.

    2. Engine, also known as application: a microservice that executes user operations. At the same time, as the actual user of resources, the engine is responsible for reporting the actual used resources and releasing resources to the RM. Each Engine has a corresponding resource record in the RM: during the startup process, it is reflected as a locked resource; during the running process, it is reflected as a used resource; after being terminated, the resource record is subsequently deleted.

    As shown in the figure above, all resource classes implement a top-level Resource interface, which defines the calculation and comparison methods that all resource classes need to support, and performs corresponding mathematical operator overloading, so that resources can be Directly calculated and compared like numbers.

    The currently supported resource types are shown in the following table. All resources have corresponding json serialization and deserialization methods, which can be stored in json format and transmitted across the network:

    Resource TypeDescription
    MemoryResourceMemoryResource
    CPUResourceCPU Resource
    LoadResourceHave both memory and CPU resources
    YarnResourceYarn queue resources (queue, queue memory, queue CPU, number of queue instances)
    LoadInstanceResourceServer resources (memory, CPU, number of instances)
    DriverAndYarnResourceDriver and actuator resources (both server resources and Yarn queue resources)
    SpecialResourceOther custom resources
    1. When the EM holding the resource starts, it will call the register interface through RPC and pass in the resource in json format for resource registration. The parameters that need to be provided to the register interface are as follows:

      1) Total resources: the total number of resources that the EM can provide.

      2) Protect resources: When the remaining resources are less than this resource, no further resources are allowed to be allocated.

      3) Resource type: such as LoadResource, DriverAndYarnResource and other type names.

      4) EM name: The EM name that provides resources such as sparkEngineManager.

      5) EM instance: machine name plus port name.

    2. After the RM receives the resource registration request, it adds a new record to the table linkis_module_resource_meta_data, the content of which is consistent with the parameter information of the interface.

    3. When the EM holding the resource is closed, it will call the unregister interface through RPC and pass in its own EM instance information as a parameter to offline the resource.

    4. After receiving the resource offline request, the RM finds the row corresponding to the EM instance information in the linkis_module_resource_meta_data table and deletes it; at the same time, finds all the rows corresponding to the EM instance in the linkis_user_resource_meta_data table and deletes it.

    4 Resource allocation and recycling

    1. Receive user’s resource application.

      a) RM provides requestResource interface to EM to report resource application, this interface accepts EM instance, user, Creator and Resource object as parameters. requestResource accepts an optional time parameter. When the processing event exceeds the limit of the time parameter, the resource request will be automatically processed as a failure.

    2. Lock the resource for the request that successfully applied for the resource. After confirming that the resources are sufficient, lock the resources in advance for the application and generate a unique identifier.

      a) In order to ensure the correctness in the concurrent scenario, two locks need to be added before the lock operation (the specific implementation of the lock mechanism is described in another chapter): EM lock and user lock.

       i. EM lock. After the lock is obtained, other resource operations for the EM will not be allowed.

       ii. User lock. After the lock is obtained, other resource operations of the user will not be allowed.

      b) After the two locks are successfully obtained, the judgment will be repeated again to determine whether the resources are sufficient, and if it is still sufficient, continue with the subsequent steps.

      c) Generate a UUID for the resource application, and insert a user resource record in the linkis_user_resource_meta_data table (pre_used_resource is the number of resources requested, and used_resource is null).

      d) Update the corresponding EM resource record fields (locked_resource, left_resource) in the linkis_module_resource_meta_data table.

      e) Submit a timed task. If the task is not cancelled, the two steps c and d will be rolled back after a fixed time, and the UUID will be invalidated so that the locked resources that are not actually used will not be occupied indefinitely.

      f) Return the UUID to the resource applicant.

      g) No matter what happens in the above steps, release the two locks obtained in a at the end.

    3. Receive the actual used resources reported by the user.

      a) Provide resourceInited interface, accept UUID, user name, EM instance information, actual use of Resource object and engine instance information as parameters.

      b) After receiving the reported information, obtain the EM lock and user lock.

      c) According to the UUID query to find the corresponding locked resource record, update pre_used_resource to null, and fill in used_resource with the resource actually used.

      d) Update the corresponding module resource record (restore locked_resource, add used_resource).

      e) Abnormal situation: If the corresponding UUID cannot be found, it is considered that the lock on the resource has been lost, and an exception message is returned.

    4. Receive a request from the user to release resources.

      a) Provide resourceReleased interface, accept UUID, username, EM instance as parameters.

      c) Query the corresponding user resource record according to UUID, and delete the row.

      d) Update the corresponding module resource record (clean up used_resource, restore left_resource).

    The lock is realized through the linkis_resource_lock table, and the unique constraint mechanism of the database itself is used to ensure that the data is not preempted.

    1. EM lock: for the global lock on an instance of an EM operation.

      a) Obtain the lock:

       i. Check whether there is a record where the user is null and the application and instance fields are corresponding values. If there is, it means that the lock has been acquired by other instances, and polling is waiting.

       ii. When there is no corresponding record, insert a record, if the insertion is successful, it means that the lock is successfully obtained; if the insertion encounters a UniqueConstraint error, the record polling and waiting until timeout.

      b) Release the lock:

       i. Delete the record that you own.

    2. User lock: lock the operation of a certain EM for a certain user.

      a) Obtain the lock:

       i. Check whether there is a record with the user, application and instance fields as corresponding values. If so, it means that the lock has been acquired by other instances, and wait for polling.

       ii. When there is no corresponding record, insert a record, if the insertion is successful, it means that the lock is successfully obtained; if the insertion fails, the record polling waits until timeout.

      b) Release the lock:

       i. Delete the records held by yourself.

    6 RM Client

    In the form of a jar package, the client is provided to resource users and resource providers, including the following:

    1. All resources Type of Java class (a subclass of Resource class), and the corresponding json serialization method.

    2. The Java class (subclass of ResultResource class) of all resource allocation results, and the corresponding json serialization method.

    3. The encapsulated RM interface (resource registration, offline, application, available resources and resource release requests).

      After calling the client’s interface, the client will generate the corresponding RPC command and pass it to a microservice of RM for processing through the Sender. After RM is processed, the result is also returned to the client via RPC.

      Because RM is a key underlying service, in order to prevent the resource allocation of all services from being affected by an abnormality of an RM instance, it is necessary to ensure that multiple RM instances are in service at the same time, and to ensure that a request is received by which instance Processing can ensure the consistency of the results.

       When a user requests the service of RM, he must request it through the forwarding of the gateway service, and cannot directly request a fixed RM instance. Through the service registration and discovery mechanism, the gateway service identifies the RM instance that normally provides the service, and then forwards the RPC request to one of the instances. This ensures that all requests will be processed by the RM instance in the normal state.