User Define Function

    There are two types of analysis requirements that UDF can meet: UDF and UDAF. UDF in this article refers to both.

    1. UDF: User-defined function, this function will operate on a single line and output a single line result. When users use UDFs for queries, each row of data will eventually appear in the result set. Typical UDFs are string operations such as concat().
    2. UDAF: User-defined aggregation function. This function operates on multiple lines and outputs a single line of results. When the user uses UDAF in the query, each group of data after grouping will finally calculate a value and expand the result set. A typical UDAF is the set operation sum(). Generally speaking, UDAF will be used together with group by.

    This document mainly describes how to write a custom UDF function and how to use it in Doris.

    If users use the UDF function and extend Doris’ function analysis, and want to contribute their own UDF functions back to the Doris community for other users, please see the document Contribute UDF.

    Before using UDF, users need to write their own UDF functions under Doris’ UDF framework. In the file is a simple UDF Demo.

    Writing a UDF function requires the following steps.

    Create the corresponding header file and CPP file, and implement the logic you need in the CPP file. Correspondence between the implementation function format and UDF in the CPP file.

    Users can put their own source code in a folder. Taking udf_sample as an example, the directory structure is as follows:

    Non-variable parameters

    For UDFs with non-variable parameters, the correspondence between the two is straightforward. For example, the UDF of INT MyADD(INT, INT) will correspond to IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2).

    1. AddUdf can be any name, as long as it is specified when creating UDF.
    2. The first parameter in the implementation function is always FunctionContext*. The implementer can obtain some query-related content through this structure, and apply for some memory to be used. The specific interface used can refer to the definition in udf/udf.h.
    3. In the implementation function, the second parameter needs to correspond to the UDF parameter one by one, for example, IntVal corresponds to INT type. All types in this part must be referenced with const.
    4. The return parameter must correspond to the type of UDF parameter.

    variable parameter

    For variable parameters, you can refer to the following example, corresponding to UDFString md5sum(String, ...) The implementation function is StringVal md5sumUdf(FunctionContext* ctx, int num_args, const StringVal* args)

    1. md5sumUdf can also be changed arbitrarily, just specify it when creating.
    2. The first parameter is the same as the non-variable parameter function, and the passed in is a FunctionContext*.
    3. The variable parameter part consists of two parts. First, an integer is passed in, indicating that there are several parameters behind. An array of variable parameter parts is passed in later.

    Type correspondence

    Since the UDF implementation relies on Doris’ UDF framework, the first step in compiling UDF functions is to compile Doris, that is, the UDF framework.

    Running sh build.sh in the root directory of Doris will generate a static library file of the UDF framework headers|libs in output/udf/

    1. ├── output
    2. └── udf
    3. ├── include
    4. ├── uda_test_harness.h
    5. └── udf.h
    6. └── lib
    7. └── libDorisUdf.a
    1. Prepare to compile UDF’s CMakeFiles.txt

      CMakeFiles.txt is used to declare how UDF functions are compiled. Stored in the source code folder, level with user code. Here, taking udf_samples as an example, the directory structure is as follows:

      • Need to show declaration reference libDorisUdf.a
      • Declare udf.h header file location

      Take udf_sample as an example

      1. # Include udf
      2. include_directories(thirdparty/include)
      3. # Set all libraries
      4. add_library(udf STATIC IMPORTED)
      5. set_target_properties(udf PROPERTIES IMPORTED_LOCATION thirdparty/lib/libDorisUdf.a)
      6. # where to put generated libraries
      7. set(LIBRARY_OUTPUT_PATH "${BUILD_DIR}/src/udf_samples")
      8. # where to put generated binaries
      9. set(EXECUTABLE_OUTPUT_PATH "${BUILD_DIR}/src/udf_samples")
      10. add_library(udfsample SHARED udf_sample.cpp)
      11. target_link_libraries(udfsample
      12. udf
      13. -static-libstdc++
      14. -static-libgcc
      15. )
      16. target_link_libraries(udasample
      17. udf
      18. -static-libstdc++
      19. -static-libgcc
      20. )

    The complete directory structure after all files are prepared is as follows:

    1. ├── thirdparty
    2. │── include
    3. └── udf.h
    4. └── libDorisUdf.a
    5. └── udf_samples
    6. ├── CMakeLists.txt
    7. ├── uda_sample.cpp
    8. ├── uda_sample.h
    9. ├── udf_sample.cpp
    10. └── udf_sample.h

    Prepare the above files and you can compile UDF directly

    Create a build folder under the udf_samples folder to store the compilation output.

    Run the command cmake ../ in the build folder to generate a Makefile, and execute make to generate the corresponding dynamic library.

    After the compilation is completed, the UDF dynamic link library is successfully generated. Under build/src/, taking udf_samples as an example, the directory structure is as follows:

    1. ├── thirdparty
    2. ├── udf_samples
    3. └── build
    4. └── src
    5. └── udf_samples
    6. ├── libudasample.so
    7. └── libudfsample.so

    After following the above steps, you can get the UDF dynamic library (that is, the .so file in the compilation result). You need to put this dynamic library in a location that can be accessed through the HTTP protocol.

    Then log in to the Doris system and create a UDF function in the mysql-client through the CREATE FUNCTION syntax. You need to have ADMIN authority to complete this operation. At this time, there will be a UDF created in the Doris system.

    1. CREATE [AGGREGATE] FUNCTION
    2. name ([argtype][,...])
    3. [RETURNS] rettype
    4. PROPERTIES (["key"="value"][,...])

    Description:

    1. “Symbol” in PROPERTIES means that the symbol corresponding to the entry function is executed. This parameter must be set. You can get the corresponding symbol through the nm command, for example, _ZN9doris_udf6AddUdfEPNS_15FunctionContextERKNS_6IntValES4_ obtained by nm libudfsample.so | grep AddUdf is the corresponding symbol.
    2. The object_file in PROPERTIES indicates where it can be downloaded to the corresponding dynamic library. This parameter must be set.

    For specific use, please refer to CREATE FUNCTION for more detailed information.

    Users must have the SELECT permission of the corresponding database to use UDF/UDAF.

    When you no longer need UDF functions, you can delete a UDF function by the following command, you can refer to DROP FUNCTION.