Spark Engine Usage

If you want to use the spark EngineConn on your server, you need to ensure that the following environment variables have been set correctly and that the user who started the EngineConn has these environment variables.

It is strongly recommended that you check these environment variables of the executing user before executing spark tasks.

Table 1-1 Environmental configuration list

In theory, Linkis1.0 supports all versions of spark2.x and above. Spark 2.4.3 is the default supported version. If you want to use your spark version, such as spark2.1.0, you only need to modify the version of the plug-in spark and then compile it. Specifically, you can find the linkis-engineplugin-spark module, change the \<spark.version> tag to 2.1.0, and then compile this module separately.

2.2 spark engineConn deployment and loading

If you have already compiled your spark EngineConn plug-in has been compiled, then you need to put the new plug-in to the specified location to load, you can refer to the following article for details

Linkis1.0 is done through tags, so we need to insert data in our database, the way of inserting is shown below.

EngineConnPlugin Installation > 2.2 Configuration modification of management console (optional)

Preparation for operation, queue setting

Figure 3-1 Queue settings

You can also add the queue value in the StartUpMap of the submission parameter:

Linkis provides a client method to call Spark tasks. The call method is through the SDK provided by LinkisClient. We provide java and scala two ways to call, the specific usage can refer to JAVA SDK Manual. If you use Hive, you only need to make the following changes:

3.2 How to use Linkis-cli

After Linkis 1.0, you can submit tasks through cli. We only need to specify the corresponding EngineConn and CodeType tag types. The use of Spark is as follows:

The specific usage can refer to Linkis CLI Manual.

The use of is the simplest. You can directly enter Scriptis and create a new sql, scala or pyspark script for execution.

The sql method is the simplest. You can create a new sql script and write and execute it. When it is executed, the progress will be displayed. If the user does not have a spark EngineConn at the beginning, the execution of sql will start a spark session (it may take some time here), After the SparkSession is initialized, you can start to execute sql.

Figure 3-2 Screenshot of the execution effect of sparksql

For spark-scala tasks, we have initialized sqlContext and other variables, and users can directly use this sqlContext to execute sql.

Figure 3-3 Execution effect diagram of spark-scala

Similarly, in the way of pyspark, we have also initialized the SparkSession, and users can directly use spark.sql to execute SQL.

Figure 3-4 pyspark execution mode

In addition to the above EngineConn configuration, users can also make custom settings, such as the number of spark session executors and the memory of the executors. These parameters are for users to set their own spark parameters more freely, and other spark parameters can also be modified, such as the python version of pyspark.