DB4AI-Query:模型训练和推断

    表 1 DB4AI语法及关键字

    使用指导

    1. 本版本支持的算法概述。

      当前版本的DB4AI支持基于SGD算子的逻辑回归(目前支持二分类任务)、线性回归和支持向量机算法(分类任务),以及基于K-Means算子的Kmeans聚类算法。

    2. 模型训练语法说明。

      • CREATE MODEL

        使用“CREATE MODEL”语句可以进行模型的创建和训练。模型训练SQL语句,现有一个数据集为kmeans_2d,该表的数据内容如下:

        该表的字段position的数据类型为 double precision[].

      • 以Kmeans为例,训练一个模型。从kmeans_2d训练集中指定position为特征列,使用kmeans算法,创建并保存模型point_kmeans。

        1. NOTICE: Hyperparameter max_iterations takes value DEFAULT (10)
        2. NOTICE: Hyperparameter num_centroids takes value 3
        3. NOTICE: Hyperparameter tolerance takes value DEFAULT (0.000010)
        4. NOTICE: Hyperparameter batch_size takes value DEFAULT (10)
        5. NOTICE: Hyperparameter num_features takes value DEFAULT (2)
        6. NOTICE: Hyperparameter distance_function takes value DEFAULT (L2_Squared)
        7. NOTICE: Hyperparameter seeding_function takes value DEFAULT (Random++)
        8. NOTICE: Hyperparameter verbose takes value DEFAULT (0)
        9. NOTICE: Hyperparameter seed takes value DEFAULT (0)
        10. MODEL CREATED. PROCESSED 1

        上述命令中:

        • “CREATE MODEL”语句用于模型的训练和保存。
        • USING关键字指定算法名称。
        • FEATURES用于指定训练模模型的特征,需根据训练数据表的列名添加。
        • TARGET指定模型的训练目标,它可以是训练所需数据表的列名,也可以是一个表达式,例如: price > 10000。
        • WITH用于指定训练模型时的超参数。当超参未被用户进行设置的时候,框架会使用默认数值。

          针对不同的算子,框架支持不同的超参组合:

          表 2 算子支持的超参

          当前各个超参数设置的默认值和取值范围如下:

          表 3 超参的默认值以及取值范围

      • 模型保存成功,则返回创建成功信息:

        1. MODEL CREATED. PROCESSED x
    3. 查看模型信息。

      当训练完成后模型会被存储到系统表gs_model_warehouse中。系统表gs_model_warehouse可以查看到关于模型本身和训练过程的相关信息。

      用户可以通过查看系统表的方式查看模型,例如查看模型名为“point_kmeans”的SQL语句如下:

      1. openGauss=# select * from gs_model_warehouse where modelname='point_kmeans';
      2. -[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3. modelname | point_kmeans
      4. modelowner | 10
      5. createtime | 2021-04-30 17:30:39.59044
      6. processedtuples | 20
      7. discardedtuples | 0
      8. pre_process_time | 6.2001e-05
      9. exec_time | .000185272
      10. iterations | 5
      11. outputtype | 23
      12. query | CREATE MODEL point_kmeans USING kmeans FEATURES position FROM kmeans_2d WITH num_centroids=3;
      13. modeldata |
      14. weight |
      15. hyperparametersnames | {max_iterations,num_centroids,tolerance,batch_size,num_features,distance_function,seeding_function,verbose,seed}
      16. hyperparametersoids | {23,23,701,23,23,1043,1043,23,23}
      17. coefnames | {original_num_centroids,actual_num_centroids,dimension,distance_function_id,seed,coordinates}
      18. coefvalues | {3,3,2,2,572368998,"(77.282589,23.724434)(74.421616,73.239455)(18.551682,76.320914)"}
      19. coefoids |
      20. trainingscoresname |
      21. trainingscoresvalue |
      22. modeldescribe | {"id:1,objective_function:542.851169,avg_distance_to_centroid:108.570234,min_distance_to_centroid:1.027078,max_distance_to_centroid:297.210108,std_dev_distance_to_centroid:105.053257,cluster_size:5","id:2,objective_function:5825.982139,avg_distance_to_centroid:529.634740,min_distance_to_centroid:100.270449,max_distance_to_centroid:990.300588,std_dev_distance_to_centroid:285.915094,cluster_size:11","id:3,objective_function:220.792591,avg_distance_to_centroid:55.198148,min_distance_to_centroid:4.216111,max_distance_to_centroid:102.117204,std_dev_distance_to_centroid:39.319118,cluster_size:4"}
    4. 利用已存在的模型做推断任务。

      使用“SELECT”和“PREDICT BY”关键字利用已有模型完成推断任务。

      查询语法:SELECT…PREDICT BY…(FEATURES…)…FROM…;

      针对相同的推断任务,同一个模型的结果是稳定的。且基于相同的超参数和训练集训练的模型也具有稳定性,同时AI模型训练存在随机成分(每个batch的数据分布、随机梯度下降),所以不同的模型间的计算表现、结果允许存在小的差别。

    5. 查看执行计划。

      使用explain语句可对“CREATE MODEL”和“PREDICT BY”的模型训练或预测过程中的执行计划进行分析。Explain关键字后可直接拼接CREATE MODEL/ PREDICT BY语句(子句),也可接可选的参数,支持的参数如下:

      表 4 EXPLAIN支持的参数

      示例:

      1. openGauss=# Explain CREATE MODEL patient_logisitic_regression USING logistic_regression FEATURES second_attack, treatment TARGET trait_anxiety > 50 FROM patients WITH batch_size=10, learning_rate = 0.05;
      2. NOTICE: Hyperparameter batch_size takes value 10
      3. NOTICE: Hyperparameter decay takes value DEFAULT (0.950000)
      4. NOTICE: Hyperparameter learning_rate takes value 0.050000
      5. NOTICE: Hyperparameter max_iterations takes value DEFAULT (100)
      6. NOTICE: Hyperparameter max_seconds takes value DEFAULT (0)
      7. NOTICE: Hyperparameter optimizer takes value DEFAULT (gd)
      8. NOTICE: Hyperparameter tolerance takes value DEFAULT (0.000500)
      9. NOTICE: Hyperparameter seed takes value DEFAULT (0)
      10. NOTICE: Hyperparameter verbose takes value DEFAULT (FALSE)
      11. NOTICE: GD shuffle cache size 212369
      12. QUERY PLAN
      13. -------------------------------------------------------------------
      14. Gradient Descent (cost=0.00..0.00 rows=0 width=0)
      15. -> Seq Scan on patients (cost=0.00..32.20 rows=1776 width=12)
      16. (2 rows)
    6. 异常场景。

      • 训练阶段。

        • 场景一:当超参数的设置超出取值范围,模型训练失败,返回ERROR,并提示错误,例如:

          1. openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety FROM patients WITH optimizer='aa';
          2. NOTICE: Hyperparameter batch_size takes value DEFAULT (1000)
          3. NOTICE: Hyperparameter decay takes value DEFAULT (0.950000)
          4. NOTICE: Hyperparameter max_seconds takes value DEFAULT (0)
          5. NOTICE: Hyperparameter optimizer takes value aa
          6. ERROR: Invalid hyperparameter value for optimizer. Valid values are: gd, ngd. (default is gd)
        • 场景二:当模型名称已存在,模型保存失败,返回ERROR,并提示错误原因:

          1. openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety FROM patients;
          2. NOTICE: Hyperparameter batch_size takes value DEFAULT (1000)
          3. NOTICE: Hyperparameter decay takes value DEFAULT (0.950000)
          4. NOTICE: Hyperparameter learning_rate takes value DEFAULT (0.800000)
          5. NOTICE: Hyperparameter max_iterations takes value DEFAULT (100)
          6. NOTICE: Hyperparameter max_seconds takes value DEFAULT (0)
          7. NOTICE: Hyperparameter optimizer takes value DEFAULT (gd)
          8. NOTICE: Hyperparameter tolerance takes value DEFAULT (0.000500)
          9. NOTICE: Hyperparameter seed takes value DEFAULT (0)
          10. NOTICE: Hyperparameter verbose takes value DEFAULT (FALSE)
          11. NOTICE: GD shuffle cache size 5502
          12. ERROR: The model name "patient_linear_regression" already exists in gs_model_warehouse.
        • 场景三:FEATURE或者TARGETS列是*,返回ERROR,并提示错误原因:

        • 场景四:对于无监督学习方法使用TARGET关键字,或者在监督学习方法中不适用TARGET关键字,均会返回ERROR,并提示错误原因:

          1. openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment FROM patients;
          2. ERROR: Supervised ML algorithms require TARGET clause
          3. -----------------------------------------------------------------------------------------------------------------------------
          4. CREATE MODEL patient_linear_regression USING linear_regression TARGET trait_anxiety FROM patients; ERROR: Supervised ML algorithms require FEATURES clause
        • 场景五:当GUC参数statement_timeout设置了时长,训练超时执行的语句将被终止:执行CREATE MODEL语句。训练集的大小、训练轮数(iteration)、提前终止条件(tolerance、max_seconds)、并行线程数(nthread)等参数都会影响训练时长。当时长超过数据库限制,语句被终止模型训练失败。

      • 推断阶段。

        • 场景六:当模型名在系统表中查找不到,数据库会报ERROR:

          1. openGauss=# select id, PREDICT BY patient_logistic_regression (FEATURES second_attack,treatment) FROM patients;
          2. ERROR: There is no model called "patient_logistic_regression".
        • 场景七:当做推断任务FEATURES的数据维度和数据类型与训练集存在不一致,将报ERROR,并提示错误原因,例如:

          1. openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES second_attack) FROM patients;
          2. ERROR: Invalid number of features for prediction, provided 1, expected 2
          3. CONTEXT: referenced column: patient_linear_regression_pred
          4. -------------------------------------------------------------------------------------------------------------------------------------
          5. openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES 1,second_attack,treatment) FROM patients;
          6. ERROR: Invalid number of features for prediction, provided 3, expected 2