TiSpark 快速上手

    为了让大家快速体验 TiSpark,通过 TiUP 安装的 TiDB 集群中默认已集成 Spark 和 TiSpark jar 包。

    • Spark 默认部署在 TiDB 实例部署目录下 spark 目录中
    • TiSpark jar 包默认部署在 Spark 部署目录 jars 文件夹下:
    • TiSpark 示例数据和导入脚本可点击 下载。

    Oracle JDK 官方下载页面下载 JDK 1.8 当前最新版,本示例中下载的版本为 jdk-8u141-linux-x64.tar.gz

    解压并根据您的 JDK 部署目录设置环境变量,编辑 ~/.bashrc 文件,比如:

    1. export JAVA_HOME=/home/pingcap/jdk1.8.0_144 &&
    2. export PATH=$JAVA_HOME/bin:$PATH

    验证 JDK 有效性:

    1. java -version
    1. java version "1.8.0_144"
    2. Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
    3. Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

    导入样例数据

    1. wget http://download.pingcap.org/tispark-sample-data.tar.gz && \
    2. tar -zxvf tispark-sample-data.tar.gz && \
    3. cd tispark-sample-data

    修改 sample_data.sh 中 TiDB 登录信息,比如:

    1. mysql --local-infile=1 -h 192.168.0.2 -P 4000 -u root < dss.ddl

    执行脚本

    登录 TiDB 并验证数据包含 TPCH_001 库及以下表:

    1. mysql -uroot -P4000 -h192.168.0.2
    1. show databases;
    1. +--------------------+
    2. | Database |
    3. +--------------------+
    4. | INFORMATION_SCHEMA |
    5. | PERFORMANCE_SCHEMA |
    6. | mysql |
    7. | test |
    8. 5 rows in set (0.00 sec)
    1. use TPCH_001;
    1. Reading table information for completion of table and column names
    2. You can turn off this feature to get a quicker startup with -A
    3. Database changed
    1. +--------------------+
    2. | Tables_in_TPCH_001 |
    3. +--------------------+
    4. | CUSTOMER |
    5. | LINEITEM |
    6. | NATION |
    7. | ORDERS |
    8. | PART |
    9. | PARTSUPP |
    10. | REGION |
    11. | SUPPLIER |
    12. +--------------------+
    13. 8 rows in set (0.00 sec)
    1. cd spark &&
    2. bin/spark-shell

    然后像使用原生 Spark 一样查询 TiDB 表:

    1. scala> spark.sql("select count(*) from lineitem").show

    结果为

    1. |count(1)|
    2. +--------+
    3. +--------+

    下面执行另一个复杂一点的 Spark SQL:

    1. scala> spark.sql(
    2. """select
    3. | l_returnflag,
    4. | l_linestatus,
    5. | sum(l_quantity) as sum_qty,
    6. | sum(l_extendedprice) as sum_base_price,
    7. | sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    8. | sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    9. | avg(l_quantity) as avg_qty,
    10. | avg(l_extendedprice) as avg_price,
    11. | avg(l_discount) as avg_disc,
    12. | count(*) as count_order
    13. |from
    14. | lineitem
    15. |where
    16. | l_shipdate <= date '1998-12-01' - interval '90' day
    17. |group by
    18. | l_returnflag,
    19. | l_linestatus
    20. |order by
    21. | l_returnflag,
    22. """.stripMargin).show

    结果为:

    更多样例请参考 pingcap/tispark-test