Running Trino with Alluxio

    Trino is an open source distributed SQL query engine for running interactive analytic queries on data at a large scale. This guide describes how to run queries against Trino with Alluxio as a distributed caching layer, for any data storage systems that Alluxio supports (AWS S3, HDFS, Azure Blob Store, NFS, and more). Alluxio allows Trino to access data regardless of the data source and transparently cache frequently accessed data (e.g., tables commonly used) into Alluxio distributed storage. Co-locating Alluxio workers with Trino workers improves data locality and reduces the I/O access latency when other storage systems are remote or the network is slow or congested.

    Currently, there are 2 ways to enable Trino to interact with Alluxio:

    • Trino interacts directly with the Hive Metastore (with table definitions updated to use Alluxio paths)

    The primary benefits for using Trino with the Alluxio Catalog Service are

    • Simpler deployments of Alluxio with Trino (no modifications to the Hive Metastore)
    • Enabling schema-aware optimizations (transformations like coalescing and file conversions).

    Currently, the catalog service is limited to read-only workloads.

    For more details and instructions on how to use the Alluxio Catalog Service with Trino, please visit the .

    The rest of this page discusses the alternative approach of Trino directly interacting with the Hive Metastore, while IO access is performed through Alluxio.

    • Setup Java for Java 11, at least version 11.0.7, 64-bit,as required by Trino
    • Setup Python version 2.6.x, 2.7.x, or 3.x, as required by Trino
    • Deploy Trino. This guide is tested with .
    • Alluxio has been set up and is running.
    • Make sure that the Alluxio client jar is available. This Alluxio client jar file can be found at /<PATH_TO_ALLUXIO>/client/alluxio-2.5.0-client.jar in the tarball downloaded from Alluxio .
    • Make sure that Hive Metastore is running to serve metadata information of Hive tables.

    Trino gets the database and table metadata information (including file system locations) from the Hive Metastore, via Trino’s Hive connector. Here is a example Trino configuration file ${Trino_HOME}/etc/catalog/hive.properties, for a catalog using the Hive connector, where the metastore is located on localhost.

    Distribute the Alluxio client jar to all Trino servers

    In order for Trino to be able to communicate with the Alluxio servers, the Alluxio client jar must be in the classpath of Trino servers. Put the Alluxio client jar /<PATH_TO_ALLUXIO>/client/alluxio-2.5.0-client.jar into the directory ${Trino_HOME}/plugin/hive-hadoop2/ (this directory may differ across versions) on all Trino servers. Restart the Trino workers and coordinator:

    1. $ ${Trino_HOME}/bin/launcher restart

    Create a Hive table on Alluxio

    Here is an example to create an internal table in Hive backed by files in Alluxio. You can download a data file (e.g., ml-100k.zip) from http://grouplens.org/datasets/movielens/. Unzip this file and upload the file u.user into /ml-100k/ in Alluxio:

    1. $ ./bin/alluxio fs mkdir /ml-100k
    2. $ ./bin/alluxio fs copyFromLocal /path/to/ml-100k/u.user alluxio:///ml-100k

    Create an external Hive table pointing to the Alluxio file location.

    1. hive> CREATE TABLE u_user (
    2. userid INT,
    3. age INT,
    4. gender CHAR(1),
    5. zipcode STRING)
    6. ROW FORMAT DELIMITED
    7. FIELDS TERMINATED BY '|'
    8. STORED AS TEXTFILE

    You can see the directory and files that Hive creates by viewing the Alluxio WebUI at http://master_hostname:19999

    Ensure your Hive Metastore service is running. Hive Metastore listens on port 9083 by default. If it is not running, execute the following command to start the metastore:

    Start Trino server

    Start your Trino server. Trino server runs on port 8080 by default (configurable with http-server.http.port in ${Trino_HOME}/etc/config.properties ):

    1. $ ${Trino_HOME}/bin/launcher run

    Query tables using Trino

    Follow to download the trino-cli-<Trino_VERSION>-executable.jar, rename it to trino, and make it executable with chmod +x (sometimes the executable trino exists in ${trino_HOME}/bin/trino and you can use it directly).

    Run a single query (replace localhost:8080 with your actual Trino server hostname and port):

    1. $ ./trino --server localhost:8080 --execute "use default; select * from u_user limit 10;" \
    2. --catalog hive --debug

    To configure additional Alluxio properties, you can append the conf path (i.e. ${ALLUXIO_HOME}/conf) containing alluxio-site.properties to Trino’s JVM config at etc/jvm.config under Trino folder. The advantage of this approach is to have all the Alluxio properties set within the same file of alluxio-site.properties.

    1. ...
    2. -Xbootclasspath/a:<path-to-alluxio-conf>

    Example: connect to Alluxio with HA

    If the Alluxio HA cluster uses internal leader election, set the Alluxio cluster property appropriately in the alluxio-site.properties file which is on the classpath.

    1. alluxio.master.rpc.addresses=master_hostname_1:19998,master_hostname_2:19998,master_hostname_3:19998

    Alternatively you can add the property to the Hadoop core-site.xml configuration which is contained by hive.config.resources.

    1. <configuration>
    2. <property>
    3. <name>alluxio.master.rpc.addresses</name>
    4. <value>master_hostname_1:19998,master_hostname_2:19998,master_hostname_3:19998</value>
    5. </property>

    For information about how to connect to Alluxio HA cluster using Zookeeper-based leader election, please refer to HA mode client configuration parameters.

    Example: change default Alluxio write type

    For example, change alluxio.user.file.writetype.default from default ASYNC_THROUGH to CACHE_THROUGH.

    One can specify the property in alluxio-site.properties and distribute this file to the classpath of each Trino node:

    1. alluxio.user.file.writetype.default=CACHE_THROUGH

    Alternatively, modify conf/hive-site.xml to include:

    Increase parallelism

    Trino’s Hive connector uses the config hive.max-split-size to control the parallelism of the query. For Alluxio 1.6 or earlier, it is recommended to set this size no less than Alluxio’s block size to avoid the read contention within the same block. For later Alluxio versions, this is no longer an issue because of Alluxio’s async caching abilities.

    Avoid Trino timeouts when reading large files

    It is recommended to increase to a bigger value (e.g 10min) to avoid a timeout failure when reading large files from remote workers.