Alluxio Cache Service
Alluxio File System serves Presto Hive Connector as an independent distributed caching file system on top of HDFS or object stores like AWS S3, GCP, Azure blob store. Users can understand the cache usage and control cache explicitly through a file system interface. For example, one can preload all files in an Alluxio directory to warm the cache for Presto queries, and set the TTL (time-to-live) for cached data to reclaim cache capacity.
interacts with Presto with both a catalog and a caching file system based on Option1. This option provides additional benefits on top of option 1 in terms of seamless access to existing Hive tables without modifying table locations on Hive Metastore and further performance optimization by consolidating many small files or transforming formats of input files.
Presto Hive connector can connect to AlluxioFileSystem as a Hadoop-compatible file system, on top of other persistent storage systems.
First, configure to use the Hive connector.
Second, ensure the Alluxio client jar is already in ${PRESTO_HOME}/plugin/hive-hadoop2/
on all Presto servers. If this is not the case, , extract the tarball to ${ALLUXIO_HOME}
and copy Alluxio client jar ${ALLUXIO_HOME}/client/alluxio-<VERSION>-client.jar
into this directory. Restart Presto service:
$ ${PRESTO_HOME}/bin/launcher restart
Third, configure Hive Metastore connects to Alluxio File System when serving Presto. Edit ${HIVE_HOME}/conf/hive-env.sh
to include Alluxio client jar on the Hive classpath:
export HIVE_AUX_JARS_PATH=${ALLUXIO_HOME}/client/alluxio-<VERSION>-client.jar
After completing the basic configuration, Presto should be able to access Alluxio File System with tables pointing to alluxio://
address. Refer to the Hive Connector documentation to learn how to configure Alluxio file system in Presto. Here is a simple example:
$ bin/alluxio-start.sh local -f
$ bin/alluxio fs mount --readonly /example \
s3://apc999/presto-tutorial/example-reason/
Start a Prest CLI connecting to the server started in the previous step.
Download , rename it to presto
, make it executable with chmod +x
, then run it:
$ ./presto --server localhost:8080 --catalog hive --debug
presto> use default;
Create a new table based on the file mounted in Alluxio:
Scan the newly created table on Alluxio:
presto:default> SELECT * FROM reason LIMIT 3;
r_reason_sk | r_reason_id | r_reason_desc
1 | AAAAAAAABAAAAAAA | Package was damaged
4 | AAAAAAAAEAAAAAAA | Not the product that was ordred
5 | AAAAAAAAFAAAAAAA | Parts missing
With Alluxio file system this approach supports the following features:
Read/write Types and Data Policies: Users can customize read and write modes for Presto when reading from and writing to Alluxio. E.g. tell Presto read to skip caching data when reading from certain locations and avoid cache thrashing, or set TTLs on files in given locations using alluxio fs setTtl.
Check Working Set: Users can verify which files are cached to understand and optimize Presto performance. For example, users can check the output from Alluxio command line , or browse the corresponding files on Alluxio WebUI.
Check Resource Utilization: System admins can monitor how much of the cache capacity on each node is used using and plan the resource accordingly.
In addition to caching data as a file system, Alluxio can further provide data abstracted as tables and via the Alluxio Structured Data Service. The Alluxio catalog is the main component responsible for managing the structured data metadata, and caching that information from the underlying table metastore (such as Hive Metastore). After an existing table metastore is to the Alluxio catalog, the catalog will cache the table metadata from the underlying metastore, and serve that information to Presto. When Presto accesses the Alluxio catalog for table metadata, the Alluxio catalog will automatically use the Alluxio locations of the files, which removes the need to modify any locations in the existing Hive Metastore. Therefore, when Presto is using the Alluxio catalog, the table metadata is cached in the catalog, and the file contents are cached with Alluxio’s file system caching.
For example, a user can attach an existing Hive Metastore to the Alluxio catalog:
./bin/alluxio table attachdb hive thrift://METASTORE_HOSTNAME:9083 hive_db_name
Then configure a Presto catalog to connect to the Alluxio catalog: