Running Deep Learning Frameworks on Alluxio
With the age of growing datasets and increased computing power, deep learning has become a popular technique for AI. Deep learning models continue to improve their performance across a variety of domains, with access to more and more data, and the processing power to train larger neural networks. This rise of deep learning advances the state-of-the-art for AI, but also exposes some challenges for the access to data and storage systems. In this page, we further describe the storage challenges for deep learning workloads and show how Alluxio can help to solve them.
Deep learning has become popular for machine learning because of the availability of large amounts of data, where more data typically leads to better performance. However, there is no guarantee that all the training data are available to deep learning frameworks (Tensorflow, Caffe, torch). For example, deep learning frameworks have integrations with some existing storage systems, but not all storage integrations are available. Therefore, a subset of the data may be inaccessible for training, resulting in lower performance and effectiveness.
There is a variety of storage options for users, with distributed storage systems (HDFS, ceph) and cloud storage (AWS S3, Azure Blob Store, Google Cloud Storage) becoming popular. Practitioners who normally interact with a local file system are generally unfamiliar with these distributed and remote storage systems. It can be difficult to properly configure and use new and different tools for each storage system. This makes accessing data from diverse systems difficult for deep learning.
While there are several data management related issues with deep learning, Alluxio can help with the challenge of accessing data. Alluxio in its simplest form is a virtual file system which transparently connects to existing storage systems and presents them as a single system to users. Using Alluxio’s unified namespace, many storage technologies can be mounted into Alluxio, including cloud storage like S3, Azure, and GCS. Because Alluxio can already integrate with storage systems, deep learning frameworks only need to interact with Alluxio to be able to access data from any connected storage. This opens the door for training to be performed on all data from any data source, which can lead to better model performance.
Alluxio also includes a FUSE interface for a convenient and familiar use experience. With , an Alluxio instance can be mounted to the local file system, so interacting with Alluxio is as simple as interacting with local files and directories. This enables users to continue to use familiar tools and paradigms to interact with their data. Since Alluxio can connect to multiple disparate storages, data from any storage can be accessed like a local file or directory.
We use Tensorflow as an example deep learning framework in this page to show how Alluxio can help data access and management. We run Tensorflow benchmarks on Alluxio Fuse as described in Alluxio Tensorflow docs.
After mounting the under storage once, data in various under storages becomes immediately available through Alluxio and can be transparently accessed to the benchmark without any modification to either Tensorflow or the benchmark scripts. This greatly simplifies the application development, which otherwise would need the integration of each particular storage system as well as the configurations of the credentials.
Alluxio also brings performance benefits. The benchmark evaluates the throughput of the training model from the input training images in units of images/sec. The training involves three stages of utilizing different resources:
- Image Processing (CPU): Decode image records into images, preprocess, and organize into mini-batches.