Accessing OpenStack Swift from Spark
Spark’s support for Hadoop InputFormat allows it to process data in OpenStack Swift using the
same URI formats as in Hadoop. You can specify a path in Swift as input through a
URI of the form
swift://container.PROVIDER/path. You will also need to set your
Swift security credentials, through
core-site.xml or via
The current Swift driver requires Swift to use the Keystone authentication method, or
its Rackspace-specific predecessor.
Configuring Swift for Better Data Locality
Although not mandatory, it is recommended to configure the proxy server of Swift with
list_endpoints to have better data locality. More information is
The Spark application should include
hadoop-openstack dependency, which can
be done by including the
hadoop-cloud module for the specific version of spark used.
For example, for Maven support, add the following to the
core-site.xml and place it inside Spark’s
The main category of parameters that should be configured are the authentication parameters
required by Keystone.
The following table contains a list of Keystone mandatory parameters.
PROVIDER can be
any (alphanumeric) name.
||Keystone Authentication URL||Mandatory|
||Keystone endpoints prefix||Optional|
||Indicates whether to use the public (off cloud) or private (in cloud; no transfer fees) endpoints||Mandatory|
For example, assume
PROVIDER=SparkTest and Keystone contains user
tester with password
defined for tenant
core-site.xml should include:
fs.swift.service.PROVIDER.password contains sensitive information and keeping them in
core-site.xml is not always a good approach.
We suggest to keep those parameters in
core-site.xml for testing purposes when running Spark
For job submissions they should be provided via