Importing Data into Cloudera Data Science Workbench

To work with Cloudera Data Science Workbench, you must import data from local files, Apache HBase, Apache Kudu, Apache Impala, Apache Hive or other external database and data stores such as Amazon S3.

Uploading Data From Your Computer

If you want to create a new project around one or more data files on your computer, select the Local option when creating the project.

To add data files from your computer to an existing project, click Upload in the Project Overview page.

Accessing Data from HDFS

There are many ways to access HDFS data from R, Python, and Scala libraries. For example, see Example: Reading Data from HDFS (Wordcount) for an example of reading data to a Spark 2 program.

You can also use HDFS command line tools from the terminal or by executing system commands in your code. See the documentation at HDFS CLI.

Using Impyla to Access Data from Impala

Impyla is a Python client that can be used to execute queries on Impala tables. It communicates with Impala using the same standard Impala protocol as the ODBC/JDBC drivers. For more details on Impyla, refer to the Impyla project documentation on GitHub and this Cloudera blog post.

To begin, click Open Workbench, launch a Python session, and use the workbench prompt to install the following dependencies in your Cloudera Data Science Workbench project. These are required so that Impyla can work with Hive and Kerberos.

Python 2

!pip install thrift==0.9.3
!pip install thrift_sasl
!pip install impyla
Then, create a new project file and use the following sample code to establish a connection to an Impala daemon on the CDH cluster, and run the SHOW TABLES query on Impala's default database.

Python 2

#impyla_example.py

import os

# Specify IMPALA_HOST as an environment variable in your project settings
IMPALA_HOST = os.getenv('IMPALA_HOST', '<FQDN_Impala_daemon_host>')

import pandas
from impala.dbapi import connect
from impala.util import as_pandas

# Connect to Impala using Impyla
#
# * If you have not already established your Kerberos credentials in CDSW do so before running this script.
# * Remove auth_mechanism and use_ssl parameters on non-secure clusters.
conn = connect(host=IMPALA_HOST,
               port=21050,
               auth_mechanism='GSSAPI',
               use_ssl=True)

# Get the available tables
cursor = conn.cursor()
cursor.execute('SHOW TABLES')

# Pretty output using Pandas
tables = as_pandas(cursor)
tables

Accessing Data in Amazon S3 Buckets

Every language in Cloudera Data Science Workbench has libraries available for uploading to and downloading from Amazon S3.

To work with S3:

  1. Add your Amazon Web Services access keys to your project's environment variables as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
  2. Pick your favorite language from the code samples below. Each one downloads the R 'Old Faithful' dataset from S3.

R

library("devtools") 
install_github("armstrtw/AWS.tools") 

Sys.setenv("AWSACCESSKEY"=Sys.getenv("AWS_ACCESS_KEY_ID")) 
Sys.setenv("AWSSECRETKEY"=Sys.getenv("AWS_SECRET_ACCESS_KEY")) 

library("AWS.tools") 

s3.get("s3://sense-files/faithful.csv")

Python

# Install Boto to the project
!pip install boto

# Create the Boto S3 connection object.
from boto.s3.connection import S3Connection
aws_connection = S3Connection()
        
# Download the dataset to file 'faithful.csv'.
bucket = aws_connection.get_bucket('sense-files')
key = bucket.get_key('faithful.csv')
key.get_contents_to_filename('/home/cdsw/faithful.csv')

Accessing External SQL Databases

Every language in Cloudera Data Science Workbench has multiple client libraries available for SQL databases.

If your database is behind a firewall or on a secure server, you can connect to it by creating an SSH tunnel to the server, then connecting to the database on localhost.

If the database is password-protected, consider storing the password in an environmental variable to avoid displaying it in your code or in consoles. The examples below show how to retrieve the password from an environment variable and use it to connect.

Accessing Data From R

R

# db.r lets you make direct SQL queries. 
install.packages("db.r") 
library("db.r") 
db <- DB(username="cdswuser", hostname="localhost", port=5432, dbname="test_db", dbtype="postgres", password=Sys.getenv("POSTGRESQL_PASSWORD")) 
db$query("select user_id, user_name from users") 

# dplyr lets you program the same way with local data frames and remote SQL databases. 
install.packages("dplyr") 
library("dplyr") 
db <- src_postgres(dbname="test_db", host="localhost", port=5432, user="cdswuser", password=Sys.getenv("POSTGRESQL_PASSWORD")) 
flights_table <- tbl(db, "flights") 
select(flights_table, year:day, dep_delay, arr_delay) 

Accessing Data From Python

You can access data using pyodbc or SQLAlchemy

Python

# pyodbc lets you make direct SQL queries.
!wget https://pyodbc.googlecode.com/files/pyodbc-3.0.7.zip
!unzip pyodbc-3.0.7.zip
!cd pyodbc-3.0.7;python setup.py install --prefix /home/cdsw
import os

# See http://www.connectionstrings.com/ for information on how to construct ODBC connection strings.
db = pyodbc.connect("DRIVER={PostgreSQL Unicode};SERVER=localhost;PORT=5432;DATABASE=test_db;USER=cdswuser;OPTION=3;PASSWORD=%s" % os.environ["POSTGRESQL_PASSWORD"])
cursor = cnxn.cursor()
cursor.execute("select user_id, user_name from users")

# sqlalchemy is an object relational database client that lets you make database queries in a more Pythonic way.
!pip install sqlalchemy
import os
 
import sqlalchemy
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
db = create_engine("postgresql://cdswuser:%s@localhost:5432/test_db" % os.environ["POSTGRESQL_PASSWORD"])
session = sessionmaker(bind=db)
user = session.query(User).filter_by(name='ed').first()