Importing Data into Cloudera Data Science Workbench

To work with Cloudera Data Science Workbench, you must import data from local files, Apache HBase, Apache Kudu, Apache Impala, Apache Hive or other external database and data stores such as Amazon S3.

Uploading Data From Your Computer
Accessing Data from HDFS
Accessing Data in Amazon S3 Buckets
Accessing External SQL Databases

Uploading Data From Your Computer

If you want to create a new project around one or more data files on your computer, select the Local option when creating the project.

To add data files from your computer to an existing project, click Upload in the Project Overview page.

Accessing Data from HDFS

There are many ways to access HDFS data from R, Python, and Scala libraries. For example, see Example: Reading Data from HDFS (Wordcount) for an example of reading data to a Spark 2 program.

You can also use HDFS command line tools from the terminal or by executing system commands in your code. See the documentation at HDFS CLI.

Accessing Data in Amazon S3 Buckets

Every language in Cloudera Data Science Workbench has libraries available for uploading to and downloading from Amazon S3.

To work with S3:

Add your Amazon Web Services access keys to your project's environment variables as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
Pick your favorite language from the code samples below. Each one downloads the R 'Old Faithful' dataset from S3.

library("devtools") 
install_github("armstrtw/AWS.tools") 

Sys.setenv("AWSACCESSKEY"=Sys.getenv("AWS_ACCESS_KEY_ID")) 
Sys.setenv("AWSSECRETKEY"=Sys.getenv("AWS_SECRET_ACCESS_KEY")) 

library("AWS.tools") 

s3.get("s3://sense-files/faithful.csv")

Python

# Install Boto to the project
!pip install boto

# Create the Boto S3 connection object.
from boto.s3.connection import S3Connection
aws_connection = S3Connection()
        
# Download the dataset to file 'faithful.csv'.
bucket = aws_connection.get_bucket('sense-files')
key = bucket.get_key('faithful.csv')
key.get_contents_to_filename('/home/cdsw/faithful.csv')

Accessing External SQL Databases

Every language in Cloudera Data Science Workbench has multiple client libraries available for SQL databases.

If your database is behind a firewall or on a secure server, you can connect to it by creating an SSH tunnel to the server, then connecting to the database on localhost.

If the database is password-protected, consider storing the password in an environmental variable to avoid displaying it in your code or in consoles. The examples below show how to retrieve the password from an environment variable and use it to connect.

Accessing Data From R

# db.r lets you make direct SQL queries. 
install.packages("db.r") 
library("db.r") 
db <- DB(username="cdswuser", hostname="localhost", port=5432, dbname="test_db", dbtype="postgres", password=Sys.getenv("POSTGRESQL_PASSWORD")) 
db$query("select user_id, user_name from users") 

# dplyr lets you program the same way with local data frames and remote SQL databases. 
install.packages("dplyr") 
library("dplyr") 
db <- src_postgres(dbname="test_db", host="localhost", port=5432, user="cdswuser", password=Sys.getenv("POSTGRESQL_PASSWORD")) 
flights_table <- tbl(db, "flights") 
select(flights_table, year:day, dep_delay, arr_delay)

Accessing Data From Python

You can access data using pyodbc or SQLAlchemy

Python

# pyodbc lets you make direct SQL queries.
!wget https://pyodbc.googlecode.com/files/pyodbc-3.0.7.zip
!unzip pyodbc-3.0.7.zip
!cd pyodbc-3.0.7;python setup.py install --prefix /home/cdsw
import os

# See http://www.connectionstrings.com/ for information on how to construct ODBC connection strings.
db = pyodbc.connect("DRIVER={PostgreSQL Unicode};SERVER=localhost;PORT=5432;DATABASE=test_db;USER=cdswuser;OPTION=3;PASSWORD=%s" % os.environ["POSTGRESQL_PASSWORD"])
cursor = cnxn.cursor()
cursor.execute("select user_id, user_name from users")

# sqlalchemy is an object relational database client that lets you make database queries in a more Pythonic way.
!pip install sqlalchemy
import os
 
import sqlalchemy
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
db = create_engine("postgresql://cdswuser:%s@localhost:5432/test_db" % os.environ["POSTGRESQL_PASSWORD"])
session = sessionmaker(bind=db)
user = session.query(User).filter_by(name='ed').first()

Jobs and Pipelines

Collaborating with Git