Data Visualization

Each language on Cloudera Data Science Workbench has a visualization system that you can use to create plots, including rich HTML visualizations.

Simple Plots

To create a simple plot, run a console in your favorite language and paste in the following code sample:

R

# A standard R plot 
plot(rnorm(1000)) 

# A ggplot2 plot 
library("ggplot2") 
qplot(hp, mpg, data=mtcars, color=am, 
facets=gear~cyl, size=I(3), 
xlab="Horsepower", ylab="Miles per Gallon")

Python 2

import matplotlib.pyplot as plt
import random
plt.plot([random.normalvariate(0,1) for i in xrange(1,1000)])

For some libraries such as matplotlib, new plots are displayed as each subsequent command is executed. Therefore, when you run a series of commands, you will see incomplete plots for each intermediate command until the final command is executed. If this is not the desired behavior, an easy workaround is to put all the plotting commands in one Python function.

With the Python 3 engine, matplotlib plots don't return the image of the plot, just the code. To work around this issue, insert %matplotlib inline at the start of your scripts. For example:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot([1,2,3])

Saved Images

You can also display images, using a command in the following format:

R

library("cdsw") 

download.file("https://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png", "/cdn/Minard.png") 
image("Minard.png")

Python 2

import urllib
from IPython.display import Image
urllib.urlretrieve("http://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png", "Minard.png")

Image(filename="Minard.png")

HTML Visualizations

Your code can generate and display HTML. To create an HTML widget, paste in the following:

R

library("cdsw") 
html('<svg><circle cx="50" cy="50" r="50" fill="red" /></svg>')

Python 2

from IPython.display import HTML
HTML('<svg><circle cx="50" cy="50" r="50" fill="red" /></svg>')

Scala

Cloudera Data Science Workbench allows you to build visualization libraries for Scala using jvm-repr. The following example demonstrates how to register a custom HTML representation with the "text/html" mimetype in Cloudera Data Science Workbench. This output will render as HTML in your workbench session.

//HTML representation
case class HTML(html: String)

//Register a displayer to render html 
Displayers.register(classOf[HTML],
  new Displayer[HTML] {
    override def display(html: HTML): java.util.Map[String, String] = {
      Map(
        "text/html" -> html.html
      ).asJava
    }
  })

val helloHTML = HTML("<h1> <em> Hello World </em> </h1>")
  
display(helloHTML)

IFrame Visualizations

Most visualizations require more than basic HTML. Embedding HTML directly in your console also risks conflicts between different parts of your code. The most flexible way to embed a web resource is using an IFrame:

R

library("cdsw")
iframe(src="https://www.youtube.com/embed/8pHzROP1D-w", width="854px", height="510px")

Python 2

from IPython.display import HTML
HTML('<iframe width="854" height="510" src="https://www.youtube.com/embed/8pHzROP1D-w"></iframe>')

You can generate HTML files within your console and display them in IFrames using the /cdn folder. The cdn folder persists and services static assets generated by your engine runs. For instance, you can embed a full HTML file with IFrames.

R

library("cdsw") 
f <- file("/cdn/index.html") 
html.content <- paste("<p>Here is a normal random variate:", rnorm(1), "</p>") 
writeLines(c(html.content), f) 
close(f) 
iframe("index.html")

Python 2

from IPython.display import HTML
import random

html_content  = "<p>Here is a normal random variate: %f </p>" % random.normalvariate(0,1)

file("/cdn/index.html", "w").write(html_content)
HTML("<iframe src=index.html>")

Cloudera Data Science Workbench uses this feature to support many rich plotting libraries such as htmlwidgets, Bokeh, and Plotly.

Grid Displays

Cloudera Data Science Workbench supports native grid displays of DataFrames across several languages.

Python 3

Using DataFrames with the pandas package requires per-session activation:
import pandas as pd
pd.options.display.html.table_schema = True
pd.DataFrame(data=[range(1,100)])

For PySpark DataFrames, use pandas and run df.toPandas() on a PySpark DataFrame. This will bring the DataFrame into local memory as a pandas DataFrame.

R

In R, DataFrames will display as grids by default. For example, to view the Iris data set, you would just use:

iris
Similar to PySpark, bringing Sparklyr data into local memory with as.data.frame will output a grid display.
sparkly_df %>% as.data.frame

Scala

Calling the display() function on an existing dataframe will trigger a collect, much like df.show().

val df = sc.parallelize(1 to 100).toDF()
display(df)

Documenting Your Analysis

Cloudera Data Science Workbench supports Markdown documentation of your code written in comments. This allows you to generate reports directly from valid Python and R code that runs anywhere, even outside Cloudera Data Science Workbench. To add documentation to your analysis, create comments in Markdown format:

R

# Heading
# -------
#
# This documentation is **important.**
#
# Inline math: $e^ x$
#
# Display math: $$y = \Sigma x + \epsilon$$

print("Now the code!")

Python

# Heading
# -------
#
# This documentation is **important.**
#
# Inline math: $e^ x$
#
# Display math: $$y = \Sigma x + \epsilon$$

print("Now the code!")