# The Data Science PipelineVisualization

Data visualization is a way to leverage your visual cortex to gain insight into data. Because vision is such a rich and well-developed interface between the human mind and the external world, visualization is a critical tool for understanding and communicating data ideas.

The standard graphics library in Python is Matplotlib, but here we will use a newer package called *Plotly*. Plotly offers a number of material advantages relative to Matplotlib: (1) figures support interactions like mouseovers and animations, (2) there is support for

If you use Plotly in a Jupyter notebook, the figures will automatically display in an interactive form. Therefore, it is recommended that you follow along using a separate tab with a Jupyter notebook. However, we will use the function `show`

defined in the cell below to display the figures as static images so they can be viewed on this page.

from datagymnasia import show print("Success!")

## Scatter plot

We can visualize the relationship between two columns of numerical data by associating them with the horizontal and vertical axes of the Cartesian plane and drawing a point in the figure for each observation. This is called a **scatter plot**. In Plotly Express, scatter plots are created using the `px.scatter`

function. The columns to associate with the two axes are identified by name using the keyword arguments `x`

and `y`

.

import plotly.express as px import pydataset iris = pydataset.data('iris') show(px.scatter(iris,x='Sepal.Width',y='Sepal.Length'))

An **aesthetic** is any visual property of a plot object. For example, horizontal position is an aesthetic, since we can visually distinguish objects based on their horizontal position in a graph. We call horizontal position the `x`

aesthetic. Similarly, the `y`

aesthetic represents vertical position.

We say that the `x='Sepal.Width'`

argument *maps* the `'Sepal.Width'`

variable to the `x`

aesthetic. We can map other variables to other aesthetics, with further keyword arguments, like `color`

and `symbol`

:

show(px.scatter(iris, x='Sepal.Width', y='Sepal.Length', color='Species', symbol='Species'))

Note that we used the same categorical variable (`'Species'`

) to the `color`

and `symbol`

aesthetics.

**Exercise**

Create a new data frame by appending a new column called "area" which is computed as a product of petal length and width. Map this new column to the `size`

aesthetic (keeping `x`

, `y`

, and `color`

the same as above). Which species of flowers has the smallest petal area?

*Solution.* We use the `assign`

method to add the suggested column, and we include an additiona keyword argument to map the new column to the `size`

aesthetic.

show(px.scatter(iris.assign(area = iris["Petal.Length"] * iris['Petal.Width']), x='Sepal.Width', y='Sepal.Length', color='Species', size='area'))

### Faceting

Rather than distinguishing species by color, we could also show them on three separate plots. This is called **faceting**. In Plotly Express, variables can be faceted using the `facet_row`

and `facet_col`

arguments.

show(px.scatter(iris, x = 'Sepal.Width', y = 'Sepal.Length', facet_col = 'Species'))

## Line plots

A point is not the only geometric object we can use to represent data. A *line* might be more suitable if we want to help guide the eye from one data point to the next. Points and lines are examples of plot **geometries**. Geometries are tied to Plotly Express functions: `px.scatter`

uses the point geometry, and `px.line`

uses the line geometry.

Let's make a line plot using the *Gapminder* data set, which records life expectancy and per-capita GDP for 142 countries.

import plotly.express as px gapminder = px.data.gapminder() usa = gapminder.query('country == "United States"') show(px.line(usa, x="year", y="lifeExp"))

The `line_group`

argument allows us to group the data by country so we can plot multiple lines. Let's also map the `'continent'`

variable to the `color`

aesthetic.

show(px.line(gapminder, x="year", y="lifeExp", line_group="country", color="continent"))

**Exercise**

Although Plotly Express is designed primarily for data analysis, it can be used for mathematical graphs as well. Use `px.line`

to graph the function over the interval .

Hint: begin by making a new data frame with appropriate columns. You might find `np.linspace`

useful.

*Solution.* We use `np.linspace`

to define an array of -values, and we exponentiate it to make a list of -values. We package these together into a data frame and plot it with `px.line`

as usual:

import numpy as np import pandas as pd x = np.linspace(0,5,100) y = np.exp(x) df = pd.DataFrame({'x': x, 'exp(x)': y}) show(px.line(df, x = 'x', y = 'exp(x)'))

## Bar plots

Another common plot geometry is the *bar*. Suppose we want to know the average petal width for flowers with a given petal length. We can group by petal length and aggregate with the `mean`

function to obtain the desired data, and then visualize it with a bar graph:

show(px.bar(iris.groupby('Petal.Length').agg('mean').reset_index(), x = 'Petal.Length', y = 'Petal.Width'))

We use `reset_index`

because we want to be able to access the index column of the data frame (which contains the petal lengths), and the index is not directly accessible from Plotly Express. Resetting makes the index a normal column and replaces it with consecutive integers starting from 0.

Perhaps the most common use of the bar geometry is to make **histograms**. A histogram is a bar plot obtained by *binning* observations into intervals based on the values of a particular variable and plotting the intervals on the horizontal axis and the bin counts on the vertical axis.

Here's an example of a histogram in Plotly Express.

show(px.histogram(iris, x = 'Sepal.Width', nbins = 30))

We can control the number of bins with the `nbins`

argument.

**Exercise**

Does it make sense to map a categorical variable to the `color`

aesthetic for a histogram? Try changing the command below to map the species column to `color`

.

show(px.histogram(iris, x = 'Sepal.Width', nbins = 30))

*Solution.* Yes, we can split each bar into multiple colors to visualize the contribution to each bar from each category. This works in Plotly Express:

show(px.histogram(iris, x = 'Sepal.Width', nbins = 30, color = 'Species'))

## Density plots

Closely related to the histogram is a one-dimensional *density plot*. A density plot approximates the distribution of a variable in a smooth way, rather than the using the

Unfortunately, Plotly Express doesn't have direct support for one-dimensional density plots, so we'll use plotly module called the *figure factory*:

import plotly.figure_factory as ff show(ff.create_distplot([iris['Sepal.Width']],['Sepal.Width']))

The figure factory takes two lists as arguments: one contains the values to use to estimate the density, and the other represents the names of the groups (in this case, we're just using one group). You'll see that the plot produced by this function contains three **rug plot**).

If a categorical variables is mapped to the `x`

aesthetic, the point geometry fails to make good use of plot space because all of the points will lie on a limited number of

show(px.box(iris, x = 'Species', y = 'Petal.Width'))

show(px.violin(iris, x = 'Species', y = 'Petal.Width'))

The box plot represents the distribute of the `y`

variable using five numbers: the min, first quartile, median, third quartile, and max. Alternatively, the min and max are sometimes replaced with upper and lower *fences*, and observations which lie outside are considered outliers and depicted with with points. The plot creator has discretion regarding how to calculate fence cutoffs, but one common choice for the upper fence formula is , where is the third quartile and is the

A violin plot is similar to a boxplot, except that rather than a box, a small

In this section we introduced several of the main tools in a data scientist's visualization toolkit, but you will learn many others. Check out the cheatsheet for ggplot2 to see a much longer list of geometries, aesthetics, and statistical transformations.