Exploring Data

Visualizations

One of the most rewarding and useful things you can do to understand your data is to visualize it in a pictorial format. Visualizing your data allows you to interact with it, analyze it in a straightforward way, and identify new patterns, making your would-be complex data more accessible and understandable. The way our brains processes visuals like shapes, colors, and lengths makes looking at charts and graphs more intuitive for us than poring over spreadsheets. Author and illustrator Felicita Sala demonstrates this masterfully:

Food Visualization #1

Similarly, Picture Cook: See, Make, Eat by Katie Shelly (Ulysses Press, 2013):

Food Visualization #2

Besides making your mouth water, in a single image, these visualizations convey the ingredients needed, the amount, and the order they're to be introduced into the recipe. Effective visualization keys into your ability to understand graphical information using passive brain power. If you're going to spend hours, days, and weeks doing analysis of your data, you owe it to yourself not to fall flat on your face when it comes to its presentation.
MatPlotLib
If, like me, you aren't an award winning photographer or master graphics artist, MatPlotLib is here to help. MatPlotLib is a Python data visualization tool that supports 2D and 3D rendering, animation, UI design, event handling, and more. It only requires you pass in your data and some display parameters and then takes care of all of the rasterization implementation details. For the most part, you will be interacting with MatPlotLib's Pyplot functionality through a Pandas series or dataframe's .plot namespace. Pyplot is a collection of command-style methods that essentially make MatPlotLib's charting methods feel like MATLAB.

Histograms

Histograms are one of the The Seven Basic Tools of Quality, graphical techniques which have been identified as being most helpful for troubleshooting issues. Histograms help you understand the distribution of a feature in your dataset. They accomplish this by simultaneously answering the questions where in your feature's domain your records are located at, and how many records exist there. Coincidentally, these two questions are also answered by the .unique() and .value_counts() methods discussed in the feature wrangling section, but in a graphical way. Be sure to take note of this in the exploring section of your course map!
NOTE: To follow the instruction below, download and extract the following zip file.
Since the wheat dataset was used to explore histograms in the video lecture, and since you're going to get intimately familiar with that dataset with the labs, in this reading we'll keep it fresh by using a different dataset about student grades, conveniently downloaded to your /DAT210x/Datasets/students.data**.
Recall from the Features Primer section that there are two types of features: continuous and categorical. Histograms are only really meaningful with categorical data. If you have a continuous feature, it must first be *binned *or discretized by transforming the continuous feature into a categorical one by grouping similar values together. To accomplish this, the entire range values is divided into a series of intervals that are usually consecutive, equal in length, and non-overlapping. These intervals will become the categories. Then, a count of how many values fall into each interval serves as the categorical bin count.
To render a histogram with MatPlotLib through Pandas, call the .plot.hist() method on either a dataframe or series. The .plot.hist() method fully handles the discretization of your continuous features for you behind the scenes as needed!

import pandas as pd
import matplotlib
matplotlib.style.use('ggplot') # Look Pretty
# If the above line throws an error, use plt.style.use('ggplot') instead

student_dataset = pd.read_csv("/Datasets/students.data", index_col=0)

my_series = student_dataset.G3
my_dataframe = student_dataset[['G3', 'G2', 'G1']] 

my_series.plot.hist(alpha=0.5)
my_dataframe.plot.hist(alpha=0.5)

Histogram Plots

If your interest lies in probabilities per bin rather than frequency counts, set the named parameter normed=True, which will normalize your results as percentages. MatPlotLib's online API documentation exposes many other features and optional parameters that can be used with histograms, such as cumulative and histtype. Be sure to experiment with them in the exercises so that you can use histograms effectively as needed.
Knowing how a feature is distributed throughout your dataset is useful, as some machine learning models expect that, and only work when, the provided data is normally (Gaussian) distributed! For such models, if exploring your data with a histogram proves your data is skewed, all hope isn't lost. There are transformation techniques that will correct for this, and you'll learn about them in the Transforming and Munging module.

2D Scatter Plots

Similar to histograms, scatter plots are also one of the Seven Basic Tools of Quality. Let's get them added to your arsenal, starting with the 2D variant.
2D scatter plots are used to visually inspect if a correlation exist between the charted features. Both axes of a 2D scatter plot represent a distinct, numeric feature. They don't have to be continuous, but they must at least be ordinal since each record in your dataset is being plotted as a point with its location along the axes corresponding to its feature values. Without ordering, the position of the plots would have no meaning.
It is possible that either a negative or positive correlation exist between the charted features, or alternatively, none at all. The correlation type can be assessed through the overall diagonal trending of the plotted points.
Positive and negative correlations may further display a linear or non-linear relationship. If a straight line can be drawn through your scatter plot and most of points seem to stick close to it, then it can be said with a certain level of confidence that there is a linear relationship between the plotted features. Similarly, if a curve can be drawn through the points, there is likely a non-linear relationship. If neither a curve nor line adequately seems to fit the overall shape of the plotted points, chances are there is neither a correlation nor relationship between the features, or at least not enough information at present to determine.
Begin with the code to pull up the students performance dataset, then simply call .plot.scatter on your dataset:

import pandas as pd
import matplotlib

matplotlib.style.use('ggplot') # Look Pretty
# If the above line throws an error, use plt.style.use('ggplot') instead

student_dataset = pd.read_csv("/Datasets/students.data", index_col=0) student_dataset.plot.scatter(x='G1', y='G3')

Scatter Plot

This is your basic 2D scatter plot. Notice you have to call .scatter on a dataframe rather than a series, since two features are needed rather than just one. You also have to specify which features within the dataset you want graphed. You'll be using scatter plots so frequently in your data analysis you should also know how to create them directly from MatPlotLib, in addition to knowing how to graph them from Pandas. This is because many Pandas methods actually return regular NumPy NDArrays, rather than fully qualified Pandas dataframes.
This plot shows us that there certainly seems to be a positive correlation and linear relationship between a student's first and final exam scores—except for those few students at the bottom of the barrel! None of them did great on their first exam, and they all completely bombed their finals; if only there were a way to chart more than two variables simultaneously, perhaps you could verify if some other variable were at play in causing the students to fail...

3D Scatter Plots

To follow up from the last section, there surely is a way to visualize the relationship between three variables simultaneously. That way is through 3D scatter plots. Unfortunately, the Pyplot member of Pandas dataframes don't natively support the ability to generate 3D plots... so for the sake of your visualization repertoire, you're going to learn how to make them directly with MatPlotLib.

import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

import pandas as pd
matplotlib.style.use('ggplot') # Look Pretty
# If the above line throws an error, use plt.style.use('ggplot') instead

student_dataset = pd.read_csv("/Datasets/students.data", index_col=0)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel('Final Grade')
ax.set_ylabel('First Grade')
ax.set_zlabel('Daily Alcohol')

ax.scatter(student_dataset.G1, student_dataset.G3, student_dataset['Dalc'], c='r', marker='.')
plt.show()

3D Scatter Plot

This plot communicates a few things to you:
It's still easy to see the positive, linear relationship between the first grade and the final grade. Generally the higher the first grade, the higher the final grade.
A large number of the students in this study contume alcohol on a daily basis.
The more alcohol a student consumes daily, the worse their final score is on average.
Daily alcohol consumption doesn't really seem to be the feature that contributed to those select students who bombed their finals, scoring 0.

Exploring Data

Visualizations

Histograms

2D Scatter Plots

3D Scatter Plots

你可能感兴趣的:(Exploring Data)