DataFrame.
head
(
n=5
)
[source] ¶
Return the first n rows.
This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.
For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n]
.
Number of rows to select.
The first n rows of the caller object.
See also
DataFrame.tail
Returns the last n rows.
>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
... 'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra
Viewing the first 5 lines
>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
Viewing the first n lines (three in this case)
>>> df.head(3)
animal
0 alligator
1 bee
2 falcon
For negative values of n
>>> df.head(-3)
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
DataFrame.
describe
(
percentiles=None,
include=None,
exclude=None,
datetime_is_numeric=False
)
[source] ¶
Generate descriptive statistics.
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN
values.
Analyzes both numeric and object series, as well as DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.
The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles.
A white list of data types to include in the result. Ignored for Series
. Here are the options:
‘all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number
. To limit it instead to object columns submit the numpy.object
data type. Strings can also be used in the style of select_dtypes
(e.g. df.describe(include=['O'])
). To select pandas categorical columns, use 'category'
None (default) : The result will include all numeric columns.
A black list of data types to omit from the result. Ignored for Series
. Here are the options:
A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number
. To exclude object columns submit the data type numpy.object
. Strings can also be used in the style of select_dtypes
(e.g. df.describe(include=['O'])
). To exclude pandas categorical columns, use 'category'
None (default) : The result will exclude nothing.
Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.
New in version 1.1.0.
Summary statistics of the Series or Dataframe provided.
See also
DataFrame.count
Count number of non-NA/null observations.
DataFrame.max
Maximum of the values in the object.
DataFrame.min
Minimum of the values in the object.
DataFrame.mean
Mean of the values.
DataFrame.std
Standard deviation of the observations.
DataFrame.select_dtypes
Subset of a DataFrame including/excluding columns based on their dtype.
Notes
For numeric data, the result’s index will include count
, mean
, std
, min
, max
as well as lower, 50
and upper percentiles. By default the lower percentile is 25
and the upper percentile is 75
. The 50
percentile is the same as the median.
For object data (e.g. strings or timestamps), the result’s index will include count
, unique
, top
, and freq
. The top
is the most common value. The freq
is the most common value’s frequency. Timestamps also include the first
and last
items.
If multiple object values have the highest count, then the count
and top
results will be arbitrarily chosen from among those with the highest count.
For mixed data types provided via a DataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all'
is provided as an option, the result will include a union of attributes of each type.
The include and exclude parameters can be used to limit which columns in a DataFrame
are analyzed for the output. The parameters are ignored when analyzing a Series
.
Describing a numeric Series
.
>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
dtype: float64
Describing a categorical Series
.
>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count 4
unique 3
top a
freq 2
dtype: object
Describing a timestamp Series
.
>>> s = pd.Series([
... np.datetime64("2000-01-01"),
... np.datetime64("2010-01-01"),
... np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)
count 3
mean 2006-09-01 08:00:00
min 2000-01-01 00:00:00
25% 2004-12-31 12:00:00
50% 2010-01-01 00:00:00
75% 2010-01-01 00:00:00
max 2010-01-01 00:00:00
dtype: object
Describing a DataFrame
. By default only numeric fields are returned.
>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
... 'numeric': [1, 2, 3],
... 'object': ['a', 'b', 'c']
... })
>>> df.describe()
numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Describing all columns of a DataFrame
regardless of data type.
>>> df.describe(include='all')
categorical numeric object
count 3 3.0 3
unique 3 NaN 3
top f NaN a
freq 1 NaN 1
mean NaN 2.0 NaN
std NaN 1.0 NaN
min NaN 1.0 NaN
25% NaN 1.5 NaN
50% NaN 2.0 NaN
75% NaN 2.5 NaN
max NaN 3.0 NaN
Describing a column from a DataFrame
by accessing it as an attribute.
>>> df.numeric.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Name: numeric, dtype: float64
Including only numeric columns in a DataFrame
description.
>>> df.describe(include=[np.number])
numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Including only string columns in a DataFrame
description.
>>> df.describe(include=[object])
object
count 3
unique 3
top a
freq 1
Including only categorical columns from a DataFrame
description.
>>> df.describe(include=['category'])
categorical
count 3
unique 3
top f
freq 1
Excluding numeric columns from a DataFrame
description.
>>> df.describe(exclude=[np.number])
categorical object
count 3 3
unique 3 3
top f a
freq 1 1
Excluding object columns from a DataFrame
description.
>>> df.describe(exclude=[object])
categorical numeric
count 3 3.0
unique 3 NaN
top f NaN
freq 1 NaN
mean NaN 2.0
std NaN 1.0
min NaN 1.0
25% NaN 1.5
50% NaN 2.0
75% NaN 2.5
max NaN 3.0
DataFrame.
plot
(
*args,
**kwargs
)
[source] ¶
Make plots of Series or DataFrame.
Uses the backend specified by the option plotting.backend
. By default, matplotlib is used.
The object for which the method is called.
Only used if data is a DataFrame.
Allows plotting of one column versus another. Only used if data is a DataFrame.
The kind of plot to produce:
‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
‘scatter’ : scatter plot
‘hexbin’ : hexbin plot.
An axes of the current figure.
Make separate subplots for each column.
In case subplots=True
, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in; Be aware, that passing in both an ax and sharex=True
will alter all x axis labels for all axis in a figure.
In case subplots=True
, share y axis and set some y axis labels to invisible.
(rows, columns) for the layout of subplots.
Size of a figure object.
Use index as ticks for x axis.
Title to use for the plot. If a string is passed, print the string at the top of the figure. If a list is passed and subplots is True, print each item in the list above the corresponding subplot.
Axis grid lines.
Place legend on axis subplots.
The matplotlib line style per column.
Use log scaling or symlog scaling on x axis. .. versionchanged:: 0.25.0
Use log scaling or symlog scaling on y axis. .. versionchanged:: 0.25.0
Use log scaling or symlog scaling on both x and y axes. .. versionchanged:: 0.25.0
Values to use for the xticks.
Values to use for the yticks.
Set the x limits of the current axes.
Set the y limits of the current axes.
Name to use for the xlabel on x-axis. Default uses index name as xlabel.
New in version 1.1.0.
Name to use for the ylabel on y-axis. Default will show no ylabel.
New in version 1.1.0.
Rotation for ticks (xticks for vertical, yticks for horizontal plots).
Font size for xticks and yticks.
Colormap to select colors from. If string, load colormap with that name from matplotlib.
If True, plot colorbar (only relevant for ‘scatter’ and ‘hexbin’ plots).
Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center).
If True, draw a table using the data in the DataFrame and the data will be transposed to meet matplotlib’s default layout. If a Series or DataFrame is passed, use passed data to draw a table.
See Plotting with Error Bars for detail.
Equivalent to yerr.
If True, create stacked plot.
Sort column names to determine plot ordering.
Whether to plot on the secondary y-axis if a list/tuple, which columns to plot on secondary y-axis.
When using a secondary_y axis, automatically mark the column labels with “(right)” in the legend.
If True, boolean values can be plotted.
Backend to use instead of the backend specified in the option plotting.backend
. For instance, ‘matplotlib’. Alternatively, to specify the plotting.backend
for the whole session, set pd.options.plotting.backend
.
New in version 1.0.0.
Options to pass to matplotlib plotting method.
matplotlib.axes.Axes
or numpy.ndarray of them
If the backend is not the default matplotlib one, the return value will be the object returned by the backend.
Notes
See matplotlib documentation online for more on this subject
If kind = ‘bar’ or ‘barh’, you can specify relative alignments for bar plot layout by position keyword. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center)