python 描述性统计_使用Python的描述性统计

python 描述性统计

描述性统计 (Descriptive Statistics)

After data collection, most Psychology researchers use different ways to summarise the data. In this tutorial we will learn how to do descriptive statistics in Python. Python, being a programming language, enables us  many ways to carry out descriptive statistics. Pandas makes data manipulation and summary statistics quite similar to how you would do it in R. I believe that the dataframe in R is very intuitive to use and pandas offers a DataFrame method similar to Rs. Also, many Psychology researchers may have experience of R.

收集数据后,大多数心理学研究人员使用不同的方式来汇总数据。 在本教程中,我们将学习如何在Python中进行描述性统计 。 Python是一种编程语言,它使我们可以采用多种方式来进行描述性统计。 Pandas使数据操作和汇总统计信息与R中的操作非常相似。我相信R中的数据框的使用非常直观,Pandas提供了类似于Rs的DataFrame方法。 同样,许多心理学研究人员可能有R的经验。

Thus, in this tutorial you will learn how to do descriptive statistics using  Pandas, but also using NumPy, and SciPy. We start with using Pandas for obtaining summary statistics and some variance measures. After that we continue with the central tenancy measures (e.g., mean and median) using Pandas and NumPy. The harmonic, geometric, and trimmed mean cannot be calculated using Pandas or NumPy so we use SciPy. Towards the end we learn how get some measures of variability (e.g., variance using pandas).

因此,在本教程中,您将学习如何使用Pandas以及NumPy和SciPy进行描述性统计。 我们首先使用熊猫获取摘要统计信息和一些方差度量。 之后,我们继续使用Pandas和NumPy进行中央租赁措施(例如,均值和中位数)。 谐波,几何和修剪均值无法使用Pandas或NumPy计算,因此我们使用SciPy。 最后,我们学习如何获得一些可变性的度量(例如,使用熊猫的变异)。

import numpy as np
from pandas import DataFrame as df
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean
import numpy as np
from pandas import DataFrame as df
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean
 

模拟响应时间数据 (Simulate response time data)

Many times in experimental psychology response time is the dependent variable. I to simulate an experiment in which the dependent variable is response time to some arbitrary targets. The simulated data will, further, have two independent variables (IV, “iv1” have 2 levels and “iv2” have 3 levels). The data are simulated as the same time as a dataframe is created and the first descriptive statistics is obtained using the method describe.

实验心理学中,响应时间很多时候都是因变量。 我模拟一个实验,其中因变量是对某些任意目标的响应时间。 此外,模拟数据将具有两个自变量(IV,“ iv1”具有2个级别,“ iv2”具有3个级别)。 在创建数据框的同时对数据进行仿真,并使用描述的方法获得第一个描述性统计信息。

使用熊猫进行描述性统计 (Descriptive statistics using Pandas)

data.describe()
data.describe()
 

Pandas will output summary statistics by using this method. Output is a table, as you can see below.

熊猫将使用此方法输出摘要统计信息。 输出是一个表,如下所示。

python 描述性统计_使用Python的描述性统计_第1张图片
Output table of data.describe() data.describe()的输出表

Typically, a researcher is interested in the descriptive statistics of the IVs. Therefore, I group the data by these. Using describe on the grouped date aggregated data for each level in each IV.  As can be seen from the output it is somewhat hard to read. Note, the method unstack is used to get the mean, standard deviation (std), etc as columns and it becomes somewhat easier to read.

通常,研究人员会对IV的描述性统计感兴趣。 因此,我将这些数据分组。 使用分组日期上的describe描述每个IV中每个级别的汇总数据。 从输出中可以看出,它有点难以阅读。 请注意,unstack方法用于获取均值,标准差(std)等作为列,并且变得更易于阅读。

python 描述性统计_使用Python的描述性统计_第2张图片
Output from describe on the grouped data 来自分组数据描述的输出

中央倾向 (Central tendancy)

Often we want to know something about the “average” or “middle” of our data. Using Pandas and NumPy the two most commonly used measures of central tenancy can be obtained; the mean and the median. The mode and trimmed mean  can also be obtained using Pandas but I will use methods from  SciPy.

通常,我们想了解一些有关数据“平均”或“中间”的信息。 使用Pandas和NumPy,可以获得两种最常用的中央租房措施。 均值和中位数。 模式和修剪后的均值也可以使用Pandas获得,但我将使用SciPy的方法。

意思 (Mean)

There are at least two ways of doing this using our grouped data. First, Pandas have the method mean;

使用我们的分组数据至少有两种方法可以做到这一点。 首先,熊猫具有方法的含义;

grouped_data['rt'].mean().reset_index()
grouped_data['rt'].mean().reset_index()
 

But the method aggregate in combination with NumPys mean can also be used;

但是也可以使用与NumPys平均值结合的方法。

Both methods will give the same output but the aggregate method have some advantages that I will explain later.

两种方法将提供相同的输出,但是聚合方法具有一些优点,我将在后面解释。

python 描述性统计_使用Python的描述性统计_第3张图片
Output of mean and aggregate using NumPy – Mean 使用NumPy输出均值和合计–均值

 

几何与谐波均值 (Geometric & Harmonic mean)

Sometimes the geometric or harmonic mean  can be of interested. These two descriptives can be obtained using the method apply with the methods gmean and hmean (from SciPy) as arguments. That is, there is no method in Pandas or NumPy that enables us to calculate geometric and harmonic means.

有时,几何或调和均值可能令人感兴趣。 可以使用gmean和hmean(来自SciPy)方法作为参数的方法获得这两个描述。 也就是说,Pandas或NumPy中没有任何方法可以使我们计算几何和调和平均值。

几何 (Geometric)
grouped_data['rt'].apply(gmean, axis=None).reset_index()
grouped_data['rt'].apply(gmean, axis=None).reset_index()
 
谐波 (Harmonic)

均值修整 (Trimmed mean)

Trimmed means are, at times, used. Pandas or NumPy seems not to have methods for obtaining the trimmed mean. However, we can use the method trim_mean from SciPy . By using apply to our grouped data we can use the function (‘trim_mean’) with an argument that will make 10 % av the largest and smallest values to be removed.

有时会使用修饰后的方法。 Pandas或NumPy似乎没有获得修整平均值的方法。 但是,我们可以使用SciPy中的trim_mean方法。 通过应用应用于分组数据,我们可以将函数('trim_mean')与参数一起使用,该参数将使10%av成为要删除的最大值和最小值。

trimmed_mean = grouped_data['rt'].apply(trim_mean, .1)
trimmed_mean.reset_index()
trimmed_mean = grouped_data['rt'].apply(trim_mean, .1)
trimmed_mean.reset_index()
 

Output from the mean values above (trimmed, harmonic, and geometric means):

从上述平均值(修整,谐波和几何均值)输出:

Trimmed Mean

python 描述性统计_使用Python的描述性统计_第4张图片
均值

Harmonic Mean

python 描述性统计_使用Python的描述性统计_第5张图片
谐波均值

Geometric Mean

python 描述性统计_使用Python的描述性统计_第6张图片
几何平均数

中位数 (Median)

As with the mean there are also at least two ways of obtaining the median;

与平均值一样,至少还有两种获取中位数的方法;

grouped_data['rt'].aggregate(np.median).reset_index()
grouped_data['rt'].aggregate(np.median).reset_index()
 
python 描述性统计_使用Python的描述性统计_第7张图片
Output of aggregate using Numpy – Median. 使用Numpy –中位数的合计输出。

模式 (Mode)

There is a method (i.e., pandas.DataFrame.mode()) for getting the mode for a DataFrame object. However, it cannot be used on the grouped data so I will use mode from SciPy:

有一种方法(即pandas.DataFrame.mode() )用于获取DataFrame对象的模式。 但是,它不能用于分组数据,因此我将使用SciPy的模式:

Most of the time I probably would want to see all measures of central tendency at the same time. Luckily, aggregate enables us to use many NumPy and SciPy methods. In the example below the standard deviation (std), mean, harmonic mean,  geometric mean, and trimmed mean are all in the same output. Note that we will have to add the trimmed means afterwards.

大多数时候,我可能希望同时查看所有集中趋势指标。 幸运的是,聚合使我们能够使用许多NumPy和SciPy方法。 在下面的示例中,标准偏差(std),均值,谐波均值,几何均值和微调均值都在同一输出中。 请注意,我们将必须在之后添加调整后的均值。

descr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).reset_index()

descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index)
descrdescr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).reset_index()

descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index)
descr 
python 描述性统计_使用Python的描述性统计_第8张图片
Output of aggregate using some of the methods. 使用某些方法输出合计。

变异性度量 (Measures of variability)

Central tendency (e.g., the mean & median) is not the only type of summary statistic that we want to calculate. Doing data analysis we also want a measure of the variability of the data.

集中趋势(例如,均值和中位数)不是我们要计算的唯一统计摘要类型。 在进行数据分析时,我们还希望度量数据的可变性。

标准偏差 (Standard deviation)

四分位间距 (Inter quartile range)

Note that here the use unstack()  also get the quantiles as columns and the output is easier to read.

请注意,这里使用unstack()还将分位数作为列,并且输出更易于阅读。

grouped_data['rt'].quantile([.25, .5, .75]).unstack()
grouped_data['rt'].quantile([.25, .5, .75]).unstack()
 
python 描述性统计_使用Python的描述性统计_第9张图片
IQR IQR

方差 (Variance)

python 描述性统计_使用Python的描述性统计_第10张图片
Variance 方差

That is all. Now you know how to obtain some of the most common descriptive statistics using Python. Pandas, NumPy, and SciPy really makes these calculation almost as easy as doing it in graphical statistical software such as SPSS. One great advantage of the methods apply and aggregate is that we can input other methods or functions to obtain other types of descriptives.

就这些。 现在,您知道如何使用Python获得一些最常见的描述性统计信息。 Pandas,NumPy和SciPy实际上使这些计算几乎与在诸如SPSS之类的图形统计软件中进行计算一样容易。 应用和聚合方法的一大优势是我们可以输入其他方法或函数来获取其他类型的描述。

翻译自: https://www.pybloggers.com/2016/02/descriptive-statistics-using-python/

python 描述性统计

你可能感兴趣的:(python,数据分析,机器学习,深度学习,人工智能)