如何开始使用任何类型的数据? - 第1部分

从数据开始 (START WITH DATA)

My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was naïve and still doing my masters.

我的数据科学之旅从在德国最大的汽车制造商之一的Advanced Analytics部门的一名学生工作开始。 我很天真,仍然在做我的主人。

I was excited for this job because my current specialization was Digitalization. I wanted to get a hang of how it really works. I had studied programming too, but not python. My colleagues were all really smart — PhDs, Mathematicians and Physicists. Their understanding level of analytics was way beyond what I could gain by merely reading books!

我对这项工作感到很兴奋,因为我目前的专长是数字化。 我想了解它的真正工作原理。 我也学习过编程,但是没有学习过python。 我的同事们都很聪明-博士,数学家和物理学家。 他们对分析的理解水平远远超出了我仅通过阅读书本就能获得的知识!

For the first few days, the variety of projects and tasks, analysis and projects bewildered me. But, you know what was more bewildering? Questions like what is analytics? Why do it? What are all these files with so much data? What do all those numbers in the results say? How does an analytics project look like? What do they mean when they say they are analyzing data?

在最初的几天里,各种各样的项目和任务,分析和项目让我感到困惑。 但是,您知道还有什么更令人困惑的吗? 诸如什么是分析之类的问题? 为什么呢 这些数据量很大的文件是什么? 结果中所有这些数字表示什么? 分析项目的外观如何? 当他们说他们正在分析数据时,它们是什么意思?

Overwhelming!

压倒!

I spent days understanding analytics and the job itself. I gorged on various books and online courses that taught python, statistics, data science, etc. Gradually, I developed an understanding for the subjects and successfully completed my thesis in the same department too.

我花了几天的时间来了解分析和工作本身。 我浏览了各种书籍和在线课程,这些课程和课程教授python,统计学,数据科学等。逐渐地,我对这些主题有了认识,并且也成功地在同一部门完成了我的论文。

I have explained the data analytics recipe for you below. Hope you can use this as a guide even if your ingredients change with the application.

我在下面为您解释了数据分析方法。 希望即使您的成分随应用程序而变化,也可以将其用作指导。

The most important step to make any project successful is having a clear start. No matter how big or small your project is, if you do not have the ingredients in the required form and the right tools, even a masterchef’s recipe will not guarantee a delicious meal in the end.

要使任何项目成功,最重要的步骤就是要有一个清晰的起点。 无论您的项目大小不一,如果您没有所需形式的配料和正确的工具,那么即使是Masterchef的食谱也无法保证一顿美餐。

Let’s start with the ingredients before starting with the preparation.

让我们先从成分开始,然后再开始准备。

配料: (Ingredients:)

1. The Problem

1.问题

Did you ever get irrelevant results after you searched for your query in google? What do you do then? Rephrase and refine the keywords and search again. Similarly, having the ‘why’ of your analysis clear in the beginning helps you interpret your results better.

在google中搜索查询后,您是否得到不相关的结果? 那你怎么办呢? 重新定义和优化关键字,然后再次搜索。 同样,一开始就明确分析的“原因”可以帮助您更好地解释结果。

After you get all the data that you need, the next step is to understand and define the problem statement. The pain points of the business case need to be addressed here. It is imperative for your aim to align with the business strategy of your company so that the analysis proves fruitful to the stakeholders.

在获得所需的所有数据之后,下一步就是理解并定义问题说明。 这里需要解决业务案例的痛点。 您的目标必须与公司的业务战略保持一致,以使分析对利益相关者证明是卓有成效的。

Consider the above store location example. As the result of your analysis, you will get a score assigned to each prospective location. If the strategy of your management is to finance the project only when the new location results in more than $ 100,000 profit in a new city with a minimum population of 5000. Thus, you have clear criteria to narrow down the analysis results in line with the vision of your company.

考虑上面的商店位置示例。 分析的结果是,您将获得分配给每个预期地点的分数。 如果您的管理策略是仅在新地点在最低人口为5000的新城市中获得超过100,000美元的利润时才为项目提供资金。因此,您有明确的标准来缩小分析结果的范围,以符合您公司的愿景。

2. The Data

2.数据

For any kind of data analysis, getting the data is unquestionable. Data can be acquired from various relevant sources. Thus, it may come in diverse types and formats. Your job is to cut and crush it according to its type so that it is usable for your recipe.

对于任何类型的数据分析,获取数据都是毫无疑问的。 可以从各种相关来源获取数据。 因此,它可能有多种类型和格式。 您的工作是根据类型将其切碎,以便将其用于您的食谱。

In a tabular representation of data, each column is a data field and each row is a record. Each record may be labelled uniquely with an ID.

在数据的表格表示中,每一列都是数据字段,每一行都是记录。 每个记录可以用ID唯一地标记。

For example, for predicting the next location for opening a new store, you may have to use Yearly Sales Data, Sales Data for existing store locations, Population Density of the locations, Total number of Households, Census Data, Land Area. If your company sells pet products then you need number of households with pets. If your company sells children’s products then number of households with children under 15.

例如,为了预测下一个要开设新商店的位置,您可能必须使用“年度销售数据”,“现有商店位置的销售数据”,该位置的人口密度,家庭总数,人口普查数据,土地面积。 如果您的公司销售宠物产品,那么您需要携带宠物的家庭数量。 如果您的公司销售儿童产品,那么有15岁以下儿童的家庭数。

Most common types of input files are .csv (comma-separated-values file), .xlsx (excel sheet file) and .txt (text file). Excel file consumes more memory while importing data. On the contrary, CSV files are faster and consumes less memory.

输入文件的最常见类型是.csv(逗号分隔值文件)、. xlsx(excel工作表文件)和.txt(文本文件)。 Excel文件在导入数据时会占用更多内存。 相反,CSV文件更快并且消耗更少的内存。

Regardless of the file type, you have to clean each of the files and then blend all of it into one file to do the analysis. You can check out more about this here:

无论文件类型如何,您都必须清理每个文件,然后将所有文件混合到一个文件中进行分析。 您可以在此处查看有关此内容的更多信息:

3. The Software

3.软件

The software used for the analysis can be selected depending on the kind of results you want; your knowledge of programming languages like Python or R. For those who do not prefer programming may simply use any modular analytics software. In such a tool, you just drag and drop the required functions and you are good to go with the beautifully structured results and presentations.

可以根据所需结果的类型选择用于分析的软件。 您对Python或R等编程语言的了解。对于不喜欢编程的人,可以简单地使用任何模块化分析软件。 在这样的工具中,您只需拖放所需的功能,就可以很好地处理结构精美的结果和演示文稿。

Popular ‘No Code’ analytics software include:

流行的“无代码”分析软件包括:

  • Tableau — Data Visualization and Reporting

    Tableau-数据可视化和报告

  • DataRobot — Automated Machine Learning Platform

    DataRobot —自动化机器学习平台

  • RapidMiner — Useful for entire life-cycle from prediction to deployment

    RapidMiner-从预测到部署的整个生命周期有用

  • Alteryx — Advanced Analytics Platform

    Alteryx —高级分析平台

  • MLBase — Open Source

    MLBase —开源

  • TriFacta — Free

    TriFacta —免费

For these, you simply need to go to their site, create an account and download (some may only allow trial versions for a limited period)

对于这些,您只需要访问他们的站点,创建一个帐户并下载(有些可能只允许在有限的时间内提供试用版)

After that just upload your data file for analysis and run. You will have your results already when you finish reading this article.

之后,只需上传您的数据文件进行分析并运行。 阅读完本文后,您已经拥有了结果。

Popular IDEs for statistical computing:

流行的用于统计计算的IDE:

  • PyCharm (Python)

    PyCharm (Python)

  • Spyder (Anaconda Python distribution)

    Spyder (Anaconda Python发行版)

  • RStudio (R)

    RStudio (R)

You can also directly start your data analytics projects online, without downloading or installing anything!

您也可以直接在线启动数据分析项目,而无需下载或安装任何内容!

  • Google Colab

    Google Colab

  • Microsoft Azure Notebooks

    Microsoft Azure笔记本

制备: (Preparation:)

Different types of data come in different formats. Data from usually disparate sources requires cleansing, enriching and proper consolidation into one usable form in a downstream process. The technical terms generally used are data cleaning, feature selection, data transforms, feature engineering and dimensionality reduction.

不同类型的数据具有不同的格式。 通常来自不同来源的数据需要在下游过程中进行净化,丰富和适当合并为一种可用形式。 通常使用的技术术语是数据清理,特征选择,数据转换,特征工程和降维。

Data cleaning and preparation is the most time consuming task in the entire analysis process.

数据清理和准备是整个分析过程中最耗时的任务。

The first thing to do with any file is to check whether the given path is correct and it opens without errors. Load the data in the software of your choice. Now, look inside.

处理任何文件的第一件事是检查给定的路径是否正确,并且打开时没有错误。 将数据加载到您选择的软件中。 现在,看看里面。

An example of looking at the data is the field summary tool in Alteryx that provides a summary of data for all fields. The summary is shown below:

查看数据的一个示例是Alteryx中的字段摘要工具,该工具提供所有字段的数据摘要。 摘要如下所示:

Priyanka Mane from Alteryx Software Priyanka Mane提供的图像

Analyze and interpret the data using statistical tools (i.e. finding correlations, trends, outliers, etc.). However, the data might have missing values, typing errors or heterogeneous date formats; this must first be identified and fixed for better results.

使用统计工具(即查找相关性,趋势,离群值等)分析和解释数据。 但是,数据可能缺少值,键入错误或日期格式不均; 必须首先确定并修复此问题,以获得更好的结果。

Priyanka Mane from Alteryx Software Priyanka Mane提供的图像

· Variables

·变量

Categorical Variables are variables that can take values or labels belonging to a fixed number of categories. Gender is a nominal categorical variable having two categories -male and female. The categories have no intrinsic ordering. An ordinal variable has a clear ordering. Temperature is an ordinal categorical variable with three orderly categories (low, medium and high). Such variables are encoded using different techniques for easier analysis.

分类变量是可以采用属于固定数量类别的值或标签的变量。 性别是具有两个类别-男性和女性的名义分类变量。 类别没有内在的顺序。 序数变量具有清晰的顺序。 温度是具有三个有序类别(低,中和高)的有序分类变量。 使用不同的技术对此类变量进行编码,以便于分析。

Quantitative Variables represent measurement and count. They are of two types continuous (may take any value between an interval) and discrete (countable).

定量变量代表度量和计数。 它们有连续(可在间隔之间取任意值)和离散(可数)两种类型。

Types of Data: Numerical and Categorical 数据类型 :数值和分类

The link below gives an overview of the methods to encode the variables.

下面的链接概述了编码变量的方法。

You may have to deal with the following challenges while preparing the data:

准备数据时,您可能必须应对以下挑战:

  • Null Values / Missing data

    空值/缺少数据

Null Values are shown in the data as NaN or “Not-a-Number” value. The NaN property is the same as the Number but not a legal number. In python, use the isNaN() global function to check if a value is NaN. In Alteryx Software, the values are shown as [Null] after running the code. They can be filtered using isNull in the formula Tool. Additionally, summarize tool also helps you to count null.

空值在数据中显示为NaN或“非数字”值。 NaN属性与Number相同,但不是合法编号。 在python中,使用isNaN()全局函数检查值是否为NaN。 在Alteryx软件中,运行代码后,这些值显示为[Null]。 可以使用公式工具中的isNull过滤它们。 此外,汇总工具还可以帮助您计算空值。

When no data value is stored for an observation in the dataset, it is termed as missing data or missing values in statistics. Rubin stated three mechanisms for occurance of missing data: missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR).

如果在数据集中没有为观察值存储任何数据值,则将其称为缺失数据或统计信息中的缺失值。 鲁宾指出了发生丢失数据的三种机制:随机丢失(MAR),完全随机丢失(MCAR)和非随机丢失(MNAR)。

The process of assigning substituted values to missing values is called imputation. If a small portion (upto 5%) of the data is missing then the values can be imputed using method like mean, median or mode. It uses the other values in the same column for imputation. Principled methods such as the multiple-imputation (MI) method, the full information maximum likelihood (FIML) method, and the expectation-maximization (EM) method.

将替换值分配给缺失值的过程称为插补。 如果缺少一小部分数据(最多5%),则可以使用平均值,中位数或众数等方法估算值。 它使用同一列中的其他值进行插补。 原则方法,例如多输入(MI)方法,完整信息最大似然(FIML)方法和期望最大化(EM)方法。

It is advisable to delete the field if more than 10% of the data is missing, as it may add statistical bias to the results.

如果缺少10%以上的数据, 建议删除该字段,因为这可能会增加结果的统计偏差。

In Alteryx, field Summary Tool shows the percentage of missing records for each data field.

在Alteryx中,字段“汇总工具”显示每个数据字段丢失记录的百分比。

Follow this link for steps and formulae to deal with missing data in excel.

单击此链接以获取处理excel中缺失数据的步骤和公式。

  • Heterogeneous Data

    异构数据

Numerical fields like age, currency, date have a huge potential for errors due to non-uniform format. For example, age can be written as 30 years or 30.2 Years or simply 30. The thousands separator and decimal separator for currency varies according to countries and regions. Make sure that these columns have an even format.

年龄,货币,日期等数字字段由于格式不统一而具有很大的出错可能性。 例如,年龄可以写为30年或30.2年,或简单地为30。货币的千位分隔符和十进制分隔符会因国家和地区而异。 确保这些列具有偶数格式。

Micha Sager from Micha Sager在 Pixabay Pixabay上发布
  • Outliers

    离群值

Once the dataset is cleaned, it is time to run another pre-process regime over it. Outliers are unusual values in the dataset that may cause statistical errors in your calculations. They are abnormally away from other values in a dataset and can severely distort your output values.

清除数据集后,就该对它运行另一个预处理方案了。 离群值是数据集中的异常值,可能会导致计算中的统计错误。 它们异常远离数据集中的其他值,并且可能严重扭曲您的输出值。

Statistics by Jim Jim的统计数据

Scatterplots help immensely when you need to instantly identify outliers in your data. Simply visualize the relationship between each predictor variable and the target variable using plots.

当您需要立即识别数据中的异常值时,散点图可以提供极大的帮助。 使用绘图可以简单地可视化每个预测变量和目标变量之间的关系。

Here are some links to help you with the terms and methods to deal with outliers:

以下是一些链接,可帮助您了解处理异常值的条款和方法:

That was all for part 1. Check out part 2 for the analysis and presentation phases of a data science project. Stay tuned!

这就是第1部分的全部内容。请查看第2部分,了解数据科学项目的分析和演示阶段。 敬请关注!

翻译自: https://towardsdatascience.com/how-to-get-started-with-any-kind-of-data-part-1-c1746c66bc2d

你可能感兴趣的:(python,java,人工智能)