【机器学习基础】鸢尾花的分类 - 机器学习领域的Hello World

1 项目简介

【背景】
假设有一名植物学爱好者对她发现的鸢尾花的品种很感兴趣。她收集了每朵鸢尾花的一些测量数据:花瓣的长度和宽度以及花萼的长度和宽度,所有测量结果的单位都是厘米。她还有一些鸢尾花的测量数据,这些花之前已经被植物学专家鉴定为属于setosa、versicolor或virginica三个品种之一。对于这些测量数据,她可以确定每朵鸢尾花所属的品种。

【目标】构建一个机器学习模型,可以从上述已知品种的鸢尾花测量数据,从而预测新鸢尾花的品种

【分析】监督学习问题;分类问题;

【拓展】

  • 类别:可能输出(鸢尾花的不同品种)
  • 标签:单个数据点的预期输出
  • 样本:机器学习中的个体
  • 特征:样本属性

【补充】from…import…可能造成命名污染,不推荐过多使用

1.1 初识数据

【关键词】Bunch对象;load_iris;

from sklearn.datasets import load_iris
iris_dataset = load_iris()
print('Keys of iris dataset: \n{}'.format(iris_dataset.keys()))
Keys of iris dataset: 
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

【DESCR】其对应的值是数据集的简要说明

print(iris_dataset['DESCR']+'\n')
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%[email protected])
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, 

你可能感兴趣的:(机器学习基础,机器学习,分类,人工智能)