python_归一化
最大最小值 MinMaxScaler
标准化 &中位数和四分位数间距进行缩放
使用曼哈顿范数&欧式范数归一化
4.1 Rescaling a feature¶
Use scikit-learn's MinMaxScaler to rescale a feature array
import numpy as np
from sklearn import preprocessing
feature = np.array([
[-500.5],
[-100.1],
[0],
[100.1],
[900.9]
])
feature
minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
scaled_feature = minmax_scaler.fit_transform(feature)
scaled_feature
array([[0. ],
[0.28571429],
[0.35714286],
[0.42857143],
[1. ]])
Discussion
Rescaling is a common preprocessing task in machine learning. Many of the algorithms described later in this book will assume all features are on the same scale, typically 0 to 1 or -1 to 1. There are a number of rescaling techniques, but one of the simlest is called min-max scaling. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. Specfically, min-max calculates:
x‘i=xi−min(x)max(x)−min(x)
xi‘=xi−min(x)max(x)−min(x)
where x is the feature vector, xixi is an individual element of feature x, and x‘ixi‘ is the rescaled element
-
标准化 &中位数和四分位数间距进行缩放
4.2 Standardizing a Feature¶
scikit-learn's StandardScaler transforms a feature to have a mean of 0 and a standard deviation of 1.
import numpy as np
from sklearn import preprocessing
feature = np.array([
[-1000.1],
[-200.2],
[500.5],
[600.6],
[9000.9]
])
scaler = preprocessing.StandardScaler()
standardized = scaler.fit_transform(feature)
standardized
array([[-0.76058269],
[-0.54177196],
[-0.35009716],
[-0.32271504],
[ 1.97516685]])
Discussion
A common alternative to min-max scaling is rescaling of features to be approximately standard normally distributed. To achieve this, we use standardization to tranform the data such that it has a mean, x¯x¯ , or 0 and a standard deviation σσ , of 1. Specifically, each element in the feature is transformed so that:
x‘i=xi−x¯σ
xi‘=xi−x¯σ
Where x‘IxI‘ is our standardized form of xixi . The transformed feature represents the number of standard deviations in the original value is away from the feature's mean value (also called a z-score in statistics)
Standardization is a common go-to scaling method for machine learning preprocessing and in my experience is used more than min-max scaling. However it depends on the learning algorithm. For example, principal component analysis often works better using standardization, while min-max scaling is often recommended for neural netwroks. As a general rule, I'd recommend defauling to standardization unless you have a specific reason to use an alternative.
We can see the effect of standardization by looking at the mean and standard deviation of our solutions output:
print("Mean {}".format(round(standardized.mean())))
print("Standard Deviation: {}".format(standardized.std()))
Mean 0.0
Standard Deviation: 1.0
If our data has significant outliers, it can negatively impact our standardizatino by affecting the feature's mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikit-learn, we do this using the RobustScaler method:
robust_scaler = preprocessing.RobustScaler()
robust_scaler.fit_transform(feature)
array([[-1.87387612],
[-0.875 ],
[ 0. ],
[ 0.125 ],
[10.61488511]])
使用曼哈顿范数&欧式范数归一化
4.3 Normalizing Observations¶
Use scikit-learn's Normalizer to rescale the feature values to have unit norm (a total length of 1)
normalizer = Normalizer(norm="l2")
normalizer.transform(features)
import numpy as np
from sklearn.preprocessing import Normalizer
features = np.array([
[0.5, 0.5],
[1.1, 3.4],
[1.5, 20.2],
[1.63, 34.4],
[10.9, 3.3]
])
normalizer = Normalizer(norm="l2")
normalizer.transform(features)
array([[0.70710678, 0.70710678],
[0.30782029, 0.95144452],
[0.07405353, 0.99725427],
[0.04733062, 0.99887928],
[0.95709822, 0.28976368]])
Discussion
Many rescaling methods operate of features; however, we can also rescale across individual observations. Normalizer rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features (e.g. text-classification when every word is n-word group is a feature).
Normalizer provides three norm options with Euclidean norm (often called L2) being the default:
||x||2=x21+x22+...+x2n⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√
||x||2=x12+x22+...+xn2
where x is an individual observation and x_n is that observation's value for the nth feature.
Alternatively, we can specify Manhattan norm (L1):
||x||1=∑i=1nxi
||x||1=∑i=1nxi
Intuitively, L2 norm can be thought of as the distance between two poitns in New York for a bird (i.e. a straight line), while L1 can be thought of as the distance for a human wlaking on the street (walk north one block, east one block, north one block, east one block, etc), which is why it is called "Manhattan norm" or "Taxicab norm".
Practically, notice that norm='l1' rescales an observation's values so they sum to 1, which can sometimes be a desirable quality
features_l1_norm = Normalizer(norm="l1").transform(features)
features_l1_norm
print("Sum of the first observation's values: {}".format(features_l1_norm[0,0] + features_l1_norm[0,1]))
Sum of the first observation's values: 1.0