官方教程,个人提取自用。
import pandas as pd
# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path)
# print a summary of the data in Melbourne data
melbourne_data.describe()
count:shows how many rows have non-missing values.
std:the standard deviation, which measures how numerically spread out the values are.
import pandas as pd
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data.columns
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.describe()
# shows the top few rows
X.head()
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)
# Fit model
melbourne_model.fit(X, y)
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)
# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(val_y, val_predictions))
Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting.
the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area to the overfitting area.
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size.
1.data-loading
import pandas as pd
# Load data
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
# Filter rows with missing values
melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
2.build a random forest model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
# simply drop columns with missing values
data_without_missing_values = original_data.dropna(axis=1)
# drop the same columns in both a training dataset and a test dataset.
cols_with_missing = [col for col in original_data.columns
if original_data[col].isnull().any()]
reduced_original_data = original_data.drop(cols_with_missing, axis=1)
reduced_test_data = test_data.drop(cols_with_missing, axis=1)
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
# make copy to avoid changing original data (when Imputing)
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()
# make new columns indicating what will be imputed
cols_with_missing = (col for col in X_train.columns
if X_train[col].isnull().any())
for col in cols_with_missing:
imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)
Categorical data is data that takes only a limited number of values.
Here we’ll show the most popular method for encoding categorical variables.
One hot encoding is the most widespread approach, and it works very well unless your categorical variable takes on a large number of values(more than 15 different values)
One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original data.
Pandas assigns a data type (called a dtype) to each column.
Object indicates a column has text (there are other things it could be theoretically be, but that’s unimportant for our purposes). It’s most common to one-hot encode these “object” columns, since they can’t be plugged directly into most models. Pandas offers a convenient function called get_dummies to get one-hot encodings.
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
Ensure the test data is encoded in the same manner as the training data with the align command:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
join='left',
axis=1)
The argument join=‘left’ specifies that we will do the equivalent of SQL’s left join. That means, if there are ever columns that show up in one dataset and not the other, we will keep exactly the columns from our training data.
The argument join=‘inner’ would do what SQL databases call an inner join, keeping only the columns showing up in both datasets.
1.Pipelines: Scikit-learn offers a class for one-hot encoding and this can be added to a Pipeline. Unfortunately, it doesn’t handle text or object values.
2.Applications To Text for Deep Learning: Keras and TensorFlow have fuctionality for one-hot encoding, which is useful for working with text.
3.Categoricals with Many Values: Scikit-learn’s FeatureHasher uses the hashing trick to store high-dimensional data. This will add some complexity to your modeling code.
XGBoost models dominate many Kaggle competitions.
XGBoost models require more knowledge and model tuning than techniques like Random Forest.
1.data pre-loading.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
data = pd.read_csv('../input/train.csv')
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.25)
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)
2.build and fit a model
from xgboost import XGBRegressor
my_model = XGBRegressor()
# Add silent=True to avoid printing out updates with each cycle
my_model.fit(train_X, train_y, verbose=False)
3.evaluate a model and make predictions
# make predictions
predictions = my_model.predict(test_X)
from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))
XGBoost has a few parameters that can dramatically affect your model’s accuracy and training speed.
1.n_estimators specifies how many times to go through the modeling cycle.
In the underfitting vs overfitting graph, n_estimators moves you further to the right.
You can experiment with your dataset to find the ideal. Typical values range from 100-1000, though this depends a lot on the learning rate discussed below.
2.early_stopping_rounds offers a way to automatically find the ideal n_estimators value.
Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren’t at the hard stop for n_estimators. It’s smart to set a high value for n_estimators and then use early_stopping_rounds to find the optimal time to stop iterating.
Since random chance sometimes causes a single round where validation scores don’t improve, you need to specify a number for how many rounds of straight deterioration to allow before stopping. early_stopping_rounds = 5 is a reasonable value. Thus we stop after 5 straight rounds of deteriorating validation scores.
my_model = XGBRegressor(n_estimators=1000)
my_model.fit(train_X, train_y, early_stopping_rounds=5,
eval_set=[(test_X, test_y)], verbose=False)
3.learning_rate: In general, a small learning rate (and large number of estimators) will yield more accurate XGBoost models, though it will also take the model longer to train since it does more iterations through the cycle.
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(train_X, train_y, early_stopping_rounds=5,
eval_set=[(test_X, test_y)], verbose=False)
4.n_jobs: On larger datasets where runtime is a consideration, you can use parallelism to build your models faster. It’s common to set the parameter n_jobs equal to the number of cores on your machine. On smaller datasets, this won’t help.
Partial dependence plots show how each variable or predictor affects the model’s predictions.
The partial dependence plot is calculated only after the model has been fit.
After the model is fit, We then use the model to predict the price of that house, but we change the distance variable before making a prediction. We first predict the price for that house when sitting distance to 4. We then predict it’s price setting distance to 5. Then predict again for 6. And so on.
we repeat that mental experiment with multiple houses, and we plot the average predicted price on the vertical axis.
some negative numbers means the prices would have been less than the actual average price for that distance.
In the left graph, we see house prices fall as we get further from the central business distract. Though there seems to be a nice suburb about 16 kilometers out, where home prices are higher than many nearer and further suburbs.
The right graph shows the impact of building area, which is interpreted similarly. A larger building area means higher prices.
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
# get_some_data is defined in hidden cell above.
X, y = get_some_data()
# scikit-learn originally implemented partial dependence plots only for Gradient Boosting models
# this was due to an implementation detail, and a future release will support all model types.
my_model = GradientBoostingRegressor()
# fit the model as usual
my_model.fit(X, y)
# Here we make the plot
my_plots = plot_partial_dependence(my_model,
features=[0, 2], # column numbers of plots we want to show
X=X, # raw predictors data.
feature_names=['Distance', 'Landsize', 'BuildingArea'], # labels on graphs
grid_resolution=10) # number of values to plot on x axis
There is a function called partial_dependence to get the raw data making up this plot.
Pipelines are a simple way to keep your data processing and modeling code organized.
Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())
# fit and predict using this pipeline as a fused whole
my_pipeline.fit(train_X, train_y)
predictions = my_pipeline.predict(test_X)
Most scikit-learn objects are either transformers or models.
Transformers are for pre-processing before modeling.
Models are used to make predictions.
Your pipeline must start with transformer steps and end with a model.
Machine learning is an iterative process.
You will face choices about predictive variables to use, what types of models to use,what arguments to supply those models, etc. We make these choices in a data-driven way by measuring model quality of various alternatives.
The Shortcoming of Train-Test Split: we can only get a large test set by removing data from our training data, and smaller training datasets mean worse models.
In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality. For example, we could have 5 folds or experiments. We divide the data into 5 pieces, each being 20% of the full dataset.
Trade-offs Between Cross-Validation and Train-Test Split: Cross-validation gives a more accurate measure of model quality, However, it can take more time to run.
if your dataset is smaller, you should run cross-validation.
a simple train-test split is sufficient for larger datasets. It will run faster, and you may have enough data that there’s little need to re-use some of it for holdout.
1.read the data
import pandas as pd
data = pd.read_csv('../input/melb_data.csv')
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price
2.specify a pipeline (It can be very difficult to do cross-validation properly if you arent’t using pipelines)
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())
3.cross-validation scores
from sklearn.model_selection import cross_val_score
scores = cross_val_score(my_pipeline, X, y, scoring='neg_mean_absolute_error')
print(scores)
print('Mean Absolute Error %2f' %(-1 * scores.mean()))
Specifically, leakage causes a model to look accurate until you start making decisions with the model, and then the model becomes very inaccurate. There are two main types of leakage: Leaky Predictors and a Leaky Validation Strategies.
This occurs when your predictors include data that will not be available at the time you make predictions.
People take antibiotic medicines after getting pneumonia in order to recover. So the raw data shows a strong relationship between those columns. But took_antibiotic_medicine is frequently changed after the value for got_pneumonia is determined. This is target leakage.
The model would see that anyone who has a value of False for took_antibiotic_medicine didn’t have pneumonia. Validation data comes from the same source, so the pattern will repeat itself in validation, and the model will have great validation (or cross-validation) scores. But the model will be very inaccurate when subsequently deployed in the real world.
To prevent this type of data leakage, any variable updated (or created) after the target value is realized should be excluded. Because when we use this model to make new predictions, that data won’t be available to the model.
leaky predictors frequently have high statistical correlations to the target.
So two tactics to keep in mind:
1.To screen for possible leaky predictors, look for columns that are statistically correlated to your target.(basic data comparisons)
2.If you build a model and find it extremely accurate, you likely have a leakage problem.
A much different type of leak occurs when you aren’t careful distinguishing training data from validation data.