Kaggle Intermediate ML Part One

RandomForestRegressor

Kaggle Intermediate ML Part One_第1张图片

What it is:

  • Ensemble method: It combines multiple decision trees to create a more robust and accurate model than a single tree.
  • Regression task: It's specifically designed to predict continuous numerical values (e.g., house prices, sales figures, temperature).
  • Bagging technique: It employs bagging, where each tree is trained on a random subset of the data and features, reducing overfitting.

How it works:

  1. Tree building:

    • For each tree in the forest:
      • Randomly select a subset of samples (bootstrapping).
      • At each node, randomly select a subset of features to consider for splitting.
      • Split the node based on the feature that best separates the data.
      • Continue splitting until a stopping criterion is met (e.g., maximum depth).
  2. Prediction:

    • To make a prediction for a new data point, feed it through each tree in the forest.
    • Each tree provides its own prediction.
    • Average the predictions from all trees to get the final prediction.

Key advantages:

  • Accuracy: Often outperforms other regression algorithms.
  • Robustness to outliers and noise: Less sensitive to individual data points.
  • Handles non-linear relationships: Able to capture complex patterns in data.
  • Built-in feature importance: Assesses the relative importance of features.
  • Handles missing values: Can work with datasets containing missing data.

Expert insights:

  • Tuning hyperparameters: Carefully adjust parameters like number of trees, maximum depth, and number of features to optimize performance.
  • Interpretability: While less interpretable than individual decision trees, techniques like feature importance can provide insights.
  • Computational cost: Training can be time-consuming for large datasets.
  • Consider alternatives: For certain tasks, linear regression or gradient boosting machines might be more suitable.

Decision-Tree

Kaggle Intermediate ML Part One_第2张图片

What are Decision Trees?

  • Structure: Imagine an inverted tree, with the root node at the top and leaf nodes at the bottom. Each node represents a question about a feature in your data, and the branches leading away from it represent the possible answers to that question.
  • Classification vs. Regression: Decision trees can be used for either classification (predicting discrete categories) or regression (predicting continuous values). For classification, the leaf nodes would contain class labels (e.g., spam/not spam), while for regression, they would contain predicted values (e.g., house price).
  • Decision-Making Process: To make a prediction, the model starts at the root node and asks the question associated with that node. Based on the answer, it follows the corresponding branch to the next node, asking another question. This process continues until it reaches a leaf node, which contains the final prediction.

How Do Decision Trees Work?

  1. Data Preparation: Divide your data into two sets: a training set for building the model and a testing set for evaluating its performance. Ensure your features are well-suited for decision trees (e.g., numerical or categorical with a small number of values).
  2. Recursive Splitting:
    • Start with the entire training set at the root node.
    • Choose the feature that best splits the data into subsets that are more homogeneous with respect to the target variable. This is often measured using an impurity measure like Gini impurity or information gain.
    • Create branches for each possible value of the chosen feature, leading to child nodes.
    • Repeat this process recursively for each child node, using the remaining features to find the best split, until a stopping criterion is met (e.g., maximum depth, minimum leaf size).
  3. Prediction: When presented with a new data point, the model traverses the tree by asking questions at each node until it reaches a leaf node. The prediction is the value or class label associated with that leaf node.

Why Use Decision Trees?

  • Interpretability: One of the key strengths of decision trees is their interpretability. By following the branches, you can easily understand the logic behind the model's predictions. This is especially valuable when you need to explain or justify your model's decisions.
  • No Feature Scaling: Often, machine learning models require feature scaling, but decision trees generally work well without it.
  • Robustness to Missing Data: Decision trees can handle missing data by imputing values or using surrogate splits.
  • Efficiency: They can be relatively fast to train, especially for smaller datasets.

Potential Drawbacks:

  • Overfitting: If not carefully controlled, decision trees can become overly complex and learn the training data too well, leading to poor performance on unseen data (overfitting). Pruning techniques (e.g., cost-complexity pruning) can help address this.
  • Bias: Like any model, decision trees can be biased, reflecting biases in the training data.
  • Performance: For complex problems, decision trees might not achieve the same level of accuracy as other models like ensemble methods.


Missing Values

In machine learning, missing values (also known as missing data) refer to the absence of data points in a dataset. This can occur due to various reasons, such as:

  • Data entry errors or omissions: Data might not be collected or recorded correctly.
  • Technical issues: Faulty sensors, equipment malfunctions, or software errors can lead to data loss.
  • Privacy concerns: Sensitive information might be anonymized or removed.
  • Inherent limitations: Measurements might not be feasible for certain values (e.g., income for unemployed individuals).

Missing values are a common challenge in ML, potentially affecting model performance and leading to inaccurate or biased results. Here's why:

Impact of Missing Values:

  • Reduced sample size: Dropping rows or columns with missing values can decrease the available data for training, leading to less robust models.
  • Biased estimations: If missing values are not handled carefully, algorithms might misinterpret patterns in the data, resulting in biased predictions.
  • Misleading interpretations: Ignoring missing values or using simple imputation techniques can mask underlying relationships and hinder proper interpretation of results.

Effective Approaches to Handle Missing Values:

The optimal approach depends on the characteristics of your data, the type of missing values (MCAR, MAR, MNAR), and the objectives of your analysis. Here are some common strategies:

  1. Deletion:
    • Simple: Remove rows or columns with missing values.
    • Efficient: Reduces computational cost.
    • Drawbacks: Can significantly reduce data size, introduce bias if missingness is not random.
  2. Imputation:
    • Estimate missing values based on available data.
    • Mean/median/mode imputation: Replace missing values with the average, median, or most frequent value, respectively. Simple but might create unrealistic estimates.
    • K-Nearest Neighbors (KNN): Impute values based on similar instances in the data. More sophisticated but computationally expensive.
    • Model-based imputation: Use statistical models (e.g., regression) to predict missing values. Can capture complex relationships but requires careful model selection.
  3. Machine Learning with missing data handling:
    • Use algorithms explicitly designed for missing data:
      • Missing value indicators: Introduce binary features to represent missingness.
      • Multiple imputation: Create multiple imputed datasets, train models on each, and aggregate results.
      • Missingness-robust estimators: Train models that are less sensitive to missing values.

Additional Considerations:

  • Understand the reasons for missingness: This can help guide the choice of imputation strategy.
  • Investigate patterns: Check if missingness is related to other variables to avoid introducing bias.
  • Evaluate the impact of different approaches: Compare model performance and robustness with different techniques.
  • Document your choices: Explain how you handled missing values for transparency and reproducibility.

Coding

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('/kaggle/input/melb-data/melb_data.csv')

# Select target
y = data.Price

# To keep things simple, we'll use only numerical predictors
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

####################### SimpleImputer ###################### 
# from sklearn.impute import SimpleImputer

# # Imputation
# my_imputer = SimpleImputer()
# imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
# imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# # Imputation removed column names; put them back
# imputed_X_train.columns = X_train.columns
# imputed_X_valid.columns = X_valid.columns

# print("MAE from Approach 2 (Imputation):")
# print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid)) 

See

Kaggle: Your Machine Learning and Data Science Community

https://williamkoehrsen.medium.com/random-forest-simple-explanation-377895a60d2d

https://bard.google.com/chat/cc50bd9880314190

你可能感兴趣的:(ML,&,ME,&,GPT,数据,(Data),New,Developer,机器学习)