DS Wannabe之5-AM Project: DS 30day int prep day3

Q1. How do you treat heteroscedasticity in regression?

Heteroscedasticity means unequal scattered distribution. In regression analysis, we generally talk about the heteroscedasticity in the context of the error term. Heteroscedasticity is the systematic change in the spread of the residuals or errors over the range of measured values. Heteroscedasticity is the problem because Ordinary least squares (OLS) regression assumes that all residuals are drawn from a random population that has a constant variance.

在回归分析中，异方差性指的是残差（即实际观测值与模型预测值之间的差异）的分散不是恒定的。这种情况经常发生在数据集的最大值和最小值之间有很大范围的情况下。异方差性的存在可能有很多原因，一个通用的解释是误差的方差随着某个因素成比例变化。

2 cate 异方差性大致可以分为两类：

纯异方差性：指的是当我们指定了正确的模型，但在残差图中观察到了非恒定的方差时的情况。Pure heteroscedasticity:- It refers to cases where we specify the correct model and let us observe the non-constant variance in residual plots.
非纯异方差性：指的是当你错误地指定了模型，这导致了非恒定的方差。当你在模型中遗漏了一个重要变量时，遗漏的效应就被吸收进了误差项中。如果遗漏变量的效应在观测数据的范围内变化，它可以在残差图中产生异方差性的明显迹象。Impure heteroscedasticity:- It refers to cases where you incorrectly specify the model, and that causes the non-constant variance. When you leave an important variable out of a model, the omitted effect is absorbed into the error term. If the effect of the omitted variable varies throughout the observed range of data, it can produce the telltale signs of heteroscedasticity in the residual plots.

如何修正异方差性 How to Fix Heteroscedasticity

Redefining the variables:

If your model is a cross-sectional model that includes large differences between the sizes of the observations, you can find different ways to specify the model that reduces the impact of the size differential. To do this, change the model from using the raw measure to using rates and per capita values. Of course, this type of model answers a slightly different kind of question. You’ll need to determine whether this approach is suitable for both your data and what you need to learn.

重新定义变量：如果你的模型是横截面模型，包括观测值大小之间的较大差异，你可以找到不同的方式来指定模型，以减少大小的影响。

Weighted regression:

It is a method that assigns each data point to a weight based on the variance of its fitted value. The idea is to give small weights to observations associated with higher variances to shrink their squared residuals. Weighted regression minimizes the sum of the weighted squared residuals. When you use the correct weights, heteroscedasticity is replaced by homoscedasticity.

加权回归是一种根据每个数据点拟合值的方差分配权重的方法。其基本思想是给与较高方差相关联的观测值分配较小的权重，以减小它们的残差平方。加权回归的目标是最小化加权残差平方和。当使用正确的权重时，异方差性（heteroscedasticity）就会被替换为等方差性（homoscedasticity）。

在实际操作中，加权回归可以用来解决异方差性问题，使得模型对所有的数据点都具有相同的方差，从而满足普通最小二乘回归的等方差性假设。这种方法特别适用于那些误差项的方差随着解释变量的变化而变化的情况。

加权回归的关键是如何确定每个观测值的权重。在一些情况下，权重可以基于先验知识或外部信息来设定；在其他情况下，权重可能需要通过对数据的初步分析来估计，例如，通过检查残差图来识别方差的变化模式，并据此确定权重。

通过给予方差较大的观测值较小的权重，加权回归有助于确保模型不会被那些具有较大偏差的点过度影响，从而提高模型的整体稳定性和预测准确性。

异方差性意味着不等的散布分布。在回归分析中，我们通常在误差项的背景下讨论异方差性。异方差性是系统性地改变了测量值范围内残差或误差的扩散。异方差性是一个问题，因为普通最小二乘（OLS）回归假设所有的残差都来自于具有恒定方差的随机样本。

Q2. What is multicollinearity, and how do you treat it?

Multicollinearity means independent variables are highly correlated to each other. In regression analysis, it's an important assumption that the regression model should not be faced with a problem of multicollinearity.
If two explanatory variables are highly correlated, it's hard to tell, which affects the dependent variable. Let's say Y is regressed against X1 and X2 and where X1 and X2 are highly correlated. Then the effect of X1 on Y is hard to distinguish from the effect of X2 on Y because any increase in X1 tends to be associated with an increase in X2.

Another way to look at the multicollinearity problem is: Individual t-test P values can be misleading. It means a P-value can be high, which means the variable is not important, even though the variable is important.

多重共线性指的是回归分析中独立变量之间高度相关的现象。在回归模型中，一个重要的假设是模型不应面临多重共线性的问题。

如果两个解释变量高度相关，很难判断哪个变量对因变量有影响。比如说，如果Y对X1和X2进行回归，而X1和X2之间高度相关，那么X1对Y的影响就很难与X2对Y的影响区分开来，因为X1的增加往往伴随着X2的增加。

另一种看待多重共线性问题的方式是：单独的t检验P值可能会产生误导。这意味着即使变量很重要，P值也可能很高，表明该变量不重要。

纠正多重共线性的方法包括 Correcting Multicollinearity:

1) Remove one of the highly correlated independent variables from the model. If you have two or more factors with a high VIF, remove one from the model.
2) Principle Component Analysis (PCA) - It cut the number of interdependent variables to a smaller set of uncorrelated components. Instead of using highly correlated variables, use components in the model that have eigenvalue greater than 1.
3) Run PROC VARCLUS and choose the variable that has a minimum (1-R2) ratio within a cluster.
4) Ridge Regression - It is a technique for analyzing multiple regression data that suffer from multicollinearity.
5) If you include an interaction term (the product of two independent variables), you can also reduce multicollinearity by "centering" the variables. By "centering," it means subtracting the mean from the values of the independent variable before creating the products.

移除高度相关的独立变量之一：如果有两个或更多因子具有高VIF（方差膨胀因子），从模型中移除一个。
主成分分析（PCA）：它将一组相互依赖的变量减少到一组不相关的成分。在模型中使用主成分代替高度相关的变量，使用那些特征值大于1的成分。
运行PROC VARCLUS并选择在一个簇内具有最小（1-R²）比率的变量。
岭回归：这是一种用于分析因多重共线性而遭受困扰的多重回归数据的技术。
引入交互项：如果你在模型中包含了两个独立变量的乘积作为交互项，通过“居中化”（即在创建乘积之前从独立变量的值中减去均值）变量也可以减少多重共线性。

When is multicollinearity not a problem? 多重共线性并不总是一个问题，特别是在以下情况下：

预测目标：如果你的目标是从一组X变量中预测Y，那么多重共线性通常不是问题。尽管独立变量之间存在高度相关性，但整体模型的预测仍然可以是准确的，整体的R²（或调整后的R²）能够量化模型预测Y值的效果有多好。在这种情况下，模型的预测能力不会受到多重共线性的影响，但需要注意的是，关于哪个独立变量对Y有更大影响的解释可能会受到限制。
多个虚拟（二进制）变量：当使用多个虚拟变量来表示三个或更多类别的分类变量时，多重共线性也不是问题。例如，在处理分类数据时，经常需要将分类变量转换为一系列的虚拟变量（也称为哑变量），这些虚拟变量之间自然存在完全的多重共线性（因为它们加起来等于1）。在这种情况下，多重共线性是预期内的，并不影响模型的预测能力，但是可能会影响单个变量系数的解释。

Q3. What is market basket analysis? How would you do it in Python?

Market basket analysis is the study of items that are purchased or grouped in a single transaction or multiple, sequential transactions. Understanding the relationships and the strength of those relationships is valuable information that can be used to make recommendations, cross-sell, up-sell, offer coupons, etc.

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

在Python中进行市场篮分析

在Python中进行市场篮分析通常使用mlxtend库中的apriori算法和association_rules。以下是进行市场篮分析的简要步骤：

数据准备：将交易数据转换成适合关联规则挖掘的格式。通常，这意味着将数据转换成一个矩阵，矩阵的行代表交易，列代表商品，如果某商品在某交易中出现，则对应的单元格为1，否则为0。
应用Apriori算法：使用apriori算法找出频繁项集。频繁项集是那些在交易数据中频繁出现的商品组合。
生成关联规则：基于频繁项集，使用association_rules生成关联规则。这些规则可以帮助识别当顾客购买了某些商品时，他们还可能购买哪些其他商品。
分析和应用规则：根据这些规则，零售商可以优化商品的布局、制定营销策略和提供个性化的商品推荐等。

from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# 假设transactions是一个交易列表，每个交易是一个商品列表
transactions = [['牛奶', '面包'], ['面包', '尿布', '啤酒'], ['牛奶', '尿布', '啤酒', '橙汁'], ['面包', '牛奶', '尿布', '啤酒'], ['面包', '牛奶', '尿布', '可乐']]

# 使用TransactionEncoder转换数据
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# 应用apriori算法找出频繁项集
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

# 生成关联规则
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=

Q4. What is Association Analysis? Where is it used?

Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction.

The technique of association rules is widely used for retail basket analysis. It can also be used for classification by using rules with class labels on the right-hand side. It is even used for outlier detection with rules indicating infrequent/abnormal association.

Association analysis also helps us to identify cross-selling opportunities, for example, we can use the rules resulting from the analysis to place associated products together in a catalog, in the supermarket, or the Webshop, or apply them when targeting a marketing campaign for product B at customers who have already purchased product A.

Association rules are given in the form as below:

关联规则通常以“A=>B[支持度, 置信度]”的形式给出，其中“=>”之前的部分被称为前件（Antecedent），之后的部分被称为后件（Consequent）。

A=>B[Support,Confidence] The part before => is referred to as if (Antecedent) and the part after => is referred to as then (Consequent).
Where A and B are sets of items in the transaction data, a and B are disjoint sets. Computer=>Anti−virusSoftware[Support=20%,confidence=60%]

在这里，A和B是交易数据中的物品集合，A和B是不相交的集合。

例如，“电脑=>防病毒软件[支持度=20%, 置信度=60%]”这条规则意味着：

有20%的交易显示防病毒软件与电脑一起被购买。
购买了防病毒软件的顾客中，有60%是在购买电脑时购买的。

下面是一个关联规则的例子，假设有100名顾客：

其中10人买了牛奶，8人买了黄油，6人同时买了这两样商品。
从“买了牛奶=>买了黄油”这条规则中，我们可以得到：
- 支持度 = P(牛奶 & 黄油) = 6/100 = 0.06，这意味着6%的交易同时包含了牛奶和黄油。
- 置信度 = 支持度/P(黄油) = 0.06/0.08 = 0.75，这意味着在购买黄油的顾客中，有75%的人同时购买了牛奶。
- 提升度（Lift）= 置信度/P(牛奶) = 0.75/0.10 = 7.5，这意味着购买牛奶的顾客购买黄油的可能性是不购买牛奶的顾客的7.5倍。

提升度是一个重要的指标，因为它帮助我们了解一个商品对另一个商品销售的正面影响程度。提升度大于1意味着两个商品之间有正向关联，即一个商品的销售对另一个商品的销售有正面推动作用。

Q5. What is KNN Classifier ?

KNN means K-Nearest Neighbour Algorithm. It can be used for both classification and regression. KNN（K-最近邻）分类器是一种简单的机器学习算法，可以用于分类和回归任务。它被认为是最简单的机器学习算法之一，也称为懒惰学习算法。之所以称为懒惰学习，是因为它在训练阶段不会创建一个泛化的模型，而是将所有的工作推迟到测试阶段，在测试阶段进行实际的分类或回归任务。因此，测试阶段在时间和金钱上的成本都很高。KNN也被称为基于实例的学习或基于记忆的学习，因为它需要保留所有的训练数据。

It is the simplest machine learning algorithm. Also known as lazy learning (why? Because it does not create a generalized model during the time of training, so the testing phase is very important where it does the actual job. Hence Testing is very costly - in terms of time & money). Also called an instance- based or memory-based learning

In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is assigned to the class of that single nearest neighbor.

KNN的工作原理如下：

选择K的值：K是一个用户定义的常数，表示最近邻居的数量。
计算距离：对于一个未分类的样本，计算它与训练集中每个样本之间的距离。常用的距离度量包括欧氏距离、曼哈顿距离等。
找到最近的K个邻居：从训练集中选择距离最近的K个样本作为最近邻居。
进行投票：对于分类问题，基于最近的K个邻居的类别进行多数投票来决定未分类样本的类别；对于回归问题，通常取最近的K个邻居的输出值的平均值作为预测结果。

KNN算法的优点包括简单、直观和易于实现，但它的缺点是对大数据集的计算和存储要求较高，且对于具有不同尺度的特征敏感，可能需要进行特征缩放。此外，选择合适的K值对算法的性能也非常关键。

在k-NN回归中，输出是对象的属性值，这个值是其k个最近邻居值的平均值。以下是几种常用的距离度量方法的简单解释：

欧几里得距离（Euclidean Distance）：最常用的距离度量方法，可以被理解为两点间直线的长度。在二维空间中，就像我们使用直尺测量两点之间的距离。公式为：其中，x 和 y 是两个点，i 是维度数。
曼哈顿距离（Manhattan Distance）：也称为城市街区距离，因为它像在规划有规则的城市街区中行走时，需要沿着街区边缘走的距离。公式为：^。
闵可夫斯基距离（Minkowski Distance）：是欧几里得距离和曼哈顿距离的一般形式，公式为：^。其中，当 q=2 时，就是欧几里得距离；当 q=1 时，就是曼哈顿距离。
汉明距离（Hamming Distance）: 对于连续变量，欧几里得距离、曼哈顿距离和闵可夫斯基距离是常用的距离度量方法。然而，在处理分类变量时，这些距离度量方法可能不再适用。

汉明距离用于度量两个等长字符串之间的差异，即对应位置的字符不同的数量。在分类变量的上下文中，汉明距离可以被理解为两个观测值在分类属性上不一致特征的数量。例如，如果有两个观测值，它们在五个分类属性上的值分别为（A, B, C, D, E）和（A, B, X, D, Z），那么这两个观测值之间的汉明距离为2，因为有两个属性的值不匹配。

汉明距离特别适用于处理只包含“是”或“否”、"0"或"1"这类二元特征的数据集，因为它简单地计算了两个观测值之间不一致的特征数。

使用汉明距离时需要注意的一点是，它假设所有的属性同等重要。如果某些分类变量比其他变量更重要，可能需要考虑使用加权的距离度量方法，以便更准确地反映不同特征的重要性。

How to choose the value of K: K value is a hyperparameter which needs to choose during the time of model building

Also, a small number of neighbors are most flexible fit, which will have a low bias, but the high variance and a large number of neighbors will have a smoother decision boundary, which means lower variance but higher bias.

We should choose an odd number if the number of classes is even. It is said the most common values are to be 3 & 5.

Q6. What is Pipeline in sklearn ?

A pipeline is what chains several steps together, once the initial exploration is done. For example, some codes are meant to transform features — normalize numerically, or turn text into vectors, or fill up missing data, and they are transformers; other codes are meant to predict variables by fitting an algorithm, such as random forest or support vector machine, they are estimators. Pipeline chains all these together, which can then be applied to training data in block.
Example of a pipeline that imputes data with the most frequent value of each column, and then fit a decision tree classifier.

From sklearn.pipeline import Pipeline
steps = [('imputation', Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)),
('clf', DecisionTreeClassifier())]
pipeline = Pipeline(steps)
clf = pipeline.fit(X_train,y_train)```

在sklearn（scikit-learn）中，管道（Pipeline）是一种工具，用于将多个数据处理步骤和模型训练步骤串联起来形成一个处理流程。这些步骤可以包括数据预处理（如归一化、文本向量化、缺失数据填充等），特征选择，以及使用算法进行模型的训练和预测等。每个步骤都是一个转换器（transformer）或一个估计器（estimator）。

使用管道的好处包括：

简化代码：通过将数据预处理和模型训练步骤封装在一起，管道简化了机器学习工作流程的代码实现。
避免数据泄露：在模型训练过程中，保证数据预处理步骤（如特征缩放、归一化）是在每次交叉验证的训练集上独立完成的，而不是在整个数据集上完成的，从而避免了数据泄露问题。
方便的模型评估和参数调整：通过管道，可以将整个流程作为一个整体进行交叉验证和超参数调整，而不需要对每个步骤单独进行调整。

一个简单的管道示例可能包括以下几个步骤：

使用StandardScaler对数值特征进行标准化。
使用PCA进行特征降维。
使用LogisticRegression进行分类。

Instead of fitting to one model, it can be looped over several models to find the best one.

classifiers = [ KNeighborsClassifier(5), RandomForestClassifier(), GradientBoostingClassifier()] for clf in classifiers:
steps = [('imputation', Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)),('clf', clf)]
pipeline = Pipeline(steps)

I also learned the pipeline itself can be used as an estimator and passed to cross-validation or grid search.

在sklearn中创建管道的代码如下：

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, X_train, y_train, cv=kfold)
print(results.mean())

在这个例子中，Pipeline对象pipeline将数据预处理和模型训练步骤封装在一起，使得从原始数据到最终预测的整个过程更加高效和一致。

Q7. What is Principal Component Analysis(PCA), and why we do?

主成分分析（PCA）的主要思想是降低由许多相互之间存在着轻度或重度相关的变量构成的数据集的维度，同时尽可能地保留数据集中存在的变异。这是通过将变量转换为一组新的变量来实现的，这组新变量被称为主成分（PCs），它们是正交的，并且按照顺序排列，使得保留在原始变量中的变异随着顺序的下降而减少。因此，第一个主成分保留了原始组成部分中存在的最大变异。主成分是协方差矩阵的特征向量，因此它们是正交的。

进行PCA的原因包括：

降维：在包含大量特征的数据集中，不是所有特征都是有用的，PCA可以帮助识别最重要的特征。
可视化：通过降维，PCA可以帮助我们将高维数据可视化为二维或三维图形，从而更容易地进行探索性数据分析。
去除噪声：PCA通过保留最重要的成分，可以帮助去除数据中的噪声。
优化算法性能：降维可以减少计算资源的需求，提高算法的运行效率。

使用sklearn中的管道进行数据预处理和模型训练的示例代码如下：

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# 定义管道步骤
steps = [
    ('imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
    ('clf', DecisionTreeClassifier())
]

# 创建管道
pipeline = Pipeline(steps)

# 使用k折交叉验证评估管道
kfold = KFold(n_splits=10, random_state=42)
results = cross_val_score(pipeline, X_train, y_train, cv=kfold)

print("交叉验证结果的平均值:", results.mean())

Q8. What is t-SNE? - 不熟悉

(t-SNE) t-Distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data. It maps multi-dimensional data to two or more dimensions suitable for human observation. With the help of the t-SNE algorithms, you may have to plot fewer exploratory data analysis plots next time you work with high dimensional data. t-SNE（t-分布随机邻域嵌入）是一种用于探索高维数据的非线性降维算法。它能将多维数据映射到两维或更多维，使其适合人类观察。t-SNE特别适合于数据可视化的任务，因为它能够在低维空间中保持高维数据点之间的局部结构，从而使得相似的数据点在降维后的空间中也保持相近。

t-SNE的关键特点包括：

非线性降维：与PCA等线性降维方法不同，t-SNE能够捕捉数据中的复杂非线性结构。
局部结构保留：t-SNE通过保留数据点之间的局部相似性来进行降维，这使得算法特别适用于揭示数据中的簇或群组。
参数选择：t-SNE算法中有几个关键参数，包括感知参数（perplexity）和学习率，这些参数的选择对最终的可视化结果有较大影响。

使用t-SNE进行高维数据可视化的步骤大致如下：

选择合适的参数：感知参数通常设置在5到50之间，学习率通常设置在10到1000之间。参数的选择取决于数据的具体情况，可能需要尝试不同的值来获得最佳的可视化效果。
运行t-SNE算法：使用t-SNE算法对高维数据进行降维处理，通常将数据降到二维或三维空间中。
数据可视化：将降维后的数据绘制为散点图，观察数据点的分布，寻找可能的模式或簇。

t-SNE是一种强大的工具，特别适合于探索性数据分析和高维数据的可视化。通过使用t-SNE，你可以在处理高维数据时减少探索性数据分析所需绘制的图形数量，快速直观地识别数据中的模式和结构。

Q9. VIF(Variation Inflation Factor),Weight of Evidence & Information Value. Why and when to use?

方差膨胀因子（VIF，Variance Inflation Factor）是一种用来衡量多重共线性程度的指标。多重共线性是指模型中的一个或多个解释变量彼此高度相关的情况，这可能会导致回归模型的估计参数变得不准确或不稳定。

为什么以及何时使用VIF：

诊断多重共线性：VIF提供了一个指数，用来衡量由于共线性导致估计的回归系数的方差（估计标准差的平方）增大的程度。当VIF的值大于5时，通常认为存在多重共线性问题，需要进一步处理。
改善模型精度和稳定性：通过识别并处理具有高VIF值的变量，可以改善模型的精度和稳定性，使模型的预测更加可靠。

Variation Inflation Factor 理解VIF：

如果某个预测变量的VIF值为5，这意味着该预测变量的系数方差是如果该预测变量与其他预测变量不相关时的5倍。换句话说，如果预测变量的VIF值为5，这意味着该预测变量的系数标准误差是如果该预测变量与其他预测变量不相关时的2.23倍（√5 = 2.23）。

如何计算VIF：

VIF的计算公式为：VIF = 1 / (1-R-Square of j-th variable) where R2 of jth variable is the coefficient of determination of the model that includes all independent variables except the jth predictor.
Where R-Square of j-th variable is the multiple R2 for the regression of Xj on the other independent variables (a regression that does not involve the dependent variable Y).
If VIF > 5, then there is a problem with multicollinearity.

处理高VIF值的方法：

移除变量：考虑从模型中移除具有高VIF值的变量，尤其是当这些变量在理论上不是非常重要时。
合并变量：如果可能，可以将相关的变量合并为一个变量。
使用岭回归（Ridge Regression）：岭回归可以处理多重共线性问题，通过引入正则化项减少系数的大小。

Understanding VIF

If the variance inflation factor of a predictor variable is 5 this means that variance for the coefficient of that predictor variable is 5 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.
In other words, if the variance inflation factor of a predictor variable is 5 this means that the standard

error for the coefficient of that predictor variable is 2.23 times (√5 = 2.23) as large as it would be if that predictor variable were uncorrelated with the other predictor variables.

Weight of evidence (WOE) and information value (IV) are simple, yet powerful techniques to perform variable transformation and selection.

Q10: How to evaluate that data does not have any outliers ?

In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal observation that lies far away from other values. An outlier is an observation that diverges from otherwise well- structured data.
Detection:

Method 1 — Standard Deviation: In statistics, If a data distribution is approximately normal, then about 68% of the data values lie within one standard deviation of the mean, and about 95% are within two standard deviations, and about 99.7% lie within three standard deviations.

Therefore, if you have any data point that is more than 3 times the standard deviation, then those points are very likely to be anomalous or outliers.

Method 2 — Boxplots: Box plots are a graphical depiction of numerical data through their quantiles. It is a very simple but effective way to visualize outliers. Think about the lower and upper whiskers as the boundaries of the data distribution. Any data points that show above or below the whiskers can be considered outliers or anomalous.

Method 3 - Violin Plots: Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator. Typically a violin plot will include all the data that is in a box plot: a marker for the median of the data, a box or marker indicating the interquartile range, and possibly all sample points if the number of samples is not too high.

Method 4 - Scatter Plots: A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

The points which are very far away from the general spread of data and have a very few neighbors are considered to be outliers.

Q11: What you do if there are outliers?

处理异常值（outliers）是数据预处理中的一个重要步骤，因为异常值可能会扭曲统计分析的结果，影响模型的性能。以下是一些常见的处理异常值的方法：

1. 删除异常值（Drop the Outliers）

如果异常值的数量不多，而且它们的存在对分析结果的影响很大，可以考虑直接删除这些异常值。这种方法简单直接，但要小心使用，因为删除数据可能会导致信息丢失。

2. 赋予新值（Assign a New Value）

如果异常值看起来是由数据录入错误或其他问题导致的，可以考虑对其进行修正或赋予一个合理的新值。这可以通过多种方式实现，如使用中位数、均值、或者基于其他变量的预测值来填充异常值。

3. 分组处理（Grouping and Separate Analysis）

如果异常值占比较小，但在数量上有很多，直接删除可能会损失重要的信息。在这种情况下，可以考虑将异常值分为一个单独的组，并对这个组进行单独的分析。这样做可以避免异常值对主要数据分析造成影响，同时又能够探索这些异常值背后可能存在的模式或原因。

其他方法：

变换数据：有时通过对数据进行变换（如对数变换、平方根变换等）可以减少异常值的影响。
使用鲁棒的统计方法：一些统计方法和模型对异常值不敏感，如中位数、IQR（四分位数间距）等。
离群值检测算法：使用专门的算法来检测和处理异常值，如箱型图、Z分数、DBSCAN等。

处理异常值时，重要的是先理解异常值的来源和性质，以及它们对分析或模型可能产生的影响。在删除或修正异常值之前，最好对数据进行彻底的探索性数据分析（EDA），并考虑到特定领域的知识和业务背景。

Q12: What are the encoding techniques you have applied with Examples ?

在许多实际的数据科学活动中，数据集会包含分类变量。这些变量通常以文本值的形式存储。由于机器学习基于数学方程，如果保持分类变量原样，会导致问题。考虑以下包含水果名称和它们重量的数据集为例，我们可以使用一些常见的编码技术来处理分类变量：

标签编码（Label Encoding）

在标签编码中，我们将每个类别映射到一个数字或标签。为类别选择的标签之间没有关系。因此，在编码后，那些有某些联系或彼此接近的类别会丢失这类信息。例如，如果有一个水果类别变量，包含“苹果”、“香蕉”和“橙子”，通过标签编码，它们可能被分别编码为1、2和3。这种方法简单直接，但不适用于模型中存在类别顺序或重要性的情况。

独热编码（One-Hot Encoding）

在这种方法中，我们将每个类别映射到一个包含1和0的向量，1表示特征的存在，0表示特征的不存在。向量的数量取决于我们想要保留的类别数量。对于高基数（许多唯一值）的特征，这种方法会产生大量的列，显著降低学习速度。继续上面的例子，通过独热编码，“苹果”、“香蕉”和“橙子”可能分别被编码为[1, 0, 0]、[0, 1, 0]和[0, 0, 1]。这种方法适用于没有顺序关系的类别特征，但会增加数据的维度。

Q13: Tradeoff between bias and variances, the relationship between them.

Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance). The prediction error for any machine learning algorithm can be broken down into three parts:

Bias Error
Variance Error
Irreducible Error

The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable.

Bias: Bias means that the model favors one result more than the others. Bias is the simplifying assumptions made by a model to make the target function easier to learn. The model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to a high error in training and test data.

Variance: Variance is the amount that the estimate of the target function will change if different training data was used. The model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but have high error rates on test data.

So, the end goal is to come up with a model that balances both Bias and Variance. This is called Bias Variance Trade-off. To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.

Q14: What is the difference between Type 1 and Type 2 error and severity of the error?

Type I Error

A Type I error is often referred to as a “false positive" and is the incorrect rejection of the true null hypothesis in favor of the alternative.
In the example above, the null hypothesis refers to the natural state of things or the absence of the tested effect or phenomenon, i.e., stating that the patient is HIV negative. The alternative hypothesis states that the patient is HIV positive. Many medical tests will have the disease they are testing for as the alternative hypothesis and the lack of that disease as the null hypothesis. A Type I error would thus occur when the patient doesn’t have the virus, but the test shows that they do. In other words, the test incorrectly rejects the true null hypothesis that the patient is HIV negative.

又称为“假阳性”（False Positive）。
发生在错误地拒绝了真实的零假设（即，实际上零假设是正确的，但测试结果错误地支持了备择假设）。
在HIV检测的例子中，如果患者实际上没有HIV病毒，但检测结果错误地显示他们感染了HIV病毒，这就是一个类型I错误。
类型I错误的严重性取决于具体情境。在某些情况下，如HIV检测，一个假阳性可能导致不必要的焦虑和进一步的检测，但最终可以通过进一步的测试来纠正。

Type II Error

A Type II error is the inverse of a Type I error and is the false acceptance of a null hypothesis that is not true, i.e., a false negative. A Type II error would entail the test telling the patient they are free of HIV when they are not.

Considering this HIV example, which error type do you think is more acceptable? In other words, would you rather have a test that was more prone to Type I or Types II error? With HIV, the momentary stress of a false positive is likely better than feeling relieved at a false negative and then failing to take steps to treat the disease. Pregnancy tests, blood tests, and any diagnostic tool that has serious consequences for the health of a patient are usually overly sensitive for this reason – they should err on the side of a false positive.

But in most fields of science, Type II errors are seen as less serious than Type I errors. With the Type II error, a chance to reject the null hypothesis was lost, and no conclusion is inferred from a non-rejected null. But the Type I error is more serious because you have wrongly rejected the null hypothesis and ultimately made a claim that is not true. In science, finding a phenomenon where there is none is more egregious than failing to find a phenomenon where there is.

又称为“假阴性”（False Negative）。
发生在错误地接受了不真实的零假设（即，实际上备择假设是正确的，但测试结果错误地支持了零假设）。
在HIV检测的例子中，如果患者实际上感染了HIV病毒，但检测结果错误地显示他们没有感染HIV病毒，这就是一个类型II错误。
类型II错误的严重性也取决于具体情况。在HIV检测的情况下，一个假阴性可能会导致患者错过及早治疗的机会，这可能比类型I错误的后果更加严重。

错误类型的选择

在不同的应用领域中，对类型I错误和类型II错误的容忍度不同。在医学检测（如HIV检测）中，人们可能更倾向于避免类型II错误，因为错过疾病的诊断可能导致严重的后果。
在大多数科学领域，类型I错误被认为比类型II错误更严重，因为错误地拒绝零假设可能导致错误的科学发现，这可能会误导后续的研究和政策制定。

总的来说，无论是类型I错误还是类型II错误，都需要根据研究的具体背景和对结果准确性的需求来综合考虑，并采取相应的统计方法和实验设计来尽可能地减少这些错误的发生。

Q15: What is binomial distribution and polynomial distribution?

Binomial Distribution: A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice). For example, a coin toss has only two possible outcomes: heads or tails, and taking a test could have two possible outcomes: pass or fail.

二项分布（Binomial Distribution）

二项分布是一种离散概率分布，用来描述在固定次数的独立实验中，成功次数的概率分布，每次实验只有两种可能结果（成功或失败），并且每次实验成功的概率相同。
例如，抛掷硬币10次，正面朝上的次数就服从二项分布。在这个例子中，每次抛掷硬币正面朝上可以视为“成功”，硬币有两面，所以是两种结果的实验。
二项分布的概率质量函数（PMF）

Multimonial/Polynomial Distribution:Multi or Poly means many. In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts of each side for rolling a k-sided die n times. For n independent trials each of which leads to success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

多项分布（Multinomial/Polyomial Distribution）

多项分布是二项分布的推广，用于描述在固定次数的独立实验中，每种可能结果的次数的概率分布，每次实验有多于两种可能的结果，并且每种结果发生的概率固定。
例如，掷一个六面的骰子10次，各面朝上次数的分布就服从多项分布。在这个例子中，每次掷骰子有六种可能的结果。

Q16: What is the Mean Median Mode standard deviation for the sample and population?

Mean It is an important technique in statistics. Arithmetic Mean can also be called an average. It is the number of the quantity obtained by summing two or more numbers/variables and then dividing the sum by the number of numbers/variables.

Mode The mode is also one of the types for finding the average. A mode is a number that occurs most frequently in a group of numbers. Some series might not have any mode; some might have two modes, which is called a bimodal series.
In the study of statistics, the three most common ‘averages’ in statistics are mean, median, and mode.

Median is also a way of finding the average of a group of data points. It’s the middle number of a set of numbers. There are two possibilities, the data points can be an odd number group, or it can be an even number group.
If the group is odd, arrange the numbers in the group from smallest to largest. The median will be the one which is exactly sitting in the middle, with an equal number on either side of it. If the group is even, arrange the numbers in order and pick the two middle numbers and add them then divide by 2. It will be the median number of that set.

Standard Deviation (Sigma) Standard Deviation is a measure of how much your data is spread out in statistics.

Q17: What is Mean Absolute Error ?

What is Absolute Error? Absolute Error is the amount of error in your measurements. It is the difference between the measured value and the “true” value. For example, if a scale states 90 pounds, but you know your true weight is 89 pounds, then the scale has an absolute error of 90 lbs – 89 lbs = 1 lbs.

This can be caused by your scale, not measuring the exact amount you are trying to measure. For example, your scale may be accurate to the nearest pound. If you weigh 89.6 lbs, the scale may “round up” and give you 90 lbs. In this case the absolute error is 90 lbs – 89.6 lbs = .4 lbs.

Mean Absolute Error The Mean Absolute Error(MAE) is the average of all absolute errors. The formula is: mean absolute error

Where, n = the number of errors, Σ = summation symbol (which means “add them all up”), |xi – x| = the absolute errors. The formula may look a little daunting, but the steps are easy:
Find all of your absolute errors, xi – x. Add them all up. Divide by the number of errors. For example, if you had 10 measurements, divide by 10.

Q18: What is the difference between long data and wide data?

There are many different ways that you can present the same dataset to the world. Let's take a look at one of the most important and fundamental distinctions, whether a dataset is wide or long.
The difference between wide and long datasets boils down to whether we prefer to have more columns in our dataset or more rows.

Wide Data A dataset that emphasizes putting additional data about a single subject in columns is called a wide dataset because, as we add more columns, the dataset becomes wider.

Long Data Similarly, a dataset that emphasizes including additional data about a subject in rows is called a long dataset because, as we add more rows, the dataset becomes longer. It's important to point out that there's nothing inherently good or bad about wide or long data.
In the world of data wrangling, we sometimes need to make a long dataset wider, and we sometimes need to make a wide dataset longer. However, it is true that, as a general rule, data scientists who embrace the concept of tidy data usually prefer longer datasets over wider ones.

Q19: What are the data normalization method you have applied, and why?

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges. In simple words, when multiple attributes are there, but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale, usually something between (0,1).

It is not always a good idea to normalize the data since we might lose information about maximum and minimum values. Sometimes it is a good idea to do so.

For example, ML algorithms such as Linear Regression or Support Vector Machines typically converge faster on normalized data. But on algorithms like K-means or K Nearest Neighbours, normalization could be a good choice or a bad depending on the use case since the distance between the points plays a key role here.

Types of Normalisation :

1. 最小-最大归一化（Min-Max Normalization）

这种方法通过将数据缩放到[0,1]区间来进行归一化。

2. Z得分归一化（Z-Score Normalization）

这种方法基于原始数据的均值（μ）和标准差（σ）进行归一化，使得归一化后的数据具有均值为0和标准差为1的特性。

v’, v is new and old of each entry in data respectively. σA, A is the standard deviation and mean of A respectively standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with μ=0 and σ=1 where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows: z=(x−μ)/σ

3. 小数定标归一化（Decimal Scaling）

这种方法通过移动数据的小数点位置来进行归一化，小数点的移动位数取决于数据绝对值的最大值。

Q20: What is the difference between normalization and Standardization with example?

在机器学习中，特征缩放是一个重要的步骤，常用的方法包括归一化（Normalization）和标准化（Standardization）。

归一化（Normalization）

归一化通常指的是将数值缩放到[0, 1]的范围内。
归一化的目的是将不同尺度的特征转换到相同的尺度上，避免因为特征值范围的差异而对模型训练产生不良影响。

标准化（Standardization）

标准化通常指的是将数据缩放使其具有0的均值和1的标准差。
标准化的目的是将特征值转换为更接近正态分布的形式，这对于许多机器学习算法尤其是那些假设数据为正态分布的算法来说非常重要。

选择归一化还是标准化

归一化通常在需要将数值直接映射到特定范围时使用，特别是在神经网络中或者当数据不遵循高斯分布时。
标准化在数据遵循高斯分布或者算法（如支持向量机、线性回归、逻辑回归）假设数据遵循高斯分布时更为常见。

在实际应用中，归一化和标准化的选择取决于数据的特性和所选用的机器学习算法。有时，甚至可能结合使用归一化和标准化，以达到最佳的数据预处理效果。

你可能感兴趣的:(学习,笔记,面试,机器学习)

【Python】一文详细介绍 py格式文件高斯小哥 Python基础【高质量合集】python 新手入门学习
【Python】一文详细介绍py格式文件个人主页：高斯小哥高质量专栏：Matplotlib之旅：零基础精通数据可视化、Python基础【高质量合集】、PyTorch零基础入门教程希望得到您的订阅和支持~创作高质量博文(平均质量分92+)，分享更多关于深度学习、PyTorch、Python领域的优质内容！（希望得到您的关注~）文章目录一、py格式文件简介二、如何创建和编辑py格式文件三、如何运行py
大学播音主持都学什么内容？播音主持专业学什么？配音新手圈
有些喜欢播音主持并且犹豫要不要报考这个大学专业的小伙伴们就会想要了解大学播音主持都学什么内容吧，毕竟如果不够了解就直接选择这个专业真的等选择完进去学习以后才知道这个专业并不是自己想要学习的东西那就来不及了。下面是小编为大家整理出来的一些播音主持专业学习的内容，请往下看吧。大学播音主持专业主要学习的课程有：播音发声、播音创作基础、广播播音主持、电视播音主持、文艺作品演播学概论、新闻学概论、新闻采编、
新网师的精神肤色（幕布笔记）悦读书香
王子老师的《极简100小妙招》收到已经几天了，之前大概的浏览了全书，今天起给自己定了一个计划，必须每天学习极简小妙招里面的一个妙招，并加以运用。一、今天要打卡什么内容因有完成每天学习极简小妙招的计划，所以今天晚饭吃的比较简单，草草吃完以后带着小宝到广场溜达一圈，急忙赶回来学习极简小妙招。再重看的时候不知道自己要学点什么，打卡哪一招，感觉哪个都简单，就看这一环节像王子老师说的“一看就会”，但做这一环
学习JavaEE的日子 Day32 线程池 A 北枝学习JavaEE 学习 java-ee java 线程池
Day32线程池1.引入一个线程完成一项任务所需时间为：创建线程时间-Time1线程中执行任务的时间-Time2销毁线程时间-Time32.为什么需要线程池(重要)线程池技术正是关注如何缩短或调整Time1和Time3的时间，从而提高程序的性能。项目中可以把Time1，T3分别安排在项目的启动和结束的时间段或者一些空闲的时间段线程池不仅调整Time1，Time3产生的时间段，而且它还显著减少了创建
没有如释重负君远近
虽然只有短短的一个多月的努力复习时间，但今天的整个考试经过，还是发现了效果的，题目做的比较自如，没有慌里慌张，而且提前五分钟完成。至于考试成绩，没有实足的把握，60分都不敢保证。但绝对相信自己，比去年肯定要好！今天早早的赶到考场，见到了刘老师，谈起来学习情况，坦率的说，真的是自己不够重视。总以为会很难，没有信心。其实不是的，只要认真对待，树立足够的信心，绝对可以通过考试的。还向老师询问了，后续再报
C++学习笔记（lambda函数） __TAT__ C&C++c++学习笔记
C++learningnote1、lambda函数的语法2、lambda函数的几种用法1、lambda函数的语法lambda函数的一般语法如下：[capture_clause](parameters)->return_type{function_body}capture_clause：需要捕获的变量，但要求该变量必须在这个作用域中。通常的捕获方式有以下几种：[]：不捕获任何变量[&]：按引用捕获变
心赏（2018.10.8）六一节_3928
1.上班第一天，同事彤休完产假，回来上班，给我带了酸奶和水果。她生小孩时，我给她发了一个小红包贺喜，哪知她就记在心里了。心赏这个有心的90后。2.女儿放学回来，说自己当了小组长。一边说不想当，一边得意的样子。心赏老师给了孩子这个锻炼的机会。3.老妈今天做了"蚂蚁上树"的菜，得到女儿的高度肯定。心赏老妈还在不断学习。
家长们的福音：教师对书面作业全批全改，学校不得考试选拔学生丝雨润春风
年前年后教育部门公布了不少措施，来减轻学生负担，维护学生的身心健康成长，随后各地教育局也陆陆续续颁布了各种新政策，这不最近山东教育厅也起草了《山东普通中小学规范办学十五条规定》。在这15条规定内容之中包括了：教师对书面作业全批全改，不给家长布置作业或要求家长评改作业；义务教育学校不得以考试、面试、评测等名义选拔学生；保障学生每天睡眠时间，高中生不少于8个小时。毋庸置疑这个规定的初衷非常得好，是对学
2022-2-13晨间日记越亮也打烊
今天是什么日子起床：7:00就寝：12:08天气：晴心情：糟糕纪念日：无任务清单昨日完成的任务，最重要的三件事：寒假作业，网课，画画改进：作业时间剪短习惯养成：网课不逃～周目标·完成进度数学卷子100％学习·信息·阅读《傅雷家书》《钢铁是怎样炼成的》健康·饮食·锻炼我终于不喝饮料啦，喝茶～人际·家人·朋友邝姐姐带我吃火锅工作·思考啥时候开学，我还有几天赶完作业最美好的三件事1.卷子写完了2.我有冰
中原焦点团队38期王芳芳坚持分享第236天，20230630总约练134次，来访113次，咨8次，观察员13次芳芳王
学习焦点的初心是想拯救孩子，孩子由于沉迷游戏，成绩下滑，在学习的过程中发现是自己的教育方式出了状况。经过半年的学习，一些焦点的基本技巧，如接纳、欣赏、倾听、同理心、尊重等都有了一定的了解。但在实际应用时仍然存在很多问题，感觉自己仍然没有放下对孩子成绩的期望，仍然把握不住对孩子管理的度。我该如何去陪伴好孩子？多用心去听课，并加强反思，多约练。去思考如何让自己快乐起来？
大创项目推荐深度学习 opencv python 公式识别(图像识别机器视觉) laafeer python
文章目录0前言1课题说明2效果展示3具体实现4关键代码实现5算法综合效果6最后0前言优质竞赛项目系列，今天要分享的是基于深度学习的数学公式识别算法实现该项目较为新颖，适合作为竞赛课题方向，学长非常推荐！学长这里给一个题目综合评分(每项满分5分)难度系数：3分工作量：4分创新点：4分更多资料,项目分享：https://gitee.com/dancheng-senior/postgraduate1课题
#D174-读书会作业-《财务自由之路》3 白洲笔记
最近沉迷于写作营，一直就没时间去弄读书会的作业，书的第二遍也就看了个开头，趁着日更的时间，赶紧把作业做了，这次是15到21课。【1.印象最深刻的部分】(本周所读内容中印象最深刻的部分)*活在未来，最正确的方法是什么？用正确的方法做正确的事情，判断什么是正确的？逻辑。学会思考。"作对事情"永远比“把事情作对“重要的多。”长远思考，耐心验证，小心总结提炼“证明自己正确并不是学习的任务和目标，时刻成长，
读书笔记《穿越寒冬》如雪般飞舞
各位好，我们今天来讲一本书，名字叫作《穿越寒冬》。看起来特别应景，大家觉得现在创业的状况不景气，大家都在忍受着寒冬的煎熬。但实际上，这本书的英文名字并不是这个意思，它的英文名叫作“如何创立一家新公司，并且能够活下来”。我在整个读完了以后，我发现这本书真正要翻译得好，它的名字应该叫作《创业生存手册》。这个书的作者，来自硅谷的霍夫曼船长。霍夫曼船长写过一本让创业者觉得特别贴心的书，叫作《让大象飞》它和
账务处理又出错？资深会计来教你，学会效率翻倍！共同学习小橘子要努力吖
作为一名会计，在实际工作中会遇到各种麻烦的账务处理问题。那么，最常用的会计处理方法都有哪些呢？今天小编为大家带来了从业二十六年的资深老会计分享的十四中会计常用的账务处理问题的解决方案，快来看看吧！一、促销品的账务处理在促销时公司经常会把一些商品按进价赠送给消费者使用二、款已付清但发票未到的账务处理三、购买材料发生不合理损耗的账务处理问题公司在购买材料时，常常会发生一些不合理的损耗，那么这种问题该怎
【真诚子】通晓鬼谷第七篇读书日记。真诚子l通晓鬼谷
今天把个人品牌，从193读到208页，书的内容质量出奇的高，尤其是这一段。对标学习法，找一个比自己强，或者你期望成为的人进行模仿性学习，对标学习，不是到处，去找人对标兵学习很多人的优点，或是学习自己认为好的方面，而是找准一个对标高手，然后全方位的学习这个人。我在做品牌咨询时就对标，学习了一个在国内很有名的行业顶尖大咖。我先找到他公司的方案，进行完全模仿，连PPT的排版都一样，而且我只参照他一个人的
ES-LTR粗排模块 poins jenkins 运维
ES-LTR粗排模块官方资源：https://github.com/HeiBoWang/elasticsearch-learning-to-rankElasticsearch学习排名插件使用机器学习提高搜索相关性排名。它为维基媒体基金会和Snagajob等地方的搜索提供了动力！这个插件有什么功能此插件：允许您在Elasticsearch中存储特征（Elasticsearch查询模板）记录特征得分（
2018-11-18成长小组学习笔记实验中学45
因为嗓子“罢工”，我面对众人只能借“微笑”代言。在开始授课前，绣霞老师先反馈上次作业的情况，提到“接纳”需是真正发自内心的完全接纳，而不是口头上的接纳，内心却是排斥的。提到一个“问题”孩子恰恰对家爱的更加“深沉”，夫妻间的问题不能影响到孩子，对孩子更好的爱不是你为他做的更多，而是给他自由、健康成长的空间。图片发自App一、孩子：家庭的一面镜子夫妻成了彼此的“投射”，婚姻便“吵的不可开交”，婚姻便成
2.5 项目讲解流程王守谦26 项目资料数据库
一、项目讲解1、自我介绍2、项目流程-===============================二、自我介绍（一）、学员自我介绍，讲解存在的问题比如：讲解年份、卡顿、重点学历、忘记（二）自我规则内容1、开场白：礼貌用语2、时间：自我介绍1-2分钟以内3、内容：姓名、籍贯、毕业院校、（拉进面试官距离）4、技能：功能测试、接口测试、自动化测试、app测试、性能测试、安全测试黑盒测试、白盒测试、灰盒
【鸿蒙HarmonyOS开发笔记】ArkUI常用组件介绍汇总（更新中）温、鸿蒙HarmonyOS开发笔记学习记录 harmonyos 笔记华为
概述此文总结开发中用到的一些常用组件，便于查阅，此文持续更新，闲的没事就更线性布局（Row/Column）不多介绍了，最常用的布局组件，两者除了方向不一样，别的都一样方便起见下面只写Column常用属性排列方向上的间距：spaceColumn({space:20}){Row().width('90%').height(50).backgroundColor(0xF5DEB3)Row().width
MyBatis高级面试题-2024 my_styles mybatis java 开发语言面试题
MyBatis的核心组件有哪些？首先第一个是，SqlSessionFactory，它就像是一个会话工厂。它的任务是创建SqlSession对象，这个对象是我们与数据库交互的主要途径。SqlSessionFactory的作用很重要，因为它可以帮我们配置数据库连接信息和事务管理等。一旦这个工厂被建立起来，它就会加载一些必要的配置和映射文件，为后续的数据库操作提供一个可靠的基础。第二个是SqlSessi
排序算法太多？常用排序都在这了，一篇文章总结和实现所有面试会考的排序算法（基于Python实现）宇宙之一粟不归路之Python #IT面试题收集与总结数据结构与算法算法数据结构排序算法 python java
文章目录排序算法1.常见的排序算法1.1选择排序1.1.1思想1.1.2实现**1.1.3选择排序分析**1.2冒泡排序**1.2.1思想****1.2.2实现****1.2.3冒泡排序分析**1.3插入排序**1.3.1思想****1.3.2实现****1.3.3插入排序分析**1.4归并排序☆☆★**1.4.1思想****1.4.2实现****1.4.3归并排序分析**1.5快速排序☆★★**
2019-07-16 振华老凤祥店长崔宁宁
大爱的李老师，智慧的教授，亲爱的跃友们：大家好！我是莱州鑫和金店李总的人～崔宁宁今天是我的日精进行动第56天，我分享一下今天的改变，我们相互勉励，每天进步一点点，离成功便不远。1、比学习：人这一生最主要的就是信念，坚定不移的信念是成功路上的重要基石！2、比改变：我是一切的根源，我变了世界就变了！改变自己的心态！3、比付出：承担才能成长，付出才会杰出！4、比谦卑：学习每位优秀店长身上的优点！5、比感
python清华大学出版社答案_Python机器学习及实践 weixin_39805119 python清华大学出版社答案
第1章机器学习的基础知识1.1何谓机器学习1.1.1传感器和海量数据1.1.2机器学习的重要性1.1.3机器学习的表现1.1.4机器学习的主要任务1.1.5选择合适的算法1.1.6机器学习程序的步骤1.2综合分类1.3推荐系统和深度学习1.3.1推荐系统1.3.2深度学习1.4何为Python1.4.1使用Python软件的由来1.4.2为什么使用Python1.4.3Python设计定位1.4.
2018-12-02 子分小
姓名：张颖公司：菲尔德国际英语【反省总结第146天，始于20180709今天是20181202】【知～学习】六项精进大纲背诵3遍每天十个单词坚持第181天每天学习一篇英文文章第94天英语流利说课程第71天学习30分钟【行～实践】一、修身：（对自己个人）步行5000步二、齐家：（对家庭和家人）无三、建功：（对工作)完成与Arti活动课和两节Demo准备开班事宜｛积善｝：发愿从2018年7月9日起1年
安卓笔记本 - Handler Message MessageQueue Looper SocialException
不爱写字，一张图解决。Handler,Message,MessageQueue,Looper工作原理
枚举使用笔记万变不离其宗_8 项目笔记笔记
1.java枚举怎么放在方法上面的注释里面/***保存*@paramuserId用户id*@paramtype见枚举{@linkcom.common.enums.TypeEnum}*@return*/voidsave(LonguserId,Stringtype);
如何成为思维的高手？明安包装闫慧玲
六项精进训练营Day2复盘20210112湖北荆州学习靠氛围，成长靠圈子1.关于金钱认知金句：1.当今世界，非钱不行2.有钱能使鬼推磨3.金钱是万恶之本4.时间就是金钱5.金钱不是万能的，但是没有钱是万万不能的6.谈钱伤感情，谈感情伤钱道德系统→好人→美德→回流利益系统→好好生活天下熙熙皆为利来，天下攘攘皆为利往出自西汉著名史学家、文学家司马迁《史记》的第一百二十九章“货殖列传”。这句话意思是说天
十分钟自由写作知意zy
主题：我缺乏的东西自从加入2022年弘丹写作学院，感觉每天的生活都忙碌了起来，我要上班，要学习。所以我每天都必须拼尽全力向前奔跑，才追得上小伙伴们的脚步。在写作学院，我学会了反省自己的不足，我的想法多，缺乏的东西也太多。比如：写作的文笔，写作逻辑，底层自信心……看到社群里那么多优秀的小伙伴，我感觉自己越来越自卑，我这么一个平庸的人，会完成今年的写作目标吗？我开始不停怀疑自己是否能坚持下去。而弘丹老
2021-04-11 英英成长日记
（1）每天写50字以上的催眠语言肯定自己或孩子或爱人今天的公益沙龙第二期，你有充分的准备！所以一切都很顺利！你还可以更灵活，我相信你可以做到！你是一个有爱的人！爱能成就一切！加油！分享也是成长！你说对吗？（2）每天晚上跟潜意识沟通一次。谢谢你潜意识，今天支持我讲完两个小时沙龙！感恩你每天这样支持我成长学习！（3）每天学习三条时间管理方法，共100条。(4)自己想要坚持3件事（确定下来至少一件，坚持
ruoyi使用笔记万变不离其宗_8 项目笔记代码参考笔记笔记 java 前端
1.限流处理@RateLimiter@PostMapping("/createOrder")@ApiOperation("创建充值订单")@RateLimiter(key=CacheConstants.REPEAT_SUBMIT_KEY,time=10,count=1,limitType=LimitType.IP)publicRcreateOrder(@RequestBodyFormform){/
Java实现的简单双向Map，支持重复Value superlxw1234 java 双向map
关键字：Java双向Map、DualHashBidiMap 有个需求，需要根据即时修改Map结构中的Value值，比如，将Map中所有value=V1的记录改成value=V2，key保持不变。数据量比较大，遍历Map性能太差，这就需要根据Value先找到Key，然后去修改。即：既要根据Key找Value，又要根据Value
PL/SQL触发器基础及例子百合不是茶 oracle数据库触发器 PL/SQL编程
触发器的简介; 触发器的定义就是说某个条件成立的时候，触发器里面所定义的语句就会被自动的执行。因此触发器不需要人为的去调用，也不能调用。触发器和过程函数类似过程函数必须要调用, 一个表中最多只能有12个触发器类型的,触发器和过程函数相似触发器不需要调用直接执行, 触发时间：指明触发器何时执行，该值可取： before：表示在数据库动作之前触发
[时空与探索]穿越时空的一些问题 comsci 问题
我们还没有进行过任何数学形式上的证明,仅仅是一个猜想..... 这个猜想就是; 任何有质量的物体(哪怕只有一微克)都不可能穿越时空,该物体强行穿越时空的时候,物体的质量会与时空粒子产生反应,物体会变成暗物质,也就是说,任何物体穿越时空会变成暗物质..(暗物质就我的理
easy ui datagrid上移下移一行商人shang js 上移下移 easyui datagrid
/** * 向上移动一行 * * @param dg * @param row */ function moveupRow(dg, row) { var datagrid = $(dg); var index = datagrid.datagrid("getRowIndex", row); if (isFirstRow(dg, row)) {
Java反射 oloz 反射
本人菜鸟，今天恰好有时间，写写博客，总结复习一下java反射方面的知识，欢迎大家探讨交流学习指教首先看看java中的Class package demo; public class ClassTest { /*先了解java中的Class*/ public static void main(String[] args) { //任何一个类都
springMVC 使用JSR-303 Validation验证杨白白 spring mvc
JSR-303是一个数据验证的规范，但是spring并没有对其进行实现，Hibernate Validator是实现了这一规范的，通过此这个实现来讲SpringMVC对JSR-303的支持。 JSR-303的校验是基于注解的，首先要把这些注解标记在需要验证的实体类的属性上或是其对应的get方法上。登录需要验证类 public class Login { @NotEmpty
log4j 香水浓 log4j
log4j.rootCategory=DEBUG, STDOUT, DAILYFILE, HTML, DATABASE #log4j.rootCategory=DEBUG, STDOUT, DAILYFILE, ROLLINGFILE, HTML #console log4j.appender.STDOUT=org.apache.log4j.ConsoleAppender log4
使用ajax和history.pushState无刷新改变页面URL agevs jquery 框架 Ajax html5 chrome
表现如果你使用chrome或者firefox等浏览器访问本博客、github.com、plus.google.com等网站时，细心的你会发现页面之间的点击是通过ajax异步请求的，同时页面的URL发生了了改变。并且能够很好的支持浏览器前进和后退。是什么有这么强大的功能呢？ HTML5里引用了新的API，history.pushState和history.replaceState，就是通过
centos中文乱码 AILIKES centos OS ssh
一、CentOS系统访问 g.cn ，发现中文乱码。于是用以前的方式：yum -y install fonts-chinese CentOS系统安装后，还是不能显示中文字体。我使用 gedit 编辑源码，其中文注释也为乱码。后来，终于找到以下方法可以解决，需要两个中文支持的包： fonts-chinese-3.02-12.
触发器 baalwolf 触发器
触发器(trigger)：监视某种情况，并触发某种操作。触发器创建语法四要素：1.监视地点(table) 2.监视事件(insert/update/delete) 3.触发时间(after/before) 4.触发事件(insert/update/delete) 语法： create trigger triggerName after/before
JS正则表达式的i m g bijian1013 JavaScript 正则表达式
g:表示全局（global)模式，即模式将被应用于所有字符串，而非在发现第一个匹配项时立即停止。 i:表示不区分大小写（case-insensitive）模式，即在确定匹配项时忽略模式与字符串的大小写。 m:表示
HTML5模式和Hashbang模式 bijian1013 JavaScript AngularJS Hashbang模式 HTML5模式
我们可以用$locationProvider来配置$location服务（可以采用注入的方式，就像AngularJS中其他所有东西一样）。这里provider的两个参数很有意思，介绍如下。 html5Mode 一个布尔值，标识$location服务是否运行在HTML5模式下。 ha
[Maven学习笔记六]Maven生命周期 bit1129 maven
从mvn test的输出开始说起当我们在user-core中执行mvn test时，执行的输出如下： /software/devsoftware/jdk1.7.0_55/bin/java -Dmaven.home=/software/devsoftware/apache-maven-3.2.1 -Dclassworlds.conf=/software/devs
【Hadoop七】基于Yarn的Hadoop Map Reduce容错 bit1129 hadoop
运行于Yarn的Map Reduce作业，可能发生失败的点包括 Task Failure Application Master Failure Node Manager Failure Resource Manager Failure 1. Task Failure 任务执行过程中产生的异常和JVM的意外终止会汇报给Application Master。僵死的任务也会被A
记一次数据推送的异常解决端口解决 ronin47 记一次数据推送的异常解决
　　需求：从db获取数据然后推送到B 程序开发完成，上jboss,刚开始报了很多错，逐一解决，可最后显示连接不到数据库。机房的同事说可以ping 通。　　自已画了个图，逐一排除，把linux 防火墙　和　setenforce　设置最低。　　　service iptables stop
巧用视错觉-UI更有趣 brotherlamp UI ui视频 ui教程 ui自学 ui资料
我们每个人在生活中都曾感受过视错觉（optical illusion）的魅力。视错觉现象是双眼跟我们开的一个玩笑，而我们往往还心甘情愿地接受我们看到的假象。其实不止如此，视觉错现象的背后还有一个重要的科学原理——格式塔原理。格式塔原理解释了人们如何以视觉方式感觉物体，以及图像的结构，视角，大小等要素是如何影响我们的视觉的。在下面这篇文章中，我们首先会简单介绍一下格式塔原理中的基本概念，
线段树-poj1177-N个矩形求边长（离散化+扫描线） bylijinnan 数据结构算法线段树
package com.ljn.base; import java.util.Arrays; import java.util.Comparator; import java.util.Set; import java.util.TreeSet; /** * POJ 1177 (线段树+离散化+扫描线)，题目链接为http://poj.org/problem?id=1177
HTTP协议详解 chicony http协议
引言
Scala设计模式 chenchao051 设计模式 scala
Scala设计模式我的话：在国外网站上看到一篇文章，里面详细描述了很多设计模式，并且用Java及Scala两种语言描述，清晰的让我们看到各种常规的设计模式，在Scala中是如何在语言特性层面直接支持的。基于文章很nice，我利用今天的空闲时间将其翻译，希望大家能一起学习，讨论。翻译
安装mysql daizj mysql 安装
安装mysql (1)删除linux上已经安装的mysql相关库信息。rpm -e xxxxxxx --nodeps (强制删除) 执行命令rpm -qa |grep mysql 检查是否删除干净 (2)执行命令 rpm -i MySQL-server-5.5.31-2.el
HTTP状态码大全 dcj3sjt126com http状态码
完整的 HTTP 1.1规范说明书来自于RFC 2616，你可以在http://www.talentdigger.cn/home/link.php?url=d3d3LnJmYy1lZGl0b3Iub3JnLw%3D%3D在线查阅。HTTP 1.1的状态码被标记为新特性，因为许多浏览器只支持 HTTP 1.0。你应只把状态码发送给支持 HTTP 1.1的客户端，支持协议版本可以通过调用request
asihttprequest上传图片 dcj3sjt126com ASIHTTPRequest
NSURL *url =@"yourURL"; ASIFormDataRequest*currentRequest =[ASIFormDataRequest requestWithURL:url]; [currentRequest setPostFormat:ASIMultipartFormDataPostFormat];[currentRequest se
C语言中，关键字static的作用 e200702084 C++c C#
在C语言中，关键字static有三个明显的作用： 1)在函数体，局部的static变量。生存期为程序的整个生命周期，（它存活多长时间）；作用域却在函数体内（它在什么地方能被访问（空间））。一个被声明为静态的变量在这一函数被调用过程中维持其值不变。因为它分配在静态存储区，函数调用结束后并不释放单元，但是在其它的作用域的无法访问。当再次调用这个函数时，这个局部的静态变量还存活，而且用在它的访
win7/8使用curl geeksun win7
1. WIN7/8下要使用curl，需要下载curl-7.20.0-win64-ssl-sspi.zip和Win64OpenSSL_Light-1_0_2d.exe。下载地址： http://curl.haxx.se/download.html 请选择不带SSL的版本，否则还需要安装SSL的支持包 2. 可以给Windows增加c
Creating a Shared Repository; Users Sharing The Repository hongtoushizi git
转载自： http://www.gitguys.com/topics/creating-a-shared-repository-users-sharing-the-repository/ Commands discussed in this section: git init –bare git clone git remote git pull git p
Java实现字符串反转的8种或9种方法 Josh_Persistence 异或反转递归反转二分交换反转 java字符串反转栈反转
注：对于第7种使用异或的方式来实现字符串的反转，如果不太看得明白的，可以参照另一篇博客： http://josh-persistence.iteye.com/blog/2205768 /** * */ package com.wsheng.aggregator.algorithm.string; import java.util.Stack; /**
代码实现任意容量倒水问题 home198979 PHP 算法倒水
形象化设计模式实战 HELLO!架构 redis命令源码解析倒水问题：有两个杯子，一个A升，一个B升，水有无限多，现要求利用这两杯子装C
Druid datasource zhb8015 druid
推荐大家使用数据库连接池 DruidDataSource. http://code.alibabatech.com/wiki/display/Druid/DruidDataSource DruidDataSource经过阿里巴巴数百个应用一年多生产环境运行验证，稳定可靠。它最重要的特点是：监控、扩展和性能。下载和Maven配置看这里： http
两种启动监听器ApplicationListener和ServletContextListener spjich java spring 框架
引言:有时候需要在项目初始化的时候进行一系列工作，比如初始化一个线程池，初始化配置文件，初始化缓存等等，这时候就需要用到启动监听器，下面分别介绍一下两种常用的项目启动监听器 ServletContextListener 特点: 依赖于sevlet容器，需要配置web.xml 使用方法: public class StartListener implements
JavaScript Rounding Methods of the Math object 何不笑 JavaScript Math
The next group of methods has to do with rounding decimal values into integers. Three methods — Math.ceil(), Math.floor(), and Math.round() — handle rounding in differen