QuickReference/source/_posts/machinelearning/knn.md

190 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: k近邻算法K-Nearest NeighborsKNN
tags: machinelearning
abbrlink: 29139
mathjax: true
date: 2025-01-13 17:20:59
---
## **k近邻算法K-Nearest NeighborsKNN**
将当前样本的类别归类于距离最近的**k**个样本的类别
#### **距离公式(2维)**
- 欧式距离
$$
d = \sqrt{(x_1-y_1)^2 + (x_2 - y_2)^2}
$$
- 曼哈顿距离
$$
d = |x_1 - x_2| + |y_1 - y_2|
$$
- 切比雪夫距离
$$
d = \max\left(|x_1 - x_2|, |y_1 - y_2|\right)
$$
#### k值选择问题
| k值 | 影响 |
| --- | ------------------ |
| 越大 | 模型过拟合,准确率波动较大 |
| 越小 | 模型欠拟合,准确率趋于稳定但可能较低 |
### 特征预处理
> 通过一些转换函数将特征数据转换成更加适合算法模型的特征数据过程
- 归一化
将数据变换到指定区间(默认是\[0,1\]
$$ x' = \frac{x- x_{\text {min}}}{x_{\text{max}} - x_{\text{min}}} $$
若需要缩放到任意区间 \(\[a, b\]\),公式为: $$ x' = a + \frac{(x - x_{\text{min}}) \cdot (b - a)}{x_{\text{max}} - x_{\text{min}}} $$
其中:\( \[a, b\] \):目标区间的范围
归一化受到数据集的异常值的影响,需要进行标准化处理(更加合理)
``` python
from sklearn.preprocessing import MinMaxScaler # 归一化
```
- 标准化
将数据调整为均值为 0标准差为 1 的标准正态分布
$$ z = \frac{x - \mu}{\sigma} $$
\( z \):标准化后的值 \( x \):原始数据值 \( $\mu$ \):数据的均值 \( $\sigma$\):数据的标准差
``` python
from sklearn.preprocessing import StandardScaler # 标准化
```
### KNN代码实现
```python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# 1 数据集获取
iris = load_iris()
# print(iris.feature_names)
iris_data = pd.DataFrame(iris.data,columns=['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
iris_data['target'] = iris.target
def iris_plot(data,col1,col2):
sns.lmplot(x=col1,y=col2,data=data,hue="target",fit_reg=False)
plt.show()
# 2 数据集可视化
# iris_plot(iris_data, 'Sepal_Width', 'Petal_Length')
# 3 数据集的划分
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,test_size=0.2,random_state=44)
# print("训练集的特征值:\n",X_train)
# print("训练集的目标值:\n",y_train)
# print("测试集的特征值:\n",X_test)
# print("测试集的特征值:\n",y_test)
# 4 归一化
transfer = StandardScaler()
X_train = transfer.fit_transform(X_train)
X_test = transfer.transform(X_test)
# print("归一化的,X_train\n",X_train)
# print("归一化的X_test\n",X_test)
# 5 机器学习 KNN
# 5.1 实例化估计器
estimator = KNeighborsClassifier(n_neighbors=9)
# 5.2 进行训练
estimator.fit(X_train,y_train)
# 6 模型评估
y_pred = estimator.predict(X_test)
print("预测值:\n",y_pre)
print("预测值与真实值是否相等:\n",y_pred==y_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nKNN 模型的准确率: {accuracy:.4f}")
```
![](/img/machinelearning/knn-01.png)
### 交叉验证与网格搜索
```python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# 1 数据集获取
iris = load_iris()
iris_data = pd.DataFrame(iris.data,columns=['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
iris_data['target'] = iris.target
# 3 数据集的划分
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,test_size=0.2)
# 4 归一化
transfer = StandardScaler()
X_train = transfer.fit_transform(X_train)
X_test = transfer.transform(X_test)
# 5 机器学习 KNN
# 5.1 实例化估计器
#
#不指定 <code> n_neighbors </code> ,使用网格搜索进行循环训练
estimator = KNeighborsClassifier()
# 5.2 模型调优 -- 交叉验证,网格搜素
estimator = GridSearchCV(estimator,param_grid={"n_neighbors":[1,3,5,7]},cv=5) # 5 折
# 5.2 进行训练
estimator.fit(X_train,y_train)
# 6 模型评估
y_pred = estimator.predict(X_test)
print("预测值:\n",y_pred)
print("预测值与真实值是否相等:\n",y_pred==y_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nKNN 模型的准确率: {accuracy:.4f}")
# 交叉验证的相关参数
print(f"最好结果:{estimator.best_score_}")
print(f"最好模型:{estimator.best_estimator_}")
print(f"最好模型结果:{estimator.cv_results_}")
```
![](/img/machinelearning/cros-valid.png)
### 机器学习的基本步骤
- 获取数据集
- 数据集基本处理
- 去重去空、填充等操作
- 确定特征值和目标值
- 分割数据集
- 特征工程(特征预处理 标准化等)
- 机器学习
- 模型评估
### 数据分割的方法
- 留出法
训练/测试集的划分要尽可能保持数据分布的一致性,避免因数据划分过程引入额外的偏差而对最终结果产生影响。
单次使用留出法得到的估计结果往往不够稳定可靠,在使用留出法时,一般要采用若干次随机划分、重复进行实验评估后取平均值作为留出法的评估结果。
``` python
from sklearn.model_selection import KFold,StratifiedKFold
import pandas as pd
X = np.array([
[1,2,3,4],
[11,12,13,14],
[21,22,23,24],
[31,32,33,34],
[41,42,43,44],
[51,52,53,54],
[61,62,63,64],
[71,72,73,74]
])
y=np.array([1,1,0,0,1,1,0,0])
folder = KFold(n_splits=4)
sfloder = StratifiedKFold(n_splits=4)
print("KFOLD:")
for train,test in folder.split(X,y):
print(f"train:{train},test:{test}")
print("SKFOLD:")
for train,test in sfloder.split(X,y):
print(f"train:{train},test:{test}")
```
![](/img/machinelearning/kfold-skfold.png)
- 自助法
- 交叉验证法