--- title: 线性回归 tags: linear-regression categories: machinelearning mathjax: true abbrlink: 52662 date: 2025-01-19 16:46:51 --- ### 线性回归简介 >用于预测一个连续的目标变量(因变量),与一个或多个特征(自变量)之间存在线性关系。 假设函数: $$y = w_1x_1 + w_2x_2 + \cdot\cdot\cdot+w_nx_n$$ - $y$ 是目标变量(因变量),即我们希望预测的值。 - $x1​,x2​,…,xn$​ 是特征变量(自变量),即输入的值。 ### 损失函数 为了找到最佳的线性模型,我们需要通过最小化损失函数来优化模型参数。在线性回归中,常用的损失函数是 **均方误差(MSE)**: $$J(\theta) = \frac{1}{2N} \sum_{i=1}^{N} (y_i - f_\theta(x_i))^2$$ - N 是样本的数量。 - $y_i$​ 是第 i 个样本的真实值。 - $f_\theta(x_i)$ 是模型预测的第 i 个样本的值。 ### 线性回归优化 - 梯度下降法 ```python from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import SGDRegressor from sklearn.metrics import mean_squared_error # 1. 获取数据集 housing = fetch_california_housing() # 2. 数据集处理 # 2.1 分割数据集 X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.25) # 3. 特征工程 # 3.1 标准化 transfer = StandardScaler() X_train = transfer.fit_transform(X_train) X_test = transfer.transform(X_test) # 使用 transform() 而不是 fit_transform() # 4.机器学习- 梯度下降法 estimater = SGDRegressor(max_iter=1000, eta0=0.01) estimater.fit(X_train, y_train) print(f"SGD模型的偏置是:{estimater.intercept_}") print(f"SGD模型的系数是:{estimater.coef_}") # 5. 模型评估 y_pred = estimater.predict(X_test) print(f"SGD模型预测值:{y_pred}") mse = mean_squared_error(y_test, y_pred) print(f"SGD模型均方误差:{mse}") ``` - 正规方程 ```python from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # 1. 获取数据集 housing = fetch_california_housing() # 2. 数据集处理 # 2.1 分割数据集 X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.25) # 3. 特征工程 # 3.1 标准化 transfer = StandardScaler() X_train = transfer.fit_transform(X_train) X_test = transfer.fit_transform(X_test) # 4.机器学习- 线性回归 estimater = LinearRegression() estimater.fit(X_train, y_train) print(f"模型的偏置是:{estimater.intercept_}") print(f"模型的系数是:{estimater.coef_}") # 5. 模型评估 y_pred = estimater.predict(X_test) print(f"模型预测值:{y_pred}") mse = mean_squared_error(y_test, y_pred) print(f"模型均方误差:{mse}") ``` - 岭回归 ```python from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Ridge, RidgeCV from sklearn.metrics import mean_squared_error # 1. 获取数据集 housing = fetch_california_housing() # 2. 数据集处理 # 2.1 分割数据集 X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.25) # 3. 特征工程 # 3.1 标准化 transfer = StandardScaler() X_train = transfer.fit_transform(X_train) X_test = transfer.transform(X_test) # 使用 transform() 而不是 fit_transform() # 4.机器学习- 岭回归 使用了Ridge的alpha的搜索 # estimater = Ridge(alpha=1.0) estimater = RidgeCV(alphas=[0.001, 0.01, 0.1, 1, 10, 100]) estimater.fit(X_train, y_train) print(f"Ridge模型的偏置是:{estimater.intercept_}") print(f"Ridge模型的系数是:{estimater.coef_}") # 查看最佳 alpha print(f"最佳 alpha 值是:{estimater.alpha_}") # 5. 模型评估 y_pred = estimater.predict(X_test) print(f"Ridge模型预测值:{y_pred}") mse = mean_squared_error(y_test, y_pred) print(f"Ridge模型均方误差:{mse}") ``` 这样每个代码块的缩进保持一致,便于阅读和理解。如果有其他优化需求,随时告诉我! ![](/img/machinelearning/linear.png) ![](/img/machinelearning/fitting.png) ### 模型保存和加载 ```python from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Ridge, RidgeCV from sklearn.metrics import mean_squared_error import joblib def save_model(): # 1. 获取数据集 housing = fetch_california_housing() # 2. 数据集处理 # 2.1 分割数据集 X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.25) # 3. 特征工程 # 3.1 标准化 transfer = StandardScaler() X_train = transfer.fit_transform(X_train) X_test = transfer.transform(X_test) # 使用 transform() 而不是 fit_transform() # 4. 机器学习 - 岭回归 使用了Ridge的alpha的搜索 estimater = RidgeCV(alphas=[0.001, 0.01, 0.1, 1, 10, 100]) estimater.fit(X_train, y_train) print(f"Ridge模型的偏置是:{estimater.intercept_}") print(f"Ridge模型的系数是:{estimater.coef_}") # 保存模型 joblib.dump(estimater, 'ridge_model.pkl') # 查看最佳 alpha print(f"最佳 alpha 值是:{estimater.alpha_}") # 5. 模型评估 y_pred = estimater.predict(X_test) mse = mean_squared_error(y_test, y_pred) print(f"Ridge模型均方误差:{mse}") def load_model(): # 1. 获取数据集 housing = fetch_california_housing() # 2. 数据集处理 # 2.1 分割数据集 X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.25) # 3. 特征工程 # 3.1 标准化 transfer = StandardScaler() X_train = transfer.fit_transform(X_train) X_test = transfer.transform(X_test) # 使用 transform() 而不是 fit_transform() # 加载模型 estimater = joblib.load('ridge_model.pkl') print(f"Ridge模型的偏置是:{estimater.intercept_}") print(f"Ridge模型的系数是:{estimater.coef_}") # 查看最佳 alpha print(f"最佳 alpha 值是:{estimater.alpha_}") # 5. 模型评估 y_pred = estimater.predict(X_test) mse = mean_squared_error(y_test, y_pred) print(f"Ridge模型预测值:{y_pred}") print(f"Ridge模型均方误差:{mse}") print("训练并保存模型:") save_model() print("加载模型") load_model() ```