您的当前位置：首页正文

[scikit-learn 机器学习] 6. 逻辑回归

2024-11-09 来源：个人技术集锦

本文为学习笔记

逻辑回归常用于分类任务

1. 逻辑回归二分类

定义：设 $X$ 是连续随机变量， $X$ 服从 logistic 分布是指 $X$ 具有下列分布函数和密度函数：

$\leq x) = \frac{1}{1+e^{{-(x-\mu)} / \gamma}}$

$\frac {e^{{-(x-\mu)} / \gamma}}{\gamma {(1+e^{{-(x-\mu)}/\gamma})}^2}$

在逻辑回归中，当预测概率 >= 阈值，预测为正类，否则预测为负类

2. 垃圾邮件过滤

从信息中提取 TF-IDF 特征，并使用逻辑回归进行分类

import pandas as pd
data = pd.read_csv("SMSSpamCollection", delimiter='\t',header=None)
data

data[data[0]=='ham'][0].count() # 4825 条正常信息
data[data[0]=='spam'][0].count() # 747 条垃圾信息

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

X = data[1].values
y = data[0].values
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
y = lb.fit_transform(y)

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=520)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

pred = classifier.predict(X_test)
for i, pred_i in enumerate(pred[:5]):
    print("预测为：%s, 信息为：%s,真实为：%s" %(pred_i,X_test_raw[i],y_test[i]))

预测为：0, 信息为：Aww that's the first time u said u missed me without asking if I missed u first. You DO love me! :),真实为：[0]
预测为：0, 信息为：Poor girl can't go one day lmao,真实为：[0]
预测为：0, 信息为：Also remember the beads don't come off. Ever.,真实为：[0]
预测为：0, 信息为：I see the letter B on my car,真实为：[0]
预测为：0, 信息为：My love ! How come it took you so long to leave for Zaher's? I got your words on ym and was happy to see them but was sad you had left. I miss you,真实为：[0]

2.1 性能指标

混淆矩阵

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
confusion_matrix = confusion_matrix(y_test, pred)
plt.matshow(confusion_matrix)
plt.rcParams["font.sans-serif"]= 'SimHei' # 消除中文乱码
plt.title("混淆矩阵")
plt.ylabel('真实')
plt.xlabel('预测')
plt.colorbar()

2.2 准确率

scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracies: %s' % scores)
print('Mean accuracy: %s' % np.mean(scores))

Accuracies: [0.94976077 0.95933014 0.96650718 0.95215311 0.95688623]
Mean accuracy: 0.9569274847434318

准确率不是一个很合适的性能指标，它不能区分预测错误，是正预测为负，还是负预测为正

2.3 精准率、召回率

可以参考

单独只看精准率或者召回率是没有意义的

from sklearn.metrics import precision_score, recall_score, f1_score
precisions = precision_score(y_test, pred)
print('Precision: %s' % precisions)
recalls = recall_score(y_test, pred)
print('Recall: %s' % recalls)

Precision: 0.9852941176470589
预测为垃圾信息的基本上真的是垃圾信息

Recall: 0.6979166666666666
有30%的垃圾信息预测为了非垃圾信息

2.4 F1值

F1 值是以上精准率和召回率的均衡

f1s = f1_score(y_test, pred)
print('F1 score: %s' % f1s)
# F1 score: 0.8170731707317074

2.5 ROC、AUC

好的分类器AUC面积越接近1越好，随机分类器AUC面积为0.5

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

false_positive_rate, recall, thresholds = roc_curve(y_test, pred)
roc_auc_score  = roc_auc_score(y_test, pred)

plt.title('受试者工作特性')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc_score)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

3. 网格搜索调参

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score


pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5, 0.75), # 模块name__参数name
    'vect__stop_words': ('english', None),
    'vect__max_features': (2500, 5000, None),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (0.01, 0.1, 1, 10),
}

if __name__ == "__main__":
    df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
    X = df[1].values
    y = df[0].values
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
    grid_search.fit(X_train, y_train)
    
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
        
    predictions = grid_search.predict(X_test)
    print('Accuracy: %s' % accuracy_score(y_test, predictions))
    print('Precision: %s' % precision_score(y_test, predictions))
    print('Recall: %s' % recall_score(y_test, predictions))

Best score: 0.985
Best parameters set:
	clf__C: 10
	clf__penalty: 'l2'
	vect__max_df: 0.5
	vect__max_features: 5000
	vect__ngram_range: (1, 2)
	vect__stop_words: None
	vect__use_idf: True
Accuracy: 0.9791816223977028
Precision: 1.0
Recall: 0.8605769230769231

调整参数后，提高了召回率

4. 多类别分类

电影情绪评价预测

data = pd.read_csv("./chapter5_movie_train.csv",header=0,delimiter='\t')
data

data['Sentiment'].describe()

count    156060.000000
mean          2.063578
std           0.893832
min           0.000000
25%           2.000000
50%           2.000000
75%           3.000000
max           4.000000
Name: Sentiment, dtype: float64

平均都是比较中立的情绪

data["Sentiment"].value_counts()/data["Sentiment"].count()

2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64

50% 的例子都是中立的情绪

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('./chapter5_movie_train.csv', header=0, delimiter='\t')
X, y = df['Phrase'], df['Sentiment'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'clf__C': (0.1, 1, 10),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('\t%s: %r' % (param_name, best_parameters[param_name]))

Best score: 0.619
Best parameters set:
	clf__C: 10
	vect__max_df: 0.25
	vect__ngram_range: (1, 2)
	vect__use_idf: False

性能指标

predictions = grid_search.predict(X_test)

print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_test, predictions))
print('Classification Report:')
print(classification_report(y_test, predictions))

Accuracy: 0.6292323465333846
Confusion Matrix:
[[ 1013  1742   682   106    11]
 [  794  5914  6275   637    49]
 [  196  3207 32397  3686   222]
 [   28   488  6513  8131  1299]
 [    1    59   548  2388  1644]]
Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.29      0.36      3554
           1       0.52      0.43      0.47     13669
           2       0.70      0.82      0.75     39708
           3       0.54      0.49      0.52     16459
           4       0.51      0.35      0.42      4640

    accuracy                           0.63     78030
   macro avg       0.55      0.48      0.50     78030
weighted avg       0.61      0.63      0.62     78030

5. 多标签分类

一个实例可以被贴上多个 labels

问题转换：

实例的标签(假设为L1,L2)，转换成（L1 and L2）,以此类推，缺点，产生很多种类的标签，且模型只能训练数据中包含的类，很多可能无法覆盖到
对每个标签，训练一个二分类器（这个实例是L1吗，是L2吗？），缺点，忽略了标签之间的关系

5.1 多标签分类性能指标

汉明损失：不正确标签的平均比例，0最好
杰卡德相似系数：预测与真实标签的交集数量 / 并集数量，1最好

from sklearn.metrics import hamming_loss, jaccard_score
# help(jaccard_score)

print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))

print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))

print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))

print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]]),average=None))

print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]]),average=None))

print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]]),average=None))

0.0
0.25
0.5
[1. 1.]
[0.5 1. ]
[0. 1.]