我正在使用 sklearn 的 MLPClassifier 在 Python 中为分类任务构建神经网络。我想绘制一条精度与时代数的关系曲线,看看我需要多少个时代才能达到某种程度的精度。我能够做到这一点的唯一方法是partial_fit()
在循环中使用。这是执行此操作的代码:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
scaler = StandardScaler()
scaler.fit(df_train_sample)
X_train = scaler.transform(df_train_sample)
scaler.fit(df_val)
X_val = scaler.transform(df_val)
pca = PCA(pca_frac)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_val = pca.transform(X_val)
n_classes = np.unique(labels_train_sample)
n_train_sample = len(df_train_sample)
scores_train = []
scores_val = []
epoch = 0
while epoch < max_iter:
random_perm = np.random.permutation(n_train_sample)
mini_batch_index = 0
while True:
indices = random_perm[mini_batch_index:mini_batch_index + batch_size]
mlpc.partial_fit(X_train[indices], labels_train_sample[indices], classes=n_classes)
mini_batch_index += batch_size
if mini_batch_index >= n_train_sample:
break
scores_train.append(mlpc.score(X_train, labels_train_sample))
scores_val.append(mlpc.score(X_val, labels_val))
epoch += 1
fig, ax = plt.subplots()
ax.plot(np.arange(1, max_iter + 1), scores_train, label = "Train")
ax.plot(np.arange(1, max_iter + 1), scores_val, label = "Validation")
这里,max_iter
是时代数,mlpc
是分类器,定义如下:
seed = 123
hidden_layers = [30, 15]
activation = "relu"
learning_rate = 5e-4
beta_1 = 0.99
epsilon = 1e-4
batch_size = 200
max_iter = 200
tol = 1e-4
warm_start = True
shuffle = True
mlpc = MLPClassifier(
hidden_layer_sizes = hidden_layers,
activation = activation,
batch_size = batch_size,
learning_rate_init = learning_rate,
beta_1 = beta_1,
epsilon = epsilon,
warm_start = warm_start,
shuffle = shuffle,
max_iter = max_iter,
tol = tol,
random_state = seed
)
只是可以肯定,这里是如何df_train_sample
和labels_train_sample
从原始数据帧结构:
df_train_sample = df_train.sample(N, replace = False).reset_index(drop = True)
labels_train_sample = labels_train[df_train_sample.index].reset_index(drop = True)
其中N
是要采样的行数。df_val
和labels_val
是验证数据,直接从.csv
文件中读取而无需修改。请注意,标签是布尔值。
问题在于,如果使用 调用该算法,则mlpc.fit()
在采样数据集上的准确度约为 82%,而我发布的这段代码的准确度为 65%。这是情节:
在线搜索我发现改组数据会有所帮助,但正如您所看到的,数据已经在每个时期都进行了改组。为什么会这样?是否有另一种方式以另一种更直接的方式构建所述情节?
我发现了问题所在。问题不在于partial_fit()
,而在于我构建示例数据框的方式:
df_train_sample = df_train.sample(N, replace = False).reset_index(drop = True)
labels_train_sample = labels_train[df_train_sample.index].reset_index(drop = True)
在这一部分中df_train_sample
,我在构建它时重置了 的索引,但随后我使用它的索引从labels_train
. 如果我没有重置索引(这是我以前在以前的版本中所做的),这将起作用。
解决方案就是在重置之前存储索引,就像这样
df_train_sample = df_train.sample(N, replace = False)
train_index = df_train_sample.index
df_train_sample = df_train_sample.reset_index(drop = True)
labels_train_sample = labels_train[train_index].reset_index(drop = True)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句