cs231n - assignment1 - neural net 梯度推导

来源:本网整理

作者:杜客链接:https://www.zhihu.com/question/37686401/answer/101494842来源:知乎著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

Implementing a Neural Network

In this exercise we will develop a neural network with fully-connected layers to perform classification, and test it out on the CIFAR-10 dataset.

可选中1个或多个下面的关键词,搜索相关资料。也可直接点“搜索资料”搜索整个问题。 本科 cs231n 搜索资料本地图片 图片链接 代码 提交回答

可以先看看之前softmax的梯度推导方法,这里开始采用矩阵的形式来推导梯度,而且将逐级推导梯度,这种方式有很大的好处。

恭喜你遇到了我

首先来回顾一下我们的网络结结构:输入层(D),全连接层-ReLu(H),softmax(C)。网络输入 X[N×D]X_{[N \times D]},groundtruth y[N×1]y_{[N \times 1]}

安装ipython notebook ,这是一个良好的支持python程序的环境,让python拥有和ue或是eclipse一样的支持环境,并且有很多依赖包它会在安装时自己下载

网络参数: W1[D×H],b1[1×H],W2[H×C],b2[1×C]W_1{_{ [D \times H] }}, b_1{_{[1 \times H]}}, W_2{_{ [H \times C]}}, b_2{_{[1 \times C] }}

感谢被邀请回答这个问题。我认为这个问题提的非常好,饺子现在已经成为家喻户晓的一道美食了,大街上随处可见饺子馆的身影,在家里,隔三差五的也都会包顿饺子吃。那么,煮饺子的时候,我们应该开水下锅还是冷水下锅呢?我觉得只有亲身实践了,才能得出更真实的答案。市面上的水饺有两种:一是新鲜水饺,二是速冻水饺。那么,这两种水饺煮的时候到底有什么区别呢?我们先来说说新鲜水饺,新鲜水饺顾名思义就是现场包制的水饺。这样的水饺煮的时候,应该选择开水下锅,是没有争议的,因为冷水下锅饺子皮还会继续吸收汤里的水分,等到水开的时候,饺子皮吸收了太多的水分,已经变得绵软没有弹性了,再继续煮下去饺子皮就会破损,而馅还没有熟,所以

Propagation:

人造蛋是由各种原料人工合成的鸡蛋。其外形和普通鸡蛋基本相同,但其实人造蛋本身结构和真实鸡蛋是完全不同的。人造蛋的有害成份是在制造过程中加入的各种色素等材料,其蛋黄和蛋清是用海藻酸钠、明矾、明胶、食用氯化钙加水、色素等制成。长期食用人造蛋不但不能补充营养还会影响人体健康。如造成大脑记忆力衰退、痴呆等症状,严重的甚至还可致癌。那么如何区分这些对身体不利的鸡蛋呢?我献丑分享下自己的妙招吧!1.观色:人造鸡蛋的蛋壳比普通鸡蛋的外壳反光度高一些,表面更光滑,但不太明显。2.听音:晃动人造鸡蛋会有响声,这是因为在制作过程中添加了凝固剂,时间长了,水分就会从凝固剂中溢出。3.细闻:打开后普通鸡蛋会有隐隐的腥

FC1_out=X?W1+b1???(1) FC1\_out = X \cdot W_1+b_1 ---(1)

根据我们的了解,目前在国内销售比较好的品牌是固特异。国产的话目前我不推荐,现在给你推荐的是固特异,它是美国的知名品牌,也是全世界最大的轮胎生产商之一。说到轮胎耐磨,我们也许可以从这个品牌的发展史来了解一下。1:固特异是百年老品牌,历史悠久固特异轮胎橡胶公司是全球最大的轮胎制造商之一,市场业务几乎遍及全球。固特异位于美国俄亥俄州。1898年的美国,路上通行的交通工具形形色色,从马匹、马车直到诞生不久的汽车,但他们都迫切需要一种能缓冲路面冲击力的垫子。于是弗兰克希柏林兄弟买下了俄亥俄州阿克隆市东部的一间硬纸板厂,开始制造橡胶制品。为了纪念1839年发明“硫化”的查尔斯?固特异先生,兄弟俩将公司取名

H_out=maximum(0,FC1_out)???(2) H\_out = maximum(0, FC1\_out)---(2)

林俊杰被称为“行走的CD”,有如此美称,足已说明林俊杰的实力,无论在唱功还是创作上,JJ的能力都是有目共睹的,一直喜欢着JJ,喜欢他的人,更喜欢他的歌,从他的歌里听懂了听不懂的,也听懂了听得懂的,“修炼爱情”,唱出了心酸,也唱出了幸福。听到了过去,也听到了未来,我只想说,JJ的唱功是有目共睹的。

FC2_out=H_out?W2+b2???(3) FC2\_out = H\_out \cdot W_2+b_2---(3)

final_output=softmax(FC2_out)???(4) final\_output = softmax(FC2\_out) ---(4)

Backpropogation

?L?FC2_out=final_output[N×C]?MaskMat[N×C]???(5) \frac{\partial L}{\partial FC2\_out } = final\_output_{[N \times C]} - MaskMat_{[N \times C] } ---(5)

MaskMat参见这里

?L?W2=?FC2_out?W2?L?FC2_out=H_outT??L?FC2_out???(6) {\frac{\partial L}{\partial W_2 }} = \frac{\partial FC2\_out}{\partial W_2 } \frac{\partial L}{\partial FC2\_out } =H\_out^T \cdot \frac{\partial L}{\partial FC2\_out } ---(6)

?L?b2=?FC2_out?b2?L?FC2_out=[1...1][1×H]??L?FC2_out???(7) \frac{\partial L}{\partial b_2 } = \frac{\partial FC2\_out}{\partial b_2 } \frac{\partial L}{\partial FC2\_out } = [1...1]_{ [1 \times H]} \cdot \frac{\partial L}{\partial FC2\_out } ---(7)

?L?H_out=?L?FC2_out?FC2_out?H_out=?L?FC2_outWT2,?L?H_out=maxmium(?L?H_out,0)???(8) \frac{\partial L}{\partial H\_out}= \frac{\partial L}{\partial FC2\_out } \frac{\partial FC2\_out}{\partial H\_out } = \frac{\partial L}{\partial FC2\_out }W_2^T , \frac{\partial L}{\partial H\_out} = maxmium( \frac{\partial L}{\partial H\_out}, 0)--- (8)

?L?W1=?H_cout?W1??L?H_out=XT??L?H_out???(9) {\frac{\partial L}{\partial W_1 }} =\frac{\partial H\_cout}{\partial W_1} \cdot \frac{\partial L}{\partial H\_out}=X^T \cdot \frac{\partial L}{\partial H\_out}---(9)

?L?b1=?H_cout?b1??L?H_out=[1...1][1×N]??L?H_out???(10) {\frac{\partial L}{\partial b_1 }} =\frac{\partial H\_cout}{\partial b_1} \cdot \frac{\partial L}{\partial H\_out}=[1...1]_{[1 \times N]} \cdot \frac{\partial L}{\partial H\_out}---(10)


# neural_net.py import numpy as np import matplotlib.pyplot as plt class TwoLayerNet(object): """ A two-layer fully-connected neural network. The net has an input dimension of N, a hidden layer dimension of H, and performs classification over C classes. We train the network with a softmax loss function and L2 regularization on the weight matrices. The network uses a ReLU nonlinearity after the first fully connected layer. In other words, the network has the following architecture: input - fully connected layer - ReLU - fully connected layer - softmax The outputs of the second fully-connected layer are the scores for each class. """ def __init__(self, input_size, hidden_size, output_size, std=1e-4): """ Initialize the model. Weights are initialized to small random values and biases are initialized to zero. Weights and biases are stored in the variable self.params, which is a dictionary with the following keys: W1: First layer weights; has shape (D, H) b1: First layer biases; has shape (H,) W2: Second layer weights; has shape (H, C) b2: Second layer biases; has shape (C,) Inputs: - input_size: The dimension D of the input data. - hidden_size: The number of neurons H in the hidden layer. - output_size: The number of classes C. """ self.params = {} self.params['W1'] = std * np.random.randn(input_size, hidden_size) self.params['b1'] = np.zeros(hidden_size) self.params['W2'] = std * np.random.randn(hidden_size, output_size) self.params['b2'] = np.zeros(output_size) def loss(self, X, y=None, reg=0.0): """ Compute the loss and gradients for a two layer fully connected neural network. Inputs: - X: Input data of shape (N, D). Each X[i] is a training sample. - y: Vector of training labels. y[i] is the label for X[i], and each y[i] is an integer in the range 0 <= y[i] < C. This parameter is optional; if it is not passed then we only return scores, and if it is passed then we instead return the loss and gradients. - reg: Regularization strength. Returns: If y is None, return a matrix scores of shape (N, C) where scores[i, c] is the score for class c on input X[i]. If y is not None, instead return a tuple of: - loss: Loss (data loss and regularization loss) for this batch of training samples. - grads: Dictionary mapping parameter names to gradients of those parameters with respect to the loss function; has the same keys as self.params. """ # Unpack variables from the params dictionary W1, b1 = self.params['W1'], self.params['b1'] W2, b2 = self.params['W2'], self.params['b2'] N, D = X.shape # Compute the forward pass scores = None ############################################################################# # TODO: Perform the forward pass, computing the class scores for the input. # # Store the result in the scores variable, which should be an array of # # shape (N, C). # ############################################################################# # evaluate class scores, [N x K] hidden_layer = np.maximum(0, np.dot(X,W1)+b1) # ReLU activation scores = np.dot(hidden_layer, W2)+b2 ############################################################################# # END OF YOUR CODE # ############################################################################# # If the targets are not given then jump out, we're done if y is None: return scores # Compute the loss loss = None ############################################################################# # TODO: Finish the forward pass, and compute the loss. This should include # # both the data loss and L2 regularization for W1 and W2. Store the result # # in the variable loss, which should be a scalar. Use the Softmax # # classifier loss. So that your results match ours, multiply the # # regularization loss by 0.5 # ############################################################################# # compute the class probabilities #scores -= np.max(scores, axis = 1)[:, np.newaxis] #exp_scores = np.exp(scores) exp_scores = np.exp(scores-np.max(scores, axis=1, keepdims=True)) probs = exp_scores/np.sum(exp_scores, axis=1, keepdims=True) #[N X C] correct_logprobs = -np.log(probs[range(N),y]) data_loss = np.sum(correct_logprobs)/N reg_loss = 0.5 * reg * ( np.sum(W1*W1) + np.sum(W2*W2) ) loss = data_loss + reg_loss ############################################################################# # END OF YOUR CODE # ############################################################################# # Backward pass: compute gradients grads = {} ############################################################################# # TODO: Compute the backward pass, computing the derivatives of the weights # # and biases. Store the results in the grads dictionary. For example, # # grads['W1'] should store the gradient on W1, and be a matrix of same size # ############################################################################# # compute the gradient on scores dscores = probs dscores[range(N),y] -= 1 dscores /= N # backpropate the gradient to the parameters # first backprop into parameters W2 and b2 dW2 = np.dot(hidden_layer.T, dscores) db2 = np.sum(dscores, axis=0, keepdims=False) # next backprop into hidden layer dhidden = np.dot(dscores, W2.T) # backprop the ReLU non-linearity dhidden[hidden_layer <= 0] = 0 # finally into W,b dW1 = np.dot(X.T, dhidden) db1 = np.sum(dhidden, axis=0, keepdims=False) # add regularization gradient contribution dW2 += reg * W2 dW1 += reg * W1 grads['W1'] = dW1 grads['W2'] = dW2 grads['b1'] = db1 grads['b2'] = db2 #print dW1.shape, dW2.shape, db1.shape, db2.shape ############################################################################# # END OF YOUR CODE # ############################################################################# return loss, grads def train(self, X, y, X_val, y_val, learning_rate=1e-3, learning_rate_decay=0.95, reg=1e-5, num_iters=100, batch_size=200, verbose=False): """ Train this neural network using stochastic gradient descent. Inputs: - X: A numpy array of shape (N, D) giving training data. - y: A numpy array f shape (N,) giving training labels; y[i] = c means that X[i] has label c, where 0 <= c < C. - X_val: A numpy array of shape (N_val, D) giving validation data. - y_val: A numpy array of shape (N_val,) giving validation labels. - learning_rate: Scalar giving learning rate for optimization. - learning_rate_decay: Scalar giving factor used to decay the learning rate after each epoch. - reg: Scalar giving regularization strength. - num_iters: Number of steps to take when optimizing. - batch_size: Number of training examples to use per step. - verbose: boolean; if true print progress during optimization. """ num_train = X.shape[0] iterations_per_epoch = max(num_train / batch_size, 1) # Use SGD to optimize the parameters in self.model loss_history = [] train_acc_history = [] val_acc_history = [] for it in xrange(num_iters): X_batch = None y_batch = None ######################################################################### # TODO: Create a random minibatch of training data and labels, storing # # them in X_batch and y_batch respectively. # ######################################################################### sample_index = np.random.choice(num_train, batch_size, replace=True) X_batch = X[sample_index, :] y_batch = y[sample_index] ######################################################################### # END OF YOUR CODE # ######################################################################### # Compute loss and gradients using the current minibatch loss, grads = self.loss(X_batch, y=y_batch, reg=reg) loss_history.append(loss) ######################################################################### # TODO: Use the gradients in the grads dictionary to update the # # parameters of the network (stored in the dictionary self.params) # # using stochastic gradient descent. You'll need to use the gradients # # stored in the grads dictionary defined above. # ######################################################################### dW1 = grads['W1'] dW2 = grads['W2'] db1 = grads['b1'] db2 = grads['b2'] self.params['W1'] -= learning_rate*dW1 self.params['W2'] -= learning_rate*dW2 self.params['b1'] -= learning_rate*db1 self.params['b2'] -= learning_rate*db2 ######################################################################### # END OF YOUR CODE # ######################################################################### if verbose and it % 100 == 0: print 'iteration %d / %d: loss %f' % (it, num_iters, loss) # Every epoch, check train and val accuracy and decay learning rate. if it % iterations_per_epoch == 0: # Check accuracy train_acc = (self.predict(X_batch) == y_batch).mean() val_acc = (self.predict(X_val) == y_val).mean() train_acc_history.append(train_acc) val_acc_history.append(val_acc) # Decay learning rate learning_rate *= learning_rate_decay return { 'loss_history': loss_history, 'train_acc_history': train_acc_history, 'val_acc_history': val_acc_history, } def predict(self, X): """ Use the trained weights of this two-layer network to predict labels for data points. For each data point we predict scores for each of the C classes, and assign each data point to the class with the highest score. Inputs: - X: A numpy array of shape (N, D) giving N D-dimensional data points to classify. Returns: - y_pred: A numpy array of shape (N,) giving predicted labels for each of the elements of X. For all i, y_pred[i] = c means that X[i] is predicted to have class c, where 0 <= c < C. """ y_pred = None ########################################################################### # TODO: Implement this function; it should be VERY simple! # ########################################################################### hidden_lay = np.maximum(0, np.dot(X,self.params['W1'])+self.params['b1']) y_pred = np.argmax( np.dot(hidden_lay, self.params['W2']), axis=1) ########################################################################### # END OF YOUR CODE # ########################################################################### return y_pred

Tune your hyperparameters

What’s wrong?. Looking at the visualizations above, we see that the loss is decreasing more or less linearly, which seems to suggest that the learning rate may be too low. Moreover, there is no gap between the training and validation accuracy, suggesting that the model we used has low capacity, and that we should increase its size. On the other hand, with a very large model we would expect to see more overfitting, which would manifest itself as a very large gap between the training and validation accuracy.

Tuning. Tuning the hyperparameters and developing intuition for how they affect the final performance is a large part of using Neural Networks, so we want you to get a lot of practice. Below, you should experiment with different values of the various hyperparameters, including hidden layer size, learning rate, numer of training epochs, and regularization strength. You might also consider tuning the learning rate decay, but you should be able to get good performance using the default value.

Approximate results. You should be aim to achieve a classification accuracy of greater than 48% on the validation set. Our best network gets over 52% on the validation set.

Experiment: You goal in this exercise is to get as good of a result on CIFAR-10 as you can, with a fully-connected Neural Network. For every 1% above 52% on the Test set we will award you with one extra bonus point. Feel free implement your own techniques (e.g. PCA to reduce dimensionality, or adding dropout, or adding features to the solver, etc.).

# two_layer_net.ipynb best_net = None # store the best model into this best_stats = None ################################################################################# # TODO: Tune hyperparameters using the validation set. Store your best trained # # model in best_net. # # # # To help debug your network, it may help to use visualizations similar to the # # ones we used above; these visualizations will have significant qualitative # # differences from the ones we saw above for the poorly tuned network. # # # # Tweaking hyperparameters by hand can be fun, but you might find it useful to # # write code to sweep through possible combinations of hyperparameters # # automatically like we did on the previous exercises. # ################################################################################# input_size = 32 * 32 * 3 hidden_size = 300 num_classes = 10 results = {} best_val = -1 learning_rates = [1e-3, 1.2e-3, 1.4e-3, 1.6e-3, 1.8e-3] regularization_strengths = [1e-4, 1e-3, 1e-2] params = [(x,y) for x in learning_rates for y in regularization_strengths ] for lrate, regular in params: net = TwoLayerNet(input_size, hidden_size, num_classes) # Train the network stats = net.train(X_train, y_train, X_val, y_val, num_iters=1600, batch_size=400, learning_rate=lrate, learning_rate_decay=0.90, reg=regular, verbose=False) # Predict on the validation set accuracy_train = (net.predict(X_train) == y_train).mean() accuracy_val = (net.predict(X_val) == y_val).mean() results[(lrate, regular)] = (accuracy_train, accuracy_val) if( best_val < accuracy_val ): best_val = accuracy_val best_net = net best_stats = stats # Print out results. for lr, reg in sorted(results): train_accuracy, val_accuracy = results[(lr, reg)] print 'lr %e reg %e train accuracy: %f val accuracy: %f' % ( lr, reg, train_accuracy, val_accuracy) print 'best validation accuracy achieved during cross-validation: %f' % best_val # Plot the loss function and train / validation accuracies plt.subplot(2, 1, 1) plt.plot(best_stats['loss_history']) plt.title('Loss history') plt.xlabel('Iteration') plt.ylabel('Loss') plt.subplot(2, 1, 2) plt.plot(best_stats['train_acc_history'], label='train',color='r') plt.plot(best_stats['val_acc_history'], label='val',color='g') plt.title('Classification accuracy history') plt.xlabel('Epoch') plt.ylabel('Clasification accuracy') plt.show() ################################################################################# # END OF YOUR CODE # #################################################################################

lr 1.000000e-03 reg 1.000000e-04 train accuracy: 0.541551 val accuracy: 0.499000

lr 1.000000e-03 reg 1.000000e-03 train accuracy: 0.541694 val accuracy: 0.511000

lr 1.000000e-03 reg 1.000000e-02 train accuracy: 0.540898 val accuracy: 0.490000

lr 1.200000e-03 reg 1.000000e-04 train accuracy: 0.562041 val accuracy: 0.528000

lr 1.200000e-03 reg 1.000000e-03 train accuracy: 0.563653 val accuracy: 0.507000

lr 1.200000e-03 reg 1.000000e-02 train accuracy: 0.564184 val accuracy: 0.512000

lr 1.400000e-03 reg 1.000000e-04 train accuracy: 0.580857 val accuracy: 0.532000

lr 1.400000e-03 reg 1.000000e-03 train accuracy: 0.580857 val accuracy: 0.513000

lr 1.400000e-03 reg 1.000000e-02 train accuracy: 0.575245 val accuracy: 0.534000

lr 1.600000e-03 reg 1.000000e-04 train accuracy: 0.593347 val accuracy: 0.529000

lr 1.600000e-03 reg 1.000000e-03 train accuracy: 0.594857 val accuracy: 0.548000

lr 1.600000e-03 reg 1.000000e-02 train accuracy: 0.593878 val accuracy: 0.551000

lr 1.800000e-03 reg 1.000000e-04 train accuracy: 0.605306 val accuracy: 0.537000

lr 1.800000e-03 reg 1.000000e-03 train accuracy: 0.610000 val accuracy: 0.533000

lr 1.800000e-03 reg 1.000000e-02 train accuracy: 0.603204 val accuracy: 0.546000

best validation accuracy achieved during cross-validation: 0.551000

Test accuracy: 0.542

这里写图片描述 $(function () { $('pre.prettyprint code').each(function () { var lines = $(this).text().split('\n').length; var $numbering = $('

免责声明 - 关于我们 - 联系我们 - 广告联系 - 友情链接 - 帮助中心 - 频道导航
Copyright © 2017 www.zgxue.com All Rights Reserved