Dropout
see
完成前向传播
def dropout_forward(x, dropout_param): """ Performs the forward pass for (inverted) dropout. Inputs: - x: Input data, of any shape - dropout_param: A dictionary with the following keys: - p: Dropout parameter. We keep each neuron output with probability p. - mode: 'test' or 'train'. If the mode is train, then perform dropout; if the mode is test, then just return the input. - seed: Seed for the random number generator. Passing seed makes this function deterministic, which is needed for gradient checking but not in real networks. Outputs: - out: Array of the same shape as x. - cache: tuple (dropout_param, mask). In training mode, mask is the dropout mask that was used to multiply the input; in test mode, mask is None. NOTE: Please implement **inverted** dropout, not the vanilla version of dropout. See http://cs231n.github.io/neural-networks-2/#reg for more details. NOTE 2: Keep in mind that p is the probability of **keep** a neuron output; this might be contrary to some sources, where it is referred to as the probability of dropping a neuron output. """ p, mode = dropout_param['p'], dropout_param['mode'] if 'seed' in dropout_param: np.random.seed(dropout_param['seed']) mask = None out = None if mode == 'train': ####################################################################### # TODO: Implement training phase forward pass for inverted dropout. # # Store the dropout mask in the mask variable. # ####################################################################### keep_prob = 1 - p mask = (np.random.rand(*x.shape) < keep_prob) / keep_prob #首先,代码 (np.random.rand(*x.shape),表示根据输入数据矩阵x,亦即经过”激活”后的得分,生成一个相同shape的随机矩阵,其为均匀分布的随机样本[0,1)。然后将其与可被保留神经元的概率 keep_prob 做比较,就可以得到一个随机真值表作为随机失活遮罩(mask)。原始的办法是:由于在训练模式时,我们丢掉了部分的激活值,数值调整 out = mask * x 后造成整体分布的期望值的下降,因此在预测时就需要乘上一个概率 1/keep_prob,才能保持分布的统一。不过,我们用一种叫做inverted dropout的技巧,就是如上面代码所示,直接在训练模式下多除以一个概率 keep_prob,那么在测试模式下就不用做任何操作了,直接让数据通过dropout层即可。 out = mask * x ####################################################################### # END OF YOUR CODE # ####################################################################### elif mode == 'test': ####################################################################### # TODO: Implement the test phase forward pass for inverted dropout. # ####################################################################### out = x ####################################################################### # END OF YOUR CODE # ####################################################################### cache = (dropout_param, mask) out = out.astype(x.dtype, copy=False) return out, cache
Running tests with p = 0.25
Mean of input: 10.000207878477502 Mean of train-time output: 9.998198947788465 Mean of test-time output: 10.000207878477502 Fraction of train-time output set to zero: 0.250168 Fraction of test-time output set to zero: 0.0Running tests with p = 0.4
Mean of input: 10.000207878477502 Mean of train-time output: 9.976910758765856 Mean of test-time output: 10.000207878477502 Fraction of train-time output set to zero: 0.401368 Fraction of test-time output set to zero: 0.0Running tests with p = 0.7
Mean of input: 10.000207878477502 Mean of train-time output: 9.98254739313744 Mean of test-time output: 10.000207878477502 Fraction of train-time output set to zero: 0.700496 Fraction of test-time output set to zero: 0.0完成后向传播
def dropout_backward(dout, cache): """ Perform the backward pass for (inverted) dropout. Inputs: - dout: Upstream derivatives, of any shape - cache: (dropout_param, mask) from dropout_forward. """ dropout_param, mask = cache mode = dropout_param['mode'] dx = None if mode == 'train': ####################################################################### # TODO: Implement training phase backward pass for inverted dropout # ####################################################################### dx = mask * dout #梯度反向传播时使用同样的 mask将被遮罩的梯度置零。 ####################################################################### # END OF YOUR CODE # ####################################################################### elif mode == 'test': dx = dout return dx
dx relative error: 5.445612718272284e-11
带有dropout的全连接网络:
class FullyConnectedNet(object): """ A fully-connected neural network with an arbitrary number of hidden layers, ReLU nonlinearities, and a softmax loss function. This will also implement dropout and batch/layer normalization as options. For a network with L layers, the architecture will be {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax where batch/layer normalization and dropout are optional, and the {...} block is repeated L - 1 times. Similar to the TwoLayerNet above, learnable parameters are stored in the self.params dictionary and will be learned using the Solver class. """ def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10, dropout=1, normalization=None, reg=0.0, weight_scale=1e-2, dtype=np.float32, seed=None): """ Initialize a new FullyConnectedNet. Inputs: - hidden_dims: A list of integers giving the size of each hidden layer. - input_dim: An integer giving the size of the input. - num_classes: An integer giving the number of classes to classify. - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then the network should not use dropout at all. - normalization: What type of normalization the network should use. Valid values are "batchnorm", "layernorm", or None for no normalization (the default). - reg: Scalar giving L2 regularization strength. - weight_scale: Scalar giving the standard deviation for random initialization of the weights. - dtype: A numpy datatype object; all computations will be performed using this datatype. float32 is faster but less accurate, so you should use float64 for numeric gradient checking. - seed: If not None, then pass this random seed to the dropout layers. This will make the dropout layers deteriminstic so we can gradient check the model. 默认无随机种子,若有会传递给dropout层。 """ self.normalization = normalization self.use_dropout = dropout != 1 self.reg = reg self.num_layers = 1 + len(hidden_dims) self.dtype = dtype self.params = {} ############################################################################ # TODO: Initialize the parameters of the network, storing all values in # # the self.params dictionary. Store weights and biases for the first layer # # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be # # initialized from a normal distribution centered at 0 with standard # # deviation equal to weight_scale. Biases should be initialized to zero. # # # # When using batch normalization, store scale and shift parameters for the # # first layer in gamma1 and beta1; for the second layer use gamma2 and # # beta2, etc. Scale parameters should be initialized to ones and shift # # parameters should be initialized to zeros. # ############################################################################ #初始化所有隐藏层的参数 in_dim = input_dim #D for i,h_dim in enumerate(hidden_dims): #(0,H1)(1,H2) self.params['W%d' %(i+1,)] = weight_scale * np.random.randn(in_dim,h_dim) self.params['b%d' %(i+1,)] = np.zeros((h_dim,)) if self.normalization=='batchnorm': self.params['gamma%d' %(i+1,)] = np.ones((h_dim,)) #初始化为1 self.params['beta%d' %(i+1,)] = np.zeros((h_dim,)) #初始化为0 in_dim = h_dim #将该层的列数传递给下一层的行数 #初始化所有输出层的参数 self.params['W%d' %(self.num_layers,)] = weight_scale * np.random.randn(in_dim,num_classes) self.params['b%d' %(self.num_layers,)] = np.zeros((num_classes,)) ############################################################################ # END OF YOUR CODE # ############################################################################ # 当开启 dropout 时,我们需要在每一个神经元层中传递一个相同的 dropout 参数字典 self.dropout_param ,以保证每一层的神经元们 都知晓失活概率p和当前神经网络的模式状态mode(训练/测试)。 self.dropout_param = {} #dropout的参数字典 if self.use_dropout: self.dropout_param = {'mode': 'train', 'p': dropout} if seed is not None: self.dropout_param['seed'] = seed # 当开启批量归一化时,我们要定义一个BN算法的参数列表 self.bn_params , 以用来跟踪记录每一层的平均值和标准差。其中,第0个元素 self.bn_params[0] 表示前向传播第1个BN层的参数,第1个元素 self.bn_params[1] 表示前向传播 第2个BN层的参数,以此类推。 self.bn_params = [] #BN的参数字典 if self.normalization=='batchnorm': self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)] if self.normalization=='layernorm': self.bn_params = [{} for i in range(self.num_layers - 1)] # Cast all parameters to the correct datatype for k, v in self.params.items(): self.params[k] = v.astype(dtype) def loss(self, X, y=None): """ Compute loss and gradient for the fully-connected net. Input / output: Same as TwoLayerNet above. """ X = X.astype(self.dtype) mode = 'test' if y is None else 'train' # Set train/test mode for batchnorm params and dropout param since they # behave differently during training and testing. if self.use_dropout: self.dropout_param['mode'] = mode if self.normalization=='batchnorm': for bn_param in self.bn_params: bn_param['mode'] = mode scores = None ############################################################################ # TODO: Implement the forward pass for the fully-connected net, computing # # the class scores for X and storing them in the scores variable. # # # # When using dropout, you'll need to pass self.dropout_param to each # # dropout forward pass. # # # # When using batch normalization, you'll need to pass self.bn_params[0] to # # the forward pass for the first batch normalization layer, pass # # self.bn_params[1] to the forward pass for the second batch normalization # # layer, etc. # ############################################################################ fc_mix_cache = {} # # 初始化每层前向传播的缓冲字典 if self.use_dropout: # 如果开启了dropout,初始化其对应的缓冲字典 dp_cache = {} # 从第一个隐藏层开始循环每一个隐藏层,传递数据out,保存每一层的缓冲cache out = X for i in range(self.num_layers - 1): # 在每个hidden层中循环 w,b = self.params['W%d' %(i+1,)],self.params['b%d' %(i+1,)] if self.normalization == 'batchnorm': gamma = self.params['gamma%d' %(i+1,)] beta = self.params['beta%d' %(i+1,)] out,fc_mix_cache[i] = affine_bn_relu_forward(out,w,b,gamma,beta,self.bn_params[i]) else: out,fc_mix_cache[i] = affine_relu_forward(out,w,b) if self.use_dropout: out,dp_cache[i] = dropout_forward(out,self.dropout_param) #最后的输出层 w = self.params['W%d' %(self.num_layers,)] b = self.params['b%d' %(self.num_layers,)] out,out_cache = affine_forward(out,w,b) scores = out ############################################################################ # END OF YOUR CODE # ############################################################################ # If test mode return early if mode == 'test': return scores loss, grads = 0.0, {} ############################################################################ # TODO: Implement the backward pass for the fully-connected net. Store the # # loss in the loss variable and gradients in the grads dictionary. Compute # # data loss using softmax, and make sure that grads[k] holds the gradients # # for self.params[k]. Don't forget to add L2 regularization! # # # # When using batch/layer normalization, you don't need to regularize the scale # # and shift parameters. # # # # NOTE: To ensure that your implementation matches ours and you pass the # # automated tests, make sure that your L2 regularization includes a factor # # of 0.5 to simplify the expression for the gradient. # ############################################################################ loss,dout = softmax_loss(scores,y) loss += 0.5 * self.reg * np.sum(self.params['W%d' %(self.num_layers,)] ** 2) # 在输出层处梯度的反向传播,顺便把梯度保存在梯度字典 grad 中: dout,dw,db = affine_backward(dout,out_cache) grads['W%d' %(self.num_layers,)] = dw + self.reg * self.params['W%d' %(self.num_layers,)] grads['b%d' %(self.num_layers,)] = db # 在每一个隐藏层处梯度的反向传播,不仅顺便更新了梯度字典 grad,还迭代算出了损失值loss for i in range(self.num_layers - 1): ri = self.num_layers - 2 - i #倒数第ri+1隐藏层 loss += 0.5 * self.reg * np.sum(self.params['W%d' %(ri+1,)] ** 2) #迭代地补上每层的正则项给loss if self.use_dropout: dout = dropout_backward(dout,dp_cache[ri]) if self.normalization == 'batchnorm': dout,dw,db,dgamma,dbeta = affine_bn_relu_backward(dout,fc_mix_cache[ri]) grads['gamma%d' %(ri+1,)] = dgamma grads['beta%d' %(ri+1,)] = dbeta else: dout,dw,db = affine_relu_backward(dout,fc_mix_cache[ri]) grads['W%d' %(ri+1,)] = dw + self.reg * self.params['W%d' %(ri+1,)] grads['b%d' %(ri+1,)] = db ############################################################################ # END OF YOUR CODE # ############################################################################ return loss, grads
Running check with dropout = 1
Initial loss: 2.3004790897684924 W1 relative error: 1.48e-07 W2 relative error: 2.21e-05 W3 relative error: 3.53e-07 b1 relative error: 5.38e-09 b2 relative error: 2.09e-09 b3 relative error: 5.80e-11Running check with dropout = 0.75
Initial loss: 2.2924325088330475 W1 relative error: 2.74e-08 W2 relative error: 2.98e-09 W3 relative error: 4.29e-09 b1 relative error: 7.78e-10 b2 relative error: 3.36e-10 b3 relative error: 1.65e-10Running check with dropout = 0.5
Initial loss: 2.3042759220785896 W1 relative error: 3.11e-07 W2 relative error: 1.84e-08 W3 relative error: 5.35e-08 b1 relative error: 5.37e-09 b2 relative error: 2.99e-09 b3 relative error: 1.13e-10dropout可以视为一种正则化手段
1
(Iteration 1 / 125) loss: 7.856643 (Epoch 0 / 25) train acc: 0.260000; val_acc: 0.184000 (Epoch 1 / 25) train acc: 0.404000; val_acc: 0.259000 (Epoch 2 / 25) train acc: 0.468000; val_acc: 0.248000 (Epoch 3 / 25) train acc: 0.526000; val_acc: 0.247000 (Epoch 4 / 25) train acc: 0.646000; val_acc: 0.273000 (Epoch 5 / 25) train acc: 0.686000; val_acc: 0.257000 (Epoch 6 / 25) train acc: 0.690000; val_acc: 0.260000 (Epoch 7 / 25) train acc: 0.758000; val_acc: 0.255000 (Epoch 8 / 25) train acc: 0.832000; val_acc: 0.264000 (Epoch 9 / 25) train acc: 0.856000; val_acc: 0.268000 (Epoch 10 / 25) train acc: 0.914000; val_acc: 0.289000 (Epoch 11 / 25) train acc: 0.922000; val_acc: 0.293000 (Epoch 12 / 25) train acc: 0.948000; val_acc: 0.307000 (Epoch 13 / 25) train acc: 0.960000; val_acc: 0.313000 (Epoch 14 / 25) train acc: 0.972000; val_acc: 0.311000 (Epoch 15 / 25) train acc: 0.964000; val_acc: 0.309000 (Epoch 16 / 25) train acc: 0.966000; val_acc: 0.295000 (Epoch 17 / 25) train acc: 0.984000; val_acc: 0.306000 (Epoch 18 / 25) train acc: 0.988000; val_acc: 0.332000 (Epoch 19 / 25) train acc: 0.996000; val_acc: 0.318000 (Epoch 20 / 25) train acc: 0.992000; val_acc: 0.313000 (Iteration 101 / 125) loss: 0.000961 (Epoch 21 / 25) train acc: 0.996000; val_acc: 0.311000 (Epoch 22 / 25) train acc: 0.994000; val_acc: 0.304000 (Epoch 23 / 25) train acc: 0.998000; val_acc: 0.308000 (Epoch 24 / 25) train acc: 1.000000; val_acc: 0.316000 (Epoch 25 / 25) train acc: 0.998000; val_acc: 0.320000 0.25 (Iteration 1 / 125) loss: 11.299055 (Epoch 0 / 25) train acc: 0.234000; val_acc: 0.187000 (Epoch 1 / 25) train acc: 0.382000; val_acc: 0.228000 (Epoch 2 / 25) train acc: 0.490000; val_acc: 0.247000 (Epoch 3 / 25) train acc: 0.534000; val_acc: 0.228000 (Epoch 4 / 25) train acc: 0.648000; val_acc: 0.298000 (Epoch 5 / 25) train acc: 0.676000; val_acc: 0.316000 (Epoch 6 / 25) train acc: 0.752000; val_acc: 0.285000 (Epoch 7 / 25) train acc: 0.774000; val_acc: 0.252000 (Epoch 8 / 25) train acc: 0.818000; val_acc: 0.288000 (Epoch 9 / 25) train acc: 0.844000; val_acc: 0.326000 (Epoch 10 / 25) train acc: 0.864000; val_acc: 0.311000 (Epoch 11 / 25) train acc: 0.920000; val_acc: 0.293000 (Epoch 12 / 25) train acc: 0.922000; val_acc: 0.282000 (Epoch 13 / 25) train acc: 0.960000; val_acc: 0.303000 (Epoch 14 / 25) train acc: 0.966000; val_acc: 0.290000 (Epoch 15 / 25) train acc: 0.948000; val_acc: 0.277000 (Epoch 16 / 25) train acc: 0.970000; val_acc: 0.324000 (Epoch 17 / 25) train acc: 0.950000; val_acc: 0.295000 (Epoch 18 / 25) train acc: 0.970000; val_acc: 0.316000 (Epoch 19 / 25) train acc: 0.972000; val_acc: 0.296000 (Epoch 20 / 25) train acc: 0.990000; val_acc: 0.293000 (Iteration 101 / 125) loss: 0.556808 (Epoch 21 / 25) train acc: 0.990000; val_acc: 0.303000 (Epoch 22 / 25) train acc: 0.990000; val_acc: 0.306000 (Epoch 23 / 25) train acc: 0.992000; val_acc: 0.301000 (Epoch 24 / 25) train acc: 0.994000; val_acc: 0.303000 (Epoch 25 / 25) train acc: 0.998000; val_acc: 0.289000这张图真的能看出来什么吗。。train上的准确率几乎相同,validation上的准确率也差不多。。