神经网络
神经元
神经元和感知器本质上是一样的,当我们说神经元时,激活函数往往选择为Sigmoid \text{Sigmoid} Sigmoid 函数或者tanh \tanh tanh 函数。
Sigmoid \text{Sigmoid} Sigmoid 函数定义如下:
Sigmoid ( x ) = 1 1 + e − x \text{Sigmoid}(x)=\frac1{1+e^{-x}}
Sigmoid ( x ) = 1 + e − x 1
那么对于输出y y y :
y = 1 1 + e − ω T ⋅ x y=\frac1{1+e^{-\omega^T\cdot x}}
y = 1 + e − ω T ⋅ x 1
令y = Sigmoid ( x ) y=\text{Sigmoid}(x) y = Sigmoid ( x ) ,则y ′ = y ( 1 − y ) y'=y(1-y) y ′ = y ( 1 − y )
神经网络
神经网络就是按照一定规则连接起来的多个神经元。
左图就是一个**全连接(full connected)**神经网络。
神经元按照层 来布局。最左边的层叫做输入层 ,负责接收输入数据;最右边的层叫输出层 ,负责输出数据。输入层和输出层之间的层叫做隐藏层 ,因为它们对于外部来说是不可见的。
同一层的神经元之间没有连接。
第N层的每个神经元和第N-1层的所有 神经元相连,第N-1层神经元的输出就是第N层神经元的输入。
每个连接都有一个权值 。
计算神经网络的输出
以节点4的计算为例:
a 4 = sigmoid ( ω T x ) = sigmoid ( ω 41 x 1 + ω 42 x 2 + ω 43 x 3 + ω 4 b ) a_4=\text{sigmoid}(\omega^Tx)=\text{sigmoid}(\omega_{41}x_1+\omega_{42}x_2+\omega_{43}x_3+\omega_{4b})
a 4 = sigmoid ( ω T x ) = sigmoid ( ω 41 x 1 + ω 42 x 2 + ω 43 x 3 + ω 4 b )
其中ω 4 b \omega_{4b} ω 4 b 是节点4的偏置项,图中没有画出来。在给权重ω j i \omega_{ji} ω ji 编号时,我们把目标节点编号j j j 放在前面,把源节点编号i i i 放在后面。
这样所有节点的输出值计算完毕,我们就得到了在输入向量x ⃗ = [ x 1 x 2 x 3 ] \vec x=\begin{bmatrix}x_1\\x_2\\x_3\end{bmatrix} x = x 1 x 2 x 3 时,神经网络的输出向量y ⃗ = [ y 1 y 2 ] \vec y=\begin{bmatrix}y_1\\y_2\end{bmatrix} y = [ y 1 y 2 ] 。
输入向量的维度和输入层神经元个数相同,输出向量的维度和输出层神经元个数相同。
神经网络的训练
对于人为设置的参数,如神经网络的连接方式、网络的层数、每层的节点数,这些参数叫做超参数。
反向传播算法的推导
对目标函数
E d ≡ 1 2 ∑ i ∈ outputs ( t i − y i ) 2 E_d\equiv\frac12\sum_{i\in\text{outputs}}(t_i-y_i)^2
E d ≡ 2 1 i ∈ outputs ∑ ( t i − y i ) 2
也使用随机梯度下降算法对目标函数进行优化:
ω j i ← ω j i − η ∂ E d ∂ ω j i \omega_{ji}\larr\omega_
{ji}-\eta\frac{\partial E_d}{\partial \omega_{ji}}
ω ji ← ω ji − η ∂ ω ji ∂ E d
设n e t j net_j n e t j 是节点j j j 的加权输入,即
n e t j = ω j ⃗ ⋅ x j ⃗ = ∑ i ω j i x j i net_j=\vec{\omega_j}\cdot\vec{x_j}=\sum_i\omega_{ji}x_{ji}
n e t j = ω j ⋅ x j = i ∑ ω ji x ji
我们仍然需要∂ E d ∂ ω j i \frac{\partial E_d}{\partial \omega_{ji}} ∂ ω ji ∂ E d . 而
∂ E d ∂ ω j i = ∂ E d ∂ n e t j ∂ n e t j ∂ ω j i = ∂ E d ∂ n e t j ∂ ∑ i ω j i x j i ∂ ω j i = ∂ E d ∂ n e t j x j i \begin{aligned}\frac{\partial E_d}{\partial\omega_{ji}}&=\frac{\partial E_d}{\partial net_j}\frac{\partial net_j}{\partial \omega_{ji}}\\&=\frac{\partial E_d}{\partial net_j}\frac{\partial\sum_i\omega_{ji}x_{ji}}{\partial\omega_{ji}}\\&=\frac{\partial E_d}{\partial net_j}x_{ji}\end{aligned}
∂ ω ji ∂ E d = ∂ n e t j ∂ E d ∂ ω ji ∂ n e t j = ∂ n e t j ∂ E d ∂ ω ji ∂ ∑ i ω ji x ji = ∂ n e t j ∂ E d x ji
这样就转化为求∂ E d ∂ n e t j \frac{\partial E_d}{\partial net_{j}} ∂ n e t j ∂ E d 的问题
输出层权值训练
对于输出层来说,E d E_d E d 是y j y_j y j 的函数,而y j y_j y j 是n e t j net_j n e t j 的函数,则
∂ E d ∂ n e t j = ∂ E d ∂ y j ∂ y j ∂ n e t j \frac{\partial E_d}{\partial net_j}=\frac{\partial E_d}{\partial y_j}\frac{\partial y_j}{\partial net_j}
∂ n e t j ∂ E d = ∂ y j ∂ E d ∂ n e t j ∂ y j
其中,
∂ E d ∂ y j = ∂ ∂ y j 1 2 ∑ i ∈ outputs ( t i − y i ) 2 = − ( t j − y j ) \begin{aligned}\frac{\partial E_d}{\partial y_j}&=\frac{\partial}{\partial y_j}\frac12\sum_{i\in \text{outputs}}(t_i-y_i)^2\\&=-(t_j-y_j)\end{aligned}
∂ y j ∂ E d = ∂ y j ∂ 2 1 i ∈ outputs ∑ ( t i − y i ) 2 = − ( t j − y j )
∂ y j ∂ n e t j = y j ( 1 − y j ) \frac{\partial y_j}{\partial net_j}=y_j(1-y_j)
∂ n e t j ∂ y j = y j ( 1 − y j )
代入,得:
∂ E d ∂ n e t j = − ( t j − y j ) y j ( 1 − y j ) \frac{\partial E_d}{\partial net_j}=-(t_j-y_j)y_j(1-y_j)
∂ n e t j ∂ E d = − ( t j − y j ) y j ( 1 − y j )
令δ j = − ∂ E d ∂ n e t j \delta_j=-\frac{\partial E_d}{\partial net_j} δ j = − ∂ n e t j ∂ E d ,即一个节点的误差项,代入得:
δ j = ( t j − y j ) y j ( 1 − y j ) \delta_j=(t_j-y_j)y_j(1-y_j)
δ j = ( t j − y j ) y j ( 1 − y j )
即可得到ω j i \omega_{ji} ω ji 的更新算法
ω j i ← ω j i − η ∂ E d ∂ ω j i = ω j i + η δ j x j i \begin{aligned}\omega _{ji}&\larr\omega_{ji}-\eta\frac{\partial E_d}{\partial \omega_{ji}}\\&=\omega_{ji}+\eta\delta_jx_{ji}\end{aligned}
ω ji ← ω ji − η ∂ ω ji ∂ E d = ω ji + η δ j x ji
隐藏层权值训练
对于隐藏层来说,E d E_d E d 是下游节点的加权输入n e t k net_k n e t k 的函数,而n e t k net_k n e t k 是隐藏层的加权输入n e t j net_j n e t j 的函数,因此
∂ E d ∂ n e t j = ∑ k ∈ downstream(j) ∂ E d ∂ n e t k ∂ n e t k ∂ n e t j = ∑ k ∈ downstream(j) − δ k ∂ n e t k ∂ a j ∂ a j ∂ n e t j = ∑ k ∈ downstream(j) − δ k ω k j ∂ a j ∂ n e t j = ∑ k ∈ downstream(j) − δ k ω k j a j ( 1 − a j ) = − a j ( 1 − a j ) ∑ k ∈ downstream(j) δ k ω k j \begin{aligned}\frac{\partial E_d}{\partial net_j}&=\sum_{k\in\text{downstream(j)}}\frac{\partial E_d}{\partial net_k}\frac{\partial net_k}{\partial net_j}\\&=\sum_{k\in\text{downstream(j)}}-\delta_k\frac{\partial net_k}{\partial a_j}\frac{\partial a_j}{\partial net_j}\\&=\sum_{k\in\text{downstream(j)}}-\delta_k\omega_{kj}\frac{\partial a_j}{\partial net_j}\\&=\sum_{k\in\text{downstream(j)}}-\delta_k\omega_{kj}a_j(1-a_j)\\&=-a_j(1-a_j)\sum_{k\in\text{downstream(j)}}\delta_k\omega_{kj}\end{aligned}
∂ n e t j ∂ E d = k ∈ downstream(j) ∑ ∂ n e t k ∂ E d ∂ n e t j ∂ n e t k = k ∈ downstream(j) ∑ − δ k ∂ a j ∂ n e t k ∂ n e t j ∂ a j = k ∈ downstream(j) ∑ − δ k ω kj ∂ n e t j ∂ a j = k ∈ downstream(j) ∑ − δ k ω kj a j ( 1 − a j ) = − a j ( 1 − a j ) k ∈ downstream(j) ∑ δ k ω kj
因为δ j = − ∂ E d ∂ n e t j \delta_j=-\frac{\partial E_d}{\partial net_j} δ j = − ∂ n e t j ∂ E d ,代入得:
δ j = a j ( 1 − a j ) ∑ k ∈ downstream(j) δ k ω k j \delta_j=a_j(1-a_j)\sum_{k\in\text{downstream(j)}}\delta_k\omega_{kj}
δ j = a j ( 1 − a j ) k ∈ downstream(j) ∑ δ k ω kj
反向传播算法
刚刚已经推导得出输出层和隐藏层各自的δ i \delta_i δ i 计算方法。下面是反向传播算法的实现,以上图的神经网络为例:
首先,根据输入x 1 , x 2 , x 3 x_1,x_2,x_3 x 1 , x 2 , x 3 和激活函数sigmoid ( x ) \text{sigmoid}(x) sigmoid ( x ) 计算隐藏层节点4,5,6,7的输出a 4 , a 5 , a 6 , a 7 a_4,a_5,a_6,a_7 a 4 , a 5 , a 6 , a 7 。
根据a 4 , a 5 , a 6 , a 7 a_4,a_5,a_6,a_7 a 4 , a 5 , a 6 , a 7 计算输出层节点8,9的输出y 1 , y 2 y_1,y_2 y 1 , y 2 。
根据y 1 , y 2 y_1,y_2 y 1 , y 2 和标签t 1 , t 2 t_1,t_2 t 1 , t 2 计算输出层节点8,9的误差项δ 8 , δ 9 \delta_8,\delta_9 δ 8 , δ 9 。
对于输出层节点i i i :
δ i = y i ( 1 − y i ) ( t i − y i ) \delta_i=y_i(1-y_i)(t_i-y_i)
δ i = y i ( 1 − y i ) ( t i − y i )
根据δ 8 , δ 9 \delta_8,\delta_9 δ 8 , δ 9 和隐藏层到输出层的权重ω \omega ω 计算隐藏层节点a 4 , a 5 , a 6 , a 7 a_4,a_5,a_6,a_7 a 4 , a 5 , a 6 , a 7 的误差项δ 4 , δ 5 , δ 6 , δ 7 \delta_4,\delta_5,\delta_6,\delta_7 δ 4 , δ 5 , δ 6 , δ 7 。
对于隐藏层节点i i i :
δ i = a i ( 1 − a i ) ∑ k ∈ outputs δ k ω k i \delta_i=a_i(1-a_i)\sum_{k\in\text{outputs}}\delta_k\omega_{ki}
δ i = a i ( 1 − a i ) k ∈ outputs ∑ δ k ω ki
最后,更新每个连接上的权值:
ω j i ← ω j i + η δ j x i \omega_{ji}\larr\omega_{ji}+\eta\delta_jx_i
ω ji ← ω ji + η δ j x i
偏置项的输入值永远为1.因此,偏置项ω j b \omega_{jb} ω jb 采用如下方法计算:
ω j b ← ω j b = η δ j \omega_{jb}\larr\omega_{jb}=\eta\delta_j
ω jb ← ω jb = η δ j
神经网络的实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 import mathfrom functools import reducefrom random import randomdef sigmoid (x ): return 1 / (1 + math.exp(-x)) class Node (object ): def __init__ (self, layer_index, node_index ): self .layer_index = layer_index self .node_index = node_index self .downstream = [] self .upstream = [] self .output = 0.0 self .delta = 0 def set_output (self, output ): self .output = output def append_downstream_connection (self, conn ): self .downstream.append(conn) def append_upstream_connection (self, conn ): self .upstream.append(conn) def calc_output (self ): output = reduce(lambda ret, conn: ret + conn.upstream_node.output * conn.weight, self .upstream, 0.0 ) self .output = sigmoid(output) def calc_hidden_layer_delta (self ): downstream_delta = reduce( lambda ret, conn: ret + conn.downstream_node.delta * conn.weight, self .downstream, 0.0 ) self .delta = self .output * (1 - self .output) * downstream_delta def calc_output_layer_delta (self, label ): self .delta = self .output * (1 - self .output) * (label - self .output) def __str__ (self ): node_str = '%u-%u: output:%f, delta:%f\n' % (self .layer_index, self .node_index, self .output, self .delta) downstream_str = reduce(lambda ret, conn: ret + '\n\t' + str (conn), self .downstream, '' ) upstream_str = reduce(lambda ret, conn: ret + '\n\t' + str (conn), self .upstream, '' ) return node_str + '\n\tdownstream:' + downstream_str + '\n\tupstream:' + upstream_str class ConstNode (object ): def __init__ (self, layer_index, node_index ): self .layer_index = layer_index self .node_index = node_index self .downstream = [] self .output = 1.0 def __str__ (self ): node_str = '%u-%u: output:1\n' % (self .layer_index, self .node_index) downstream_str = reduce(lambda ret, conn: ret + '\n\t' + str (conn), self .downstream, '' ) return node_str + '\n\tdownstream:' + downstream_str class Layer (object ): def __init__ (self, layer_index, node_count ): self .layer_index = layer_index self .nodes = [] for i in range (node_count): self .nodes.append(Node(layer_index, i)) self .nodes.append(ConstNode(layer_index, node_count)) def set_output (self, data ): for i in range (len (data)): self .nodes[i].set_output(data[i]) def calc_output (self ): for node in self .nodes: node.calc_output() def dump (self ): for node in self .nodes: print (node) class Connection (object ): def __init__ (self, upstream_node, downstream_node ): self .upstream_node = upstream_node self .downstream_node = downstream_node self .weight = random.uniform(-0.1 , 0.1 ) self .gradient = 0.0 def calc_gradient (self ): self .gradient = self .downstream_node.delta * self .upstream_node.output def get_gradient (self ): return self .gradient def update_weight (self, rate ): self .calc_gradient() self .weight += rate * self .gradient def __str__ (self ): return '%u-%u -> %u-%u: weight:%f, gradient:%f' % ( self .upstream_node.layer_index, self .upstream_node.node_index, self .downstream_node.layer_index, self .downstream_node.node_index, self .weight, self .gradient) class Connections (object ): def __init__ (self ): self .connections = [] def add_connection (self, connection ): self .connections.append(connection) def dump (self ): for conn in self .connections: print (conn) class Network (object ): def __init__ (self, layers ): self .connections = Connections() self .layers = [] layer_count = len (layers) for i in range (layer_count): self .layers.append(Layer(i, layers[i])) for layer in range (layer_count - 1 ): connections = [Connection(upstream_node, downstream_node) for upstream_node in self .layers[layer].nodes for downstream_node in self .layers[layer + 1 ].nodes[:-1 ]] for conn in connections: self .connections.add_connection(conn) conn.upstream_node.append_downstream_connection(conn) conn.downstream_node.append_upstream_connection(conn) def train (self, labels, data_set, rate, iteration ): for i in range (iteration): for d in range (len (data_set)): self .train_one_sample(labels[d], data_set[d], rate) def train_one_sample (self, label, sample, rate ): self .predict(sample) self .calc_delta(label) self .update_weights(rate) def calc_delta (self, label ): output_nodes = self .layers[-1 ].nodes for i in range (len (label)): output_nodes[i].calc_output_layer_delta(label[i]) for layer in self .layers[-2 ::-1 ]: for node in layer.nodes: node.calc_hidden_layer_delta() def update_weight (self, rate ): for layer in self .layers[:-1 ]: for node in layer.nodes: for conn in node.downstream: conn.update_weight(rate) def calc_gradient (self ): for layer in self .layers[:-1 ]: for node in layer.nodes: for conn in node.upstream: conn.calc_gradient() def get_gradient (self, label, sample ): self .predict(sample) self .calc_delta(label) self .calc_gradient() def predict (self, sample ): self .layers[0 ].set_output(sample) for i in range (1 , len (self .layers)): self .layers[i].calc_output() return map (lambda node: node.output, self .layers[-1 ].nodes[:-1 ]) def dump (self ): for layer in self .layers: layer.dump()
梯度检查
如何检查梯度的计算是否正确?
对于∂ E d ∂ ω j i \frac{\partial E_d}{\partial\omega_{ji}} ∂ ω ji ∂ E d ,
∂ E d ( ω j i ) ∂ ω j i = lim ϵ → 0 E d ( ω j i + ϵ ) − E d ( ω j i − ϵ ) 2 ϵ \frac{\partial E_d(\omega_{ji})}{\partial\omega_{ji}}=\lim_{\epsilon\rarr0}\frac{E_d(\omega_{ji}+\epsilon)-E_d(\omega_{ji}-\epsilon)}{2\epsilon}
∂ ω ji ∂ E d ( ω ji ) = ϵ → 0 lim 2 ϵ E d ( ω ji + ϵ ) − E d ( ω ji − ϵ )
当ϵ \epsilon ϵ 为一个很小的数如1 0 − 4 10^{-4} 1 0 − 4 ,
∂ E d ( ω j i ) ∂ ω j i ≈ E d ( ω j i + ϵ ) − E d ( ω j i − ϵ ) 2 ϵ \frac{\partial E_d(\omega_{ji})}{\partial\omega_{ji}}\approx\frac{E_d(\omega_{ji}+\epsilon)-E_d(\omega_{ji}-\epsilon)}{2\epsilon}
∂ ω ji ∂ E d ( ω ji ) ≈ 2 ϵ E d ( ω ji + ϵ ) − E d ( ω ji − ϵ )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 from functools import reducedef gradient_check (network, sample_feature, sample_label ): network_error = lambda vec1, vec2: \ 0.5 * reduce(lambda a, b: a + b, map (lambda v: (v[0 ] - v[1 ]) * (v[0 ] - v[1 ]), zip (vec1, vec2))) network.get_gradient(sample_feature, sample_label) for conn in network.connections.connections: actual_gradient = conn.get_gradient() epsilon = 0.0001 conn.weight += epsilon error1 = network_error(network.predict(sample_feature), sample_label) conn.weight -= 2 * epsilon error2 = network_error(network.predict(sample_feature), sample_label) expected_gradient = (error2 - error1) / (2 * epsilon) print 'expected gradient: \t%f\nactual gradient: \t%f' % ( expected_gradient, actual_gradient)
神经网络实战——手写数字识别
超参数的确定
输入层节点数是确定的,因为MNIST数据集每个训练数据是28 ∗ 28 28*28 28 ∗ 28 的图片,共784 784 784 个像素,因此输入层节点数应该是784 784 784 ,每个像素对应一个输入节点。输出层节点数也是确定的。因为数字只可能是0~9中的一个,所以是10 10 10 分类,可以用10 10 10 个节点作为输出层。输出最大值的那个节点所对应的分类就是模型的预测结果。
隐藏层的节点数量不好确定。有几个经验公式如下。
当n n n 为输入层节点数,l l l 为输出层节点数,α \alpha α 为一个1~10之间的常数:
m = n + l + α m = log 2 n m = n l \begin{aligned}m&=\sqrt{n+l}+\alpha\\m&=\log_2n\\m&=\sqrt{nl}\end{aligned}
m m m = n + l + α = log 2 n = n l
这里先设置隐藏层节点数为300 300 300 .
模型训练和评估
MNIST数据集有10000个测试样本,先用60000个训练样本训练网络,再用测试样本对网络进行测试。错误率 :
错误率 = 错误预测样本数 总样本数 错误率=\frac{错误预测样本数}{总样本数}
错误率 = 总样本数 错误预测样本数
每训练10轮评估一次准确率,当准确率开始下降(出现过拟合)终止训练。
代码实现
首先将MNIST数据集处理为神经网络能够接受的形式。将28*28的图像按行优先转化为一个784维的向量,每个标签是0~9的值,将其转换为一个10维的one-hot 向量:如果标签值为n,就把向量的第n维设置为0.9,其余维设为0.1.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 import structfrom bp import *from datetime import datetimeclass Loader (object ): def __init__ (self, path, count ): self .path = path self .count = count def get_file_content (self ): f = open (self .path, 'rb' ) content = f.read() f.close() return content def to_int (self, byte ): return struct.unpack('B' , byte)[0 ] class ImageLoader (Loader ): def get_picture (self, content, index ): start = index * 28 * 28 + 16 picture = [] for i in range (28 ): picture.append([]) for j in range (28 ): picture[i].append( self .to_int(content[start + i * 28 + j])) return picture def get_one_sample (self, picture ): sample = [] for i in range (28 ): for j in range (28 ): sample.append(picture[i][j]) return sample def load (self ): content = self .get_file_content() data_set = [] for index in range (self .count): data_set.append( self .get_one_sample( self .get_picture(content, index))) return data_set class LabelLoader (Loader ): def load (self ): content = self .get_file_content() labels = [] for index in range (self .count): labels.append(self .norm(content[index + 8 ])) return labels def norm (self, label ): label_vec = [] label_value = self .to_int(label) for i in range (10 ): if i == label_value: label_vec.append(0.9 ) else : label_vec.append(0.1 ) return label_vec def get_training_data_set (): image_loader = ImageLoader('train-images-idx3-ubyte' , 60000 ) label_loader = LabelLoader('train-labels-idx1-ubyte' , 60000 ) return image_loader.load(), label_loader.load() def get_test_data_set (): image_loader = ImageLoader('t10k-images-idx3-ubyte' , 10000 ) label_loader = LabelLoader('t10k-labels-idx1-ubyte' , 10000 ) return image_loader.load(), label_loader.load() def get_result (vec ): max_value_index = 0 max_value = 0 for i in range (len (vec)): if vec[i] > max_value: max_value = vec[i] max_value_index = i return max_value_index def evaluate (network, test_data_set, test_labels ): error = 0 total = len (test_data_set) for i in range (total): label = get_result(test_labels[i]) predict = get_result(network.predict(test_data_set[i])) if label != predict: error += 1 return float (error) / float (total) def now (): return datetime.now().strftime('%c' ) def train_and_evaluate (): last_error_ratio = 1.0 epoch = 0 train_data_set, train_labels = get_training_data_set() test_data_set, test_labels = get_test_data_set() network = Network([784 , 300 , 10 ]) while True : epoch += 1 network.train(train_labels, train_data_set, 0.3 , 1 ) print ('%s epoch %d finished ' % (now(), epoch)) if epoch % 10 == 0 : error_ratio = evaluate(network, test_data_set, test_labels) print ('%s after epoch %d, error ratio is %f' ) % (now(), epoch, error_ratio) if error_ratio > last_error_ratio: break else : last_error_ratio = error_ratio if __name__ == '__main__' : train_and_evaluate()
向量化编程
下面用向量化编程的方法重新实现本全连接网络,将所有的计算都表达为向量的形式。
对于前向计算:
a ⃗ = σ ( W ⋅ x ⃗ ) \vec a=\sigma(W\cdot\vec x)
a = σ ( W ⋅ x )
对于反向传播,我们将其用向量来表示:
δ ⃗ = y ⃗ ( 1 − y ⃗ ) ( t ⃗ − y ⃗ ) δ ( l ) ⃗ = a ( l ) ⃗ ( 1 − a ( l ) ⃗ ) W T δ ( l + 1 ) \vec\delta=\vec y(1-\vec y)(\vec t-\vec y)\\\vec{\delta^{(l)}}=\vec{a^{(l)}}(1-\vec{a^{(l)}})W^T\delta^{(l+1)}
δ = y ( 1 − y ) ( t − y ) δ ( l ) = a ( l ) ( 1 − a ( l ) ) W T δ ( l + 1 )
其中,δ ( l ) \delta^{(l)} δ ( l ) 表示第l l l 层的误差项,W T W^T W T 表示矩阵W W W 的转置
还需将权重数组W W W 和偏置项b b b 的梯度计算进行向量化表示。
W ← W + η δ ⃗ x T ⃗ b ⃗ ← b ⃗ + η δ ⃗ W\larr W+\eta\vec\delta\vec{x^T}\\\vec b\larr\vec b+\eta\vec\delta
W ← W + η δ x T b ← b + η δ
代码实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 import randomimport numpy as npclass FullConnectedLayer (object ): def __init__ (self, input_size, output_size, activator ): ''' 构造函数 input_size: 本层输入向量的维度 output_size: 本层输出向量的维度 activator: 激活函数 ''' self .input_size = input_size self .output_size = output_size self .activator = activator self .W = np.random.uniform(-0.1 , 0.1 , (output_size, input_size)) self .b = np.zeros((output_size, 1 )) self .output = np.zeros((output_size, 1 )) def forward (self, input_array ): ''' 前向计算 input_array: 输入向量,维度必须等于input_size ''' self .input = input_array self .output = self .activator.forward( np.dot(self .W, input_array) + self .b) def backward (self, delta_array ): ''' 反向计算W和b的梯度 delta_array: 从上一层传递过来的误差项 ''' self .delta = self .activator.backward(self .input ) * np.dot( self .W.T, delta_array) self .W_grad = np.dot(delta_array, self .input .T) self .b_grad = delta_array def update (self, learning_rate ): ''' 使用梯度下降算法更新权重 ''' self .W += learning_rate * self .W_grad self .b += learning_rate * self .b_grad class SigmoidActivator (object ): def forward (self, weighted_input ): ''' Sigmoid激活函数的正向计算 ''' return 1.0 / (1.0 + np.exp(-weighted_input)) def backward (self, output ): ''' Sigmoid激活函数的反向计算 ''' return output * (1.0 - output) class Network (object ): def __init__ (self, layers ): ''' 构造函数 layers: 各层的节点数 ''' self .layers = [] for i in range (len (layers) - 1 ): self .layers.append( FullConnectedLayer( layers[i], layers[i + 1 ], SigmoidActivator() ) ) def predict (self, sample ): ''' 使用神经网络实现预测 sample: 输入样本 ''' output = sample for layer in self .layers: layer.forward(output) output = layer.output return output def train (self, labels, data_set, rate, epoch ): ''' 训练函数 labels: 样本标签 data_set: 输入样本 rate: 学习速率 epoch: 训练轮数 ''' for i in range (epoch): for d in range (len (data_set)): self .train_one_sample(labels[d], data_set[d], rate) def train_one_sample (self, label, sample, rate ): ''' 用一个样本训练网络 ''' self .predict(sample) self .calc_gradient(label) self .update_weight(rate) def calc_gradient (self, label ): ''' 计算每个连接的梯度 ''' delta = self .layers[-1 ].activator.backward( self .layers[-1 ].output ) * (label - self .layers[-1 ].output) for layer in self .layers[::-1 ]: layer.backward(delta) delta = layer.delta return delta def update_weight (self, rate ): ''' 更新每个连接的权重 ''' for layer in self .layers: layer.update(rate) def dump (self ): for layer in self .layers: layer.dump() def loss (self, output, label ): ''' 计算平方误差 ''' return 0.5 * ((label - output) * (label - output)).sum () def gradient_check (self, sample_feature, sample_label ): ''' 梯度检查 sample_feature: 样本的特征 sample_label: 样本的标签 ''' self .predict(sample_feature) self .calc_gradient(sample_label) epsilon = 10e-4 for fc in self .layers: for i in range (fc.W.shape[0 ]): for j in range (fc.W.shape[1 ]): fc.W[i, j] += epsilon output = self .predict(sample_feature) err1 = self .loss(sample_label, output) fc.W[i, j] -= 2 * epsilon output = self .predict(sample_feature) err2 = self .loss(sample_label, output) expect_grad = (err1 - err2) / (2 * epsilon) fc.W[i, j] += epsilon print ('weights(%d,%d): expected - actural %.4e - %.4e' % ( i, j, expect_grad, fc.W_grad[i, j]))