/ Machine Learning

总结2-神经网络

逻辑回归神经网络,本文示例为三层网络。

第一层400,第二层25,第三层10。数据集5000。

1. 假设函数(Hypothesis)

$$ h_{\Theta}\left(x\right) = a^{\left(L\right)}=g\left(\Theta^{\left(L-1\right)T}a^{\left(L-1\right)}\right) $$

2. 成本函数(Cost Function)

$$ J\left( \Theta \right) =-\dfrac {1} {m}\left[ \sum _{i=1}^{m}\sum _{k=1}^{K}y_{k}^{\left( i\right)}\log \left( h_{\Theta}\left(x^{\left( i\right)}\right)\right)_{k}+\left( 1-y_{k}^{\left( i\right)}\right) \log \left( 1-\left( h_{\Theta}\left( x^{\left( i\right)}\right) \right) _{k}\right) \right]\\ +\dfrac {\lambda } {2m}\sum _{l=1}^{L-1}\sum _{i=1}^{s_{l}}\sum _{j=1}^{s_{l+1}}\left(\Theta _{ji}^{\left( l\right)}\right)^{2} $$
$s_{l}$ 为第 $l$ 层的逻辑单元数量,包含偏置单元(bias unit)。

$s_{l+1}$ 为第 $l+1$ 层的逻辑单元数量,不包含偏置单元(bias unit)。

Y = zeros(m, num_labels);
for i = 1 : m,
    Y(i, y(i)) = 1;
end

a1 = [ones(m, 1) X]; % 加 a0=1 列,5000x401
z2 = a1*Theta1'; % 5000x25
a2 = [ones(m, 1) sigmoid(z2)]; % 加 a0=1 列,5000x26
z3 = a2*Theta2'; % 5000x10
a3 = sigmoid(z3); % 5000x10

tempTheta1 = Theta1;
tempTheta1(:, 1) = 0;
tempTheta2 = Theta2;
tempTheta2(:, 1) = 0;

J = -1 / m * sum(sum(Y .* log(a3) + (1 - Y) .* log((1 - a3)))) + lambda / (2 * m) * sum([tempTheta1(:); tempTheta2(:)] .^ 2);

3. 梯度下降(Gradient Descent)

$$ \begin {split} i&=1\ to\ m;\\ \Delta_{i,j}^{\left(l\right)}&:=0;\\ a^{\left(1\right)}&=x^{\left(t\right)};\\ Repeat\{ \\ l &= 1\ to\ L-1;\\ z^{\left(l+1\right)}&=\Theta^{\left(l\right)}a^{\left(l\right)}\\ a^{\left(l+1\right)}&=g\left(z^{\left(l+1\right)}\right)\\ \} \\ \\ \delta^{\left(L\right)}&=a^{\left(L\right)}-y^{\left(t\right)}\\ Repeat\{ \\ l &= L-1\ to\ 2\\ \delta^{\left(l\right)}&=\left(\left(\Theta^{\left(l\right)}\right)^{T}\delta^{\left(l+1\right)}\right).*g'\left(z^{\left(l\right)}\right)\\ &=\left(\left(\Theta^{\left(l\right)}\right)^{T}\delta^{\left(l+1\right)}\right).*a^{\left(l\right)} .* \left(1 - a^{\left(l\right)}\right)\\ &=\left(\left(\Theta^{\left(l\right)}\right)^{T}\delta^{\left(l+1\right)}\right).*g\left(z^{\left(l\right)}\right).*\left(1-g\left(z^{\left(l\right)}\right)\right)\\ \} \\ \\ \Delta_{i,j}^{\left(l\right)}&:=\Delta_{i,j}^{\left(l\right)}+a_{j}^{\left(l\right)}\delta_{i}^{\left(l+1\right)}\\ \Delta^{\left(l\right)}&:=\Delta^{\left(l\right)}+\delta^{\left(l+1\right)}\left(a^{\left(l\right)}\right)^{T}\\ \\ D_{i,j}^{\left(l\right)}&:=\dfrac {1} {m}\left(\Delta_{i,j}^{\left(l\right)}+\lambda\Theta_{i,j}^{\left(l\right)}\right) \qquad if\ j \neq0\\ D_{i,j}^{\left(l\right)}&:=\dfrac {1} {m}\Delta_{i,j}^{\left(l\right)} \qquad if \ j=0 \end {split} $$
delta3 = a3 - Y; % 5000x10
delta2 = delta3 * Theta2 .* a2 .* (1 - a2); % 5000x26

Delta1 = zeros(size(Theta1));
Delta2 = zeros(size(Theta2));

Delta1 = Delta1 + delta2(:, 2 : end)' * a1; % 25x401
Delta2 = Delta2 + delta3' * a2; % 10x26

Delta1 = 1 / m * (Delta1 .+ lambda * tempTheta1); % 25x401
Delta2 = 1 / m * (Delta2 .+ lambda * tempTheta2); % 10x26

Theta1_grad = Delta1;
Theta2_grad = Delta2;

grad = [Theta1_grad(:); Theta2_grad(:)];

4. 其他

4.1 展开参数(Unrolling Parameters)

size(Theta1); % 25x401
size(Theta2); % 10x26
ThetaVec = [Theta1(:); Theta2(:)];
DVec = [D1(:); D2(:)];
Theta1 = reshape(ThetaVec(1:25x401), 25, 401);
Theta2 = reshape(ThetaVec(25x401+1:25x401+10x26), 25, 401);

4.2 梯度检查(Gradient Checking)

ε = 1e-4;
for i=1:n,
	θPlus = θ;
	θPlus(i) = θPlus(i) + ε;
	θMinus = θ;
	θMinus(i) = θMinus(i) + ε;
	gradApprox(i) = (J(θPlus)-J(θMinus))/(2ε);
end
% Check DVec ≈ gradApprox

检查一次后必须关闭!!!

4.3 随机初始化 Random Initialization

控制 $\Theta_{i,j}^{\left(l\right)}$ 的值都在[-ε, ε]之间,这里的ε与上面的ε无关。

function W = randInitializeWeights(L_in, L_out)
epsilon = 3;
W = rand(L_out, 1 + L_in) * (2 * epsilon) - epsilon;
end

4.4 g'(z)函数

function g = sigmoidGradient(z)
g = sigmoid(z) .* (1 - sigmoid(z));
end

4.5 效能

  • 输入x的维度为n,输出y的维度为k。

  • 一般默认只使用一层hidden layer,如果使用多层hidden layer时,一般每层的units数相同。

  • hidden layer的units越多效果越好,性能越差。

  • units数一般要大于n。(图像处理时一般是小于n)。

4.6 步骤

  • 随机初始化权重
  • 使用正向传播计算出 $a^{\left(1\right)}\sim a^{\left(L\right)}$
  • 实现成本函数 $J\left( \Theta \right)$
  • 通过反向传播计算出 $\delta^{\left(L\right)}\sim \delta^{\left(2\right)}$
  • 再分别计算出对应的 $\Delta^{\left(l\right)}$ 和 $D^{\left(l\right)}$
  • $D^{\left(l\right)}$ 即对应为 $\dfrac {\partial} {\partial \Theta_{i,j}^{\left(l\right)}}J\left( \Theta \right)$
  • 使用梯度效验算法进行验证 $D^{\left(l\right)}$ 的准确性,然后关闭梯度效验算法。
  • 使用最优算法计算出 $J\left( \Theta \right)$ 最小值时的 $\Theta$ 。

这里计算的是局部最小值!!!