8. 경사 소실 문제 해결하기

2018. 1. 18. 15:28

경사 소실 문제 해결 방안

출력층 활성화 함수 : 신경망 모델에서는 반드시 확률을 출력하는 함수여야 하기 때문에 일반적으로 시그모이드 함수나 소프트맥스 함수를 사용한다.

은닉층 활성화 함수 : 입력값이 작으면 작은 값을 출력하고, 입력값이 크면 큰 값을 출력하는 함수를 사용

은닉층에서 활성화 함수로 시그모이드 함수를 사용하면 경사 소실 문제가 발생하기 때문에 다른 활성화 함수를 사용해 경사 소실 문제를 해결할 수 있다.

시그모이드 함수를 대체할 수 있는 활성화 함수를 사용하려면 시그모이드 함수와 모양이 비슷하고 경사가 소실되지 않는 함수를 찾아봐야 한다

쌍곡탄젠트 함수(Hyperbolic Tangent Function) $\tanh(x)$

$\tanh(x)$ 함수는 식 $(8.1)$과 같이 정의하며, 그림 $8.1$은 $\tanh(x)$ 함수의 그래프이다.

$$\begin{align} \tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}\tag{8.1}\end{align}$$

그림 8.1 $\tanh(x)$의 그래프

그림 $8.1$의 그래프를 보면 시그모이드 함수 $\sigma(x)$의 그래프와 비슷하지만 함수값의 범위가 다르다.

시그모이드 함수 : 입력값 $-\infty<x<\infty$에 대하여 함수값의 범위는 $0<\sigma(x)<1$
쌍곡탄젠트 함수 : 입력값 $-\infty<x<\infty$에 대하여 함수값의 범위는 $-1<\tanh(x)<1$

활성화 함수로 쌍곡탄젠트 함수를 사용할 경우 경사를 구하려면 $\tanh(x)$의 도함수를 알아야 하는데 도함수를 구해보자. 먼저 $\tanh(x)$는 식 $(8.2)$와 같이 변형할 수 있다.

$$\begin{align} \tanh(x) &= \frac{e^x-e^{-x}}{e^x+e^{-x}} \\ &= \frac{e^x-e^{-x}}{e^x+e^{-x}}\times 1\\&= \frac{e^x-e^{-x}}{e^x+e^{-x}}\times \frac{e^{-x}}{e^{-x}} \\&= \frac{(e^x-e^{-x})e^{-x}}{(e^x+e^{-x})e^{-x}}\\&= \frac{e^xe^{-x}-e^{-x}e^{-x}}{e^xe^{-x}+e^{-x}e^{-x}}\\&= \frac{e^{x-x}-e^{-x-x}}{e^{x-x}+e^{-x-x}}\\&= \frac{e^{0}-e^{-2x}}{e^{0}+e^{-2x}}\\&= \frac{1-e^{-2x}}{1+e^{-2x}}\tag{8.2}\end{align}$$

이제 $\tanh'(x)$는 식 $(8.3)$과 같이 구할 수 있다.

\begin{align} \tanh'(x) &= \frac{(1-e^{-2x})'(1+e^{-2x}) + (1-e^{-2x})(1+e^{-2x})'}{(1+e^{-2x})^2} \\&= \frac{2e^{-2x}(1+e^{-2x}) + (1-e^{-2x})(-2e^{-2x})}{(1+e^{-2x})^2}\\&= \frac{2e^{-2x}(1+\not{e^{-2x}} + 1 - \not{e^{-2x}})}{(1+e^{-2x})^2}\\&= \frac{4e^{-2x}}{(1+e^{-2x})^2}\\&= \frac{(1+2e^{-2x}+e^{-4x})-(1-2e^{-2x}+e^{-4x})}{(1+e^{-2x})^2}\\&= \frac{(1+e^{-2x})^2-(1-e^{-2x})^2}{(1+e^{-2x})^2}\\&= \frac{\not{(1+e^{-2x})^2}}{\not{(1+e^{-2x})^2}}-\frac{(1-e^{-2x})^2}{(1+e^{-2x})^2}\\&= 1-\Bigg(\frac{1-e^{-2x}}{1+e^{-2x}}\Bigg)^2\\&= 1-\tanh^2(x)\\&= (1+\tanh(x))(1-\tanh(x))\tag{8.3}\end{align}

그림 8.2는 $\tanh'(x)$와 $\sigma'(x)$의 그래프로 $\sigma'(x)$의 최대값은 $\sigma'(0)=0.25$였지만 $\tanh'(x)$의 최대값은 $\tanh'(0)=1$이기 때문에 시그모이드 함수와 비교했을 때 경사가 소실되기 힘들다는 것을 알 수 있다.

그림 8.2 $\tanh'(x)$와 $\sigma'(x)$의 그래프

쌍곡탄젠트 함수 코드 구현

TensorFlow

tf.nn.tanh()

Keras

Activation('tanh')

은닉층 4개로 MNIST 모델링하기

Keras 코드로 구현하면 정확도가 $93.2%$까지 올라가는 것을 알 수 있다.

from time import time 
import numpy as np 
from keras.models import Sequential 
from keras.layers import Dense, Activation 
from keras.optimizers import SGD 
from sklearn import datasets 
from sklearn.model_selection import train_test_split 

start = time() 
MNIST = datasets.fetch_mldata('MNIST original', data_home='.') 

n = len(MNIST.data) 
N = 10000 
indices = np.random.permutation(range(n))[:N] 

X = MNIST.data[indices] 
Y = MNIST.target[indices] 
Y = np.eye(10)[Y.astype(int)] 

x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.8) 

# 모델 설정 
n_in = len(X[0]) 
n_hidden = 200 
n_out = len(Y[0]) 

model = Sequential() 
model.add(Dense(input_dim=n_in, units=n_hidden)) 
model.add(Activation('tanh')) 
model.add(Dense(units=n_hidden)) 
model.add(Activation('tanh')) 
model.add(Dense(units=n_hidden)) 
model.add(Activation('tanh')) 
model.add(Dense(units=n_hidden)) 
model.add(Activation('tanh')) 
model.add(Dense(units=n_out)) 
model.add(Activation('softmax')) 
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01), metrics=['accuracy']) 

# 모델 학습 
epochs = 1000 
batch_size = 100 

model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size) 

print(f'\nElapse training time : {(time() - start)//60}분 {(time() - start) % 60:.6}초\n') 

# 정확도 측정 
start = time() 
loss_and_metrics = model.evaluate(x_test, y_test) 

print(f'\nLoss : ') 
print(f'Accuracy : %') 
print(f'Elapse test time : 초')

Elapse training time : 4.0분 37.9497초 

  32/2000 [..............................] - ETA: 0s 
2000/2000 [==============================] - 0s 32us/step 

Loss : 0.396467 
Accuracy : 93.2% 
Elapse test time : 0.06404519081115723초

ReLU(Rectified Linear Unit) 함수

$\tanh(x)$ 함수를 사용하면 경사가 소실되기 어렵기 때문에 좋기는 하지만 고차원 데이터를 다룰 경우에는 값이 커질 수 있어 다시 경사가 소실될 수 있는 문제점이 발생한다

복잡한 데이터일 수록 고차원일 경우가 많은 이를 회피할 수 있는 활성화 함수가 ReLU 함수로 램프 함수 또는 정규화 선형 함수라고도 하는데 식 $(8.4)$와 같이 정의하며, 그래프는 그림 $8.3$과 같다.

$$\begin{align}f(x) = \max(0,x) \tag{8.4}\end{align}$$

그림 8.3 $f(x)=\max(0,x)$ 함수의 그래프

import numpy as np 
import matplotlib.pyplot as plt 

def relu(x): 
    if x >= 0: 
        return x 
    else: 
        return 0 
        
fig = plt.figure() 
ax = fig.add_subplot(111) 
major_y_ticks = np.arange(-1, 2, 0.2) 
ax.set_yticks(major_y_ticks) 
major_x_ticks = np.arange(-1, 1.3, 0.5) 
ax.set_xticks(major_x_ticks) 
ax.grid(which='major', linestyle='--') 

x = np.linspace(-1, 1, 100) 
y = np.array([relu(x) for x in x]) 
ax.plot(x, y, label='ReLU') 
ax.legend() 
plt.show()

ReLU 함수는 시그모이드나 쌍곡탄젠트 함수와 달리 곡선 부분이 없으며, ReLU 함수를 미분하면 식 $(8.5)$와 같이 계단 함수가 되는 것을 알 수 있다.

\begin{align}f'(x) =\left\{ \begin{array}{lc} 1 & \quad x > 0\\ 0 & \quad x \leqq 0\end{array}\right. \tag{8.5}\end{align}

ReLU 함수의 도함수는 $x$가 아무리 커져도 1을 반환하므로 경사가 소실되지 않기 때문에 시그모이드 함수나 쌍곡탄젠트 함수에 비교해 학습 속도가 빠르다.

또한 ReLU와 ReLU의 도함수는 지수 함수가 포함되지 않은 단순한 식으로 표현되기 때문에 빠르게 계산이 가능하다.

단점으로는 $x\leqq 0$일 때는 함수값도 경사도 $0$이기 때문에 ReLU를 활성화 함수로 사용한 신경망 모델의 뉴런 중 활성화되지 못한 뉴런은 학습동안 활성화가 되지 않는 문제가 있다.

학습률을 큰 값으로 설정하면 첫 오차역전파에서 뉴런의 값이 너무 작아져 해당 뉴런은 신경망 모델에서 존재하지 않는 것이나 다름없는 상태가 되기 때문에 주의해야 한다.

쌍곡탄젠트 함수 코드 구현

TensorFlow

tf.nn.relu()

Keras

Activation('relu')

은닉층 4개로 MNIST 모델링하기

Keras 코드로 구현하면 정확도가 $94.1\%$까지 올라가는 것을 알 수 있다.

from time import time 
import numpy as np 
from keras.models import Sequential 
from keras.layers import Dense, Activation 
from keras.optimizers import SGD 
from sklearn import datasets 
from sklearn.model_selection import train_test_split 

start = time() 

MNIST = datasets.fetch_mldata('MNIST original', data_home='.') 

n = len(MNIST.data) 
N = 10000 
indices = np.random.permutation(range(n))[:N] 

X = MNIST.data[indices] 
Y = MNIST.target[indices] 
Y = np.eye(10)[Y.astype(int)] 

x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.8) 

# 모델 설정 
n_in = len(X[0]) 
n_hidden = 200 
n_out = len(Y[0]) 

model = Sequential() 
model.add(Dense(input_dim=n_in, units=n_hidden)) 
model.add(Activation('relu')) 
model.add(Dense(units=n_hidden)) 
model.add(Activation('relu')) 
model.add(Dense(units=n_hidden)) 
model.add(Activation('relu')) 
model.add(Dense(units=n_hidden)) 
model.add(Activation('relu')) 
model.add(Dense(units=n_out)) 
model.add(Activation('softmax')) 
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01), metrics=['accuracy']) 

# 모델 학습 
epochs = 1000 
batch_size = 100 
model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size) 

print(f'\nElapse training time : {(time() - start)//60}분 {(time() - start) % 60:.6}초\n') 

# 정확도 측정 
start = time() 
loss_and_metrics = model.evaluate(x_test, y_test) 

print(f'\nLoss : ') 
print(f'Accuracy : %') 
print(f'Elapse test time : 초')

Elapse training time : 4.0분 32.813초 

  32/2000 [..............................] - ETA: 0s 
2000/2000 [==============================] - 0s 31us/step 

Loss : 0.534027 
Accuracy : 94.1% 
Elapse test time : 0.06304502487182617초

LeakyReLU 함수

LeakyReLU 함수는 LReLU라고도 하며 ReLU 함수를 개량시킨 것으로 식 $(8.6)$과 같이 정의되며 $\alpha$는 $0.01$과 같은 작은 상수값이다.

$$}\begin{align} f(x)=\max(\alpha x, x)\tag{8.6}\end{align}$$

LReLU의 그래프는 그림 $8.4$와 같으며, ReLU 함수와의 차이점은 $\alpha$에 의해 $x<0$일 때도 작은 경사(\alpha)를 갖는다. LReLU 함수를 미분하면 식 $(8.7)$과 같다.

$$\begin{align}f'(x) =\left\{ \begin{array}{lc} 1 & x > 0\\ \alpha & x \leqq 0\end{array}\right. \tag{8.7}\end{align}$$

그림 8.4 $f(x)=\max(\alpha x,x)$ 함수의 그래프

ReLU 함수는 $x\leqq 0$일 때 경사가 사라져버려 학습 과정이 불안해질 수 있는 문제가 있었지만 LReLU는 $x\leqq 0$일 때도 학습이 진행되기 때문에 ReLU 함수보다 효과적인 활성화 함수라고 생각할 수 있지만 실제로 사용하게 되면 효과가 있는 경우도 있고, 업는 경우도 있어 언제 효과가 나타나는지에 관해 아직 밝혀진 바가 없다.

LReLU 함수 코드 구현

TensorFlow에서는 아직 API가 제공되지 않아 직접 함수를 정의해 사용해야 한다.

def lrelu(x, alpha=0.01): 
    if x >= 0: 
        return x 
    else: 
        return alpha * x

def lrelu(x, alpha=0.01): 
    return tf.maximum(alpha * x, x)

Keras에서는 다음 코드를 추가한다.

from keras.layers.advanced_activations import LeakyReLU

은닉층 4개로 MNIST 모델링하기

TensorFlow로 구현하면 정확도 $93.7\%$가 나온다

import numpy as np 
import tensorflow as tf 
from sklearn import datasets 
from sklearn.model_selection import train_test_split 
from sklearn.utils import shuffle 

mnist = datasets.fetch_mldata('MNIST original', data_home='.') 

n = len(mnist.data) 
N = 10000 
train_size = 0.8 
indices = np.random.permutation(range(n))[:N] 

X = mnist.data[indices] 
y = mnist.target[indices] 
Y = np.eye(10)[y.astype(int)] 
X_train, X_test, Y_train, Y_test =\ train_test_split(X, Y, train_size=train_size) 

n_in = len(X[0]) 
n_hidden = 200 
n_out = len(Y[0]) 

def lrelu(x, alpha=0.01): 
    return tf.maximum(alpha * x, x) 
    
x = tf.placeholder(tf.float32, shape=[None, n_in]) 
t = tf.placeholder(tf.float32, shape=[None, n_out]) 

W0 = tf.Variable(tf.truncated_normal([n_in, n_hidden], stddev=0.01)) 
b0 = tf.Variable(tf.zeros([n_hidden])) 
h0 = lrelu(tf.matmul(x, W0) + b0) 

W1 = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], stddev=0.01)) 
b1 = tf.Variable(tf.zeros([n_hidden]))
h1 = lrelu(tf.matmul(h0, W1) + b1) 

W2 = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], stddev=0.01)) 
b2 = tf.Variable(tf.zeros([n_hidden])) 
h2 = lrelu(tf.matmul(h1, W2) + b2) 

W3 = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], stddev=0.01)) 
b3 = tf.Variable(tf.zeros([n_hidden])) 
h3 = lrelu(tf.matmul(h2, W3) + b3)

W4 = tf.Variable(tf.truncated_normal([n_hidden, n_out], stddev=0.01)) 
b4 = tf.Variable(tf.zeros([n_out])) 

y = tf.nn.softmax(tf.matmul(h3, W4) + b4) 

cross_entropy = tf.reduce_mean(-tf.reduce_sum(t * tf.log(y), axis=1)) 

train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy) 
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(t, 1)) 
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 

epochs = 50 
batch_size = 200 
init = tf.global_variables_initializer() 
sess = tf.Session() 
sess.run(init) 
n_batches = (int)(N * train_size) // batch_size 

for epoch in range(epochs): 
    X_, Y_ = shuffle(X_train, Y_train) 
    
    for i in range(n_batches): 
        start = i * batch_size 
        end = start + batch_size 
        sess.run(train_step, feed_dict={ x: X_[start:end], t: Y_[start:end] }) 
        
    loss = cross_entropy.eval(session=sess, feed_dict={ x: X_, t: Y_ }) 
    acc = accuracy.eval(session=sess, feed_dict={ x: X_, t: Y_ }) 
    print(f'epoch: , loss: , accuracy: ') 


accuracy_rate = accuracy.eval(session=sess, feed_dict={ x: X_test, t: Y_test }) 
print(f'accuracy: %')

epoch: 47, loss: 0.03343997895717621, accuracy: 0.9942499995231628 
epoch: 48, loss: 0.02821164019405842, accuracy: 0.9956250190734863 
epoch: 49, loss: 0.036636706441640854, accuracy: 0.9912499785423279 

accuracy: 93.7 %

Keras 코드로 구현하면 정확도 $92.9\%$가 나온다.

from time import time 
import numpy as np 
from keras.models import Sequential 
from keras.layers import Dense, Activation 
from keras.optimizers import SGD 
from sklearn import datasets 
from sklearn.model_selection import train_test_split 
from keras.layers.advanced_activations import LeakyReLU 

start = time() 
MNIST = datasets.fetch_mldata('MNIST original', data_home='.') 

n = len(MNIST.data) 
N = 10000 
indices = np.random.permutation(range(n))[:N] 

X = MNIST.data[indices] 
Y = MNIST.target[indices] 
Y = np.eye(10)[Y.astype(int)] 

x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.8) 

# 모델 설정 
n_in = len(X[0]) 
n_hidden = 200 
n_out = len(Y[0]) 

alpha = 0.01 

model = Sequential() 
model.add(Dense(input_dim=n_in, units=n_hidden)) 
model.add(LeakyReLU(alpha=alpha)) 
model.add(Dense(units=n_hidden)) 
model.add(LeakyReLU(alpha=alpha)) 
model.add(Dense(units=n_hidden)) 
model.add(LeakyReLU(alpha=alpha)) 
model.add(Dense(units=n_hidden)) 
model.add(LeakyReLU(alpha=alpha)) 
model.add(Dense(units=n_out)) 
model.add(Activation('softmax')) 
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01), metrics=['accuracy']) 

# 모델 학습 
epochs = 20 
batch_size = 200 

model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size) 

print(f'\nElapse training time : {(time() - start)//60}분 {(time() - start) % 60:.6}초\n') 

# 정확도 측정 
start = time() 
loss_and_metrics = model.evaluate(x_test, y_test) 

print(f'\nLoss : ') print(f'Accuracy : %') 
print(f'Elapse test time : 초')

   32/2000 [..............................] - ETA: 0s 
 1984/2000 [============================>.] - ETA: 0s 
 2000/2000 [==============================] - 0s 33us/step 
 
 Elapse training time : 0.0분 0.0660467초 
 
 Loss : 0.473356 Accuracy : 92.9% 
 Elapse test time : 0.06604671478271484초

Parametric ReLU 함수

LeakyReLU 함수는 $x<0$일 때, 경사 $\alpha$가 고정되었지만, 경사 $\alpha$도 학습으로 최적화하는 것이 Parametric ReLU 함수로 PReLU라고도 한다.

PReLU 함수는 활성화 이전의 값(벡터) $\mathbb=(p_1, p_2, \ldots, p_J)$에 대하여 식 $(8.7)$과 같이 정의하는데, 이 때 상수(스칼라) $\alpha$가 아닌 벡터 $\mathbb{\alpha}=(\alpha_1, \alpha_2, \ldots, \alpha_j, \ldots, \alpha_J)$가 주어진다.

\begin{align} f(p_j) = \left\{\begin{array}{lc} p_j & \quad p_j > 0 \\ \alpha_jp_j & \quad p_j \leqq 0 \end{array}\right.\tag{8.8}\end{align}

이 벡터가 최적화해야할 매개변수(중 하나)이기 때문에 weight와 bias를 최적화할 때와 마찬가지로 오차 함수 $E$에 포함된 $\alpha_j$에 대한 경사를 구해야 하며, 경사는 식 $(8.8)$과 같이 나타낼 수 있다.

\begin{align} \frac{\partial E}{\partial \alpha_j} = \sum_{p_j}\frac{\partial E}{\partial f(p_j)} \frac{\partial f(p_j)}{\partial \alpha_j}\tag{8.9}\end{align}

우변에 있는 두 항 중에 $\frac{\partial E}{\partial f(p_j)}$는 앞쪽에 있는 층(순전파에서 다음 층)에서 역전파해오는 오차항이기 때문에 $\frac{\partial f(p_j)}{\partial \alpha_j}$는 식 $(8.8)$에 의해 다음과 같이 구할 수 있다.

\begin{align} \frac{\partial f(p_j)}{\partial \alpha_j} = \left\{ \begin{array}{ll}0 & \quad p_j > 0 \\ p_j &\quad p_j \leqq 0 \end{array}\right.\tag{8.10}\end{align}

식 $(8.10)$과 같이 경사를 계산할 수 있기 때문에 경사하강법으로 매개변수를 최적화시킬 수 있다.

PReLU 함수 코드 구현

TensorFlow에서는 아직 API가 제공되지 않아 직접 함수를 정의해 사용해야 한다.
식 $(8.8)$은 식 $(8.11)$로 변형할 수 있기 때문에 식 $(8.11)$을 사용해 구현한다.

\begin{align} f(p_j) = \max(0, p_j) + \alpha_j\min(0, p_j)\tag{8.11}\end{align}

def prelu(x, alpha): 
    return tf.maximum(tf.zeros(tf.shape(x)), x) + alpha * tf.minimum(tf.zeros(tf.shape(x)), x)

Keras에서는 다음 코드를 추가한다.

from keras.layers.advanced_activations import PReLU

은닉층 4개로 MNIST 모델링하기

TensorFlow로 구현하면 정확도 $92.55\%$가 나온다

import numpy as np 
import tensorflow as tf 
from sklearn import datasets 
from sklearn.model_selection import train_test_split 
from sklearn.utils import shuffle 

mnist = datasets.fetch_mldata('MNIST original', data_home='.') 

n = len(mnist.data) 
N = 10000 train_size = 0.8 
indices = np.random.permutation(range(n))[:N] 

X = mnist.data[indices] 
y = mnist.target[indices] 
Y = np.eye(10)[y.astype(int)] 

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=train_size) 

n_in = len(X[0]) 
n_hidden = 200 
n_out = len(Y[0]) 

def prelu(x, alpha): 
    return tf.maximum(tf.zeros(tf.shape(x)), x) + alpha * tf.minimum(tf.zeros(tf.shape(x)), x) 
    
x = tf.placeholder(tf.float32, shape=[None, n_in]) 
t = tf.placeholder(tf.float32, shape=[None, n_out])
    
W0 = tf.Variable(tf.truncated_normal([n_in, n_hidden], stddev=0.01)) 
b0 = tf.Variable(tf.zeros([n_hidden])) alpha0 = tf.Variable(tf.zeros([n_hidden])) 
h0 = prelu(tf.matmul(x, W0) + b0, alpha0) 
    
W1 = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], stddev=0.01)) 
b1 = tf.Variable(tf.zeros([n_hidden])) 
alpha1 = tf.Variable(tf.zeros([n_hidden])) 
h1 = prelu(tf.matmul(h0, W1) + b1, alpha1) 
   
W2 = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], stddev=0.01)) 
b2 = tf.Variable(tf.zeros([n_hidden])) 
alpha2 = tf.Variable(tf.zeros([n_hidden])) 
h2 = prelu(tf.matmul(h1, W2) + b2, alpha2) 
    
W3 = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], stddev=0.01)) 
b3 = tf.Variable(tf.zeros([n_hidden])) 
alpha3 = tf.Variable(tf.zeros([n_hidden])) 
h3 = prelu(tf.matmul(h2, W3) + b3, alpha3) 
    
W4 = tf.Variable(tf.truncated_normal([n_hidden, n_out], stddev=0.01)) 
b4 = tf.Variable(tf.zeros([n_out])) 
y = tf.nn.softmax(tf.matmul(h3, W4) + b4) 
   
cross_entropy = tf.reduce_mean(-tf.reduce_sum(t * tf.log(y), axis=1)) 
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy) 
    
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(t, 1)) 
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 
    
epochs = 50 
batch_size = 200 
init = tf.global_variables_initializer() 
    
sess = tf.Session() 
sess.run(init) 

n_batches = (int)(N * train_size) // batch_size 
    
for epoch in range(epochs): 
    X_, Y_ = shuffle(X_train, Y_train) 
    
    for i in range(n_batches): 
        start = i * batch_size 
        end = start + batch_size sess.run(train_step, feed_dict={ x: X_[start:end], t: Y_[start:end] }) 
            
    loss = cross_entropy.eval(session=sess, feed_dict={ x: X_, t: Y_ }) 
    acc = accuracy.eval(session=sess, feed_dict={ x: X_, t: Y_ }) 
    print(f'epoch: , loss: , accuracy: ') 
            
accuracy_rate = accuracy.eval(session=sess, feed_dict={ x: X_test, t: Y_test }) 
print(f'accuracy: %')

epoch: 47, loss: 0.027591068297624588, accuracy: 0.9975000023841858 
epoch: 48, loss: 0.326477587223053, accuracy: 0.9257500171661377 
epoch: 49, loss: 0.048840634524822235, accuracy: 0.9894999861717224 

accuracy: 92.55 %

Keras로 구현하면 정확도 $90.6\%$가 나온다.

from time import time 
import numpy as np 
from keras.models import Sequential 
from keras.layers import Dense, Activation 
from keras.optimizers import SGD 
from sklearn import datasets 
from sklearn.model_selection import train_test_split 
from keras.layers.advanced_activations import PReLU 

start = time() 
MNIST = datasets.fetch_mldata('MNIST original', data_home='.') 

n = len(MNIST.data)
N = 10000 
indices = np.random.permutation(range(n))[:N] 

X = MNIST.data[indices] 
Y = MNIST.target[indices] 
Y = np.eye(10)[Y.astype(int)] 

x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.8) # 모델 설정 

n_in = len(X[0]) 
n_hidden = 200 
n_out = len(Y[0]) 
alpha = 0.01 

model = Sequential() 
model.add(Dense(n_hidden, input_dim=n_in)) 
model.add(PReLU()) model.add(Dense(n_hidden)) 
model.add(PReLU()) model.add(Dense(n_hidden)) 
model.add(PReLU()) model.add(Dense(n_hidden)) 
model.add(PReLU()) model.add(Dense(n_out)) 
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01), metrics=['accuracy']) # 모델 학습 

epochs = 20 
batch_size = 200 

model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size) 

print(f'\nElapse training time : {(time() - start)//60}분 {(time() - start) % 60:.6}초\n') # 정확도 측정 

start = time() 
loss_and_metrics = model.evaluate(x_test, y_test) 
print(f'\nLoss : ') 
print(f'Accuracy : %') 
print(f'Elapse test time : 초')

Elapse training time : 0.0분 7.06802초 

  32/2000 [..............................] - ETA: 1s 
1376/2000 [===================>..........] - ETA: 0s 
2000/2000 [==============================] - 0s 48us/step

Loss : 0.607181 
Accuracy : 90.6% 
Elapse test time : 0.09506750106811523초

기타 활성화 함수

Randomized ReLU(RReLU) 함수는 학습시킬 때 경사를 모두 난수로 선택하고 테스트할 때에는 그 평균을 사용하는 함수

Exponential Linear Units(ELU) 함수는 식 $(8.12)$로 정의하는 함수이다.

\begin{align}f(x) = \left\{ \begin{array}{ll} x & \quad x>0\\ e^x-1 & \quad x \leqq 0 \end{array}\right. \tag{8.12}\end{align}

마무리

활성화 함수로 어떤 함수를 사용해야할지 고민이 될 때에는 일단 ReLU 또는 LReLU 함수를 사용하면 충분히 만족스러운 결과가 나올 때가 많다.

출처 : 정석으로 배우는 딥러닝

저작자표시 비영리 동일조건 (새창열림)

'신경망(Neural Network) 스터디' 카테고리의 다른 글

10. 신경망 모델 구현 방법 (0)	2018.01.19
9. 오버피팅 문제 해결 (0)	2018.01.19
7. 심층 신경망 (0)	2018.01.10
6. 다층 퍼셉트론 모델링 (0)	2017.12.28
5. 다중 클래스 로지스틱 회귀 (0)	2017.12.27

Machine Learning with Python