SimpleRNN을 활용한 심장질환 이진분류 모델 구현

이 글에서는 302개의 샘플과 13개의 특성을 가진 심장질환 데이터셋을 사용하여 SimpleRNN 기반 이진분류 모델을 구현하는 방법을 설명합니다. 목표는 환자의 심장질환 유무(0 또는 1)를 예측하는 것입니다.

워크플로우 개요

GPU 설정 → 데이터 분할 → 특성 표준화 → 모델 구성 → 컴파일 → 학습 → 시각적 평가

1. 데이터 분할

df.iloc[:,:-1]로 마지막 열을 제외한 모든 열을 특성 X로, df.iloc[:,-1]로 마지막 열을 레이블 y로 설정합니다. train_test_split을 사용해 학습 데이터 272개(90%)와 테스트 데이터 30개(10%)로 분할합니다.

2. 특성 표준화

StandardScaler를 사용해 각 특성 열을 평균 0, 표준편차 1로 변환하여 특성 간 스케일 차이를 제거합니다. fit_transform은 학습 데이터에만 적용(평균과 표준편차 학습)하고, 테스트 데이터는 transform만 사용(학습된 파라미터로 변환)하여 데이터 누수를 방지합니다.

그 후 reshape으로 데이터 형태를 (샘플 수, 13, 1)로 변환합니다. RNN은 3차원 입력(batch_size, time_steps, features)을 필요로 하며, 여기서는 13개 특성을 13개 시간 단계로, 각 시간 단계는 1개 특성으로 간주합니다.

3. 모델 구성

Sequential API를 사용해 세 개의 레이어를 순차적으로 쌓습니다:

SimpleRNN(200) - 입력 형태 (13,1), ReLU 활성화 함수, 200차원 은닉 상태 출력
Dense(100, relu) - 완전연결 은닉층, 추가 특성 추출
Dense(1, sigmoid) - 출력층, 시그모이드 함수로 출력을 0~1로 압축(양성 클래스 확률)

총 60,601개의 학습 가능한 파라미터를 가집니다.

4. 모델 컴파일

세 가지 요소를 지정합니다:

손실 함수: binary_crossentropy(이진분류 표준 손실)
옵티마이저: Adam(학습률 0.0001, 기본값보다 낮아 안정적 학습)
평가 지표: 정확도(accuracy)

5. 모델 학습

100 에포크 동안 배치 크기 128로 학습하며, 테스트 데이터를 검증 세트로 사용합니다. 각 에포크마다 학습 및 검증 손실과 정확도가 출력되어 과적합 여부를 모니터링할 수 있습니다.

6. 시각적 평가

두 개의 꺾은선 그래프를 그립니다: (좌) 학습 정확도와 검증 정확도 비교, (우) 학습 손실과 검증 손실 비교. 두 곡선이 크게 벌어지는지 확인하여 과적합을 판단합니다.

코드 구현

# 1. GPU 설정
import tensorflow as tf
gpus = tf.config.list_physical_devices("GPU")
if gpus:
    gpu0 = gpus[0]
    tf.config.experimental.set_memory_growth(gpu0, True)
    tf.config.set_visible_devices([gpu0], "GPU")
gpus

# 2. 데이터 로드
import pandas as pd
import numpy as np
df = pd.read_csv("heart.csv")
df

# 3. 결측치 확인
df.isnull().sum()

# 4. 데이터 전처리
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

features = df.iloc[:, :-1]
target = df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.1, random_state=1
)
X_train.shape, y_train.shape

# 특성 표준화 및 형태 변환
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN

model = Sequential()
model.add(SimpleRNN(200, input_shape=(13, 1), activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

# 모델 컴파일
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)
model.compile(
    loss='binary_crossentropy',
    optimizer=optimizer,
    metrics=['accuracy']
)

num_epochs = 100

history = model.fit(
    X_train, y_train,
    epochs=num_epochs,
    batch_size=128,
    validation_data=(X_test, y_test),
    verbose=1
)

import matplotlib.pyplot as plt
from datetime import datetime

train_acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
train_loss = history.history['loss']
val_loss = history.history['val_loss']

epoch_range = range(num_epochs)

plt.figure(figsize=(14, 4))

plt.subplot(1, 2, 1)
plt.plot(epoch_range, train_acc, label='Training Accuracy')
plt.plot(epoch_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epoch_range, train_loss, label='Training Loss')
plt.plot(epoch_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

기술 용어 설명

loss (손실 함수): 예측이 얼마나 틀렸는지 측정하는 기준

optimizer (최적화): 손실을 기반으로 가중치를 업데이트하는 방법

metrics (평가 지표): 학습 과정에서 사용자에게 표시되는 성능 지표

참고 사항

이 데이터셋의 13개 의학적 특성(나이, 혈압 등)은 자연스러운 시간적 순서를 가지지 않습니다. RNN은 일반적으로 시계열 데이터(예: 주가 예측)에 적합하지만, 이 작업에서는 완전연결 신경망(MLP)도 유사하거나 더 나은 성능을 보일 수 있습니다.

태그: TensorFlow SimpleRNN Heart Disease Prediction Binary Classification StandardScaler

6월 13일 18:48에 게시됨

괴물 클럽