scikit-learn을 활용한 파이썬 기계학습 입문

scikit-learn 소개

scikit-learn은 파이썬에서 기계학습을 구현하기 위한 핵심 라이브러리로, 다양한 머신러닝 알고리즘과 전처리 도구를 제공합니다. 데이터 과학 프로젝트의 표준 툴 중 하나이며, 직관적인 API와 풍부한 문서화로 초보자부터 전문가까지 널리 사용됩니다.

설치 및 환경 구성

scikit-learn을 설치하기 전 다음 패키지들이 시스템에 설치되어 있어야 합니다:

Python 3.3 이상
NumPy 1.6.1 이상
SciPy 0.9 이상

설치 명령어는 다음과 같습니다:

pip install -U scikit-learn
# 또는 conda 사용 시
conda install scikit-learn

주요 머신러닝 작업 유형

scikit-learn은 다음과 같은 주요 분석 작업을 지원합니다:

분류(Classification): 레이블 예측 (예: 스팸 감지) – 로지스틱 회귀, 의사결정나무, 서포트벡터머신(SVM), K-최근접이웃(KNN)
회귀(Regression): 연속값 예측 (예: 집값 추정) – 선형 회귀, 결정트리 회귀
클러스터링(Clustering): 비지도 학습 – K-Means, DBSCAN, 계층적 클러스터링
차원 축소(Dimensionality Reduction): 데이터 압축 – PCA, LDA
앙상블 방법(Ensemble Methods): 여러 모델 결합 – Random Forest, Gradient Boosting, AdaBoost

기본 워크플로우

대부분의 scikit-learn 프로젝트는 아래 순서를 따릅니다:

데이터 로드
전처리 및 특징 공학
모델 훈련 (fit)
성능 평가 (score)
새 데이터에 대한 예측 (predict)

내장 데이터셋 활용

scikit-learn은 실습용으로 유용한 내장 데이터셋을 제공합니다. 대표적으로 붓꽃(Iris) 데이터셋을 살펴보겠습니다.

붓꽃 데이터셋 구조

다음 코드를 통해 데이터셋의 구성 요소를 확인할 수 있습니다:

from sklearn.datasets import load_iris

iris = load_iris()
print(iris.keys())

출력 결과:

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

데이터 정보 확인

n_samples, n_features = iris.data.shape
print(f"샘플 수: {n_samples}, 특징 수: {n_features}")
print("특징 이름:", iris.feature_names)
print("레이블 이름:", iris.target_names)
print("처음 다섯 행:\n", iris.data[:5])

출력:

샘플 수: 150, 특징 수: 4
특징 이름: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
레이블 이름: ['setosa' 'versicolor' 'virginica']
처음 다섯 행:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

Pandas DataFrame으로 변환

데이터 분석을 위해 pandas와 통합할 수 있습니다:

import pandas as pd

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target_names[iris.target]
print(df.head())

시각화 예시

Seaborn을 사용해 특징 간 관계를 시각화할 수 있습니다:

import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, hue='species', palette='Set1')
plt.show()

데이터셋 불러오기 방식

scikit-learn은 세 가지 데이터 적재 메커니즘을 제공합니다:

load_*: 소규모 내장 데이터셋 (예: load_digits(), load_boston())
fetch_*: 인터넷에서 다운로드되는 대규모 데이터셋 (예: fetch_california_housing())
make_*: 학습 목적으로 임의 생성된 데이터 (예: make_classification(), make_blobs())

예제: 가상 분류 데이터 생성

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4, n_classes=3, random_state=42)
print(X.shape, y.shape)  # (1000, 4) (1000,)

핵심 API 개념: 추정기(Estimator)

scikit-learn의 모든 객체는 '추정기' 패턴을 따릅니다. 이는 다음과 같은 인터페이스를 갖습니다:

fit(X, y): 모델 훈련
predict(X): 새로운 입력에 대한 예측
transform(X): 데이터 변환 (전처리기에서 사용)

대표적인 추정기 유형:

변환기(Transformer): StandardScaler, MinMaxScaler, PCA
예측기(Predictor): LogisticRegression, KMeans, RandomForestClassifier

예제: 간단한 모델 훈련

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# 모델 생성 및 훈련
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# 예측 및 평가
y_pred = model.predict(X_test)
print(f"정확도: {accuracy_score(y_test, y_pred):.3f}")

태그: scikit-learn machine learning python data preprocessing classification

5월 23일 07:36에 게시됨

괴물 클럽