Machine Learning

[Logistic Regression] 설명과 실습

Suda_777 2021. 5. 12. 22:44

In [5]:

import numpy as np
from sklearn.datasets import make_classification
import statsmodels as sm
from statsmodels.discrete.discrete_model import Logit
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

1. 이론¶

1.1. Logistic Regression 언제 사용하는가?

독립변수가 연속형 데이터 (수치)
- 독립변수 범위: $-\infty$ ~ $\infty$
종속변수가 Binary (0 or 1로 표현 할 수 있는 데이터 일 때)
- 예: 남/여, 0/1, 성공/실패

1.2. 공식

시그모이드 함수

$y = \frac{1}{1+e^{-x}}$

1.3. 특징

출력 결과는 항상 0, 1이 됨

2. 실습

2.1. Logit 사용하기

In [3]:

X0, y = make_classification(n_features=1, n_redundant=0, n_informative=1,
                            n_clusters_per_class=1, random_state=4) # 가상 데이터 생성
X = sm.tools.tools.add_constant(X0)

In [9]:

# chisqprob 함수가 없다는 오류가 발생하면 다음 두 줄을 실행한다.
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
logit_mod = Logit(y, X)
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                          Logit   Df Residuals:                       98
Method:                           MLE   Df Model:                            1
Date:                Fri, 31 May 2019   Pseudo R-squ.:                  0.7679
Time:                        14:38:08   Log-Likelihood:                -16.084
converged:                       True   LL-Null:                       -69.295
                                        LLR p-value:                 5.963e-25
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2515      0.477      0.527      0.598      -0.683       1.186
x1             4.2382      0.902      4.699      0.000       2.470       6.006
==============================================================================

In [26]:

xx = np.linspace(-3, 3, 100)
mu = logit_res.predict(sm.tools.tools.add_constant(xx))
plt.plot(xx, mu, lw=5, alpha=0.5)
plt.scatter(X0, y, label="y", marker='o', s=100)
plt.scatter(X0, logit_res.predict(X), label=r"$\hat{y}$", marker='x', c=y,
            s=200, lw=2, alpha=0.5, cmap=mpl.cm.jet)
plt.xlim(-3, 3)
plt.xlabel("x")
plt.ylabel(r"$\mu$")
plt.title(r"$\hat{y} = \mu(x)$")
plt.legend()
plt.show()

2. Logit.from_formula 사용하기

1) 데이터 가져오기

다음 데이터는 미국 의대생의 입학관련 데이터이다.

In [3]:

data_med = sm.datasets.get_rdataset("MedGPA", package="Stat2Data")
df_med = data_med.data
df_med.tail()

Out[3]:

	Accept	Acceptance	Sex	BCPM	GPA	VR	PS	WS	BS	MCAT	Apps
50	D	0	M	2.41	2.72	8	8	8.0	8	32	7
51	D	0	M	3.51	3.56	11	8	6.0	9	34	6
52	A	1	F	3.43	3.48	7	10	7.0	10	34	14
53	D	0	M	2.61	2.80	7	5	NaN	6	18	6
54	D	0	M	3.36	3.44	11	11	8.0	9	39	1

2) 데이터 확인하기

In [6]:

sns.stripplot(x="GPA", y="Acceptance", data=df_med,
              jitter=True, orient='h', order=[1, 0])
plt.grid(True)
plt.show()

3) 모델 만들기

In [7]:

model_med = Logit.from_formula("Acceptance ~ Sex + BCPM + GPA + VR + PS + WS + BS + Apps", df_med)
result_med = model_med.fit()
print(result_med.summary())

Optimization terminated successfully.
         Current function value: 0.280736
         Iterations 9
                           Logit Regression Results                           
==============================================================================
Dep. Variable:             Acceptance   No. Observations:                   54
Model:                          Logit   Df Residuals:                       45
Method:                           MLE   Df Model:                            8
Date:                Sat, 01 Jun 2019   Pseudo R-squ.:                  0.5913
Time:                        20:12:20   Log-Likelihood:                -15.160
converged:                       True   LL-Null:                       -37.096
                                        LLR p-value:                 6.014e-07
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -46.6414     15.600     -2.990      0.003     -77.216     -16.067
Sex[T.M]      -2.2835      1.429     -1.597      0.110      -5.085       0.518
BCPM          -6.1633      6.963     -0.885      0.376     -19.811       7.484
GPA           12.3973      8.611      1.440      0.150      -4.479      29.274
VR             0.0790      0.311      0.254      0.799      -0.530       0.688
PS             1.1673      0.539      2.164      0.030       0.110       2.225
WS            -0.7784      0.396     -1.968      0.049      -1.554      -0.003
BS             1.9184      0.682      2.814      0.005       0.582       3.255
Apps           0.0512      0.147      0.348      0.728      -0.237       0.340
==============================================================================

저작자표시 비영리 변경금지 (새창열림)

'Machine Learning' 카테고리의 다른 글

데이터 스케일링 (Data Scaling) (0)	2023.06.07
[LightGBM] 설명 및 장단점 (3)	2021.04.27

현재글[Logistic Regression] 설명과 실습

인공지능 개발자 수다

[Logistic Regression] 설명과 실습

1. 이론¶

1.1. Logistic Regression 언제 사용하는가?

1.2. 공식

1.3. 특징

2. 실습

2.1. Logit 사용하기

2. Logit.from_formula 사용하기

1) 데이터 가져오기

2) 데이터 확인하기

3) 모델 만들기

'Machine Learning' 카테고리의 다른 글

'Machine Learning'의 다른글

티스토리툴바

[Logistic Regression] 설명과 실습

1. 이론¶

1.1. Logistic Regression 언제 사용하는가?

1.2. 공식

1.3. 특징

2. 실습

2.1. Logit 사용하기

2. Logit.from_formula 사용하기

1) 데이터 가져오기

2) 데이터 확인하기

3) 모델 만들기

'Machine Learning' 카테고리의 다른 글

'Machine Learning'의 다른글

관련글

티스토리툴바