[빅데이터분석기사 실기] 암기 모음집

카테고리 없음

[빅데이터분석기사 실기] 암기 모음집 - 유형 3

Suda_777 2024. 6. 8. 23:43

1. 시험 설명

유형 3
문제 수: 2문제 (각 15점, 30점)
주제: 통계적 가설 검정

2. 이론 암기

2.1. 기본 이론

가설 설정
1. 귀무가설: $H_0$: 검정통계량이 기각역에 속하면, 귀무가설을 기각하고 대립가설을 채택한다.
2. 대립가설: $ H_1 $
3. p-value가 낮으면 -> 통계량은 아주 극단적인 수치이다 -> 귀무가설 기각
유의수준: $ \alpha $
1. 1%, 0.01, 양측검정
2. 5%, 0.05, 양측검정(1.96), 단측검정(1.64)
3. 10%, 0.1
오류 종류
1. 1종 오류(Type 1 Error): 귀무가설이 참일 때, 귀무가설을 기각하는 경우
2. 2종 오류(Type 1 Error): 귀무가설이 거짓일때, 귀무가설을 채택하는 경우

2.2. 검증 종류

2.2.1. '단일 표본의 평균' 검정

One sample Z-Test : 모집단이 정규분포, 많은 표본 (모분산을 알아야 사용가능)
One sample T-Test : 적은 표본 (모분산을 모르는 상태에서 사용 가능)
귀무가설: $ \mu_0 = \mu $
대립가설: $ \mu_0 \neq \mu $

2.2.2. '두 독립 표본의 평균 차이'

Independent Two-sample T-Test
귀무가설: $ \mu_1 = \mu_2 $
대립가설: $ \mu_1 \neq \mu_2 $

2.2.3. '대응 표본의 평균 차이' 검정

Paired T-Test
귀무가설: $ \mu_D = 0 $
대립가설: $ \mu_D \neq 0 $

2.2.4. '단일 표본의 모분산' 검정

$ \chi^2 $ Test for a single Variance, 카이제곱 분산 검정,
귀무가설: $ \sigma^2_0 = \sigma^2 $
대립가설: $ \sigma^2_0 \neq \sigma^2 $

2.2.5. '두 모분산의 비'에 대한 가설 검정

$ F $-Test for Equality of Variance (F-검정)
귀무가설: $ \sigma^2_1 = \sigma^2_2 $
대립가설: $ \sigma^2_1 \neq \sigma^2_2 $

2.2.6. '독립성' 검정

$ \chi^2 $ Test of Independence (카이제곱 독립성 검정)
귀무가설: 두 변수는 서로 독립적이다.
대립가설: 두 변수는 서로 독립적이지 않다.

3. 실습

3.1. Shapiro-Wilk Test

데이터가 정규분포를 띄는지 확인하는 테스트 이다.

import numpy as np
from scipy.stats import shapiro

# 데이터 생성
np.random.seed(42)
data = np.random.normal(loc=0, scale=1, size=40)  # 정규분포 데이터

# Shapiro-Wilk 검정 수행
statistic, p_value = shapiro(data)

# 결과 출력
print(f"Shapiro-Wilk Statistic: {statistic}")
print(f"P-Value: {p_value}")

# 가설 검정 결과
if p_value < 0.05:
    print("Null hypothesis rejected: the data does not follow a normal distribution.")
else:
    print("Fail to reject the null hypothesis: the data follows a normal distribution.")

3.2. One sample Z-Test

$$ Z = \frac{\bar{X} - \mu} {\sigma / \sqrt{1}} $$

scipy에서 따로 함수를 제공하지 않음.
모집단의 분산과 평균을 모두 알고 있을 때 사용한다.
표본의 크기가 30개 이상인 경우 사용
norm.ppf : 정규분포에서, x 위치에서의 확률 밀도를 계산. (x -> 확률)
norm.cdf : 확률에 해당하는 정규분포의 값(x)을 반환 (확률 -> x)

import numpy as np
import pandas as pd
from scipy.stats import norm

# 샘플 데이터 생성
np.random.seed(42)  # 재현 가능성을 위해 시드 설정
data = {
    'A': np.random.normal(loc=100, scale=1, size=30)  # 평균 100, 표준편차 1인 정규분포에서 30개 샘플
}
df = pd.DataFrame(data)

# 모집단 평균과 분산
population_mean = 100
population_variance = 1

# 표본의 크기
n = len(df)

# 표본 평균
mean = df['A'].mean()

# 유의 수준
alpha = 0.05

# z-점수 계산
z_score = (mean - population_mean) / (np.sqrt(population_variance) / np.sqrt(n))

# p-값 계산
p_value = 2 * (1 - norm.cdf(abs(z_score)))

# 결과 출력
print(f"Sample Mean: {mean}")
print(f"P-Value: {p_value}")
print(f"Z-Score: {z_score}")

# 가설 검정 결과
if p_value < alpha:
    print("Null hypothesis rejected: The sample mean is statistically significantly different from the population mean.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the sample mean and the population mean.")

statsmodels 사용

import numpy as np
from statsmodels.stats.weightstats import ztest

# 데이터 생성
np.random.seed(42)
data = np.random.normal(loc=100, scale=10, size=30)

# 모집단 평균
population_mean = 100

# Z-test 수행
z_score, p_value = ztest(data, value=population_mean)

print(f"Z-Score: {z_score}")
print(f"P-Value: {p_value}")

3.3. One sample T-Test

모집단의 분산을 모를 때 사용
scipy.stats.norm 대신 scipy.stats.ttest_1samp를 사용한다.
alternative: 'two-sided' 양측검정, 'less' 단측검정, 'greater' 단측검정
popmean: 모집단의 평균, (모집단의 평균을 알고 있어야 한다.)

import numpy as np
import pandas as pd
from scipy.stats import ttest_1samp

# 샘플 데이터 생성
np.random.seed(42)
data = {
    'A': np.random.normal(loc=100, scale=1, size=30)  # 평균 100, 표준편차 1인 정규분포에서 30개 샘플
}
df = pd.DataFrame(data)

# 모집단 평균
population_mean = 100

# 유의 수준 설정
alpha = 0.05

# t-test 수행
t_statistic, p_value = ttest_1samp(df['A'], popmean=population_mean)

# 결과 출력
print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")

# 가설 검정 결과
if p_value < alpha:
    print("Null hypothesis rejected: the sample mean is statistically different from the population mean.")
else:
    print("Fail to reject the null hypothesis: no significant difference between the sample mean and the population mean.")

3.4. Independent Two-sample T-Test

두 집단의 평균 비교, 두개의 표본 데이터만 있으면 됨
scipy.stats.ttest_ind 함수 사용
equal_var=True : Student's t-test 사용
equal_var=False: Welch's t-test 사용

import numpy as np
import pandas as pd
from scipy.stats import ttest_ind

# 데이터 생성
np.random.seed(42)
data1 = np.random.normal(loc=100, scale=10, size=50)  # 평균 100, 표준편차 10, 50개 샘플
data2 = np.random.normal(loc=110, scale=10, size=50)  # 평균 110, 표준편차 10, 50개 샘플

# Two-sample T-Test 수행
t_statistic, p_value = ttest_ind(data1, data2, equal_var=False)

# 결과 출력
print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")

# 가설 검정 결과
if p_value < 0.05:
    print("Null hypothesis rejected: there is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: no significant difference between the two groups.")

3.5. Mann-Whitney U Test

Independent Two-sample T-Test를 사용할 수 없을 때 사용한다. 비모수적인 방법이다.
데이터가 정규분포룰 따르지 않을 때, 이상치가 있는경우 사용.

import numpy as np
from scipy.stats import mannwhitneyu

# 데이터 생성
np.random.seed(42)
group1 = np.random.normal(50, 10, 100)
group2 = np.random.normal(55, 10, 100)

# Mann-Whitney U 검정 수행
u_statistic, p_value = mannwhitneyu(group1, group2, alternative='two-sided')

# 결과 출력
print(f"U-Statistic: {u_statistic}")
print(f"P-Value: {p_value}")

# 가설 검정 결과
if p_value < 0.05:
    print("Null hypothesis rejected: there is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: no significant difference between the two groups.")

3.6. Paired T-Test

import numpy as np
from scipy.stats import ttest_rel

# 데이터 생성
np.random.seed(42)
before = np.random.normal(loc=60, scale=5, size=30)  # 조건 전 데이터, 평균 60
after = before + np.random.normal(loc=-1.5, scale=1.5, size=30)  # 조건 후 데이터, 평균적으로 약간 감소

# Paired T-Test 수행 (단측 검증: after < before)
t_statistic, p_value = ttest_rel(before, after, alternative='less')

# 결과 출력
print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")

# 가설 검정 결과
if p_value < 0.05:
    print("Null hypothesis rejected: There is a significant decrease from before to after.")
else:
    print("Fail to reject the null hypothesis: There is no significant decrease from before to after.")

3.7. One-Sample Chi-Square Test ( $ \chi^2 $-Test)

단측 검정을 사용
- 카이제곱 분포는 항상 0 이상의 값만 가지므로 음의 방향 검정이 필요하지 않음

import numpy as np
from scipy.stats import chi2

# Example data
data = np.array([2.9, 3.0, 2.5, 3.2, 3.0, 2.7, 3.1, 2.8])

# Given population variance
population_variance = 0.25

n = len(data)

sample_variance = np.var(data, ddof=1) # 표본분산일 때, ddor=1

chi_square_stat = (n-1) * sample_variance / population_variance

# 자유도 n-1, cdf 는 통계값을 확률로 계산, 단측 검정
p_value = 1 - chi2.cdf(chi_square_stat, n - 1) 

print(p_value)

# 가설 검정 결과
if p_value < 0.05:
    print("Null hypothesis rejected: There is a significant decrease from before to after.")
else:
    print("Fail to reject the null hypothesis: There is no significant decrease from before to after.")

3.8. F-Test for Equality of Variances

예시 에서는 단측 검정 (상황에 따라 양측 검증을 할 수 도 있음)

import numpy as np
from scipy.stats import f

# 예제 데이터 생성
group1 = np.random.normal(20, 5, 30)
group2 = np.random.normal(22, 10, 30)

var1 = np.var(group1, ddof=1)
var2 = np.var(group2, ddof=1)

# F-검정 통계량 계산
f_stat = var1/var2

# 자유도
df1 = len(group1) - 1
df2 = len(group2) - 1

p_value = 1 - f.cdf(f_stat, df1, df2)

print(p_value)

# 가설 검정 결과
if p_value < 0.05:
    print("Null hypothesis rejected: There is a significant decrease from before to after.")
else:
    print("Fail to reject the null hypothesis: There is no significant decrease from before to after.")

3.9. Chi-Square Test of Independence

두개의 범주형 변수 사이의 독립성을 평가

import numpy as np
from scipy.stats import chi2_contingency

# 예제 데이터 생성 (교차표)
# 예를 들어, 두 변수 A와 B의 관찰 빈도 수를 나타내는 교차표
#        B1  B2
#   A1  [[10, 20],
#   A2   [20, 40]]

observed = np.array([[10, 20], [20, 40]])

# 카이제곱 독립성 검정 수행
chi2_stat, p_value, dof, expected = chi2_contingency(observed)

print(f"Chi-Square Statistic: {chi2_stat}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)

# 가설 검정 결과
if p_value < 0.05:
    print("Null hypothesis rejected: There is a significant decrease from before to after.")
else:
    print("Fail to reject the null hypothesis: There is no significant decrease from before to after.")

저작자표시 비영리 변경금지

티스토리

[빅데이터분석기사 실기] 암기 모음집 - 유형 3

[빅데이터분석기사 실기] 암기 모음집 - 유형 3

1. 시험 설명

2. 이론 암기

2.1. 기본 이론

2.2. 검증 종류

2.2.1. '단일 표본의 평균' 검정

2.2.2. '두 독립 표본의 평균 차이'

2.2.3. '대응 표본의 평균 차이' 검정

2.2.4. '단일 표본의 모분산' 검정

2.2.5. '두 모분산의 비'에 대한 가설 검정

2.2.6. '독립성' 검정

3. 실습

3.1. Shapiro-Wilk Test

3.2. One sample Z-Test

3.3. One sample T-Test

3.4. Independent Two-sample T-Test

3.5. Mann-Whitney U Test

3.6. Paired T-Test

3.7. One-Sample Chi-Square Test ( \( \chi^2 \)-Test)

3.8. F-Test for Equality of Variances

3.9. Chi-Square Test of Independence