Articles

2. Statistical Methods

Minwoo 2019. 12. 27.

01. 통계와 머신러닝

1) 데이터 준비

*Outlier detection

*Missing value imputation

*Data scaling/sampling

*Variable Encoding

↓

데이터 분포, 통계 Visualization(그래프, 차트 ) 통해서 위의 작업들 처리 가능

2) 모델 평가

*Data sampling/resampling(i.e, k-fold cross-validation)

*Experimental Design

3) 모델 선택

*결과들 사이에 차이점을 확인/수량화 함으로써 모델들 사이에서

최종 모델 선택 or 모델의 configuration 할 때 도움을 준다.

*통계적 hypothesis(가설) test들을 만들어 확인한다.

4) 모델 프레젠테이션

최종 결과를 프레젠테이션 할 때, 통계적 방법들을 사용할 수 있다.

*모델의 평균 예상 기대치를 요약한다.

*예상 기대치의 예상 변동성을 수량/수치화하여 제시한다.

5) 예측(prediction)

*최종 모델의 새로운 데이터에 대한 예측. 또 이러한 예측 결과가 얼마나 실제

결과와 다를 수 있는지 변동성을 예측한다.

02. Introduction to Statistics.

1) Descriptive Statistics:

관측치를 statistics 관점으로 해석하여 특정한 유용한 정보를 얻을 수 있다.

↓

데이터 요약

2) Inferential Statistics ;

Domain/sample 데이터로부터 특징들을 수량화.

↓

데이터로부터 결론 도출

03. 가우시안 분포

Gaussian Distribution = normal distribution = bell-shaped distribution

2가지 parameters for 가우시안 분포.

*Mean(평균값) : bell의 가장 꼭대기 위치.

*variance(분산): 평균값과 관측치들이 얼마나 떨어져 있는지

04. Correlation (상관계수)

*두 변수 사이에 Correlation(상관계수)

-Positive Correlation

-Neutral Correlation

-Negative Correlation

Positive/Negative correlation 특징을 가진다면 이는 두 변수 간의 무언가 상관관계가 있는 것을 의미.

*Multicollinearirty (다중 공선성)

-regression(회귀)에서 독립변수들 간에 강한 상관관계가 나타나는 문제.

-독립변수들 간의 상관관계가 높다는 것(corr >=0.9x or 1)→ 독립적이지 않다.

즉, 전제 가정 위배.

-해결법: 상관관계가 큰 독립변수 일부 제거/변수

변형/PCA 이용한 diagonal matrix의 형태로 공선성 제거.

05. Hypothesis Tests.

-Hypothesis 0 (H0); null hypothesis :

Assumption of the test holds and is failed to be rejected

-Hypothesis 1 (H1); first hypothesis :

Assumption of the tests does not hold and is rejected at some level of significance

*p-value 사용하여 hypothesis 결과 해석.

-p-value: P(observed data = true | null hypothesis)

*Example of a hypothesis test

-t-test: 두 개의 독립된 샘플들 사이의 평균값 비교→샘플 두개 모두 가우시안 분포, 같은 variance 가정.

06. Estimation of Statistics.

※ Effect Size. Methods for quantifying the size of an effect given treatment or intervention. Interval

※ Estimation. Methods for quantifying the amount of uncertainty in a value.

※ Meta-Analysis. Methods for quantifying the findings across multiple similar studies

↓

이중에서도 가장 유용한 것은, 'Interval Estimation'이다.

*Main types of Intervals ;

※Tolerance Interval: The bounds or coverage of a proportion of

a distribution with a specific level of confidence.

※Confidence Interval: The bounds on the estimate of a population parameter.

※Prediction Interval: The bounds on a single observation.

The example below demonstrates this function in a hypothetical case where a model made 88 correct predictions out of a dataset with 100 instances and we are interested in the 95% confidence interval (provided to the function as a significance of 0.05).

# calculate the confidence interval

from statsmodels.stats.proportion import proportion_confint

# calculate the interval

lower, upper = proportion_confint(88, 100, 0.05) print('lower=%. 3f, upper=%. 3f' % (lower, upper))

07. Nonparametric Statistics.

Q. 가우시안 분포가 아닐 때 어떻게 할까?

분포에 대한 정보없이 비 모수적 통계적 방법을 알아보자

(distribution_free methods)

-데이터 → rank format (rank statistics)

*절차

1. 데이터 정렬

2. 각각 데이터에 rank준다.

*Example of Nonparametric Statistical hypothesis test

-Mann-Whitney U test ; (for checking diff. b/w two independent samples)

→ similar w/ Student's t-test, but not assume 가우시안 분포

# example of the mann-whitney u test

from numpy.random import seed

from numpy.random import rand

from scipy.stats import mannwhitneyu

# seed the random number generator

seed(1)

# generate two independent samples

data1 = 50 + (rand(100) * 10)

data2 = 51 + (rand(100) * 10)

# compare samples

stat, p = mannwhitneyu(data1, data2)

print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret

alpha = 0.05

if p > alpha: print('Same distribution (fail to reject H0)')

else: print('Different distribution (reject H0)')

# calculate statistics and interpretation of the p-value.

'Articles' 카테고리의 다른 글

[Articles] 클러스터링 알고리즘들 (Unsupervised Learning이란?) (0)	2020.02.09
Git 사용하기 (0)	2020.01.13
머신러닝 기초 (0)	2020.01.05
3. Ensemble (0)	2019.12.28
1. XG Boost (0)	2019.12.24