728x90
아웃라이어란 평균치에서 크게 벗어나서 다른 대상들과 확연히 구분 되는 값이다. 이는 정확한 모델링을 위해 처리를 해야 한다. 삭제를 하거나 Winsorizing 방법 등이 있다.
1) 삭제
값을 삭제를 하기 위해선 기준이 필요하다. 예를 들면 평균에서 2시그마를 벗어나는 값을 아웃라이어로 판단하고 삭제하는 것이다.
import FinanceDataReader as fdr
if __name__ == '__main__':
aapl = fdr.DataReader('AAPL', '2020-10-01')
aapl = aapl[['Change']]
print(aapl[(aapl > (aapl[['Change']].mean() - 2 * aapl[['Change']].std())) &
(aapl < (aapl[['Change']].mean() + 2 * aapl[['Change']].std()))].dropna())
결과 값
Change
Date
2020-10-01 0.0085
2020-10-02 -0.0323
2020-10-05 0.0308
2020-10-06 -0.0287
2020-10-07 0.0170
2020-10-08 -0.0010
2020-10-09 0.0174
2020-10-13 -0.0265
2020-10-14 0.0007
2020-10-15 -0.0040
2020-10-16 -0.0140
2020-10-19 -0.0255
2020-10-20 0.0132
2020-10-21 -0.0054
2020-10-22 -0.0096
2020-10-23 -0.0061
2020-10-26 0.0001
2020-10-27 0.0135
2020-10-29 0.0371
2020-11-02 -0.0008
2020-11-03 0.0154
2020-11-04 0.0408
2020-11-05 0.0355
2020-11-06 -0.0029
2020-11-09 -0.0200
2020-11-10 -0.0030
2020-11-11 0.0304
2020-11-12 -0.0023
2020-11-13 0.0004
2020-11-16 0.0087
2020-11-17 -0.0076
2020-11-18 -0.0114
2020-11-19 0.0052
2020-11-20 -0.0110
2020-11-23 -0.0297
2020-11-24 0.0116
2020-11-25 0.0075
2020-11-27 0.0048
2020-11-30 0.0211
2020-12-01 0.0308
2020-12-02 0.0029
2020-12-03 -0.0011
2020-12-04 -0.0056
2020-12-07 0.0123
2020-12-08 0.0051
2020-12-09 -0.0209
2020-12-10 0.0120
2020-12-11 -0.0067
2020-12-14 -0.0051
2020-12-16 -0.0005
지워진 아웃라이어를 보고싶다면 다음과 같이 해보자.
print(aapl[(aapl < (aapl[['Change']].mean() - 2 * aapl[['Change']].std())) |
(aapl > (aapl[['Change']].mean() + 2 * aapl[['Change']].std()))].dropna())
결과 값
Change
Date
2020-10-12 0.0635
2020-10-28 -0.0463
2020-10-30 -0.0560
2020-12-15 0.0501
평균에서 2시그마(-0.042898~0.047061) 이상 떨어진 값들은 삭제된 것을 알 수 있다.
2) Winsorizing
Winsorizing은 지정된 수의 극한 값을 더 작은 데이터 값으로 대체하는 것을 의미한다. 이 함수는 scipy에서 지원한다.
from scipy.stats.mstats import winsorize
import pandas as pd
print(pd.DataFrame(winsorize(aapl, limits=[0.05, 0.05]), index=aapl.index, columns=aapl.columns))
결과 값
Change
Date
2020-10-01 0.0085
2020-10-02 -0.0323
2020-10-05 0.0308
2020-10-06 -0.0287
2020-10-07 0.0170
2020-10-08 -0.0010
2020-10-09 0.0174
2020-10-12 0.0408
2020-10-13 -0.0265
2020-10-14 0.0007
2020-10-15 -0.0040
2020-10-16 -0.0140
2020-10-19 -0.0255
2020-10-20 0.0132
2020-10-21 -0.0054
2020-10-22 -0.0096
2020-10-23 -0.0061
2020-10-26 0.0001
2020-10-27 0.0135
2020-10-28 -0.0323
2020-10-29 0.0371
2020-10-30 -0.0323
2020-11-02 -0.0008
2020-11-03 0.0154
2020-11-04 0.0408
2020-11-05 0.0355
2020-11-06 -0.0029
2020-11-09 -0.0200
2020-11-10 -0.0030
2020-11-11 0.0304
2020-11-12 -0.0023
2020-11-13 0.0004
2020-11-16 0.0087
2020-11-17 -0.0076
2020-11-18 -0.0114
2020-11-19 0.0052
2020-11-20 -0.0110
2020-11-23 -0.0297
2020-11-24 0.0116
2020-11-25 0.0075
2020-11-27 0.0048
2020-11-30 0.0211
2020-12-01 0.0308
2020-12-02 0.0029
2020-12-03 -0.0011
2020-12-04 -0.0056
2020-12-07 0.0123
2020-12-08 0.0051
2020-12-09 -0.0209
2020-12-10 0.0120
2020-12-11 -0.0067
2020-12-14 -0.0051
2020-12-15 0.0408
2020-12-16 -0.0005
양 극단 5%의 값들이 대체된 것을 볼 수 있다.
728x90
'Data Science > Data Preprocessing' 카테고리의 다른 글
[03. Feature Engineering] 002. Scaling (0) | 2021.06.24 |
---|---|
[03. Feature Engineering] 001. Aggregation (0) | 2021.06.24 |
[02. Data Quality Assessment] 003. Duplicate values (0) | 2020.12.18 |
[02. Data Quality Assessment] 001. Missing Values (0) | 2020.09.23 |
[01. 데이터] 001. 데이터 전처리 (0) | 2020.05.02 |