[Kaggle Course] Pipelines + How to make and submit CSV in Kaggle

Machine Learning/[Kaggle Course] ML (+ 딥러닝, 컴퓨터비전)

[Kaggle Course] Pipelines + How to make and submit CSV in Kaggle

WakaraNai 2020. 10. 21. 15:03

728x90

Pipeline

데이터 전처리 과정 중 거치는 단계.

missing values를 가진 column과 categorical data를 동시에 가지고 있는 data를 쉽게 다룰 수 있음.

Pipeline의 장점

1. Cleaner Code

전처리의 각 단계에서 data를 매번 수정하기엔 복잡할 수 있음. pipeline이 있으면, 하나하나 train data와 valid data의 각 과정마다 data 전처리를 따로 해주지 않아도 됨

2. Fewer Bugs

전처리 과정 중 한 단계를 깜박하고 skip하거나 잘못 적용할 경우가 줄어듬

3. Easier to Productionize

한 모델을 prototype에서 완벽한 모델로 바꾸는 게 실제로 어려움. 그런 어려움을 여기서 다 다루진 못해도 pipeline이 한 가지 해결책이 되어줄 것임.

4. More Options for Model Validation

cross-validation 예제글 참고합시다.

Example

1st. Define Preprocessing Steps

Pipeline이 전처리와 모델링을 함께하는 과정과 비슷하게,

"ColumnTransformer" class로 다른 전처리 방식을 다뤄볼 예정입니다.

missing values에 numerical data를 채워넣는다(impute) -> SimpleImputer
missing values에 impute하거나 categorical data에 one-hot coding을 적용하여 추가한다. -> OneHotEncoder

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

2nd. Define the Model (with RandomForestRegresser)

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)
#random_state=1로 바꿔보는 식으로 좀 더 MAE를 낮춰볼 수 있다

- n_estimators: the number of trees in the forest

3rd. Create and Evaluate the Pipeline

이제야 전처리와 모델링을 한꺼번에 하기위해 "Pipeline" class를 사용합니다!

다음의 주의사항을 잘 보세요

pipeline을 사용하면, train data를 전처리하고 model에 fit하는 과정을 단 한 줄로 끝냅니다. (pipeline이 없다면, imputation, one-hot encoding, 다른 model training 방식을 써야했습니다. 이렇게 되면 numerical과 categorical variables를 동시에 처리하기 복잡해지죠.
pipeline을 사용하면 X_valid 속 전처리 되지 않은 feature(컬럼)들을 predict() 함수에 추가할 수 있습니다.(supply) 그럼 자동적으로 예측을 하기 전에 pipeline이 features(컬럼,속성)들을 전처리해주죠. (pipeline 없다면, 예측 전에 validation data도 꼭 전처리해줘야 함을 명심하세요!)

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

Pipeline() 함수에 전처리할 데이터와 모델을 넣어서 가공해줍니다
만든 pipeline에 x_train과 y_train을 fit 해주고요
X_valid로 해당 pipeline에 대해 예측해보세요

4th. Generate test predictions

train data로 모델을 훈련합니다.

그러나 제출할 때의 예측 값은

train data가 아닌 test data를 훈련한 model에 넣습니다!

preds_test = my_pipeline.predict(X_test)

5th. Submit your results _ save result to CSV file

# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

728x90

저작자표시 비영리 동일조건 (새창열림)

'Machine Learning > [Kaggle Course] ML (+ 딥러닝, 컴퓨터비전)' 카테고리의 다른 글

[Kaggle Course] XGBoost (gradient boosting) + Ensemble method (0)	2020.10.24
[Kaggle Course] Cross-Validation(교차검증) (0)	2020.10.23
[Kaggle Course] Categorical Variables (0)	2020.10.12
[Kaggle Course] Missing Values (0)	2020.10.11
[Kaggle Course] Introduction (0)	2020.10.11

현재글[Kaggle Course] Pipelines + How to make and submit CSV in Kaggle

250x250

datascience, COSPro, 백준, cos pro 1급, cos pro, Intro to Machine Learning, 데이터분석, 머신러닝, YBMIT, 너비우선탐색, cos, data visualization, Python, course, Intro to DeepLearning, 알고리즘, 파이썬, 2급, Intermediate Machine Learning, kaggle,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

WakaraNai