Machine Learning/[Kaggle Course] ML (+ 딥러닝, 컴퓨터비전)

[Kaggle Courses] Basic Data Exploration - Ex.MelbourneHomePrice

WakaraNai 2020. 9. 22. 01:19
728x90
반응형

 

 

 

Prediction of New House Price in Melbourne

['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']에 따라 house의 Price가 어떻게 되는지 model을 만들자.

In [6]:
import pandas as pd #It has DataFrame(SQL)

melbourne_file_path = r"C:\Users\32mou\Desktop\melb_data.csv\melb_data.csv"
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data.describe()
#Checking Missing Value is important
Out[6]:
  Rooms Price Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
count 13580.000000 1.358000e+04 13580.000000 13580.000000 13580.000000 13580.000000 13518.000000 13580.000000 7130.000000 8205.000000 13580.000000 13580.000000 13580.000000
mean 2.937997 1.075684e+06 10.137776 3105.301915 2.914728 1.534242 1.610075 558.416127 151.967650 1964.684217 -37.809203 144.995216 7454.417378
std 0.955748 6.393107e+05 5.868725 90.676964 0.965921 0.691712 0.962634 3990.669241 541.014538 37.273762 0.079260 0.103916 4378.581772
min 1.000000 8.500000e+04 0.000000 3000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1196.000000 -38.182550 144.431810 249.000000
25% 2.000000 6.500000e+05 6.100000 3044.000000 2.000000 1.000000 1.000000 177.000000 93.000000 1940.000000 -37.856822 144.929600 4380.000000
50% 3.000000 9.030000e+05 9.200000 3084.000000 3.000000 1.000000 2.000000 440.000000 126.000000 1970.000000 -37.802355 145.000100 6555.000000
75% 3.000000 1.330000e+06 13.000000 3148.000000 3.000000 2.000000 2.000000 651.000000 174.000000 1999.000000 -37.756400 145.058305 10331.000000
max 10.000000 9.000000e+06 48.100000 3977.000000 20.000000 8.000000 10.000000 433014.000000 44515.000000 2018.000000 -37.408530 145.526350 21650.000000
In [7]:
melbourne_data.columns
Out[7]:
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')
In [9]:
melbourne_data.dropna(axis=0) #drops missing values
Out[9]:
  Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
1 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 4/02/2016 2.5 3067.0 ... 1.0 0.0 156.0 79.00 1900.0 Yarra -37.80790 144.99340 Northern Metropolitan 4019.0
2 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 4/03/2017 2.5 3067.0 ... 2.0 0.0 134.0 150.00 1900.0 Yarra -37.80930 144.99440 Northern Metropolitan 4019.0
4 Abbotsford 55a Park St 4 h 1600000.0 VB Nelson 4/06/2016 2.5 3067.0 ... 1.0 2.0 120.0 142.00 2014.0 Yarra -37.80720 144.99410 Northern Metropolitan 4019.0
6 Abbotsford 124 Yarra St 3 h 1876000.0 S Nelson 7/05/2016 2.5 3067.0 ... 2.0 0.0 245.0 210.00 1910.0 Yarra -37.80240 144.99930 Northern Metropolitan 4019.0
7 Abbotsford 98 Charles St 2 h 1636000.0 S Nelson 8/10/2016 2.5 3067.0 ... 1.0 2.0 256.0 107.00 1890.0 Yarra -37.80600 144.99540 Northern Metropolitan 4019.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12205 Whittlesea 30 Sherwin St 3 h 601000.0 S Ray 29/07/2017 35.5 3757.0 ... 2.0 1.0 972.0 149.00 1996.0 Whittlesea -37.51232 145.13282 Northern Victoria 2170.0
12206 Williamstown 75 Cecil St 3 h 1050000.0 VB Williams 29/07/2017 6.8 3016.0 ... 1.0 0.0 179.0 115.00 1890.0 Hobsons Bay -37.86558 144.90474 Western Metropolitan 6380.0
12207 Williamstown 2/29 Dover Rd 1 u 385000.0 SP Williams 29/07/2017 6.8 3016.0 ... 1.0 1.0 0.0 35.64 1967.0 Hobsons Bay -37.85588 144.89936 Western Metropolitan 6380.0
12209 Windsor 201/152 Peel St 2 u 560000.0 PI hockingstuart 29/07/2017 4.6 3181.0 ... 1.0 1.0 0.0 61.60 2012.0 Stonnington -37.85581 144.99025 Southern Metropolitan 4380.0
12212 Yarraville 54 Pentland Pde 6 h 2450000.0 VB Village 29/07/2017 6.3 3013.0 ... 3.0 2.0 1087.0 388.50 1920.0 Maribyrnong -37.81038 144.89389 Western Metropolitan 6543.0

6196 rows × 21 columns

 

Selecting the Prediction Target

Dot-Notation

select the "prediction target (y)". Selecting with 'a' column list 데이터프레임.컬럼이름

Features == cloumn

In [10]:
y = melbourne_data.Price
 

Choosing "Features"

In [17]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.describe()
Out[17]:
  Rooms Bathroom Landsize Lattitude Longtitude
count 13580.000000 13580.000000 13580.000000 13580.000000 13580.000000
mean 2.937997 1.534242 558.416127 -37.809203 144.995216
std 0.955748 0.691712 3990.669241 0.079260 0.103916
min 1.000000 0.000000 0.000000 -38.182550 144.431810
25% 2.000000 1.000000 177.000000 -37.856822 144.929600
50% 3.000000 1.000000 440.000000 -37.802355 145.000100
75% 3.000000 2.000000 651.000000 -37.756400 145.058305
max 10.000000 8.000000 433014.000000 -37.408530 145.526350
In [16]:
X.head()
Out[16]:
  Rooms Bathroom Landsize Lattitude Longtitude
0 2 1.0 202.0 -37.7996 144.9984
1 2 1.0 156.0 -37.8079 144.9934
2 3 2.0 134.0 -37.8093 144.9944
3 3 2.0 94.0 -37.7969 144.9969
4 4 1.0 120.0 -37.8072 144.9941
 

Building Your Model

scikit-learn (sklearn)

The steps to building and using a model:

  1. Define: 어떤 타입의 모델이 될 것인가? Decision tree? 아님 다른 것? 또한 모델 타입따라 parameters도 여기서 구체화 함.

  2. Fit: data에서 pattern 포착하기. 매우 중요

  3. Predict

  4. Evaluate: model's predictions의 얼마나 accurate한 지 확인

In [18]:
from sklearn.tree import DecisionTreeRegressor

#Define Model
#random_state 입력값 구체화: 매 실행마다 같은 결과를 보장해주기 위해 사용
melbourne_model = DecisionTreeRegressor(random_state=1)

#Fit model
melbourne_model.fit(X,y)
Out[18]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')
 

모든 데이터에 대해서 prediction하기 전에 위에 training data의 상단의 몇줄만 이용하여 predict이 잘 되는지 확인해보자

In [19]:
print("Making predictions for the following 5 houses:")
print(X.head())
 
Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     202.0   -37.7996    144.9984
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
3      3       2.0      94.0   -37.7969    144.9969
4      4       1.0     120.0   -37.8072    144.9941
 

실제값과 예측값 비교하며 예측을 잘 했는지 보자

  • 실제값: y.head()
  • 예측값: melbourne_model.predict(X.head())
In [20]:
print("The predictions are")
print(melbourne_model.predict(X.head()))
 
The predictions are
[1480000. 1035000. 1465000.  850000. 1600000.]
In [21]:
y.head()
Out[21]:
0    1480000.0
1    1035000.0
2    1465000.0
3     850000.0
4    1600000.0
Name: Price, dtype: float64
728x90
반응형