Machine Learning/[Kaggle Course] Data Visualization

[Kaggle Course] Distributions (histogram + density plots[KDE])

WakaraNai 2020. 11. 16. 22:52

728x90

Load and examine the data

이번 시간에는 150가지의 다른 꽃들에 대한 dataset을 볼 것이며,

50개씩으로 나누어 3등분하여 세 종류의 아이리스로 분류할 것입니다.

각 가로줄(row)는 각각의 꽃과 일치합니다.

여기엔 4가지의 측정값이 있는데요, sepal length와 width, 그리고 그에 따른 petal length와 width입니다.

# Path of the file to read
iris_filepath = "../input/iris.csv"

# Read the file into a variable iris_data
iris_data = pd.read_csv(iris_filepath, index_col="Id")

# Print the first 5 rows of the data
iris_data.head()

Histograms

petal length에 따른 아이리스 꽃의 분포를 histogram 으로 만들어봅시다.

a : column 선택
kde=False : histogram을 그릴 때마다 조금씩 다른 그래프가 그려지는 것을 방지. 막대를 따라 그려지는 곡선 제거

# Histogram 
sns.distplot(a=iris_data['Petal Length (cm)'], kde=False)

Density plots (KDE)

KDE: kernel density estimate plot ~ 부드러운 histogram이라고 생각하면 됩니다.

shade=True : curve 아래의 색상 지정
data = data[] : histogram을 그릴 때와 마찬가지의 속성을 가짐

# KDE plot 
sns.kdeplot(data=iris_data['Petal Length (cm)'], shade=True)

2D KDE plots

KDE는 2개의 column도 지원합니다. 2차원이죠.

아래 그래프를 보시면, color-coding으로 sepal width와 petal length의 여러 조합을 볼 수 있는 가능성이

어두운 영역에서 더 높다는 것을 알려줍니다.

curve의 위쪽 == KDE plot의 x축 (여기선 iris_data['Petal Length (cm)'])
curve의 오른쪽 == KDE plot의 y축 (여기선 iris_data['Sepal Width (cm)'])

# 2D KDE plot
sns.jointplot(x=iris_data['Petal Length (cm)'], y=iris_data['Sepal Width (cm)'], kind="kde")

Color-coded plots

종에 따른 차이점을 이해할 수 있는 그래프를 그려보려 합니다.

이를 위해, dataset을 각각의 종마다 3개의 파일로 분리시키려 합니다.

# Paths of the files to read
iris_set_filepath = "../input/iris_setosa.csv"
iris_ver_filepath = "../input/iris_versicolor.csv"
iris_vir_filepath = "../input/iris_virginica.csv"

# Read the files into variables 
iris_set_data = pd.read_csv(iris_set_filepath, index_col="Id")
iris_ver_data = pd.read_csv(iris_ver_filepath, index_col="Id")
iris_vir_data = pd.read_csv(iris_vir_filepath, index_col="Id")

# Print the first 5 rows of the Iris versicolor data
iris_ver_data.head()

아래 코드 블록에서, 각각의 종에 따라 서로 다른 histogram을 3차례 그립니다.

'lablel='로 어떤 histogram인지 legend에 나타나게 해봅시다.

여기선, legend는 자동적으로 그래프를 보여주지 않습니다.

강제로 보여주게 하려면 plt.legend()를 꼭 써야 합니다.

# Histograms for each species
sns.distplot(a=iris_set_data['Petal Length (cm)'], label="Iris-setosa", kde=False)
sns.distplot(a=iris_ver_data['Petal Length (cm)'], label="Iris-versicolor", kde=False)
sns.distplot(a=iris_vir_data['Petal Length (cm)'], label="Iris-virginica", kde=False)

# Add title
plt.title("Histogram of Petal Lengths, by Species")

# Force legend to appear
plt.legend()

또한 각 종에 따른 KDE plot도 그려보겠습니다.

shade=True로 설정 시, 그래프 내부가 색칠 됩니다.

# KDE plots for each species
sns.kdeplot(data=iris_set_data['Petal Length (cm)'], label="Iris-setosa", shade=True)
sns.kdeplot(data=iris_ver_data['Petal Length (cm)'], label="Iris-versicolor", shade=True)
sns.kdeplot(data=iris_vir_data['Petal Length (cm)'], label="Iris-virginica", shade=True)

# Add title
plt.title("Distribution of Petal Lengths, by Species")

위의 그래프에서 찾을 수 있는 한 가지 흥미로운 점은,

한 식물은 두 그룹(iris-versicolor와 iris-virginica(+ 둘은 비슷한 petal lenght를 가지네요)) 중

하나에 속하게 될 테지만,

반면, iris-setosa는 자체적으로 한 카테고리에 속하게 되죠.

이 사실을 통해 모든 iris 식물을 iris-setosa로 분류하게 될지도 모릅니다.

(반대로 iris-versicolor와 iris-virginica로 구분할 수도 있죠.)

그저 petal length만 봤기 때문입니다.

iris flower의 petal length가 최소 2cm보다 크다면 iris-setosa일 확률이 높다는 사실은 자신할 수 있겠네요.

Exercise

Scenario

옆의 그림은 유방암 종양을 현미경으로 본 이미지이며, 그 정보를 모아둔 dataset을 사용할 것입니다.

각각의 종양은 음성(benign) 또는 양성(maligant)로 표기됩니다.

이러한 종류의 데이터에 대해서 더 알기 위해, 의학적 셋팅을 바탕으로 종양을 분류할 수 있는 지능적인 알고리즘이 필요합니다. 아래의 비디오를 보세요. (1분 45초)

www.youtube.com/watch?v=9Mz84cwVmS0

양성종양일 경우, 어떤 종류의 양성 종양인지 판별할 수 있도록 노력 중. 그래서 현미경으로 본 종양 이미지를 대량확보하여 데이터화 하려 함.

Setup

import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

# Set up code checking
import os
if not os.path.exists("../input/cancer_b.csv"):
    os.symlink("../input/data-for-datavis/cancer_b.csv", "../input/cancer_b.csv")
    os.symlink("../input/data-for-datavis/cancer_m.csv", "../input/cancer_m.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.data_viz_to_coder.ex5 import *
print("Setup Complete")

1. Load the data

# Paths of the files to read
cancer_b_filepath = "../input/cancer_b.csv"
cancer_m_filepath = "../input/cancer_m.csv"

# Fill in the line below to read the (benign) file into a variable cancer_b_data
cancer_b_data = pd.read_csv(cancer_b_filepath, index_col='Id')

# Fill in the line below to read the (malignant) file into a variable cancer_m_data
cancer_m_data =  pd.read_csv(cancer_m_filepath, index_col='Id')

2. Review the data

benign data - 'diganosis(진단)' column에 B라고 표기됨

malignant data - 'diagnosis' column에 M이라고 표기됨

두 dataset 모두 각각 31개의 서로 다른 column을 갖지만 그 서로의 위치는 일치하네요.

diagnosis 외 30개의 column은 서로 다른 측정치를 가지고 있습니다.

문제는 다음과 같습니다.

# Fill in the line below: In the first five rows of the data for benign tumors, what is the
# largest value for 'Perimeter (mean)'?
max_perim = max(cancer_b_data['Perimeter (mean)'].iloc[:5])
#87.46

# Fill in the line below: What is the value for 'Radius (mean)' for the tumor with Id 842517?
mean_radius = cancer_m_data['Radius (mean)'].loc[842517]
#20.57

3. Investigating differences

Part A

두 dataset에서 'Area (mean)'에 따른 양성 음성 종양의 분포를 두 개의 histogram으로 보려 합니다.

Area (mean)는 종양의 크기, 면적, 분포도를 의미합니다.

kde=False 를 적용하면 옆과 비교해보면 알 수 있듯이
곡선 그래프가 사라진다

제일 중요한 지점은 x, y축의 숫자들이
깔끔하게 다듬어진다는 점.

Part B

한 연구 결과를 보고, 'Area (mean)' column을 이용해 종양의 음성/양성을 판정할 수 있음을 알게 되었습니다.

위의 histrogram에 기반하여,

malignant tumors는 'Area (mean)'에서, benign tumors에 비해 ,높은 수치를 갖나요? 어느 쪽 tumor가 잠재적으로 더 큰 범위(면적)를 차지하나요?
- 평균적으로 malignant tumor는 'Area (mean)'에서 높은 수치를 갖습니다.
- benign은 area가 250~600 정도의 크기를 가지며 이보다 크다면 malignant일 확률이 높네요. 특히 대략적으로 700~1300 사이의 종양크기에서요. 그러니 두번째 질문의 답도 Maligant tumor have a larger range of potential values.

4. A very useful column

Part A

'Raduis (worst)' column에 따른 두 종양의 분포도를 2개의 KDE 그래프로 그려봅시다.

'Raduis (worst)'는 종양의 최대 직경 크기를 의미합니다.

# KDE plots for benign and malignant tumors
sns.kdeplot(data=cancer_b_data['Radius (worst)'], shade=True) # Your code here (benign tumors)
sns.kdeplot(data=cancer_m_data['Radius (worst)'],shade=True) # Your code here (malignant tumors)

두 그래프 모두 column 이름으로 되어있네요.

이를 고치려면,

label='Benign', label='Malignant를

각각의 sns.kdeplot()에 추가하면 됩니다.

파란색은 benign,

주황색은 malignant입니다.

Part B

한 병원에서 최근에 높은 정확도의 종양 진단 알고리즘을 사용한다는 소식을 들었습니다.

'Radius (worst)' 값이 25인 tumor를 보시면, 그 알고리즘이 음성과 양성의 분류를 더 잘 할 수 있다고 생각하시나요?

>>> 그 알고리즘은 malignant로 분류할 확률이 높습니다.

>>> 왜냐하면, 25의 위치에서는 maligant 곡선이 benign 곡선보다 높기 때문입니다.

>>> >>> 정확도가 높은 알고리즘은 위와 같은 패턴을 바탕으로 진단을 내릴 가능성이 높습니다.

728x90

저작자표시 비영리 동일조건

'Machine Learning > [Kaggle Course] Data Visualization' 카테고리의 다른 글

[Kaggle Course] Add data on myNoteBook + (Download/Upload data on Kaggle) (0)	2020.11.18
[Kaggle Course] Choosing Plot Types and Custom Styles (0)	2020.11.16
[Kaggle Course] Scatter Plots (0)	2020.11.15
[Kaggle Course] Bar Charts, Heatmaps (0)	2020.11.13
[Kaggle Course] Line Charts (0)	2020.11.08

현재글[Kaggle Course] Distributions (histogram + density plots[KDE])

WakaraNai WakaraNai 님의 블로그입니다.

250x250

course, 백준, Python, data visualization, datascience, cos pro, Intermediate Machine Learning, COSPro, cos, 데이터분석, 파이썬, Intro to DeepLearning, YBMIT, 2급, cos pro 1급, 머신러닝, 너비우선탐색, Intro to Machine Learning, kaggle, 알고리즘,

Today :
Yesterday :

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

WakaraNai

[Kaggle Course] Distributions (histogram + density plots[KDE])

Load and examine the data

Histograms

Density plots (KDE)

2D KDE plots

Color-coded plots

Exercise

Scenario

Setup

1. Load the data

2. Review the data

3. Investigating differences

4. A very useful column

'Machine Learning > [Kaggle Course] Data Visualization' 카테고리의 다른 글

'Machine Learning/[Kaggle Course] Data Visualization'의 다른글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

2025. 04
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

[Kaggle Course] Distributions (histogram + density plots[KDE])

Load and examine the data

Histograms

Density plots (KDE)

2D KDE plots

Color-coded plots

Exercise

Scenario

Setup

1. Load the data

2. Review the data

3. Investigating differences

4. A very useful column

'Machine Learning > [Kaggle Course] Data Visualization' 카테고리의 다른 글

'Machine Learning/[Kaggle Course] Data Visualization'의 다른글

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역