Machine Learning/[Kaggle Course] SQL (Intro + Advanced)

Nested and Repeated Data

WakaraNai 2021. 1. 15. 00:20

728x90

Nested data

pets table의 ID와 toys table의 Pet_ID를 이용해 각 장난감의 주인이 어느 반려동물의 것인지 연결지으러 합니다.

pets_and toys_ table 속 Toy column을 바로, nested column으로 두려 합니다.

이 곳에는 toys table의 'name'(Toy_name)과 'type' column(Toy_Type)이 엮인 정보가 저장 됩니다.

이를 위해 어떤 column을 선택해야할 지 미리 정한 뒤, query에 적어줍니다.

Nested column은 STRUCT (or RECORD) type을 가집니다.

아래의 table의 스키마(=해당 table의 구조)에 이 형식이 반영되어 있습니다.

Repeated data

한 동물이 여러 개의 장난감을 보유할 때 이 방식을 사용합니다.

이를 위해 또다른 datatype을 적용하여 새로운 column을 만듭니다.

REPEATED datatype은 하나의 row에 여러 값을 담을 수 있도록 해줍니다. (Toys column)

이 column의 각각의 entry는 배열이거나, 동일한 datatype을 가진 값들이 정렬된 리스트입니다.

예로, Moon의 이름을 가진 row의 Toys column 자리에는 [Frisbee, Bone, Rope] 배열이 들어 있습니다.

반복되는 데이터에 대해서 쿼리문을 작성해야 할 때 column의 이름을 적어야 합니다.

그 column 이름에는 UNNEST() 함수 속 반복된 데이터가 포함되어 있죠.

이는 반복되는 데이터를 하나의 칸을 차지하도록, 즉 평평하게 만들기 위해 필수적입니다.

오른쪽 표 그림처럼 변형하려면 말이죠.

Nested and Repeated data

만약 반려동물이 여러개의 장난감을 가진다면, 각각의 장난감의 종류와 이름을 추적해야 한다면 어떻게 해야할까요?

이럴 땐 Toy column을 만듭니다. 중첩되고 반복되는 형태로요. (nested and repeated)

"more_pets_and_toys" table 속에는 Toys에 Name과 Type을 둘 다 가지고 있습니다.

각각 Toys.Name, Toys.Type이란 배열로 표현할 수 있죠.

예시 하나를 봅시다.

Toys column이 반복된 후 UNNEST 함수로 data를 일렬로, 평평하게 만들어주었습니다.

바뀐 column을 t라는 이름으로 부르기로 하였습니다.

그렇기에 Name은 t.Name, Type은 t.Type으로 명시할 수 있습니다.

Example

www.kaggle.com/bigquery/google-analytics-sample

위의 데이터를 이용해보려 합니다.

브랜드 상품을 파는 구글 Merchandise store의 고객들의 행동을 추적한 정보가 있습니다.

ga_sessions_20170801 table에서 5줄 정도 출력해봅시다.

보시다시피 여기엔 수많은 nested data column(field)가 있습니다.

이는 딕셔너리 datatype으로 접근할 수 있습니다.

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "google_analytics_sample" dataset
dataset_ref = client.dataset("google_analytics_sample", project="bigquery-public-data")

# Construct a reference to the "ga_sessions_20170801" table
table_ref = dataset_ref.table("ga_sessions_20170801")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

먼저 totals 와 device column을 봅시다.

print("SCHEMA field for the 'totals' column:\n")
print(table.schema[5])

print("\nSCHEMA field for the 'device' column:\n")
print(table.schema[7])

device column 속 "browser"(브라우저 종류) field와 "transactions"(거래량) field를 보려합니다.

(device.browser, device.transactions)

SELECT device.browser AS device_browser,
            SUM(totals.transactions) as total_transactions
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
        GROUP BY device_browser
        ORDER BY total_transactions DESC

# Query to count the number of transactions per browser
query = """
        ~~ 위의 코드 블럭 ~~
        """

# Run the query, and return a pandas DataFrame
result = client.query(query).result().to_dataframe()
result.head()

device와 total column을 STRUCT를 이용하여 저장하니, JOINs 쓰지 않아도 되었습니다.

이는 수행 속도 증가면에서 나으며 어떤 key값으로 JOIN을 해야할 지 고민하지 않아도 됩니다.

그리고 table은 정확히 필요한 정보만 가지게 되고요.

이번에는 hits column을 생성하여 nested and repeated data를 넣으려 합니다.

"hits"는 중첩되고 반복된 STRUCT 입니다.
"hitNumber", "page", "type"은 nested 형태로 hits column에 들어가며,
"pagePath"는 nested 형태로 "page" field에 들어갑니다.

SELECT hits.page.pagePath as path,
            COUNT(hits.page.pagePath) as counts
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, 
            UNNEST(hits) as hits
        WHERE hits.type="PAGE" and hits.hitNumber=1
        GROUP BY path
        ORDER BY counts DESC

# Query to count the number of transactions per browser
query = """
        ~~ 위의 코드 블럭 ~~
        """

# Run the query, and return a pandas DataFrame
result = client.query(query).result().to_dataframe()
result.head()

0번째에 /home이 있는 것으로보아

홈페이지 첫 화면에 가장 많은 유저들이 접속하네요.

Exercise

# Set up feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.sql_advanced.ex3 import *
print("Setup Complete")

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "github_repos" dataset
dataset_ref = client.dataset("github_repos", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "sample_commits" table
table_ref = dataset_ref.table("sample_commits")

# API request - fetch the table
sample_commits_table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(sample_commits_table, max_results=5).to_dataframe()

1. 깃허브에서 2016년에 가장 많이 커밋한 사람은?

sample_commits 이란 테이블은, 각 커밋이 하나의 행으로 되어있는 테이블이다.

커밋을 작성한 사람과 커밋이 작성된 날짜는 committer column의 name과 date child field를 확인하자

결과 테이블에 포함되어야 하는 column 정리

committer_name column : 2016년도에 작성된 커밋의 유저 이름이 담겨있음
num_commits column : 테이블 속 전체 사람의 수를 표시

# Print information on all the columns in the table
sample_commits_table.schema

SELECT committer.name AS committer_name,
	COUNT(*) AS num_commits
FROM `bigquery-public-data.github_repos.sample_commits`
WHERE committer.date>="2016-01-01" and committer.date<"2017-01-01"
GROUP BY committer_name
ORDER BY num_commits DESC

2. 가장 유명한 프로그래밍 언어는?

# Construct a reference to the "languages" table
table_ref = dataset_ref.table("languages")

# API request - fetch the table
languages_table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(languages_table, max_results=5).to_dataframe()

languages table를 살펴보자

repo_name column : 레포지토리의 이름
language column 속 name field : 해당 레포지토리에 있는 프로그래밍 언어들 내역
language column 속 bytes field : 파일 크기

만약 language table에서 추출된 아주 작은 sample_languages table에 접근할 수 있다고 가정해보자.

이 테이블은 오직 3개의 행만을 가지고 있다. (아래 사진처럼)

그럼 그 아래 사진 속 파란 상자 안의 쿼리를 작성하여 적용 후에 행의 개수를 세어보면 6개가 됩니다.

돌아와서 대부분의 레포지토리에서 사용된 프로그래밍 언어들을 알려주는 쿼리문을 작성해봅시다.

language_name : 프로그래밍 언어의 이름
num_repos : 해당 프로그래밍 언어를 사용한 레포지토리의 수

SELECT l.name AS language_name,
    COUNT(*) AS num_repos
FROM `bigquery-public-data.github_repos.languages`,
    UNNEST(language) AS l
GROUP BY language_name
ORDER BY num_repos DESC

3. 가장 많은 언어를 사용한 레포지토리에서 사용된 언어는?

이 질문을 위해 'polyrabbit/polyglot' 레포지토리에 집중해봅시다.

이 레포지토리에 사용된 언어를 각각 하나의 행으로 가진 테이블을 구해봅시다

name column : 프로그래밍 언어의 이름
bytes column : 그 프로그래밍 언어의 총 bytes 수

bytes column으로 정렬하여 레포지토리에서 좀 더 많은 용량을 차지하는 언어를 상단에 뜨도록 해주세요.

SELECT l.name, l.bytes
FROM `bigquery-public-data.github_repos.languages`,
    UNNEST(language) AS l
WHERE repo_name = "polyrabbit/polyglot"
ORDER BY bytes DESC

728x90

저작자표시 비영리 동일조건 (새창열림)

'Machine Learning > [Kaggle Course] SQL (Intro + Advanced)' 카테고리의 다른 글

효과적인 쿼리문 작성법 (0)	2021.03.26
How to Define Analytic/window Functions (0)	2021.01.15
JOINs and UNIONs (0)	2021.01.14
JOIN (0)	2020.12.09
AS & WITH (0)	2020.12.07

현재글Nested and Repeated Data

250x250

YBMIT, COSPro, Intro to Machine Learning, Intermediate Machine Learning, 백준, data visualization, 파이썬, 데이터분석, 머신러닝, cos pro, datascience, Python, kaggle, 알고리즘, 2급, cos pro 1급, course, cos, 너비우선탐색, Intro to DeepLearning,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

WakaraNai

Nested and Repeated Data

Nested data

Repeated data

Nested and Repeated data

Example

Exercise

1. 깃허브에서 2016년에 가장 많이 커밋한 사람은?

2. 가장 유명한 프로그래밍 언어는?

3. 가장 많은 언어를 사용한 레포지토리에서 사용된 언어는?

'Machine Learning > [Kaggle Course] SQL (Intro + Advanced)' 카테고리의 다른 글

'Machine Learning/[Kaggle Course] SQL (Intro + Advanced)'의 다른글

티스토리툴바

Nested and Repeated Data

Nested data

Repeated data

Nested and Repeated data

Example

Exercise

1. 깃허브에서 2016년에 가장 많이 커밋한 사람은?

2. 가장 유명한 프로그래밍 언어는?

3. 가장 많은 언어를 사용한 레포지토리에서 사용된 언어는?

'Machine Learning > [Kaggle Course] SQL (Intro + Advanced)' 카테고리의 다른 글

'Machine Learning/[Kaggle Course] SQL (Intro + Advanced)'의 다른글

관련글

티스토리툴바