ORDER BY

Machine Learning/[Kaggle Course] SQL (Intro + Advanced)

ORDER BY

WakaraNai 2020. 12. 5. 23:35

728x90

앞의 내용 정리

SELECT - table에서 특정 column 가져오기

WHERE - SELECT한 column 중 특정 조건에 해당하는 row로 범위를 축소하여 가져오기

COUNT() - one of aggregate functions

-> GROUP BY -> 여러 개의 rows를 하나의 group으로

이번 튜토리얼에서 볼 datset은 다음과 가틋ㅂ니다.

ORDER BY

보통 query의 가장 마지막 절에 쓰게 됩니다.

그 query의 결과를 정렬도 해주죠.

내림차순으로 하고 싶다면 DESC를 적어주세요

DATE, DATETIME format

BigQuery에서 DATE 관련 함수 정리 사이트

DATE ( + time = DATETIME)

EXTRACT

특정 날짜(년or월or일)에 대해서만 보고 싶을 때 사용

위의 table에서 Date column에서 'Day(일)'만 추출하여 'Day'란 이름의 column과 Name column, 총 2개의 column 추출

DAY 뿐만 아니라 WEEK도 추출 가능. (4~5)*12 == 1~53 사이 정수

Ex. Which day of the week has the most fatal motor accidents?

US Traffic Fatality Records database를 이용하여 적어도 한 사람 이상이 사망한 교통 사고에 대한 정보를 다뤄봅시다.

'accident_2015' table을 조사해볼 것입니다.

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "nhtsa_traffic_fatalities" dataset
dataset_ref = client.dataset("nhtsa_traffic_fatalities", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "accident_2015" table
table_ref = dataset_ref.table("accident_2015")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "accident_2015" table
client.list_rows(table, max_results=5).to_dataframe()

'accident_2015' table 설명

- 'consecutive_number' column : 각 사건별 고유한 ID 번호

- 'timestamp_of_crash' column: 사건 날짜 - DATETIME format

아래 코드 블럭 속 query 설명

-> EXTRACT으로, 'timestamp_of_crash' column에서 요일 추출 (as 'day_of_week'이라는 query로)

-> 각 요일별 사건 수를 추출하기 위해, 'consecutive_number' column을 COUNT.

-> 이후 GROUP BY으로, 추출한 요일을 수합.

-> ORDER BY로 정렬 -> 그럼 가장 사건이 많았던 날부터 내림차순(DESC)으로 출력됨

# Query to find out the number of accidents for each day of the week
query = """
        SELECT COUNT(consecutive_number) AS num_accidents, 
               EXTRACT(DAYOFWEEK FROM timestamp_of_crash) AS day_of_week
        FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2015`
        GROUP BY day_of_week
        ORDER BY num_accidents DESC
        """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
query_job = client.query(query, job_config=safe_config)

# API request ->
# run the query, and convert the results to a pandas DataFrame
accidents_by_day = query_job.to_dataframe()

# Print the DataFrame
accidents_by_day

'num_accidents' column에는 교통 사고가 많았던 요일부터 내림차순으로 그 사고 수를 담아서 출력

'day_of_week' column 속 숫자를 실제 날짜로 변환하려면

BigQuery Documentation 속 DAYOFWEEK 함수를 사용하면 됩니다.

1은 Sunday, 7은 Saturday를 의미합니다.

결과를 보시면, 2015년 미국에서는

토요일과 일요일에 가장 많이 사고가 일어났으며, 화요일에 가장 적게 일어났습니다.

Exercise

이젠 새로운 datset이 필요할 때 SELECT query를 여러 개를 실행시켜 필요한 data를 추출해야한다는 것은 알겠죠?

실습에선 World Bank의 교육용 data를 만져봅니다.

먼저 'world_bank_intl_education' dataset으로부터 'international_education' table 생성.

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "world_bank_intl_education" dataset
dataset_ref = client.dataset("world_bank_intl_education", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "international_education" table
table_ref = dataset_ref.table("international_education")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "international_education" table
client.list_rows(table, max_results=5).to_dataframe()

'international_education' table 설명

- 'indicator_code' column : 해당 row에 어떤 type의 data가 있는지 서술됨

->"SE.XPD.TOTL.GD.ZS"란 Government expenditure on education as % as GDP(%)"의 줄임말 (교육 쪽 정부 지출%)

1) Government expenditure on education _ 정부가 교육에 들인 비용

어느 국가가 GDP의 상당한 부분을 교육에 지출했을까요?

"SE.XPD.TOTL.GD.ZS"를 통해 알아볼 텐데요,

2010~2017년 사이 dataset에서 각 국가별 'value' column의 평균 값을 반환하는 query를 작성해봅시다.

이를 위해서 필요한 것들은 다음과 같습니다.

country code가 아닌 country name이 결과에 있어야합니다. 각 국가별 각 row마다요.
평균에 대한 aggregate func은 AVG() 입니다. 'avg_ed_spending_pct'라는 column을 생성하여 평균을 구하세요.
GDP 중 교육에 많이 지출하는 국가부터 내림차순으로 정렬하세요.

# Your code goes here
country_spend_pct_query = """
                          SELECT country_name,AVG(value) AS avg_ed_spending_pct
                          FROM `bigquery-public-data.world_bank_intl_education.international_education`
                          WHERE 2010 <= year AND year <= 2017 AND indicator_code ='SE.XPD.TOTL.GD.ZS'
                          GROUP BY country_name
                          ORDER BY avg_ed_spending_pct DESC
                          """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
country_spend_pct_query_job = client.query(country_spend_pct_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
country_spending_results = country_spend_pct_query_job.to_dataframe()

# View top few rows of results
print(country_spending_results.head())

2) Identify interesting codes to explore _ 조금 더 흥미로운 정보를 얻어보자

dataset에 있는 1000개 이상의 코드를 다 보는 것은 시간 낭비.

그런데 대다수의 코드들은 극소수의 국가에서만 이용 가능합니다. 국가별로 데이터를 보호 및 접근이 달라서요.

그로 인하여 그 외의 코드에 대한 옵션(추가조건)을 검색할 때,

여러 국가에서 기록된 코드로 인해 건드리고 싶었던 data를 만질 수 없을 수도 있죠.

이제 2016년에 최소 175 rows를 가진 모든 코드의 indicator name과 indicator code를 select하는 query를 작성해볼 겁니다. 필요 사항은 다음과 같습니다.

각 indicator code마다 하나의 row는 필수로 갖기
결과로 도출된 column의 이름은 'indicator_code', 'indicator_name', 'num_rows' 꼭 쓰기
일단 175개 또는 그 이상의 rows를 raw database에서 select하기
결과 DataFrame에 'indicator_name'과 'indicator_code'를 넣기 위해선, 둘 다 선택하는 SELECT 명령문 필요(COUNT()도 필요). 두 그룹을 하나로 묶어주는 GROUP BY 절에 꼭 쓰는 거 잊지 말기
가장 빈번하게 나오는 것부터 적게 나오는 순으로 정렬하기

# Your code goes here
code_count_query = """
SELECT indicator_code, indicator_name, COUNT(1) AS num_rows
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE year = 2016
GROUP BY indicator_code, indicator_name
HAVING COUNT(1) >= 175
ORDER BY COUNT(1) DESC
"""

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
code_count_query_job = client.query(code_count_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
code_count_results = code_count_query_job.to_dataframe()

# View top few rows of results
print(code_count_results.head())

+) having vs where

wakaranaiyo.tistory.com/40

728x90

저작자표시 비영리 동일조건 (새창열림)