# 04# SQL for Data Science

### [Big Query:](https://mydatasciencejourney.ml/02-sql-for-data-science-8efc97c64253?source=your_stories_page----------------------------------------) Order By

### Order By

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833641813/iCKQg4-f2.png)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833644376/mGueFagb6p.png)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833645722/p3G1UGy6X.png)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833647067/mTFqwHPvy.png)

### Example: Which day of the week has the most fatal motor accidents?

Let’s use the US Traffic Fatality Records database, which contains information on traffic accidents in the US where at least one person died.

We’ll investigate the `accident_2015` table. Here is a view of the first few rows. (*We have hidden the corresponding code. To take a peek, click on the "Code" button below.*)

### Dates

Next, we’ll talk about dates, because they come up very frequently in real-world databases. There are two ways that dates can be stored in BigQuery: as a **DATE** or as a **DATETIME**.

The **DATE** format has the year first, then the month, and then the day. It looks like this:

```
YYYY-[M]M-[D]D
```

*   `YYYY`: Four-digit year
*   `[M]M`: One or two digit month
*   `[D]D`: One or two digit day

So `2019-01-10` is interpreted as January 10, 2019.

The **DATETIME** format is like the date format … but with time added at the end

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833648465/VqfTzBpAx.jpeg)

Photo by [Matt Chesin](https://unsplash.com/@mchesin?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral)

[**04#SQL-Data-Science**  
*Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources*www.kaggle.com](https://www.kaggle.com/viannaandresouza/04-sql-data-science "https://www.kaggle.com/viannaandresouza/04-sql-data-science")[](https://www.kaggle.com/viannaandresouza/04-sql-data-science)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833649909/xGQcL1xWk.png)

Let’s use the table to determine how the number of accidents varies with the day of the week. Since:

*   the `consecutive_number` column contains a unique ID for each accident, and
*   the `timestamp_of_crash` column contains the date of the accident in DATETIME format,

we can:

*   **EXTRACT** the day of the week (as `day_of_week` in the query below) from the `timestamp_of_crash` column, and
*   **GROUP BY** the day of the week, before we **COUNT** the `consecutive_number` column to determine the number of accidents for each day of the week.

Then we sort the table with an **ORDER BY** clause, so the days with the most accidents are returned first.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833651318/s388qdoCG.png)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833652703/i745cP0km.png)

### International Educator Indicator

[**Exercise: Order By**  
*Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources*www.kaggle.com](https://www.kaggle.com/viannaandresouza/exercise-order-by "https://www.kaggle.com/viannaandresouza/exercise-order-by")[](https://www.kaggle.com/viannaandresouza/exercise-order-by)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833654205/jiB0KZ8nK.jpeg)

Photo by [NeONBRAND](https://unsplash.com/@neonbrand?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral)

The value in the `indicator_code` column describes what type of data is shown in a given row.

One interesting indicator code is `SE.XPD.TOTL.GD.ZS`, which corresponds to "Government expenditure on education as % of GDP (%)".

### 1) Government expenditure on education

Which countries spend the largest fraction of GDP on education?

To answer this question, consider only the rows in the dataset corresponding to indicator code `SE.XPD.TOTL.GD.ZS`, and write a query that returns the average value in the `value` column for each country in the dataset between the years 2010-2017 (including 2010 and 2017 in the average).

Requirements:

*   Your results should have the country name rather than the country code. You will have one row for each country.
*   The aggregate function for average is **AVG()**. Use the name `avg_ed_spending_pct` for the column created by this aggregation.
*   Order the results so the countries that spend the largest fraction of GDP on education show up first.

### Introduction

You’ve built up your SQL skills enough that the remaining hands-on exercises will use different datasets than you see in the explanations. If you need to get to know a new dataset, you can run a couple of **SELECT** queries to extract and review the data you need.

The next exercises are also more challenging than what you’ve done so far. Don’t worry, you are ready for it!

Run the code in the following cell to get everything set up:

[**Exercise: Order By**  
*Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources*www.kaggle.com](https://www.kaggle.com/viannaandresouza/exercise-order-by "https://www.kaggle.com/viannaandresouza/exercise-order-by")[](https://www.kaggle.com/viannaandresouza/exercise-order-by)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833655557/P_S5rZ4i8.png)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833656966/6mPWfjdCw.png)

### 2) Identify interesting codes to explore

The last question started by telling you to focus on rows with the code `SE.XPD.TOTL.GD.ZS`. But how would you find more interesting indicator codes to explore?

There are 1000s of codes in the dataset, so it would be time consuming to review them all. But many codes are available for only a few countries. When browsing the options for different codes, you might restrict yourself to codes that are reported by many countries.

Write a query below that selects the indicator code and indicator name for all codes with at least 175 rows in the year 2016.

Requirements:

*   You should have one row for each indicator code.
*   The columns in your results should be called `indicator_code`, `indicator_name`, and `num_rows`.
*   Only select codes with 175 or more rows in the raw database (exactly 175 rows would be included).
*   To get both the `indicator_code` and `indicator_name` in your resulting DataFrame, you need to include both in your **SELECT** statement (in addition to a **COUNT()** aggregation). This requires you to include both in your **GROUP BY** clause.
*   Order from results most frequent to least frequent.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662833658481/JnRgfLF-9.png)
