# 06# SQL for Data Science

Dataset: Github Repository & StackOverFlow

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834478205/O3ov8zHPS.png)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834479615/EUBB57Hrl.png)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834483128/DeNQPcwpx.png)

### Example: How many files are covered by each type of software license?

GitHub is the most popular place to collaborate on software projects. A GitHub **repository** (or **repo**) is a collection of files associated with a specific project.

Most repos on GitHub are shared under a specific legal license, which determines the legal restrictions on how they are used. For our example, we’re going to look at how many different files have been released under each license.

We’ll work with two tables in the database. The first table is the `licenses` table, which provides the name of each GitHub repo (in the `repo_name` column) and its corresponding license. Here's a view of the first five rows.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834484562/3zlFoAWzZ.jpeg)

Photo by [Christina @ wocintechchat.com](https://unsplash.com/@wocintechchat?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834486001/ZkPfeiz17.png)

The second table is the `sample_files` table, which provides, among other information, the GitHub repo that each file belongs to (in the `repo_name` column). The first several rows of this table are printed below.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834487497/NxrqoXfUm.png)

Next, we write a query that uses information in both tables to determine how many files are released in each license.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834489030/_XPtpYTJj.png)

It’s a big query, and so we’ll investigate each piece separately.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834490390/zhiGiO-x8.png)

We’ll begin with the **JOIN** (highlighted in blue above). This specifies the sources of data and how to join them. We use **ON** to specify that we combine the tables by matching the values in the `repo_name` columns in the tables.

Next, we’ll talk about **SELECT** and **GROUP BY** (highlighted in yellow). The **GROUP BY** breaks the data into a different group for each license, before we **COUNT** the number of rows in the `sample_files` table that corresponds to each license. (Remember that you can count the number of rows with `COUNT(1)`.)

Finally, the **ORDER BY** (highlighted in purple) sorts the results so that licenses with more files appear first.

It was a big query, but it gave us a nice table summarizing how many files have been committed under each license:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834491860/AwE78Jn1u.png)

[https://www.kaggle.com/viannaandresouza/06-sql-data-science](https://www.kaggle.com/viannaandresouza/06-sql-data-science)

### Dataset StackOverflow

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834493260/9dDCtEZFm.png)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834494591/ZyFFVJEB2.jpeg)

Photo by [Christopher Gower](https://unsplash.com/@cgower?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral)

[**Exercise: Joining Data**  
*Explore and run machine learning code with Kaggle Notebooks | Using data from Stack Overflow Data*www.kaggle.com](https://www.kaggle.com/viannaandresouza/exercise-joining-data "https://www.kaggle.com/viannaandresouza/exercise-joining-data")[](https://www.kaggle.com/viannaandresouza/exercise-joining-data)

### Introduction

[Stack Overflow](https://stackoverflow.com/) is a widely beloved question and answer site for technical questions. You’ll probably use it yourself as you keep using SQL (or any programming language).

Their data is publicly available. What cool things do you think it would be useful for?

Here’s one idea: You could set up a service that identifies the Stack Overflow users who have demonstrated expertise with a specific technology by answering related questions about it, so someone could hire those experts for in-depth help.

In this exercise, you’ll write the SQL queries that might serve as the foundation for this type of service.

As usual, run the following cell to set up our feedback system before moving on.

### 1) Explore the data

Before writing queries or **JOIN** clauses, you’ll want to see what tables are available.

*Hint*: Tab completion is helpful whenever you can’t remember a command. Type `client.` and then hit the tab key. Don't forget the period before hitting tab.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834495940/kyANWathf.png)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834497385/z3x9IPXkJ.png)

### 2) Review relevant tables

If you are interested in people who answer questions on a given topic, the `posts_answers` table is a natural place to look. Run the following cell, and look at the output.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834498827/aA9yzb_oE.png)

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834500289/QnLMqARjdt.png)

### 3) Selecting the right questions[¶](https://kkb-production.jupyter-proxy.kaggle.net/static/dist/jupyterlab/jupyterlab-index-8dc1985859509657ef66.html?session=eyJhbGciOiJub25lIiwidHlwIjoiSldUIn0.eyJpc3MiOiJrYWdnbGUiLCJhdWQiOiJjbGllbnQiLCJzdWIiOiJ2aWFubmFhbmRyZXNvdXphIiwibmJ0IjoiMjAyMS0xMS0yMVQxNzozODoxNS4zODcyNTkyWiIsImlhdCI6IjIwMjEtMTEtMjFUMTc6Mzg6MTUuMzg3MjU5MloiLCJqdGkiOiJmOGY4MzU5OC1jMTI3LTQxNDQtYWViOC0wOTM0OTM0N2UwZDMiLCJleHAiOiIyMDIxLTEyLTIxVDE3OjM4OjE1LjM4NzI1OTJaIiwidWlkIjoxOTA3MzIwLCJkaXNwbGF5TmFtZSI6IkFuZHJlIFZpYW5uYSIsImVtYWlsIjoidmlhbm5hYW5kcmVzb3V6YUBnbWFpbC5jb20iLCJ0aWVyIjoiQ29udHJpYnV0b3IiLCJ2ZXJpZmllZCI6dHJ1ZSwicHJvZmlsZVVybCI6Ii92aWFubmFhbmRyZXNvdXphIiwidGh1bWJuYWlsVXJsIjoiaHR0cHM6Ly9zdG9yYWdlLmdvb2dsZWFwaXMuY29tL2thZ2dsZS1hdmF0YXJzL3RodW1ibmFpbHMvMTkwNzMyMC1ncC5qcGciLCJmZiI6WyJHaXRIdWJQcml2YXRlQWNjZXNzIiwiRG9ja2VyTW9kYWxTZWxlY3RvciIsIkdjbG91ZEtlcm5lbEludGVnIiwiS2VybmVsRWRpdG9yQ29yZ2lNb2RlIiwiVHB1VW51c2VkTnVkZ2UiLCJDYWlwRXhwb3J0IiwiQ2FpcE51ZGdlIiwiS2VybmVsc0ZpcmViYXNlTG9uZ1BvbGxpbmciLCJLZXJuZWxzUHJldmVudFN0b3BwZWRUb1N0YXJ0aW5nVHJhbnNpdGlvbiIsIktlcm5lbHNQb2xsUXVvdGEiLCJLZXJuZWxzUXVvdGFNb2RhbHMiLCJEYXRhc2V0c0RhdGFFeHBsb3JlclYzVHJlZUxlZnQiLCJBdmF0YXJQcm9maWxlUHJldmlldyIsIkRhdGFzZXRzRGF0YUV4cGxvcmVyVjNDaGVja0ZvclVwZGF0ZXMiLCJEYXRhc2V0c0RhdGFFeHBsb3JlclYzQ2hlY2tGb3JVcGRhdGVzSW5CYWNrZ3JvdW5kIiwiS2VybmVsc1N0YWNrT3ZlcmZsb3dTZWFyY2giLCJLZXJuZWxzTWF0ZXJpYWxMaXN0aW5nIiwiS2VybmVsc0VtcHR5U3RhdGUiLCJEYXRhc2V0c01hdGVyaWFsRGV0YWlsIiwiRGF0YXNldHNNYXRlcmlhbExpc3RDb21wb25lbnQiLCJDb21wZXRpdGlvbkRhdGFzZXRzIiwiRGlzY3Vzc2lvbnNVcHZvdGVTcGFtV2FybmluZyIsIlRhZ3NMZWFybkFuZERpc2N1c3Npb25zVUkiLCJLZXJuZWxzU3VibWl0RnJvbUVkaXRvciIsIk5vUmVsb2FkRXhwZXJpbWVudCIsIk5vdGVib29rc0xhbmRpbmdQYWdlIiwiRGF0YXNldHNGcm9tR2NzIiwiVFBVQ29tbWl0U2NoZWR1bGluZyIsIkVtcGxveWVySW5mb051ZGdlcyIsIkVtYWlsU2lnbnVwTnVkZ2VzIiwiS01MZWFybkRldGFpbCIsIkZyb250ZW5kQ29uc29sZUVycm9yUmVwb3J0aW5nIiwiS2VybmVsVmlld2VySGlkZUZha2VFeGl0TG9nVGltZSIsIktlcm5lbFZpZXdlclZlcnNpb25EaWFsb2dXaXRoUGFyZW50Rm9yayIsIkRhdGFzZXRMYW5kaW5nUGFnZVJvdGF0aW5nU2hlbHZlcyIsIkxvd2VyRGF0YXNldEhlYWRlckltYWdlTWluUmVzIiwiTmV3RGlzY3Vzc2lvbnNMYW5kaW5nIiwiRGlzY3Vzc2lvbkxpc3RpbmdJbXByb3ZlbWVudHMiLCJTY2hlZHVsZWROb3RlYm9va3MiLCJTY2hlZHVsZWROb3RlYm9va3NUcmlnZ2VyIiwiVGFnUGFnZXNEZXByZWNhdGUiLCJGaWx0ZXJGb3J1bUltYWdlcyIsIlBob25lVmVyaWZ5Rm9yQ29tbWVudHMiLCJQaG9uZVZlcmlmeUZvck5ld1RvcGljIiwiTmF2Q3JlYXRlQnV0dG9uIiwiTmV3TmF2QmVoYXZpb3IiXSwiZmZkIjp7Iktlcm5lbEVkaXRvckF1dG9zYXZlVGhyb3R0bGVNcyI6IjMwMDAwIiwiRnJvbnRlbmRFcnJvclJlcG9ydGluZ1NhbXBsZVJhdGUiOiIwLjEwIiwiRW1lcmdlbmN5QWxlcnRCYW5uZXIiOiJ7XCJiYW5uZXJzXCI6IFsgeyBcInVyaVBhdGhSZWdleFwiOiBcIl4oL2MvLip8L2NvbXBldGl0aW9ucy8_KVwiLCAgXCJtZXNzYWdlSHRtbFwiOiAgICAgICAgIFwiV2UgYXJlIGhhdmluZyBkZWdyYWRlZCBwZXJmb3JtYW5jZSBvbiBjb21wZXRpdGlvbnMuIFdlIGFyZSB3b3JraW5nIG9uIGl0XCIsICAgICAgIFwiYmFubmVySWRcIjogXCIyMDIxLTExLTAxLWNvbXBzLWRlZ3JhZGVkXCIgfSBdIH0ifSwicGlkIjoia2FnZ2xlLTE2MTYwNyIsInN2YyI6IndlYi1mZSIsInNkYWsiOiJBSXphU3lBNGVOcVVkUlJza0pzQ1pXVnotcUw2NTVYYTVKRU1yZUUiLCJibGQiOiI5MjY4NjY2NjI4MTU1Nzc1ZGFlN2ZlZWUyNzUzY2E1ZDdhZDQ3MjdiIn0.#3%29-Selecting-the-right-questions)

A lot of this data is text.

We’ll explore one last technique in this course which you can apply to this text.

A **WHERE** clause can limit your results to rows with certain text using the **LIKE** feature. For example, to select just the third row of the `pets` table from the tutorial, we could use the query in the picture below.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834501640/GiP3oK2js.png)

You can also use `%` as a "wildcard" for any number of characters. So you can also get the third row with:

```
query = """  
        SELECT *   
        FROM `bigquery-public-data.pet_records.pets`   
        WHERE Name LIKE '%ipl%'  
        """
```

Try this yourself. Write a query that selects the `id`, `title` and `owner_user_id` columns from the `posts_questions` table.

*   Restrict the results to rows that contain the word “bigquery” in the `tags` column.
*   Include rows where there is other text in addition to the word “bigquery” (e.g., if a row has a tag “bigquery-sql”, your results should include that too).

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834503027/8JaiS7hD3.png)

### 4) Your first join

Now that you have a query to select questions on any given topic (in this case, you chose “bigquery”), you can find the answers to those questions with a **JOIN**.

Write a query that returns the `id`, `body` and `owner_user_id` columns from the `posts_answers` table for answers to "bigquery"-related questions.

*   You should have one row in your results for each answer to a question that has “bigquery” in the tags.
*   Remember you can get the tags for a question from the `tags` column in the `posts_questions` table.

Here’s a reminder of what a **JOIN** looked like in the tutorial:

```
query = """  
        SELECT p.Name AS Pet_Name, o.Name AS Owner_Name  
        FROM `bigquery-public-data.pet_records.pets` as p  
        INNER JOIN `bigquery-public-data.pet_records.owners` as o   
            ON p.ID = o.Pet_ID  
        """
```

It may be useful to scroll up and review the first several rows of the `posts_answers` and `posts_questions` tables.

Hint: Do an **INNER JOIN** between `bigquery-public-data.stackoverflow.posts_questions` and `bigquery-public-data.stackoverflow.posts_answers`.

Give `post_questions` an alias of `q`, and use `a` as an alias for `posts_answers`. The **ON** part of your join is `q.id = a.parent_id`.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834504362/mu0PfANzv.png)

### 5) Answer the question

You have the merge you need. But you want a list of users who have answered many questions… which requires more work beyond your previous result.

Write a new query that has a single row for each user who answered at least one question with a tag that includes the string “bigquery”. Your results should have two columns:

*   `user_id` - contains the `owner_user_id` column from the `posts_answers` table
*   `number_of_answers` - contains the number of answers the user has written to "bigquery"-related questions

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834505719/RH_hkpwqO.png)

### 6) Building a more generally useful service

How could you convert what you’ve done to a general function a website could call on the backend to get experts on any topic?

Think about it and then check the solution below.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1662834507085/Q0U8VdNay.png)

def expert\_finder(topic, client):  
    '''  
    Returns a DataFrame with the user IDs who have written Stack Overflow answers on a topic.

Inputs:  
        topic: A string with the topic of interest  
        client: A Client object that specifies the connection to the Stack Overflow dataset

Outputs:  
        results: A DataFrame with columns for user\_id and number\_of\_answers. Follows similar logic to bigquery\_experts\_results shown above.  
    '''  
    my\_query = """  
               SELECT a.owner\_user\_id AS user\_id, COUNT(1) AS number\_of\_answers  
               FROM \`bigquery-public-data.stackoverflow.posts\_questions\` AS q  
               INNER JOIN \`bigquery-public-data.stackoverflow.posts\_answers\` AS a  
                   ON q.id = a.parent\_Id  
               WHERE q.tags like '%{topic}%'  
               GROUP BY a.owner\_user\_id  
               """

\# Set up the query (a real service would have good error handling for   
    # queries that scan too much data)  
    safe\_config = bigquery.QueryJobConfig(maximum\_bytes\_billed=10\*\*10)        
    my\_query\_job = client.query(my\_query, job\_config=safe\_config)

\# API request - run the query, and return a pandas DataFrame  
    results = my\_query\_job.to\_dataframe()

return results

[**Exercise: Joining Data**  
*Explore and run machine learning code with Kaggle Notebooks | Using data from Stack Overflow Data*www.kaggle.com](https://www.kaggle.com/viannaandresouza/exercise-joining-data "https://www.kaggle.com/viannaandresouza/exercise-joining-data")[](https://www.kaggle.com/viannaandresouza/exercise-joining-data)