/ Data Science

SQL for data analysis.

As a data scientist, you deal with a lot of data. For small datasets, maybe you just store this information in a CSV file and load it into Pandas. However, this isn't really a scalable solution and won't do too well if you're constantly updating and inserting new data. You need a database; somewhere to hold your information that allows for easy inserting, updating, and reading. There are two main approaches to storing data in a database: relational databases store data as a row in tables with a pre-defined structure, while non-relational databases store data independently as a document. This post will focus on accessing data from relational databases.

Relational database tables are very similar to Pandas dataframes, each row corresponds with a certain object and columns are used to store different features of that object. For example, a website might have a table in their database that stores its users' information (username, first name, last name, etc.).

id first_name last_name username password
1 Jeremy Jordan password password
2 John Snow winteriscoming braceyourselves
3 Anderson Paak yeslawd breezylovejoy

Note: Websites should never store your password in plain text.

When we want to get information from a relational database, we do so by querying (asking) for it. Specifically, we use a structured query language (SQL) to speak with the database and access data. A schema defines how information is stored within the table, essentially describing what type of information is stored in each column.

SQL has the ability to search, filter, and combine data that might be stored across multiple tables. In this post, I'll discuss the common SQL commands used for data analysis.

Quick links for reference

Basic SQL Commands
  • SELECT/FROM: specifies the columns and table of interest
  • LIMIT: caps the query results
  • ORDER BY: specify the order results are returned in
  • WHERE: filters the data using a comparison or logical operator
    • LIKE: used to match on similar values instead of strictly identical values
    • BETWEEN: only select rows in the specified range
    • IN: specify a list of values to include
    • AND: select rows which satisfy two specified conditions
    • OR: select rows which satisfy one of the specified conditions
    • NOT: select rows which do not satisfy the specified condition
    • IS NULL: select rows with null values in specified column
Intermediate SQL Commands

Basic SQL Commands

There are two commands which are absolutely required for every SQL query, what data you want (SELECT) and where that data is stored (FROM). With the SELECT command, you can specify the columns of a table that you'd like to retrieve while FROM specifies what table you're interested in retrieving data from. The "*" symbol is shorthand for "all", so if you wanted to grab all of the users' fields from the example table above you'd perform a query like SELECT * FROM users.

The AS keyword will create an alias for a column in the data retrieved.

SELECT first_name AS "First Name" 
FROM users

You can also combine values from different columns (within the same row).

SELECT unread_count + read_count AS total_count 
FROM emails 

To limit the size of your query results, you can use the command LIMIT, followed by the sample size, at the end of your query.

FROM users 

You can use the command ORDER BY to specify the column to sort the returned results. The default order is ascending, but you can add the keyword DESC after specifying the column to return a descending order of results. You can also specify multiple columns to order by.

FROM users 
ORDER BY year, month DESC

If you'd like to filter your results, you can specify which results you'd like using the WHERE command, followed by the column to filter on and a comparison or logical expression. For example, you could query the database for records where the user's age is greater than 21.

FROM users 
WHERE age > 21
Logical operators

If you want to filter a collection of similar records, you can specify the similarity using LIKE (or ILIKE to ignore case in strings for some flavors of SQL). The % symbol can be used as a wildcard where it could represent any character or set of characters. To limit the wildcard to only one character, you can use the _ symbol.

FROM users 
WHERE name LIKE 'Bill%' -- returns for Bill, Billy, Billie, etc.

To access data from a specific time period, a certain age group, or any other range you might be interested in, you can use the BETWEEN command to specify a range of results to filter by. Both upper and lower bounds are included in the search results.

FROM users 

For even more specific ranges, you can specify every value to include using the IN command.

FROM users 
WHERE age IN (20, 22, 24, 26)

AND and OR are used to combine multiple expressions when specifying a filter on the data, where AND only returns values which meet all specifications while OR will return values where at least one of the specifications is met.

NOT can be used to negate any expression.

IS NULL will check to see whether or not the data in a specified column is null. Typically it is used in combination with the NOT keyword to only access records which do not contain null values.

FROM users 

Intermediate SQL Commands

Aggregation functions

COUNT(column) is used to return the number of rows containing non-null values in a given column. You can return the total number of rows by either counting on the index column or using *.

-- returns total number of users
FROM users 

-- returns number of users where last_name is not blank
SELECT COUNT(last_name) 
FROM users 

SUM(column) provides a summation of all numeric values in a specified column.

MIN(column) provides the minimum value within a column. For columns containing text, 'A' is considered the lowest value. MAX(column) provides the maximum value within a column. For columns containing text, 'Z' is considered the highest value.

AVG(column) calculates the arithmetic mean for numeric columns, ignoring null values.

So far, we've only used the aggregation functions on the entire dataset. However, it is possible to segment the data into groups and aggregate each group independently. Suppose you have a table that has the daily stock price of a company for the past few years, and you'd like to look at the monthly stock price average over this time period. You could accomplish this by grouping the records by year and month (GROUP BY), and then aggregate the individual records in each group to find the average (AVG).

SELECT year,
       AVG(daily_open) AS monthly_avg
FROM google_stock_price
GROUP BY year, month
ORDER BY month, year

Whereas the previously discussed WHERE command can be used to filter individual records, HAVING is used to filter groups. For example, you could query the stock prices database to report the monthly average only for the months which had high price fluctuations. The same set of comparison and logical operators can be used for filtering groups. Aggregator functions (in this case, MAX and MIN) operate individually for each group.

SELECT year,
       AVG(daily_open) AS monthly_avg
FROM google_stock_price
GROUP BY year, month
HAVING MAX(daily_open) - MIN(daily_open) > 50
ORDER BY month, year

If your table has repeated values across records, you can use DISTINCT to return a set of unique values, or unique combinations of values in the case of using DISTINCT for multiple columns. You can also place DISTINCT inside an aggregator function to perform aggregations only on the unique set of values.

-- example using DISTINCT for multiple columns
FROM google_stock_price

-- example using DISTINCT within an aggregator function (suppose there are multiple stock price records per day)
SELECT year, 
       COUNT(DISTINCT day) AS days_in_month
FROM google_stock_price

SQL is capable of performing "if/then" logic during queries using CASE. The general structure is to provide instructions on what to do if the case is true (CASE --> THEN), what to do if the case is not true (ELSE --> END or just END). For multiple "if" statements, you can specify what to do for each case (WHEN --> THEN).

Case statements may be used in both SELECT and GROUP BY statements.

SELECT first_name,
       CASE WHEN age < 18 
       THEN 'child'
       WHEN age >= 18 
       THEN 'adult'
       ELSE NULL 
       END AS legal_age_classification
FROM users

-- count the number of records in each age bracket
SELECT CASE WHEN age >= 7 AND age < 22 THEN 'gen_z'
            WHEN age >= 22 AND age < 38 THEN 'gen_y'
            WHEN age >= 38 AND age < 53 THEN 'gen_x'
            WHEN age >= 53 THEN 'baby_boomer'
            ELSE 'unassigned' 
            END AS age_group,
            COUNT(1) AS count 
FROM users
GROUP BY age_group

-- reorient the age bracket count horizontally (like a pivot table)
SELECT COUNT(CASE WHEN age >= 7 AND age < 22 THEN 1 ELSE NULL END) AS gen_z_count,
       COUNT(CASE WHEN age >= 22 AND age < 38 THEN 1 ELSE NULL END) AS gen_y_count,
       COUNT(CASE WHEN age >= 38 AND age < 53 THEN 1 ELSE NULL END) AS gen_x_count,
       COUNT(CASE WHEN age >= 53 THEN 1 ELSE NULL END) AS baby_boomer_count
FROM users
Combining data from different sources

Data joins are used to combine related information stored across multiple tables. A foreign key is used to describe how a record in one table relates to information in another table.

There are many different ways to combine information across two tables, as discussed below. These approaches define how we'll treat records which don't have a connection relating it to records in the other table. To understand how the following joins discussed work, check out this interactive demo.

Note: If both tables have the same column name, you need to create alias for the columns in the results. This is accomplished with SELECT column AS alias.

In order to combine data from multiple tables, there must be a common field relating records in one table to records in the other. If this is the case, we can combine the records by using JOIN to specify the table containing related information and ON to provide the relationship between the two tables. This is basically saying, when a record in the original table matches a record in the join table, combine the two records. You can also provide additional logic in the ON statement to further control what records are joined. Lastly, you can also specify multiple foreign keys to be matched on; if there are two columns that are common across both tables, joining on both of these keys can sometimes increase accuracy of your query.

FROM users
JOIN emails
ON users.id = emails.user_id

Pro tip: You can also create an alias for your tables by including the alias after the table name.

FROM really_long_name_for_table_of_users users -- here, users is an alias
JOIN emails
ON users.id = emails.user_id AND users.join_date > '20150214' 
-- example of additional logic used in the ON statement

By default, SQL performs an inner join. For this approach, only results which are related across the two tables are returned (excluding all other records). You can also explicitly define an inner join with INNER JOIN instead of JOIN, although the two commands are synonymous.

All joins returns rows that are matched via the join criteria, however, an inner join excludes all unmatched rows. It is possible, however, to include records even if they weren't matched with any records in the other table - this is known as an outer join. The unmatched records will contain null values for values which would have normally been retrieved from the joined table.

A left join (LEFT JOIN) returns all unmatched records in the original table (specified via FROM), in addition to the matched records. The unmatched records will contain null values for the fields that were retrieved from the second table (specified via JOIN).

A right join (RIGHT JOIN) does the opposite, returning all unmatched records in the second table (specified via JOIN), in addition to the matched records. The unmatched records will contain null values for the fields that could not be located in the original table (specified via FROM).

A full outer join (FULL JOIN) returns all records, matched and unmatched, from both tables.

You can combine multiple queries, stacking the results of one on top of the other, by creating a UNION. By default, UNION will not repeat any identical rows in the results. If you want the full addition of all results across queries, you must specify that with UNION ALL. In order for two queries to be compatible for a union, they must have the same number of columns and each column should hold the same data type across both queries.

FROM table1


FROM table2

Order of operations

When building an SQL query, it's important to keep in mind the order that commands are executed. People often say that lexical order and logical order of SQL commands differ; what they mean is that the way we would form a query in plain English (lexical) doesn't always match the order of information (logical) a computer needs to perform the query.

  1. FROM
  2. ON
  3. WHERE
  5. Aggregation functions (COUNT, SUM, MIN/MAX, AVG)
  10. ORDER BY
  11. LIMIT, TOP