As a data scientist, you deal with a lot of data. For small datasets, maybe you just store this information in a CSV file and load it into Pandas. However, this isn't really a scalable solution and won't do too well if you're constantly updating and inserting new data. You need a database; somewhere to hold your information that allows for easy inserting, updating, and reading. There are two main approaches to storing data in a database: relational databases store data as a row in tables with a pre-defined structure, while non-relational databases store data independently as a document. This post will focus on accessing data from relational databases.
Relational database tables are very similar to Pandas dataframes, each row corresponds with a certain object and columns are used to store different features of that object. For example, a website might have a table in their database that stores its users' information (username, first name, last name, etc.).
Note: Websites should never store your password in plain text.
When we want to get information from a relational database, we do so by querying (asking) for it. Specifically, we use a structured query language (SQL) to speak with the database and access data. A schema defines how information is stored within the table, essentially describing what type of information is stored in each column.
SQL has the ability to search, filter, and combine data that might be stored across multiple tables. In this post, I'll discuss the common SQL commands used for data analysis.
Quick links for reference
Basic SQL Commands
SELECT/FROM: specifies the columns and table of interest
LIMIT: caps the query results
ORDER BY: specify the order results are returned in
WHERE: filters the data using a comparison or logical operator
LIKE: used to match on similar values instead of strictly identical values
BETWEEN: only select rows in the specified range
IN: specify a list of values to include
AND: select rows which satisfy two specified conditions
OR: select rows which satisfy one of the specified conditions
NOT: select rows which do not satisfy the specified condition
IS NULL: select rows with null values in specified column
Intermediate SQL Commands
- Aggregation functions (column based)
GROUP BY: segments records into groups
HAVING: used to filter groups using a comparison or logical operator
DISTINCT: returns only unique values, no duplicates
CASE (WHEN/THEN/ELSE/END): if/then logic for SQL
- Data joins
UNION (ALL): combine results from multiple queries
- SQL commands order of operation
- SQL data types (external link)
Basic SQL Commands
There are two commands which are absolutely required for every SQL query, what data you want (
SELECT) and where that data is stored (
FROM). With the
SELECT command, you can specify the columns of a table that you'd like to retrieve while
FROM specifies what table you're interested in retrieving data from. The "*" symbol is shorthand for "all", so if you wanted to grab all of the users' fields from the example table above you'd perform a query like
SELECT * FROM users.
AS keyword will create an alias for a column in the data retrieved.
SELECT first_name AS "First Name" FROM users
You can also combine values from different columns (within the same row).
SELECT unread_count + read_count AS total_count FROM emails
SELECT * FROM users LIMIT 100
You can use the command
ORDER BY to specify the column to sort the returned results. The default order is ascending, but you can add the keyword
DESC after specifying the column to return a descending order of results. You can also specify multiple columns to order by.
SELECT * FROM users ORDER BY year, month DESC
If you'd like to filter your results, you can specify which results you'd like using the
WHERE command, followed by the column to filter on and a comparison or logical expression. For example, you could query the database for records where the user's age is greater than 21.
SELECT * FROM users WHERE age > 21
If you want to filter a collection of similar records, you can specify the similarity using
ILIKE to ignore case in strings for some flavors of SQL). The
% symbol can be used as a wildcard where it could represent any character or set of characters. To limit the wildcard to only one character, you can use the
SELECT * FROM users WHERE name LIKE 'Bill%' -- returns for Bill, Billy, Billie, etc.
To access data from a specific time period, a certain age group, or any other range you might be interested in, you can use the
BETWEEN command to specify a range of results to filter by. Both upper and lower bounds are included in the search results.
SELECT * FROM users WHERE age BETWEEN 20 AND 30
SELECT * FROM users WHERE age IN (20, 22, 24, 26)
OR are used to combine multiple expressions when specifying a filter on the data, where
AND only returns values which meet all specifications while
OR will return values where at least one of the specifications is met.
SELECT * FROM users WHERE age IS NOT NULL
Intermediate SQL Commands
-- returns total number of users SELECT COUNT(*) FROM users -- returns number of users where last_name is not blank SELECT COUNT(last_name) FROM users
MIN(column) provides the minimum value within a column. For columns containing text, 'A' is considered the lowest value.
MAX(column) provides the maximum value within a column. For columns containing text, 'Z' is considered the highest value.
So far, we've only used the aggregation functions on the entire dataset. However, it is possible to segment the data into groups and aggregate each group independently. Suppose you have a table that has the daily stock price of a company for the past few years, and you'd like to look at the monthly stock price average over this time period. You could accomplish this by grouping the records by year and month (
GROUP BY), and then aggregate the individual records in each group to find the average (
SELECT year, month, AVG(daily_open) AS monthly_avg FROM google_stock_price GROUP BY year, month ORDER BY month, year
Whereas the previously discussed
WHERE command can be used to filter individual records,
HAVING is used to filter groups. For example, you could query the stock prices database to report the monthly average only for the months which had high price fluctuations. The same set of comparison and logical operators can be used for filtering groups. Aggregator functions (in this case,
MIN) operate individually for each group.
SELECT year, month, AVG(daily_open) AS monthly_avg FROM google_stock_price GROUP BY year, month HAVING MAX(daily_open) - MIN(daily_open) > 50 ORDER BY month, year
If your table has repeated values across records, you can use
DISTINCT to return a set of unique values, or unique combinations of values in the case of using
DISTINCT for multiple columns. You can also place
DISTINCT inside an aggregator function to perform aggregations only on the unique set of values.
-- example using DISTINCT for multiple columns SELECT DISTINCT year, month FROM google_stock_price -- example using DISTINCT within an aggregator function (suppose there are multiple stock price records per day) SELECT year, month, COUNT(DISTINCT day) AS days_in_month FROM google_stock_price
SQL is capable of performing "if/then" logic during queries using
CASE. The general structure is to provide instructions on what to do if the case is true (
THEN), what to do if the case is not true (
END or just
END). For multiple "if" statements, you can specify what to do for each case (
Case statements may be used in both
GROUP BY statements.
SELECT first_name, last_name, CASE WHEN age < 18 THEN 'child' WHEN age >= 18 THEN 'adult' ELSE NULL END AS legal_age_classification FROM users -- count the number of records in each age bracket SELECT CASE WHEN age >= 7 AND age < 22 THEN 'gen_z' WHEN age >= 22 AND age < 38 THEN 'gen_y' WHEN age >= 38 AND age < 53 THEN 'gen_x' WHEN age >= 53 THEN 'baby_boomer' ELSE 'unassigned' END AS age_group, COUNT(1) AS count FROM users GROUP BY age_group -- reorient the age bracket count horizontally (like a pivot table) SELECT COUNT(CASE WHEN age >= 7 AND age < 22 THEN 1 ELSE NULL END) AS gen_z_count, COUNT(CASE WHEN age >= 22 AND age < 38 THEN 1 ELSE NULL END) AS gen_y_count, COUNT(CASE WHEN age >= 38 AND age < 53 THEN 1 ELSE NULL END) AS gen_x_count, COUNT(CASE WHEN age >= 53 THEN 1 ELSE NULL END) AS baby_boomer_count FROM users
Combining data from different sources
Data joins are used to combine related information stored across multiple tables. A foreign key is used to describe how a record in one table relates to information in another table.
There are many different ways to combine information across two tables, as discussed below. These approaches define how we'll treat records which don't have a connection relating it to records in the other table. To understand how the following joins discussed work, check out this interactive demo.
Note: If both tables have the same column name, you need to create alias for the columns in the results. This is accomplished with
SELECT column AS alias.
In order to combine data from multiple tables, there must be a common field relating records in one table to records in the other. If this is the case, we can combine the records by using
JOIN to specify the table containing related information and
ON to provide the relationship between the two tables. This is basically saying, when a record in the original table matches a record in the join table, combine the two records. You can also provide additional logic in the
ON statement to further control what records are joined. Lastly, you can also specify multiple foreign keys to be matched on; if there are two columns that are common across both tables, joining on both of these keys can sometimes increase accuracy of your query.
SELECT * FROM users JOIN emails ON users.id = emails.user_id
Pro tip: You can also create an alias for your tables by including the alias after the table name.
SELECT * FROM really_long_name_for_table_of_users users -- here, users is an alias JOIN emails ON users.id = emails.user_id AND users.join_date > '20150214' -- example of additional logic used in the ON statement
By default, SQL performs an inner join. For this approach, only results which are related across the two tables are returned (excluding all other records). You can also explicitly define an inner join with
INNER JOIN instead of
JOIN, although the two commands are synonymous.
All joins returns rows that are matched via the join criteria, however, an inner join excludes all unmatched rows. It is possible, however, to include records even if they weren't matched with any records in the other table - this is known as an outer join. The unmatched records will contain null values for values which would have normally been retrieved from the joined table.
A left join (
LEFT JOIN) returns all unmatched records in the original table (specified via
FROM), in addition to the matched records. The unmatched records will contain null values for the fields that were retrieved from the second table (specified via
A right join (
RIGHT JOIN) does the opposite, returning all unmatched records in the second table (specified via
JOIN), in addition to the matched records. The unmatched records will contain null values for the fields that could not be located in the original table (specified via
You can combine multiple queries, stacking the results of one on top of the other, by creating a
UNION. By default,
UNION will not repeat any identical rows in the results. If you want the full addition of all results across queries, you must specify that with
UNION ALL. In order for two queries to be compatible for a union, they must have the same number of columns and each column should hold the same data type across both queries.
SELECT * FROM table1 UNION ALL SELECT * FROM table2
Order of operations
When building an SQL query, it's important to keep in mind the order that commands are executed. People often say that lexical order and logical order of SQL commands differ; what they mean is that the way we would form a query in plain English (lexical) doesn't always match the order of information (logical) a computer needs to perform the query.
- GROUP BY
- Aggregation functions (COUNT, SUM, MIN/MAX, AVG)
- UNION, INTERSECT, EXCEPT
- ORDER BY
- LIMIT, TOP