17 Intro to Databases and SQL
17.1 Relational databases
A relational database is a type of database that stores data in tables, where each table represents an entity (such as a customer, product, or order) and each row represents a record of that entity. The data is organized into columns, which represent attributes or properties of the entity. The key concept behind a relational database is the idea of relationships between tables. Tables can be related to each other through the use of keys, which are columns in a table that uniquely identify each row in that table. A key in one table can be used to create a relationship with another table, where the key in one table is used as a reference to the corresponding row in another table.
For example, below is a diagram showing the relationship between three tables of IMDb TV show data – similar to what was prepared for you in your earlier project. These diagrams are called ER (Entity-Relationship) diagrams and are a visual representation of the relationships between entities (such as TV shows, episodes and ratings.) in a database.
ERD of a subset of IMDb data
SQLite is a popular relational database management system (RDBMS). It is widely used in various applications, including web browsers, operating systems, mobile devices, and more. SQLite is open-source software and is free to use and distribute. SQLite is a lightweight, serverless, self-contained, and transactional RDBMS that is widely used in applications that need to store data locally.
Other popular RDBMS include MySQL, PostgreSQL, Oracle, and Microsoft SQL.
17.2 Introduction to SQL using R
SQL (Structured Query Language) is a programming language used to manage and manipulate data stored in relational databases. While SQL is its own language R (and other popular languages such as Python) provides several packages to connect and interact with databases using SQL.
The basics of using SQL is very similar to the main verbs in dplyr. In fact, there is a package called dbplyr in R that extends dplyr functionality to databases. Although the dplyr (and dbplyr) verbs can often solve a problem that you are likely to come across, learning SQL can be a valuable skill in industry.
For example, here are a few common SQL commands and their dplyr equivalences.
| SQL | dplyr |
|---|---|
| SELECT | select |
| FROM | from |
| WHERE | filter |
| GROUP BY | group_by |
| HAVING | filter |
| ORDER BY | arrange |
| JOIN | left_join, right_join, inner_join, full_join |
| UNION | bind_rows |
17.2.1 Commonly used R packages to assist in analyzing databases
DBI: This is the standard interface for communicating with databases inR. It provides a consistent set of functions for connecting to and interacting with databases, regardless of the specific database management system being used.RSQLite: This is anRpackage that allows you to create and manipulate SQLite databases directly fromR. RSQLite is a DBI-compatible interface which means you primarily use functions defined in the DBI package, so you should always start by loading DBI, not RSQLite.dplyranddbplyr: Can be used for fairly straightforward queries as it auto-translatesdplyrcommands intoSQL.sqldf:sqldfis anRpackage for runningSQLstatements onRdata frames, optimized for convenience.
17.3 Examples
The file imdb_tv.sqlite is an SQLite database that contains information on the very popular TV shows that started since 2015.
- Load the appropriate packages to connect to and analyze this database.
library(DBI) # we will then use the RSQLite package only for connecting to our db
library(dplyr)
library(dbplyr) # we will see how to use dplyr verbs to make some simple queries- Use
dbConnect()to establish a connection to the imdb_tv.sqlite database.
imdb_db <- dbConnect(RSQLite::SQLite(),
dbname = "data/imdb_tv.sqlite")- A first step would be to check which tables are available in the database. Then, for each table we can check the fields (i.e., variables) stored within each.
# use DBI command
dbListTables(conn = imdb_db)# list field for one table
dbListFields(conn = imdb_db, name = "show_info")
dbListFields(conn = imdb_db, name = "show_ratings")
# quick trick to print out the Fields for each table
n_tables = length( dbListTables(conn = imdb_db) )
for(i in 1:n_tables){
tbl_name <- dbListTables(conn = imdb_db)[i]
print(paste("Table: ",tbl_name))
print( dbListFields(conn = imdb_db, tbl_name) )
cat(paste0("\n\n"))
}- Load an entire table in R (as a data frame). IMPORTANT NOTE - in many cases SQL databases contain a huge amount of data. You should always be aware of how large your dataset is before attempting to load it in to your R environment.
Using DBI
query =
"
SELECT * FROM episode_rating
"
# to execute the query
res = dbSendQuery(conn = imdb_db, statement = query)
res
# now we need to fetch the result
episode_rating_df = dbFetch(res)
# returns a data.frameUsing dplyr
# returns a tibble
episode_rating_df2 <-
tbl(src = imdb_db,
"episode_rating"
) %>%
collect() # similar to dbFetch()- Simple queries
- Determine the number of rows in each table
Using DBI:
query = "SELECT COUNT(*) FROM episode_rating"res <- dbSendQuery(imdb_db, query)
dbFetch(res)Using dplyr
tbl(src = imdb_db, "episode_rating") %>%
# count()
summarize(n = n()) - Finding the max number of votes in an episode
DBI
# double check the field names
dbListFields(imdb_db, "episode_rating")
query = "SELECT MAX(numVotes) FROM episode_rating"
res <- dbSendQuery(imdb_db, query)
dbFetch(res)dplyr
imdb_db %>%
tbl("episode_rating") %>%
summarize(maxVote = max(numVotes))- Keep only the episodes with averageRatings below 5. Return only the parentTconst, rating, votes, and episode name.
DBI
dbListFields(imdb_db, "episode_rating")
query = "SELECT parentTconst, averageRating, numVotes, Episode_Name FROM episode_rating WHERE averageRating < 5"
res = dbSendQuery(imdb_db, query)
episode_subset <- dbFetch(res)
episode_subsetdplyr
episode_subset2 <-
imdb_db %>%
tbl("episode_rating") %>%
filter(averageRating < 5) %>%
select(parentTconst, averageRating,
numVotes, Episode_Name) %>%
collect()
episode_subset2- Queries on multiple tables.
This data is far more interesting when you combine the information from multiple tables. In particular, attaching the show’s name and overall rating to the episode information.
The keys we will need to link these tables to each other are stored in the parentTconst field. This value is the identify of the show. (The tconst field in the episode_rating table is the identifer for the specifical episode.)
- Join the show information (show_info table) to the show ratings table.
DBI:
query =
"
SELECT *
FROM show_ratings
LEFT JOIN show_info
ON show_ratings.parentTconst = show_info.parentTconst
"
res = dbSendQuery(imdb_db, query)
show_df <- dbFetch(res)dplyr
dbListTables(imdb_db)
# grab necessary tables (but don't collect them)
show_ratings_db <- imdb_db %>% tbl("show_ratings")
show_info_db <- imdb_db %>% tbl("show_info")
show_df2 <- left_join(x = show_ratings_db,
y = show_info_db,
by = "parentTconst") %>%
collect()- Now join all this information to the episode information!
DBI
query =
"
SELECT *
FROM episode_rating
LEFT JOIN show_ratings
ON episode_rating.parentTconst = show_ratings.parentTconst
LEFT JOIN show_info
on show_ratings.parentTconst = show_info.parentTconst
"
res = dbSendQuery(imdb_db, query)
imdb_df <- dbFetch(res)dplyr
show_ratings_db <- imdb_db %>% tbl("show_ratings")
show_info_db <- imdb_db %>% tbl("show_info")
episode_rating_db <- imdb_db %>% tbl("episode_rating")
imdb_df2 <- left_join(x = show_ratings_db,
y = show_info_db,
by = "parentTconst") %>%
left_join(y = episode_rating_db,
by = "parentTconst") %>%
collect()- (Just for fun) For each show, calculate the difference between the show’s overall rating and the mean of the averageRatings of each episode. Sort the result.
imdb_df_cleaned <-
imdb_df2 %>%
rename(episodeRating = averageRating.y,
showRating = averageRating.x,
episode_numVotes = numVotes.y,
show_numVotes = numVotes.x,
)
imdb_df_cleaned %>%
# you take over from here
group_by(Show_Name) %>%
summarise(
rating_diff = mean(episodeRating, na.rm=TRUE) - mean(showRating, na.rm = TRUE)
) %>%
tidyr::drop_na(rating_diff) %>%
arrange(desc(rating_diff))- Save the joined tables as a csv file for later.
readr::write_csv(imdb_df_cleaned,
"imdb_popular_tvshows.csv"
)- Always a good practice to use
dbDisconnect()when you are finished to disconnect to the database.
dbDisconnect(imdb_db)