Remaking GraphTV
There is a little piece of software on the Internet that I very much liked.
It’s called GraphTV, and it allows you to view the IMDb ratings for any TV show, with seasonal trends.
However, it is not working at the moment, and I’ve been frustrated by the fact that it is sometimes a “hit or miss”: only one result spawns from your search, and it can be difficult to pinpoint a specific show between homonyms (think about all those British TV shows with US remakes…).
So I decided to remake it, and, to make the most of it, do a tutorial on how I did that.
TL;DR: I remade GraphTV, you can access it here, and the code is freely available on my git repo
Source of data
First of all, we need to find a source of data.
The original GraphTV used IMDb, so I set out to find an IMBb API.
Turns out it doesn’t exist, but there are other solutions:
- the Open Movie Database API, but it is limited (no possibility to find episodes from a show)
- Rotten Tomatoes seems to have an API, but it needs to be requested through a Business Proposal Form
- The Movie Database API, which works well (you can find a previous version of my take on GraphTV that used this API here), but the ratings are much more limited than those one can find on IMDb.
- The TV Db, but the ratings are even more limited than those on TMDb.
- but IMBd also provides daily snapshots of its databases, which can be found here!
It looks like we’re going to do this locally after all.
Getting the data
There are three files we are interest in on https://datasets.imdbws.com/:
- title.basics.tsv: titles of shows and episodes, as well as the release year
- title.episode.tsv: join table between shows and episodes, with season and episode numbers
- title.ratings.tsv: ratings and number of votes for every show/episode
To get these files we can use cURL
in the command line and pipe the result through gzip
to extract the data.
Also, for title.basics.tsv, we only need the tconst (id), titleType, primaryTitle and startYear columns, so we will use cut
to filter those columns out and save some space, as well as speed up the reading of the files afterwards.
Finally, the titles are not quoted, and might contain quotes, so we’ll escape those using sed
:
curl https://datasets.imdbws.com/title.basics.tsv.gz | gzip -d | cut -f 1-3,6 | sed 's/"/\\"/g' > basics.tsv
curl https://datasets.imdbws.com/title.ratings.tsv.gz | gzip -d > ratings.tsv
curl https://datasets.imdbws.com/title.episode.tsv.gz | gzip -d > episodes.tsv
We’ve got all the data, neatly formatted as TSV, with a download script that can be executed whenever we want to refresh the local database (remember those files are generated only once every 24h, so no need to scrape them every minute)
Pre-processing the data
Now that we’ve got everything required, let’s get started!
To load the data in R, we’ll use the read_
family of functions of readr, which is included in the tidyverse:
library(tidyverse)
read_tsv("basics.tsv") -> basics
read_tsv("episodes.tsv") -> episodes
read_tsv("ratings.tsv") -> ratings
However, we quickly see that loading this data takes a bit of time (~10s on my machine), and we’ll need to join the tables together to have usable data. This is unacceptable as a loading time in the scope of a web application, and since we have the data locally and updated every 24 hours, we can do the heavy lifting once and for all every time we scrape the data!
So we’ll put all the code in this section in a separate script, and save the result as an RDS file for the webapp to read at startup.
First, let’s extract the series:
basics %>%
filter(titleType == "tvSeries") %>%
select(-titleType) ->
tvseries
And the episodes:
basics %>%
filter(titleType == "tvEpisode") %>%
select(-titleType, -startYear) ->
tvepisodes
We need to join the series and episodes, using the corresponding table in title.episode.tsv.
tvseries$tconst
will correspond to episodes$parentTconst
, and tvepisodes$tconst
will correspond to episodes$tconst
.
tvseries %>%
inner_join(episodes, by = c("tconst" = "parentTconst")) %>%
Episodes names and series names are all given by the primaryTitle variable, so a bit of renaming is in order to avoid collisions when joining with the episodes names.
Series with no seasons or episodes have \\N
for seasonNumber and episodeNumber, so we can get rid of those too, preferably before joining to reduce the amount of rows to join together.
select(id = tconst, seriesTitle = primaryTitle, eptconst = tconst.y, season = seasonNumber, episode = episodeNumber, startYear) %>%
filter(season != "\\N") %>%
inner_join(tvepisodes, by = c("eptconst" = "tconst")) %>%
Next, we need to join in the ratings:
inner_join(ratings, by = c("eptconst" = "tconst")) %>%
And finally, do a last bit of renaming and selecting variables, set the numerical variables type and sort the table.
Put together:
tvseries %>%
inner_join(episodes, by = c("tconst" = "parentTconst")) %>%
select(id = tconst, seriesTitle = primaryTitle, eptconst = tconst.y, season = seasonNumber, episode = episodeNumber, startYear) %>%
filter(season != "\\N") %>%
inner_join(tvepisodes, by = c("eptconst" = "tconst")) %>%
inner_join(ratings, by = c("eptconst" = "tconst")) %>%
select(-eptconst, episodeTitle = primaryTitle) %>%
mutate_at(vars(season, episode, averageRating, numVotes), as.numeric) %>%
arrange(seriesTitle, season, episode) ->
imdb
We used inner_join
because we want to keep only series with episodes (and episodes belonging to series), as well as episodes with ratings (IMDb sometimes creates whole seasons in advance, and since we want to graph ratings for aired shows, we don’t need those extra episodes)
We can now save this prepared database as an RDS file:
saveRDS(imdb, "imbd.rds")
This preprocessing step can be executed right after the scraping in the script we made earlier.
On my server, the whole process (including downloading the files, and I’m on a slow DSL connection) takes less than 3 minutes. Every 24 hours, to transform about 400MB of raw data into a tiny 4.1MB RDS file, I say that’s not bad!
Graphing
We’re getting to the interesting part: the actual graphing of the ratings!
As I wanted to have interactivity, I decided to use plotly for the graphing.
I used to use it through its wrapper around ggplots, ggplotly
, but this time I wanted better control over the plots so I used the native functions in R.
Let’s first prototype the graph in a static version using ggplot2.
What we want is a plot of every episode rating, with linear trends for every season.
First, we need to isolate one series, let’s use an obscure one I’m sure nobody knows, Game of Thrones (id = tt0944947).
imdb %>%
filter(id == "tt0944947") %>%
Then what we’ll need is
- a
geom_point
withepisode
for x andaverageRating
for y - a
geom_smooth(method = "lm")
for the trends facet_grid
withseason
for columns
ggplot() +
aes(x = episode, y = averageRating) +
geom_point() +
geom_smooth(method = "lm") +
facet_grid(~season)
Let’s translate that into plotly lingo.
plotly doesn’t do facets, but does subplots, so we’ll need to create subplots then assemble them.
plotly doesn’t have geom_, but trace.
plotly doesn’t do smooth, but we can add a lines trace for an lm
fit.
Here we go, for one subplot (x being the database filter on one season from one series)
fit <- lm(averageRating ~ episode, data = x)
x %>%
plot_ly() %>%
add_trace(data = x, x = ~episode, y = ~averageRating, mode = "markers") %>%
add_trace(data = x, x = ~episode, y = fitted(fit), mode = "lines")
We can make it better by customizing the hover text (the text that will be displayed when hovering the mouse over a point), in the plot_ly
call:
plot_ly(hoverinfo = "text",
text = ~str_c("S", season, "E", episode, " - ", episodeTitle, "<br>",
"Rating: ", averageRating, " (", numVotes, " votes)<br>"))
Also, we can customize the layout for each subplot:
layout(xaxis = list(tick0 = 0, dtick = 1),
yaxis = list(title = "Rating"))
Now we’ll wrap this all up in an anonymous function, that will be executed on every season of a series using by
, and pass the resulting list of subplots to… subplot
, with arguments to share the Y axis, arrange the plots on one row and set a minimal margin between subplots.
Finally, we’ll get rid of the legend.
imdb %>%
filter(id == "tt0944947") %>%
by(.$season, function(x){
fit <- lm(averageRating ~ episode, data = x)
x %>%
plot_ly(hoverinfo = "text",
text = ~str_c("S", season, "E", episode, " - ", episodeTitle, "<br>",
"Rating: ", averageRating, " (", numVotes, " votes)<br>")) %>%
add_trace(data = x, x = ~episode, y = ~averageRating, mode = "markers") %>%
add_trace(data = x, x = ~episode, y = fitted(fit), mode = "lines") %>%
layout(xaxis = list(tick0 = 0, dtick = 1),
yaxis = list(title = "Rating"))
}) %>%
subplot(shareY = T, nrows = 1, margin = 0) %>%
hide_legend
Wrapping it up in a shiny app
As good TV series addicts, we’re obviously not interested in just Game of Thrones.
We want to be able to graph any show on IMDb, and preferably have autocompletion to make it easier to find a show.
First things first, let’s load the data we acquired before:
readRDS("imdb.rds") -> imdb
Then we’ll create the series list, adding the year of release for disambiguation:
imdb %>%
distinct(id, .keep_all = T) %>%
mutate(title = str_c(seriesTitle, " (", startYear, ")")) %>%
select(id, title) -> shows
shows$id %>%
setNames(shows$title) -> shows
Creating that list as a named vector makes it better for selectizeInput
in shiny: the names is what will be displayed to the user, but the values will be returned by the selector.
The shiny UI is very simple, with an inputPanel
to hold the selector, preset with our obscure show, and an output for the plotly graph:
ui <- fluidPage(
inputPanel(width = "20%", height = "100%",
h1("GraphTV"),
selectizeInput("show", "Search show", shows, selected = "tt0944947")
),
plotlyOutput("show_graph", width = "100%", height = "800px")
)
The shiny server function retrieves the show id from the selector and updates the graph with the new results:
server <- function(input, output, session)
{
output$show_graph <- renderPlotly({
req(input$show)
imdb %>%
filter(id == input$show) %>%
by(.$season, function(x){
fit <- lm(averageRating ~ episode, data = x)
x %>%
plot_ly(hoverinfo = "text",
text = ~str_c("S", season, "E", episode, " - ", episodeTitle, "<br>",
"Rating: ", averageRating, " (", numVotes, " votes)<br>")) %>%
add_trace(data = x, x = ~episode, y = ~averageRating, mode = "markers") %>%
add_trace(data = x, x = ~episode, y = fitted(fit), mode = "lines") %>%
layout(xaxis = list(tick0 = 0, dtick = 1),
yaxis = list(title = "Rating"))
}) %>%
subplot(shareY = T, nrows = 1, margin = 0) %>%
hide_legend
})
}
Et voilà !
We just have to run it with shinyApp(ui, server)
.
The resulting app is available on my shiny server, enjoy!