There is a little piece of software on the Internet that I very much liked.
It’s called GraphTV, and it allows you to view the IMDb ratings for any TV show, with seasonal trends.

However, it is not working at the moment, and I’ve been frustrated by the fact that it is sometimes a “hit or miss”: only one result spawns from your search, and it can be difficult to pinpoint a specific show between homonyms (think about all those British TV shows with US remakes…).
So I decided to remake it, and, to make the most of it, do a tutorial on how I did that.

TL;DR: I remade GraphTV, you can access it here, and the code is freely available on my git repo

Source of data

First of all, we need to find a source of data.

The original GraphTV used IMDb, so I set out to find an IMBb API.
Turns out it doesn’t exist, but there are other solutions:

the Open Movie Database API, but it is limited (no possibility to find episodes from a show)
Rotten Tomatoes seems to have an API, but it needs to be requested through a Business Proposal Form
The Movie Database API, which works well (you can find a previous version of my take on GraphTV that used this API here), but the ratings are much more limited than those one can find on IMDb.
The TV Db, but the ratings are even more limited than those on TMDb.
but IMBd also provides daily snapshots of its databases, which can be found here!

It looks like we’re going to do this locally after all.

Getting the data

There are three files we are interest in on https://datasets.imdbws.com/:

title.basics.tsv: titles of shows and episodes, as well as the release year
title.episode.tsv: join table between shows and episodes, with season and episode numbers
title.ratings.tsv: ratings and number of votes for every show/episode

To get these files we can use cURL in the command line and pipe the result through gzip to extract the data.
Also, for title.basics.tsv, we only need the tconst (id), titleType, primaryTitle and startYear columns, so we will use cut to filter those columns out and save some space, as well as speed up the reading of the files afterwards.
Finally, the titles are not quoted, and might contain quotes, so we’ll escape those using sed:

curl https://datasets.imdbws.com/title.basics.tsv.gz | gzip -d | cut -f 1-3,6 | sed 's/"/\\"/g' > basics.tsv
curl https://datasets.imdbws.com/title.ratings.tsv.gz | gzip -d > ratings.tsv
curl https://datasets.imdbws.com/title.episode.tsv.gz | gzip -d > episodes.tsv

We’ve got all the data, neatly formatted as TSV, with a download script that can be executed whenever we want to refresh the local database (remember those files are generated only once every 24h, so no need to scrape them every minute)

Pre-processing the data

Now that we’ve got everything required, let’s get started!

To load the data in R, we’ll use the read_ family of functions of readr, which is included in the tidyverse:

library(tidyverse)

read_tsv("basics.tsv") -> basics
read_tsv("episodes.tsv") -> episodes
read_tsv("ratings.tsv") -> ratings

However, we quickly see that loading this data takes a bit of time (~10s on my machine), and we’ll need to join the tables together to have usable data. This is unacceptable as a loading time in the scope of a web application, and since we have the data locally and updated every 24 hours, we can do the heavy lifting once and for all every time we scrape the data!

So we’ll put all the code in this section in a separate script, and save the result as an RDS file for the webapp to read at startup.

First, let’s extract the series:

basics %>%
  filter(titleType == "tvSeries") %>%
  select(-titleType) ->
tvseries

And the episodes:

basics %>%
  filter(titleType == "tvEpisode") %>%
  select(-titleType, -startYear) ->
tvepisodes

We need to join the series and episodes, using the corresponding table in title.episode.tsv.
tvseries$tconst will correspond to episodes$parentTconst, and tvepisodes$tconst will correspond to episodes$tconst.

tvseries %>%
  inner_join(episodes, by = c("tconst" = "parentTconst")) %>%

Episodes names and series names are all given by the primaryTitle variable, so a bit of renaming is in order to avoid collisions when joining with the episodes names.
Series with no seasons or episodes have \\N for seasonNumber and episodeNumber, so we can get rid of those too, preferably before joining to reduce the amount of rows to join together.

  select(id = tconst, seriesTitle = primaryTitle, eptconst = tconst.y, season = seasonNumber, episode = episodeNumber, startYear) %>%
  filter(season != "\\N") %>%
  inner_join(tvepisodes, by = c("eptconst" = "tconst")) %>%

Next, we need to join in the ratings:

  inner_join(ratings, by = c("eptconst" = "tconst")) %>%

And finally, do a last bit of renaming and selecting variables, set the numerical variables type and sort the table.

Put together:

tvseries %>%
  inner_join(episodes, by = c("tconst" = "parentTconst")) %>%
  select(id = tconst, seriesTitle = primaryTitle, eptconst = tconst.y, season = seasonNumber, episode = episodeNumber, startYear) %>%
  filter(season != "\\N") %>%
  inner_join(tvepisodes, by = c("eptconst" = "tconst")) %>%
  inner_join(ratings, by = c("eptconst" = "tconst")) %>%
  select(-eptconst, episodeTitle = primaryTitle) %>%
  mutate_at(vars(season, episode, averageRating, numVotes), as.numeric) %>%
  arrange(seriesTitle, season, episode) ->
imdb

We used inner_join because we want to keep only series with episodes (and episodes belonging to series), as well as episodes with ratings (IMDb sometimes creates whole seasons in advance, and since we want to graph ratings for aired shows, we don’t need those extra episodes)

We can now save this prepared database as an RDS file:

saveRDS(imdb, "imbd.rds")

This preprocessing step can be executed right after the scraping in the script we made earlier.
On my server, the whole process (including downloading the files, and I’m on a slow DSL connection) takes less than 3 minutes. Every 24 hours, to transform about 400MB of raw data into a tiny 4.1MB RDS file, I say that’s not bad!

Graphing

We’re getting to the interesting part: the actual graphing of the ratings!

As I wanted to have interactivity, I decided to use plotly for the graphing.
I used to use it through its wrapper around ggplots, ggplotly, but this time I wanted better control over the plots so I used the native functions in R.

Let’s first prototype the graph in a static version using ggplot2.
What we want is a plot of every episode rating, with linear trends for every season.

First, we need to isolate one series, let’s use an obscure one I’m sure nobody knows, Game of Thrones (id = tt0944947).

imdb %>%
  filter(id == "tt0944947") %>%

Then what we’ll need is

a geom_point with episode for x and averageRating for y
a geom_smooth(method = "lm") for the trends
facet_grid with season for columns

ggplot() +
  aes(x = episode, y = averageRating) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_grid(~season)

Let’s translate that into plotly lingo.
plotly doesn’t do facets, but does subplots, so we’ll need to create subplots then assemble them.
plotly doesn’t have geom_, but trace.
plotly doesn’t do smooth, but we can add a lines trace for an lm fit.

Here we go, for one subplot (x being the database filter on one season from one series)

fit <- lm(averageRating ~ episode, data = x)
x %>%
 plot_ly() %>%
 add_trace(data = x, x = ~episode, y = ~averageRating, mode = "markers") %>%
 add_trace(data = x, x = ~episode, y = fitted(fit), mode = "lines")

We can make it better by customizing the hover text (the text that will be displayed when hovering the mouse over a point), in the plot_ly call:

plot_ly(hoverinfo = "text",
        text = ~str_c("S", season, "E", episode, " - ", episodeTitle, "<br>",
                      "Rating: ", averageRating, " (", numVotes, " votes)<br>"))

Also, we can customize the layout for each subplot:

layout(xaxis = list(tick0 = 0, dtick = 1),
      yaxis = list(title = "Rating"))

Now we’ll wrap this all up in an anonymous function, that will be executed on every season of a series using by, and pass the resulting list of subplots to… subplot, with arguments to share the Y axis, arrange the plots on one row and set a minimal margin between subplots.
Finally, we’ll get rid of the legend.

imdb %>%
  filter(id == "tt0944947") %>%
  by(.$season, function(x){
       fit <- lm(averageRating ~ episode, data = x)
       x %>%
         plot_ly(hoverinfo = "text",
                 text = ~str_c("S", season, "E", episode, " - ", episodeTitle, "<br>",
                               "Rating: ", averageRating, " (", numVotes, " votes)<br>")) %>%
       add_trace(data = x, x = ~episode, y = ~averageRating, mode = "markers") %>%
       add_trace(data = x, x = ~episode, y = fitted(fit), mode = "lines") %>%
       layout(xaxis = list(tick0 = 0, dtick = 1),
              yaxis = list(title = "Rating"))
      }) %>%
subplot(shareY = T, nrows = 1, margin = 0) %>%
hide_legend

Wrapping it up in a shiny app

As good TV series addicts, we’re obviously not interested in just Game of Thrones.

We want to be able to graph any show on IMDb, and preferably have autocompletion to make it easier to find a show.

First things first, let’s load the data we acquired before:

readRDS("imdb.rds") -> imdb

Then we’ll create the series list, adding the year of release for disambiguation:

imdb %>%
  distinct(id, .keep_all = T) %>%
  mutate(title = str_c(seriesTitle, " (", startYear, ")")) %>%
  select(id, title) -> shows

shows$id %>%
  setNames(shows$title) -> shows

Creating that list as a named vector makes it better for selectizeInput in shiny: the names is what will be displayed to the user, but the values will be returned by the selector.

The shiny UI is very simple, with an inputPanel to hold the selector, preset with our obscure show, and an output for the plotly graph:

ui <- fluidPage(
  inputPanel(width = "20%", height = "100%",
             h1("GraphTV"),
             selectizeInput("show", "Search show", shows, selected = "tt0944947")
             ),
  plotlyOutput("show_graph", width = "100%", height = "800px")
)

The shiny server function retrieves the show id from the selector and updates the graph with the new results:

server <- function(input, output, session)
{
  output$show_graph <- renderPlotly({
    req(input$show)

    imdb %>%
      filter(id == input$show) %>%
      by(.$season, function(x){
           fit <- lm(averageRating ~ episode, data = x)
           x %>%
             plot_ly(hoverinfo = "text",
                     text = ~str_c("S", season, "E", episode, " - ", episodeTitle, "<br>",
                                   "Rating: ", averageRating, " (", numVotes, " votes)<br>")) %>%
           add_trace(data = x, x = ~episode, y = ~averageRating, mode = "markers") %>%
           add_trace(data = x, x = ~episode, y = fitted(fit), mode = "lines") %>%
           layout(xaxis = list(tick0 = 0, dtick = 1),
                  yaxis = list(title = "Rating"))
          }) %>%
    subplot(shareY = T, nrows = 1, margin = 0) %>%
    hide_legend
  })
}

Et voilà !
We just have to run it with shinyApp(ui, server).

The resulting app is available on my shiny server, enjoy!