Chapter 3 Chapter 3 Data transformation

We are working around three dataset in total, that is, titles.basics, title.crew, title.ratings. Therefore, it was very necessary to transformation the individual overall datasets by merging different datasets, adding new columns or removing the ones that were not needed or that had any NULL values.

Below, we describe the transformations that we incorporated which were necessary to accomplish the results of different questions.

3.2 Transformations to study the distributions of genre



Our second objective question studies how different genres vary in terms of votes, title type, and ratings. We would like to know which genres tend to recieve the most attention and how the type of work has an affect on ratings even within the most popular genre. As the dataset is very large, we decided to subset the dataset based on works done by experience directors. We have defined experience as those directors who have completed at least 20 works. By subsetting this way, we are also looking at top works within the dataset. The limitation with subsetting in this manner is that we may be losing data on works created by directors who are experienced but do not show up in the dataset at least 20 times.



Our transformations are as follows: - First we filtered out all of the titles that were not created by experienced directors. The remaining titles were grouped by title type, and genre. A new column was added for weighted ratings (ratings weighted by number of votes) and total number of votes. Any potential rows with missing values were dropped. - To see how different genres compare with each other, we first obtained all unique genres. Many of the rows in the genre column had comma separated values. However, there were some rows that had singular genre names. It can be noted that a list of all singular genre names represents the unique genre names. For each unique genre name, we obtained the total number of times the genre appears (within both singular and comma separated row values) and the average rating of all of the works within that genre. We also grouped this dataset based on title type to see what the distribution across title types looks like. - Due to the large size of the data, it is difficult to view trends within each genre. Therefore, we decided to specifically look at the most popular genre. After visualizing the comparison of different genres (seen in Chapter 5.2), we concluded that the most popular genre is “Drama”. Therefore, we subset our previous dataset of works by experienced directors by works that have “Drama” as at least one of the genre types. This was again grouped as earlier (grouped by title type, and genre, with the addition of total votes and weighted averages).

3.3 Transformations to study the performance of different director-writer pairs


The question basically addresses two problems
- The first part studies the top 10 director - writer pairs based on the number of times they worked together. Further, these pairs are used to study the distribution of average ratings of different title types they have produced.
- The second part studies that out of all the director writer pairs having average rating of 10, what is the distribution of the num of Votes for those particular pairs faceted by titleType.



For addressing the first question mentioned above we need to perform the following transformations:
- First, we filtered out the rows from the basics_ratings_crew dataset that had NA for writers and directors. We wanted to study the different director writer pairs so it was important that the dataset had no NULL values for either of these variables.
- Furthermore, for making the pair we used the paste function for concatenating the director column with writer column to form new column named director_writer_pair. This merging was required as our main aim revolves around studying the director-writer pair and this new columns will ease the upcoming visualization process.
- The total unique director writer pairs were approximately 600K. Visualizing all of them is not feasible and the results that they would give would also not be intuitive. Therefore, we decided to pick the top 10 pairs based on the number of times they have worked together. So we calculated the count of all the pairs and then selected the top 10.
- Eventually, we filtered the original dataset with the top 10 director writer pairs.



For addressing the second question mentioned above we need to perform the following transformations:
- To begin with, we used the dataset created for the above question which contained the director writer pairs. Furthermore, we grouped the data by “director_writer_pair”, “titleType”, and “genres” and then summarized the data by calculating the total number of Votes received by the pair for different title types.
- Further, we summarized the same grouped data for calculating the weighted rating which is calculated by taking the sum of average rating \(\times\) number of Votes divided by total number of Votes.
- Next, we joined both the grouped data to get the weighted rating as well as total number of Votes for a particular director-writer pair.
- Finally, we filtered the data by weighted rating in order to get only the pairs having weighted rating equal to 10. We then merged this filtered data with the data having the count of number of times a particular pair worked together.
- The final dataset had total of 783 pairs who had average rating of 10 so to further filter the data we studied the distribution of the number of times the pair worked together and found that out of 783, 692 pairs worked with each other only once or twice and therefore we further narrowed the data to only consider the pairs who have worked more than 5 times with each other.

## # A tibble: 603,287 × 4
## # Groups:   director_writer_pair, directors [603,287]
##    director_writer_pair            directors writers             weightedRating
##    <chr>                           <chr>     <chr>                        <dbl>
##  1 nm0000005 - nm0000005           nm0000005 nm0000005                     8.06
##  2 nm0000005 - nm0000005,nm0279027 nm0000005 nm0000005,nm0279027           6.4 
##  3 nm0000005 - nm0000005,nm0340471 nm0000005 nm0000005,nm0340471           7.5 
##  4 nm0000005 - nm0000005,nm0410694 nm0000005 nm0000005,nm0410694           7.5 
##  5 nm0000005 - nm0000005,nm0529508 nm0000005 nm0000005,nm0529508           6.6 
##  6 nm0000005 - nm0000005,nm0845496 nm0000005 nm0000005,nm0845496           6.4 
##  7 nm0000005 - nm0000005,nm1749058 nm0000005 nm0000005,nm1749058           7   
##  8 nm0000005 - nm0051923,nm0000005 nm0000005 nm0051923,nm0000005           7.1 
##  9 nm0000005 - nm0074790           nm0000005 nm0074790                     6.2 
## 10 nm0000005 - nm0091112,nm0649636 nm0000005 nm0091112,nm0649636           5.6 
## # … with 603,277 more rows

3.3.1 Interactive Plot

## # A tibble: 4,137 × 4
## # Groups:   titleType [10]
##    titleType genres                     weightedRating totalVotes
##    <chr>     <chr>                               <dbl>      <int>
##  1 movie     Action                               5.35     124954
##  2 movie     Action,Adult                         4.69        286
##  3 movie     Action,Adult,Adventure               6.6           8
##  4 movie     Action,Adult,Crime                   5.37        241
##  5 movie     Action,Adult,Drama                   4.66       1196
##  6 movie     Action,Adventure                     7.51    3527216
##  7 movie     Action,Adventure,Animation           7.23    1342592
##  8 movie     Action,Adventure,Biography           7.07     684209
##  9 movie     Action,Adventure,Comedy              6.60    8364728
## 10 movie     Action,Adventure,Crime               6.32    1809553
## # … with 4,127 more rows

The above transformation subsets our data by genre and saves each subset as its own file.