Chapter 3 Chapter 3 Data transformation
We are working around three dataset in total, that is, titles.basics, title.crew, title.ratings. Therefore, it was very necessary to transformation the individual overall datasets by merging different datasets, adding new columns or removing the ones that were not needed or that had any NULL values.
Below, we describe the transformations that we incorporated which were necessary to accomplish the results of different questions.
3.1 Transformations to study distribution trends in Adult v/s Non-Adult titles
This question basically addresses two objectives.
- The first part studies the trend of number of adult title releases with the change of time.
- The second part studies the proportion of votes distribution of adult versus non-adult titles over the given period of time.
For this question, we primarily use the title.basics and title.ratings dataset. Therefore, we will initially prepare a new auxiliary dataframe which contains transformed version of these datasets.
3.1.1 Adult title releases trends
In this part of the first objective question we wish to observe that with progressing years, did the film industry started making more adults, or less adults, or was the number of releases constant. The following steps show a detailed explanation on how we transformed the data step by step:
- Among all the columns available in the title.basics dataset, we are only interested in using the ‘startYear’ and ‘isAdult’ columns to achieve the first part of the first objective question. Thus, we dropped the other available columns in the data which were unnecessary for our exploration, likewise, ‘primaryTitle’, ‘originalTitle’, ‘endYear’, ‘runtimeMinutes’ and formed a new auxiliary dataframe named ‘basics_ratings’.
- Further, as mentioned earlier, we need the necessary information(‘startYear’ and ‘isAdult’) from title.basics, which basically is the basics_ratings dataframe, and title.ratings data combined in one auxiliary dataframe. So, we merge the title.ratings dataset with the basics_ratings dataframe and use to address both first and second part of the first objective question.
-Moreover, we analyzed that there were missing values in the ‘isAdult’ and ‘startYear’ columns. As we are going to explore that with the progress of year does the number of adult movies released had any trend or not, so we need to remove the start years values that have NULL. Also, we are summing up the total number of adult titles released per year, therefore, it is necessary to delete the rows that have NA values for isAdult. Thus, we dropped the rows which either contained NULL values for startYear or NULL values for isAdult.
- After cleaning the data by adding, removing and merging the columns, the next step is to group similar start years together and calculate the total number of adult titles that were released in that particular year. For this grouping operation to execute, we converted the datatype of both ‘startYear’ and ‘isAdult’ columns to numeric. Once the datatype was transformed in the desired format, we grouped similar years together based on two different conditions:
-Grouped same years and calculated the sum of the isAdult column which gave the total number of adult titles released in a particular year. We assigned this to a new temporary dataframe.
-Grouped same years and counted the number of rows of the isAdult column which gave the total number of titles released in a particular in a particular year irrespective of adult or non-adult. We assigned this to another new temporary dataframe.
Finally, we merged the above two temporary dataframes into one by the grouped start years.
- As the last step, we calculated the proportion of the adult titles released per start year. This percent would then be used to plot it against their respective start years and study the trend.
3.1.2 Ratings distribution of non-adult vs adult titles
In this part of the first objective question, we are basically studying that did the non-adult titles released more percentages of votes, or less percentages of votes, or same as compared to the votes received by the adult titles in particular year. The following steps explains the transformations that we followed:
- First, we filtered out the rows from the basics_ratings dataset that had information about the non-adult titles and stored them in a new temporary auxiliary dataframe. Similarily, we filtered out the rows from the basics_ratings dataset that had information about the adult titles and stored them in another new temporary auxiliary dataframe.
- Further, we merged the above two temporary dataframes together into one based on similar start years and calculated the sum of the number of votes that were received by non adult and adult titles respectively. These two sums were basically created as two different new columns.
- Next we merged the above two columns based on start year to form a new transformed dataframe. Moreover, while merging there was a case when there would be some years which were not present in non adult titles or vice versa. So, performed a detailed transformation that included the start years from both the columns and not just the common years. In this situation, the rows that would result in NA values for the number of votes were transformed to get replaced by 0.
- Finally, we calculated a new column ‘non-adult’ in the merged dataframe above that contained the proportion of titles that were non-adult type for the particular year. Similarly, we created another new column in the merged dataframe above that contained the proportion of titles that were adult type for that particular year. These two columns would be used to plot against the start year and study that among the total votes received per year, how much percentage of votes were for non-adult titles and how much were for adult.
3.2 Transformations to study the distributions of genre
Our second objective question studies how different genres vary in terms of votes, title type, and ratings. We would like to know which genres tend to recieve the most attention and how the type of work has an affect on ratings even within the most popular genre. As the dataset is very large, we decided to subset the dataset based on works done by experience directors. We have defined experience as those directors who have completed at least 20 works. By subsetting this way, we are also looking at top works within the dataset. The limitation with subsetting in this manner is that we may be losing data on works created by directors who are experienced but do not show up in the dataset at least 20 times.
Our transformations are as follows:
- First we filtered out all of the titles that were not created by experienced directors. The remaining titles were grouped by title type, and genre. A new column was added for weighted ratings (ratings weighted by number of votes) and total number of votes. Any potential rows with missing values were dropped.
- To see how different genres compare with each other, we first obtained all unique genres. Many of the rows in the genre column had comma separated values. However, there were some rows that had singular genre names. It can be noted that a list of all singular genre names represents the unique genre names. For each unique genre name, we obtained the total number of times the genre appears (within both singular and comma separated row values) and the average rating of all of the works within that genre. We also grouped this dataset based on title type to see what the distribution across title types looks like.
- Due to the large size of the data, it is difficult to view trends within each genre. Therefore, we decided to specifically look at the most popular genre. After visualizing the comparison of different genres (seen in Chapter 5.2), we concluded that the most popular genre is “Drama”. Therefore, we subset our previous dataset of works by experienced directors by works that have “Drama” as at least one of the genre types. This was again grouped as earlier (grouped by title type, and genre, with the addition of total votes and weighted averages).
3.3 Transformations to study the performance of different director-writer pairs
The question basically addresses two problems
- The first part studies the top 10 director - writer pairs based on the number of times they worked together. Further, these pairs are used to study the distribution of average ratings of different title types they have produced.
- The second part studies that out of all the director writer pairs having average rating of 10, what is the distribution of the num of Votes for those particular pairs faceted by titleType.
For addressing the first question mentioned above we need to perform the following transformations:
- First, we filtered out the rows from the basics_ratings_crew dataset that had NA for writers and directors. We wanted to study the different director writer pairs so it was important that the dataset had no NULL values for either of these variables.
- Furthermore, for making the pair we used the paste function for concatenating the director column with writer column to form new column named director_writer_pair. This merging was required as our main aim revolves around studying the director-writer pair and this new columns will ease the upcoming visualization process.
- The total unique director writer pairs were approximately 600K. Visualizing all of them is not feasible and the results that they would give would also not be intuitive. Therefore, we decided to pick the top 10 pairs based on the number of times they have worked together. So we calculated the count of all the pairs and then selected the top 10.
- Eventually, we filtered the original dataset with the top 10 director writer pairs.
For addressing the second question mentioned above we need to perform the following transformations:
- To begin with, we used the dataset created for the above question which contained the director writer pairs. Furthermore, we grouped the data by “director_writer_pair”, “titleType”, and “genres” and then summarized the data by calculating the total number of Votes received by the pair for different title types.
- Further, we summarized the same grouped data for calculating the weighted rating which is calculated by taking the sum of average rating \(\times\) number of Votes divided by total number of Votes.
- Next, we joined both the grouped data to get the weighted rating as well as total number of Votes for a particular director-writer pair.
- Finally, we filtered the data by weighted rating in order to get only the pairs having weighted rating equal to 10. We then merged this filtered data with the data having the count of number of times a particular pair worked together.
- The final dataset had total of 783 pairs who had average rating of 10 so to further filter the data we studied the distribution of the number of times the pair worked together and found that out of 783, 692 pairs worked with each other only once or twice and therefore we further narrowed the data to only consider the pairs who have worked more than 5 times with each other.
## # A tibble: 603,287 × 4
## # Groups: director_writer_pair, directors [603,287]
## director_writer_pair directors writers weightedRating
## <chr> <chr> <chr> <dbl>
## 1 nm0000005 - nm0000005 nm0000005 nm0000005 8.06
## 2 nm0000005 - nm0000005,nm0279027 nm0000005 nm0000005,nm0279027 6.4
## 3 nm0000005 - nm0000005,nm0340471 nm0000005 nm0000005,nm0340471 7.5
## 4 nm0000005 - nm0000005,nm0410694 nm0000005 nm0000005,nm0410694 7.5
## 5 nm0000005 - nm0000005,nm0529508 nm0000005 nm0000005,nm0529508 6.6
## 6 nm0000005 - nm0000005,nm0845496 nm0000005 nm0000005,nm0845496 6.4
## 7 nm0000005 - nm0000005,nm1749058 nm0000005 nm0000005,nm1749058 7
## 8 nm0000005 - nm0051923,nm0000005 nm0000005 nm0051923,nm0000005 7.1
## 9 nm0000005 - nm0074790 nm0000005 nm0074790 6.2
## 10 nm0000005 - nm0091112,nm0649636 nm0000005 nm0091112,nm0649636 5.6
## # … with 603,277 more rows
3.3.1 Interactive Plot
## # A tibble: 4,137 × 4
## # Groups: titleType [10]
## titleType genres weightedRating totalVotes
## <chr> <chr> <dbl> <int>
## 1 movie Action 5.35 124954
## 2 movie Action,Adult 4.69 286
## 3 movie Action,Adult,Adventure 6.6 8
## 4 movie Action,Adult,Crime 5.37 241
## 5 movie Action,Adult,Drama 4.66 1196
## 6 movie Action,Adventure 7.51 3527216
## 7 movie Action,Adventure,Animation 7.23 1342592
## 8 movie Action,Adventure,Biography 7.07 684209
## 9 movie Action,Adventure,Comedy 6.60 8364728
## 10 movie Action,Adventure,Crime 6.32 1809553
## # … with 4,127 more rows
The above transformation subsets our data by genre and saves each subset as its own file.