Chapter 2 Chapter 2 Data sources
We will be using the IMDb Datasets for our final group project. These datasets are updated on a daily basis. They are also quite dense and contain both categorical and continuous variables.
Among all the available datasets on the above link, we collected and used three datasets for our project: title.basics.tsv.gz, title.crew.tsv.gz, and title.ratings.tsv.gz. Moreover, we have also merged and manipulated these datasets based on our requirements to address the said objectives. We Each team member of our was responsible to collect the data. We simply downloaded it from the said link and then unzipped it to be accessed for this project.
We chose to use the three datasets mentioned above as we firmly thought that they contained the information that was relevant to answer the questions that we were interested to investigate. The detailed information about the datasets that we used is described below:
2.1 Basics
title.basics.tsv.gz
This dataset provides the basic information about different contents as displayed below:
1. tconst (string) - alphanumeric unique identifier of the title
2. ttlTy (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
3. prmrT (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
4. orgnT (string) - original title, in the original language
5. isAdl (boolean) - 0: non-adult title; 1: adult title
6. strtY (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
7. endYr (YYYY) – TV Series end year. ‘’ for all other title types
8. rntmM – primary runtime of the title, in minutes
9. genrs (string array) – includes up to three genres associated with the title
<- read.csv("sources/title.basics.tsv", sep = '\t', header = TRUE, fill = TRUE, na.strings = "NA")
basics == "\\N"] <- NA basics[basics
head(basics, 5)
## tconst titleType primaryTitle originalTitle isAdult
## 1 tt0000001 short Carmencita Carmencita 0
## 2 tt0000002 short Le clown et ses chiens Le clown et ses chiens 0
## 3 tt0000003 short Pauvre Pierrot Pauvre Pierrot 0
## 4 tt0000004 short Un bon bock Un bon bock 0
## 5 tt0000005 short Blacksmith Scene Blacksmith Scene 0
## startYear endYear runtimeMinutes genres
## 1 1894 <NA> 1 Documentary,Short
## 2 1892 <NA> 5 Animation,Short
## 3 1892 <NA> 4 Animation,Comedy,Romance
## 4 1892 <NA> 12 Animation,Short
## 5 1893 <NA> 1 Comedy,Short
The potential columns that we will use from this datatset are titleType, isAdult, startYear, and genres of different contents. These information will help us answer all the objective questions that we have formulated.
The variable types for all the columns above is character, which we will actually convert in different relevant datatypes that would help us answer our questions. Moreover, there are a total of 8486592 rows, that is, 8.4 million rows and 9 columns in the basics dataset.
2.2 Crew
title.crew.tsv.gz
This dataset gives information on directors and writers for each content type produced. The description of the columns contained in this dataset is as below:
1. tcnst (string) - alphanumeric unique identifier of the title
2. drctr (array of nconsts) - director(s) of the given title
3. wrtrs (array of nconsts) – writer(s) of the given title
<- read.csv("sources/title.crew.tsv", sep = '\t', header = TRUE, fill = TRUE, na.strings = "NA")
crew == "\\N"] <- NA crew[crew
head(crew, 5)
## tconst directors writers
## 1 tt0000001 nm0005690 <NA>
## 2 tt0000002 nm0721526 <NA>
## 3 tt0000003 nm0721526 <NA>
## 4 tt0000004 nm0721526 <NA>
## 5 tt0000005 nm0005690 <NA>
We will make use of all the columns from this dataset to address the question 3 for our project.
The default datatype of the columns above is character. Quantitatively, there are 8486594, that is, 8.4 million rows and 3 columns in the crew dataset.
2.3 Ratings
title.ratings.tsv.gz
This dataset basically gives the average ratings as well as the number of votes for each title. We may need to normalize this dataset as ratings can be skewed heavily if they have too little votes. The columns contained in this dataset are as follows:
1. tcnst (string) - alphanumeric unique identifier of the title
2. avrgR – weighted average of all the individual user ratings
3. nmVts - number of votes the title has received
<- read.csv("sources/title.ratings.tsv", sep = '\t', header = TRUE, fill = TRUE, na.strings = "NA")
ratings == "\\N"] <- NA ratings[ratings
head(ratings, 5)
## tconst averageRating numVotes
## 1 tt0000001 5.7 1842
## 2 tt0000002 6.0 237
## 3 tt0000003 6.5 1604
## 4 tt0000004 6.0 154
## 5 tt0000005 6.2 2423
Potentially, we will utilize all the columns of this datatset to answer all the 3 questions of our project. The variable type of the columns in this dataset is character, double, and integer respecively for tconst, averageRating, and numVotes variables. There are 1191702, that is, 1.1 million rows and 3 columns in the ratings dataset.