Chapter 2 Chapter 2 Data sources

We will be using the IMDb Datasets for our final group project. These datasets are updated on a daily basis. They are also quite dense and contain both categorical and continuous variables.

Among all the available datasets on the above link, we collected and used three datasets for our project: title.basics.tsv.gz, title.crew.tsv.gz, and title.ratings.tsv.gz. Moreover, we have also merged and manipulated these datasets based on our requirements to address the said objectives. We Each team member of our was responsible to collect the data. We simply downloaded it from the said link and then unzipped it to be accessed for this project.

We chose to use the three datasets mentioned above as we firmly thought that they contained the information that was relevant to answer the questions that we were interested to investigate. The detailed information about the datasets that we used is described below:

2.1 Basics

title.basics.tsv.gz This dataset provides the basic information about different contents as displayed below:
1. tconst (string) - alphanumeric unique identifier of the title
2. ttlTy (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
3. prmrT (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
4. orgnT (string) - original title, in the original language
5. isAdl (boolean) - 0: non-adult title; 1: adult title
6. strtY (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
7. endYr (YYYY) – TV Series end year. ‘’ for all other title types
8. rntmM – primary runtime of the title, in minutes
9. genrs (string array) – includes up to three genres associated with the title

basics <- read.csv("sources/title.basics.tsv", sep = '\t', header = TRUE, fill = TRUE, na.strings = "NA")
basics[basics == "\\N"] <- NA

head(basics, 5)

##      tconst titleType           primaryTitle          originalTitle isAdult
## 1 tt0000001     short             Carmencita             Carmencita       0
## 2 tt0000002     short Le clown et ses chiens Le clown et ses chiens       0
## 3 tt0000003     short         Pauvre Pierrot         Pauvre Pierrot       0
## 4 tt0000004     short            Un bon bock            Un bon bock       0
## 5 tt0000005     short       Blacksmith Scene       Blacksmith Scene       0
##   startYear endYear runtimeMinutes                   genres
## 1      1894    <NA>              1        Documentary,Short
## 2      1892    <NA>              5          Animation,Short
## 3      1892    <NA>              4 Animation,Comedy,Romance
## 4      1892    <NA>             12          Animation,Short
## 5      1893    <NA>              1             Comedy,Short

The potential columns that we will use from this datatset are titleType, isAdult, startYear, and genres of different contents. These information will help us answer all the objective questions that we have formulated.

The variable types for all the columns above is character, which we will actually convert in different relevant datatypes that would help us answer our questions. Moreover, there are a total of 8486592 rows, that is, 8.4 million rows and 9 columns in the basics dataset.

2.2 Crew

title.crew.tsv.gz This dataset gives information on directors and writers for each content type produced. The description of the columns contained in this dataset is as below:
1. tcnst (string) - alphanumeric unique identifier of the title
2. drctr (array of nconsts) - director(s) of the given title
3. wrtrs (array of nconsts) – writer(s) of the given title

crew <- read.csv("sources/title.crew.tsv", sep = '\t', header = TRUE, fill = TRUE, na.strings = "NA")
crew[crew == "\\N"] <- NA

head(crew, 5)

##      tconst directors writers
## 1 tt0000001 nm0005690    <NA>
## 2 tt0000002 nm0721526    <NA>
## 3 tt0000003 nm0721526    <NA>
## 4 tt0000004 nm0721526    <NA>
## 5 tt0000005 nm0005690    <NA>

We will make use of all the columns from this dataset to address the question 3 for our project.

The default datatype of the columns above is character. Quantitatively, there are 8486594, that is, 8.4 million rows and 3 columns in the crew dataset.

2.3 Ratings

title.ratings.tsv.gz This dataset basically gives the average ratings as well as the number of votes for each title. We may need to normalize this dataset as ratings can be skewed heavily if they have too little votes. The columns contained in this dataset are as follows:
1. tcnst (string) - alphanumeric unique identifier of the title
2. avrgR – weighted average of all the individual user ratings
3. nmVts - number of votes the title has received

ratings <- read.csv("sources/title.ratings.tsv", sep = '\t', header = TRUE, fill = TRUE, na.strings = "NA")
ratings[ratings == "\\N"] <- NA

head(ratings, 5)

##      tconst averageRating numVotes
## 1 tt0000001           5.7     1842
## 2 tt0000002           6.0      237
## 3 tt0000003           6.5     1604
## 4 tt0000004           6.0      154
## 5 tt0000005           6.2     2423

Potentially, we will utilize all the columns of this datatset to answer all the 3 questions of our project. The variable type of the columns in this dataset is character, double, and integer respecively for tconst, averageRating, and numVotes variables. There are 1191702, that is, 1.1 million rows and 3 columns in the ratings dataset.