OTT Platform Data Analysis: Helps in Choosing an OTT Service

In 2019, the OTT marketplace was estimated at $85.16 Billion and this is anticipated to reach $194.20 Billion by the year 2025. Under Coronavirus, a lot of countries have announced social distancing measures, which forced theaters in limiting total audiences or shut down and it encouraged people in staying at home, quickening the rise in OTT platform subscriptions. So, we thought that here is the right time of analyzing various OTT platforms as well as offer useful data for people that can’t decide which platforms fit those the best.


Mainly, three datasets groups were utilized for this study with the content listing for Netflix, Amazon Prime, as well as Disney+; IMDb genre classification and ratings; MovieLens ratings as well as genome tags. Content listing for every platform was used as a base for analysis. Although, as they all were collected from various resources, they were not reliable in the time of last updates and column types. So, for easing the procedure of analysis as well as making the results more dependable, we need to clean data in the identical form as well as combine them using other datasets, which contained reliable information about ratings, age limit, and genre.

Amongst them were MovieLens and IMDb datasets. Though every platform data has information about IMDb and genres ratings, they got labeled as per different categorizations as well as updated at various times, needing the usage of different data for ratings and genres. On the top, we have utilized MovieLens data for strengthening our analysis. As we did not get pre-existing studies for proving our findings, we have decided to work on a similar analysis of two various data. MovieLens had its ratings and tag data that made it an appropriate comparison for IMDb ratings and genres. We have assumed that having similar results from two resources would support the authority of the analysis done.


The content dataset of Netflix includes both movies and TV shows, which are accessible on Netflix since 2019, having 7788 rows comprising the header as well as 12 columns. These columns include title, show_id, type, cast, country, director, release_year, rating, duration, date_added, listed_in, as well as description, however, only title, type, release_year, ratings, and duration got used.

The dataset of Netflix original movies has 525 rows comprising the header as well as 48 columns, but we have only utilized title as well as release date in our study.

Amazon Prime Video

The dataset of Amazon Prime Video has a total of 8128 rows as well as 7 columns. These columns consist of title, IMDb ratings, language, run time, release year, maturity ratings, and plot. Finally, we have only utilized 4 out of 7 columns.

To differentiate original movies from datasets, we have extracted a listing of original movies taken from Wikipedia. It has 52 rows as well as 3 columns including title, release date, as well as notes.


The Disney+ dataset has data on TV series and films, which are inappropriate for the study. This has a total of 19 columns and 992 rows including title, type, director, plot, as well as genre, however, only title, type, imdb_id, released_at, as well as rated were utilized for our objective.

We have also utilized a listing of original films extracted from Wikipedia. The data had 56 rows as well as had a title, release date, genre, as well as the runtime for columns. Although the majority of them for released after 2020 that might not get utilized in this study.


From IMDb, we have utilized datasets like title.basics.tsv.gz as well as title.ratings.tsv.gz.

Title.basics.tsv.gz restricted columns like title, IMDb ID, start year, genre, runtime, etc., however, we have only utilized title, IMDb ID, as well as a genre from data. Titles worked as the key for merging platform data as well as IMDb. We have kept IMDb ID as when utilized as the key, it has made merging rating data with platform data easy.

Title.ratings.tsv.gz was utilized for retrieving the rating data. It had confined IMDb ID, average ratings, and other votes from where we have utilized the initial two columns.


MovieLens is the website owned by GroupLens, which independently collects ratings as well as tag data from the users. Luckily, link.csv contained data, which links IMDb ID with MovieLens ID that making it suitable for us for combining MovieLens data with the existing data frame. Not like IMDb data that already aggregated the average ratings as well as classified genres for every film, MovieLens data had separate tags and ratings, which had to be collected by us.

In MovieLens, you can have a total of 1128 tags as well as every film has a score range from 0–1 for every tag. The nearer it has to 1, the movie will be more relevant to tag. As there are many tag and movie pairs, we have only cleaned the pairs, which scored more than 0.8, and decided to using trial & error.


The experimentation consisted of mostly two steps: pre-processing as well as analysis. Pre-processing was a vital step for this project as all the data were from various sources. There were problems like disambiguating movies having similar titles as well as combining various age rating systems, which had to be overcome to start the analysis.

After pre-processing the resource datasets, we have explored how various platforms concentrate on content targeted towards particular audiences differing with age as well as genre preferences.

Genome Analysis of MovieLens Tag


The analysis of a MovieLens dataset of Netflix (given in Figures 4 & 5) showed co-relatable findings of IMDb genre analysis. Different tags like action, drama, and comedy were incorporated in one of the higher occurred tags for original as well as non-original movies. Different properties, which were different compared to new platforms like “good soundtrack” or “visually-appealing” were also got available. The tags having higher ratings for original as well as non-original movies were diverse with “olympics”, “conspiracy”, as well as “russia” as top three for the original movies as well as “east germany”, “berlin”, “entirely dialogue” like top three for the non-original movies. Generally, the results about highly-rated tags as well as tag amounts for both the original as well as non-original movies gave extra insights into the types of content that Netflix had.

Amazon Prime

The dataset of Amazon Prime confined a total of 52 original movies. The data of MovieLens only had around 6 Amazon original movies. Though limited by lesser original movies, the analytics has discovered that Amazon Prime original movies had tags associated with genre “comedy” and “drama”. The maximum average ratings also displayed a comparable trend. The Amazon Prime non-original movies had tags associated with “action” first and “comedy” second. The maximum average rated tags were associated with themes, which action movies might have like “compassionate”, “freedom”, as well as “scifi cult”. However, some were unrelated to trends we have discovered right through the research. Usually, the MovieLens analytics gave similar results to what the genre analytics had discovered.


For Disney+, merely non-original films got analyzed as there were merely 3 original films, which were got released till 2019 as well as accessible in MovieLens. Different tags, which appeared maximum for Disney+ films include “animation”, “family”, as well as “disney animated feature”. In terms of ratings per tag, the topics, which are generally associated with animation like “pixar animation”, “superheros”, as well as “toys” recorded the maximum ratings. Though many films were missing in MovieLens data, still the key findings were constant with those from IMDb data.


Age, genre, as well as tag genomes, are very important factors to determine subscription. With our research, we have discovered different characteristics of every OTT platform. From age’s analytics, we have analyzed that Netflix had awesome TV-MA films compared to many other platforms. Amazon Prime had nearly even distribution about various maturity rating films.

Disney+ is not having any movies rated TV-MA as well as had merely those rated TV-G or TV-PG. The results suggest which platforms to get subscribed to depending on the age groups of films that users would love to see more. Using genre analytics, we have discovered that Amazon Prime and Netflix had similar dispersal. They both had comedy, drama, as well as action to the maximum. Nevertheless, Netflix had maximum diverse content in all the genres. Though Disney+ had much lesser content in comparison to the other two, still it was strongest in adventure, family, as well as animation films. Using genome-tag analytics, we can test our effectiveness of analytics. Our findings in the movieLens analysis were mainly in line with results that we have found using genre analytics.

Amazon and Netflix had similar trends of getting tags associated with comedy, drama, and action whereas tags of Disney+ were focused more on animated films. Though, because of the smaller sizes of the datasets for original films, partially since data was restricted to films released beforehand in 2019, we think that further analysis is required with the addition of current movies to offer a more precise picture.

Still, if you have any questions, you can contact 3i Data Scraping anytime or ask for a free quote for any web scraping service requirements.

Originally published at



Data Scraping Services and Data Extraction

3i Data Scraping is an Experienced Web Scraping Service Provider in the USA. We offering a Complete Range of Data Extraction from Websites and Online Outsource.