Category Archives: Data Analytics for Business

Data Analytics: Influences of Gross Film Revenue Across Three Decades


Data Analytics: Influences of Gross Film Revenue & Opportunity Analysis

December 6, 2017

Todd Benschneider, Austin Deno, Leigh Harris, Sarah Lassiter, Lisa Velesko
Table of Contents

Problem Significance:                                                                                                         3-4

Data Source & Preparation:                                                                                               4-5

Variable Selection:                                                                                                              5

Preliminary Analysis:                                                                                                          6-8

Models:                                                                                                                                  8-12

Insights:                                                                                                                                 12-13

Problem Significance:

Several societal trends can be mined from the data captured in consumer spending patterns of the film industry, especially a comparison of different genres of films which indicate rising and falling patterns of popular fiction. Films, more so than television, literature, or music, closely correlate with upcoming trends by using a responsive pull towards consumer tastes in fiction-fantasy and most accurately reflects the psyche of a generation and its ever shifting emotional underpinnings. The nimble demand responsiveness of filmmakers has become astoundingly proficient at catering to the emotional voids that drive the fiction market and are reflected with clarity in the ever changing mix of successful films. Through the unspoken demand for clearly defined types of storylines, these quickly produced films reveal a meaningful cross-section of a society’s unfulfilled drives and highlights which particular aspects that a society’s members yearn for in their own life situations.

In addition to trending popularity of varying scripts, other valuable economic indicators can be harvested through reverse engineering techniques to capture the downward trending genres that clarify the contextual changes that indicate which previous underlying drives have since been fulfilled through sociological evolution. Marketing professionals are wise to take note of the peaking decline of each passing trend, as those peaks and valleys encapsulate at a macro level of measure, the unspoken barometers reflected in the overall mood of a culture.

In the industries of entertainment and media, consumer spending directed towards different types of fiction produces great insight into the long-term patterns of emotional and economic wants, that are as useful to producers of consumer goods, as they are to providers of entertainment. It is imperative for businesses to be on the forefront of any trend.

Our data set summarizes three decades of consumer spending trends on tales that potentially reveals early predictors of future spending behaviors. It is through the trend forecasting of these patterns of film revenue data, that a business can be on the forefront of meeting changing consumer tastes, whether that firm creates new movie plots, automobiles, or widgets. With insight into the deepest desires of the society around it, a business can tailor its marketing message to align its product with a representative cross-section of every consumers vision, of not who they are, but instead, what they want to be. Few other data sources can provide the insights into the self-identity of fantasy characters as well as film plots and with this three decade dataset, we expect to gauge the tipping points of long term trends and witness the rebounds that those tipping points predicted.

Our team viewed the movie revenue data from the perspective of a movie merchandiser, evaluating which unreleased movies in production would provide our firm with its highest return on investment for movie-themed posters, toys, clothing and related merchandise. The highest budget films command the highest royalty percentages and also require the greatest undiversified commitment of our manufacturing lines to individual movie projects. Because of the risk and profitability factors affiliated with marketing the high budget prospects, our team instead drilled down into the data looking for the more cost effective prospects. Films that maximize the return on investment allow our firm to utilize a more diversified portfolio of projects with more promising cash flows.

With this goal in mind, we chose instead, to use regression models to dig deeper into other categorical data from the set, hoping to find other actionable predictors that could be valuable on a shorter time-line. With that goal, we evaluated the given variables in search of the most significant predictors to cinematic success to determine the confidence of future investments.

Data Source & Preparation:

The data set was originally gathered from IMDb and then sourced directly from Kaggle using 6,820 movies from 1986 to 2016 and includes details such as budget, gross revenue, the production company, country of origin, director, primary genre, movie name, motion picture rating, date released, runtime, IMBd user score, lead star, IMBd user votes, writer, and year released.

Not all movies contained information regarding the budget of the movie.  Those were removed as it was critical in our analysis to be able to collate the relationships for complete data points, especially in regards to budget.  We also investigated the relationship between profit and return on investment between gross and budget independently.

Tableau and Excel were first used to identify the greatest amounts in each respective variable.  This allowed us to postulate our first level of filtering.  R was then used to plot data using histograms, box plots, and scatter plots to consider outliers, run regression models, multicollinearity and direct correlations, identify R-squared and adjusted R-squared, along with Aikaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), to determine goodness of fit, utilized numeric and qualitative predictors, and with interaction.  Charts in Tableau were generated to visually verify the interaction effects.  Tableau, Excel, and R were all used collectively to ultimately determine the strongest correlation, interaction, numeric, and qualitative predictors in using the variables.

Variable Selection:

Response Variable: In our effort to uncover the driving forces behind blockbuster films, we questioned what causes box office achievement. There are far too many flops in show-business; artistic potential is drowned out, consumer trends are completely misinterpreted, and lucrative investments are wasted. We must review success in cinema and provide a supportive study to investors in major motion pictures to appease the masses and create a stable platform for performers, thereby providing a concrete analysis of how gross revenue is determined. We therefore selected “Gross”, defined by our IMDb source as “gross revenue at the box office” as our response variable for all data modeling in this study.

Predictor Variables:
In order to evaluate the best variables to test against our response variable, we created a correlation table (below) to test the relationship amongst the quantitative variables. We focused on which variables could have a strong effect in deciding gross. The motive in tracking down the most determinant variables is so the investor can later account these factors into their decision to support a film.

Correlation Chart Budget Gross Runtime Score Votes
Budget 1 0.680033 0.313064 0.073579 0.451467
Gross 0.680033 1 0.253273 0.229552 0.642904
Runtime 0.313064 0.253273 1 0.417031 0.359817
Score 0.073579 0.229552 0.417031 1 0.470648
Votes 0.451467 0.642904 0.359817 0.470648 1

To no surprise, the correlation that stood out the most was between gross revenue and budget with .68003256. This correlation suggests that a higher budget movie will most likely fund a movie that generates more revenue. As we believe budget is the heaviest deciding factor in funding the crucial elements for a financially successful film, we regard it as our primary predictor variable which our other qualitative and quantitative variables will be matched against.


The second highest correlation was found between “gross” revenue and “votes” (that is IMBd viewer reviews on a scale from 1 to 10) at .642904. We can justify this correlation two-fold. First, more “votes” logically means more tickets were purchased to watch the movie in theaters. Second, a high number of votes can drive consumer demand, influencing movie-goers who have not yet viewed the film to either watch or avoid depending on how positive the review was. While our first conclusion is provided after the fact of viewership, the second has the potential to boost viewership, making this variable causal. However, since we cannot account for whether “votes” were causal or coincidental, and since the standard error in a simple regression with gross is very large, we decided not to make it a popular predictor variable in our study. Derived from the votes, we deemed “scores” as unacceptable variables in our models because we cannot control the scores that are given by the reviewers.


As “runtime”, the final quantitative variable which refers to the length of the film expressed in minutes, has a relatively moderate correlation with “gross” at positive .2532733, we must take into consideration what this logically means. The correlation expressed as runtime increases, gross revenue also increases. We know that this statement has a limit because if movies were formatted into countless of hours, we cannot logically expect the popularity to rise accordingly. In support we also can see from a simple regression that, like the “votes” variable, “runtime” standard error at 52262 is unacceptably high.


As far as qualitative data, we opted to use both primary “genre” and motion picture “rating” as major predictors of gross, as supported by their high multiple r-squared values. We determined these were likely predictors of movie success based on consumer taste.


Finally, we decided not to use the production company, country of origin, director, movie name, date released, and year released as these factors would be completely out of control of the film investors. This is due to the variables being too widely diverse to classify accurately since they are spread so thinly across the data.

Preliminary Analysis:

Following our variable selection, we began looking at patterns surrounding the relationships between revenue and movie genre and motion picture rating. It’s important for investors to stay current on consumer trends in order to predict where the big money will be made in the film industry.
Hypothesis Testing:


Hypothesis 1:

  1. Since the Action genre and PG-13 rating have the highest gross revenue out of all movies, it is logical to assume that these types will also generate the highest return on an investor’s funding once the production hits the theaters. We have solid evidence that this is true because budget accounts for over 47% of the prediction of a high grossing movie.

H0: Action genre and PG-13 rating have the highest return on investment and an Action PG-13 rated movie will generate the most dollars per dollars invested.
Ha: Action genre and PG-13 rating do not have the highest return on investment.


After realizing high correlations between gross and motion picture genre, we dove into separating genres to see which classifications raked in the most at the box office. We found that the movies with the highest gross revenue were Action with
a combined total of over $708 million. By seemingly no coincidence we also noticed that Action movies
had a higher total budget than all other genres. Since budget has a strong linear correlation with gross, we can assume that Action will produce the highest return on investment than any other genre.

We similarly compared motion picture ratings to gross revenue to identify that PG-13, R, and PG, respectively, generated the most revenue over the course of the 30 year history and looked at the gross revenue and budget within each sector.

Hypothesis 2: Since popular actors have a strong influence over consumer taste, we can assume that starpower has a significant effect on gross revenue. Since high budget is needed we can also assume that as budget increases, more coveted actors can be casted, resulting in a very popular, high grossing film.


H0: Movies with budgets in the upper 3rd quartile will have a significant relationship between star and gross.
H1: Movies with budgets in the upper 3rd quartile will have no relationship between star and gross.


Star: We attempted to identify the correlation of stars to gross revenue by exploring the total number of movies that they been the lead in and the sum of the gross revenue for those movies using Tableau.  We believed that particular stars would impact the budget and also impact the gross revenue.  Frequency of a star being in movies could also lead to their popularity and consequently generate more box office revenue as consumer-demand increased to see that star.  In running a regression model, there were specific stars, such as Chris Pratt(1), Daisy Ridley(1), Ellen DeGeneres(1), Felicity Jones(2), Heather Donahue(1), Jennifer Lawrence(8), Louis C.K(1), Neel Sethi(1), Paige O-Hara(1), Quinton Aaron(1), Sam Neill(3), Sam Worthington(4), Scott Weinger(1), and Taylor Kitsch(1) that had significant influence as interacted with budget to predict gross revenue.  With all but Jennifer Lawrence being listed as the star in less than five films and most less than two, as indicated by the number next to each star, we determined that there were additional factors driving this further, such as co-star, if the movie already had a cult following, was a book first, etc.  We did run a sample test using Jennifer Lawrence and Will Smith to note that, at least for these two stars, there was a positive correlation between gross revenue and budget as depicted in the scatterplot below.



Model 1<-lm(d$gross ~ d$budget)

The correlation chart was a basic look at the significance between gross revenue at the box office and film budget. We soon affirmed our prediction that the correlation between budget and gross was causal by running a simple regression. With a multiple r-squared value of .4624, this model shows that 46.24% of gross revenue can be explained by the budget. Budget also has a very low p-value (2e-16), proving to be a significant factor in predicting a high gross. A higher budget movie has greater potential to purchase the necessary artists, talent, and advertising to create a higher grossing product.
Model 2<-lm(d$gross~d$budget+as.factor(genre), data=d)

Using the as.factor for genre we are able to build a second model that explains how a movie budget and genre affects the revenue of a movie. This model had a slightly higher adjusted R-square with .4691. This model also shows that out of all the genres, the most significant ones were Action, Adventure, Animation, Comedy, and Horror. This indicates that these five genres will be more impactful on the revenue of a film with knowledge of the budget of the film. However, without knowing the budget, Comedy, Drama, and Horror have the most significant impact on gross revenue.
However, we know that correlation does not translate to causation. We carefully curbed our analysis with a linear regression model, placing “Gross” as the response variable and “Budget” as a factor of “Genre”. We used budget as a control because we want to know how the effect of dollars invested in a movie, and more specifically movie genre, would be returned. To our surprise, Action was not the most significant factor, Animation was, as confirmed by a lower p-value and a higher coefficient. In fact, the regression explained that with a hypothetical budget of $0, an Animation movie would produce $22.2M more in revenue than an Action movie. This was an astonishing and valuable discovery.  We noted that Action, Adventure, Animation, Comedy, and Horror all had significant influences.
Model 3<-lm(d$gross~d$budget+as.factor(genre)+d$budget*as.factor(genre), data=d)

For our third model we adjusted it to show a model that explains gross revenue with the budget and genre of the film and the interaction effect between budget and genre. This model was slightly better with an adjusted R-Square of .4696. The model showed that a specific genre budget has a slight effect on gross revenue. Budget is more significant for the Action, Comedy, Drama, and Horror genres.

Model 4<-lm(d$gross~d$budget+as.factor(rating)+d$budget*as.factor(rating), data=d)

For our fourth model, we looked at gross revenue with the interaction between budget and rating. This helped us narrow our data to find the most significant rating for gross revenue as budget increases. This model had an adjusted r-square of .4736. Out of all the different ratings, rated R and G movies were the most statistically significant.


Looking at just the adjusted r-squared and the AIC/BIC; the fourth model was the best predictor of increasing gross revenue. However, the rating to budget interaction was only slightly better than the genre to budget interaction. Both our third and fourth model narrowed down our data because they took into consideration the genre and rating with respect to budget of the film. These two qualitative variables were the most significant in predicting the gross revenue outside of just the movie budget.  In joining the interaction together, PG-13 and Horror had the highest and only interaction, with a slightly higher R-squared but higher AIC and BIC, therefore prompting us to return to the previous model and generating the below chart to illustrate our findings.


Confidence Interval Testing:


With the information we gathered from the regression models, we now have an in-depth look at the effect of budget on genre and rating as they relate to gross revenue. However, these findings contradict our earlier hypotheses. To examine our original assumptions, we performed confidence interval testing.


First, we subsetted the data by creating a new dataframe with only Action genre movies rated PG-13. Then we created another variable, ROI, by implementing the ROI formula using budget and gross data sets. We took a summary of the data discovering the mean ROI for PG-13 Action movies was .1666255 or 16.67%, which seems reasonable. If an investor was to invest $100,000, they could expect an average gross return of $116,000 after the movie hits theaters. With a sample size of 468, we used the normal distribution and with 97.5% confidence to determine that the range for ROI on this type of movie would fall between .0899811 and .4232491. This is a fairly large range. But we can say confidently that the largest return on investment should be 42.32%.


Using the assurance of strong significance, and high coefficient strength of our regression models, we will use the same confidence interval testing on an R rated Horror film to test the strength of our first null hypothesis. We performed the same subsetting technique to attain a dataframe of only R rated Horror movies to gather a set of 173 movies. After removing two extreme outliers, the mean ROI was pinpointed at 2.6610 or 266.1%. The testing gave us 97.5% confidence that the range of expected ROI should fall between 113.89% and 646.1%.


Concluding, R rated Horror movies have a 97.5% confidence in producing a high of 646.1% ROI compared to the maximum potential of 42.32% of a PG-13 Action movie.
We can view this practically and justify the logic in Horror movies having the highest total ROI. When looking at the data it seems that horror movies can be made with relatively low budgets and yield much higher profit. Movies like Paranormal Activity and The Blair Witch Project (the two outliers we removed before confidence interval testing) are prime examples of this phenomenon. The Blair Witch Project cost only around $15,000 to make, but made $107,918,810 in box office revenue, a 7,193% ROI. This data will allow us to make the most informed decision in consideration for investing or merchandising.



In analyzing the data, we uncovered that budget had the strongest significance and correlation to gross revenue.  Genre as a factor of budget, nor rating, influenced the gross revenue more than the budget itself but were highly significant subfactors.  Ratings of “R” and “G” along with genres of Action, Comedy, Drama, and Horror, had the highest significance when factored with budget to gross revenue, as depicted in the charts above.

As score and and votes would come after the fact, an investor or merchandising company looking to predict which movies would gross the highest revenue and consequently have the potential to yield the highest returns on product related to that movie, we would look to an “R” or “G” rated movie that is an Action, Comedy,Drama, or Horror genre specifically. This can be demonstrated by the movie “The Hangover,” which led to a major economic impact in Las Vegas.

In conclusion, while we have familiarized ourselves with the tools and theories of data mining for business applications, the most important lesson we have learned, has been to view data insights with cautious skepticism. We are confident that our regression analysis was accurate and that our data source appeared reliable; however, few of us are prepared to wager our professional reputations by advising a CEO to allocate millions of dollars of investor capital into the actionable insights that we are recommending. In the actual practice, we would be recommending finding alternate sources of similar data sets to verify these conclusions. In addition to our newfound perspective on the practical values of data mining, we are now prepared to temper future data sourced predictions with a managerial “P-Value”, named the “Group 6 N-Value” to represent common sense and intuition. We therefore recommend, that when proposed data sets lead us down a path of  assumptions based on high P and Adj R sq values, but contradict our own personal “N-Values”, we should first pursue additional data sets and alternate models to demonstrate, without doubt, that those high statistical probabilities are indeed replicable and justifiable in the abstract science of strategic management and consumer behavior.