Behind the Silver Screen: Examining Gender Inequality in the Film Industry

An analysis of the Smurfette principle and popularity regarding gender in the movie industry.

Explore this data story


The Smurfette principle is a term coined by Katha Pollitt to describe the phenomenon in which media, particularly children's media, tends to feature a disproportionate number of male characters compared to female characters. The term is derived from the popular cartoon series "The Smurfs," in which the only female character is the Smurfette. This representation is not limited to children's media, but also extends to other forms of entertainment, including film.

The underrepresentation of women in film has been a topic of discussion and debate for decades. Research has shown that there is a significant gender imbalance in terms of both the number of female characters and the roles they play on screen. Women are often relegated to secondary or tertiary roles and are less likely to be displayed as complex, multi-dimensional characters.

This gender imbalance has real-world consequences. Studies have shown that media representation can shape attitudes and beliefs, and the lack of representation of women in film can contribute to the marginalization and undervaluing of women in society. It is important to examine the ways in which women are depicted in film and the potential impacts of this representation on societal attitudes towards women.

In this project, we will be studying the Smurfette principle in the film industry. Specifically, we will be examining the representation of female characters in popular films and analyzing the ways in which they are represented. We will be using data on film and cast as well as analyzing the content of the films summaries, in order to gain a better understanding of the ways in which women are represented in the film industry.

Completness of the dataset

Before we begin our analysis, it is important to understand the completeness of the data we are working with. To do this, we can examine the percentage of valid values in each column of the CMU Movie Summary Corpus. The following graph shows the percentage of valid values for each column in the dataset:

As we can see in the graph, the high percentage of valid values in the actor gender column, with 89.87%, means that we can trust the accuracy and completeness of the information provided. This will be important for our analysis, as we are specifically interested in examining the representation of female characters in popular films. By using the actor gender data, we can accurately identify the gender of each character and understand the ways in which male and female characters are depicted.

Additionally, each movie has its plot summary, which means that we can rely on the information provided to analyze the content of the films. By examining the plot summaries, we can understand the roles and characteristics of the characters in each movie and how they are portrayed. This will be useful for our analysis of the representation of female characters in the film industry. The plot summaries data will provide valuable insights into the ways in which female characters are depicted on screen and how this representation compares to that of male characters.

What does the dataset look like comparing males and females? We have a total of 450,669 different characters, where 60% are male and 30% female (the remaining are not specified). There are a total of 96,476 actors and actresses, where 63% are men and 37% are women. So, it looks like the unbalance arises from the beginning of the data exploration. Let’s dig deeper into it!

Actor and Actress Popularity Over Time

To further understand the representation of male and female actors in popular films, we will also be analyzing the popularity of actors and actresses over time. To do this, we will examine the number of Wikipedia pageviews for each actor or actress. Wikipedia API provides this information only for recent years, so it would not be fair to compare actors that were active several years ago with currently active actors. Therefore, we decided to plot a graph showing the number of views as a function of time elapsed since their last movie. This will allow us to understand whether there are any trends in the popularity of actors and actresses and whether the gender of the performer may be influencing their popularity.

It looks like the distribution might follow a power law for both genders. It also seems like there might be more outliers of famous men than famous women. From the visualization of this graph we can hypothesize about these, but a more detailed analysis is needed to see if the difference is statistically significant.

The following graph shows the same information as before, but in log-log axes. With this visualization a power law should be a straight line.

Wow! The distribution is really sparse! In the previous graph it looked like there were many more men well above the rest than in the case of women. Now, we see that the men graph is more sparse both with more famous and with less famous people. So, we wonder whether there is really any factor that makes male actors systematically more popular than females. As power-laws follow linear dependencies in these axes, a linear regression (with ordinary least squares) is useful for this case. We model the log of the pageviews as a linear combination of the log of the time and the gender. As expected from the graph, the R-squared is very low (0.069), but the p-values for both variables are very close to zero. This means that both input variables have an impact on the popularity of the actors, and the coefficient associated to the gender is -0,3378, which would mean that the number of pageviews of women is, on average, exp(-0,3378) = 0.71 times the pageviews of men. This means that even if we cannot infer it directly from the graph, gender definitely plays a role in the popularity of the actors.

The Smurfette Principle

As described in the background, the Smurfette principle described an under-representation of women of women in movies. We took our own spin to this and applied it to the movie summaries. In our analysis we extracted the character names from the cast list, and looked at how often they were mentioned in the summaries. On top of that, we analyzed the distance at which a male character is mentioned from a female character, and vice-versa.

As one might expect, we can observe the prevalence of men in the summaries, with a frequency more than double that of women. Apart from this, the point that stands out the most is the relation between the evolution of co-occurences in the two genders as we look at a greater distance between the words. This relation seems to be colinear (see discussion at the end), despite the male representation being higher. Even as we analyze the whole sentences, the ratio is still similar, with mentions of women without a man dropping significantly.

Are there certain movie genres that are more or less discriminating?

One could assume intuitively that some movie genres favor more diverse cast genders, whereas some might be more inclined to be discriminating. We wanted to verify this supposition, and at the same time see if this difference is reflected in the movie summaries.

As expected, the genres which come out are in accordance with the idea people might have. A surprising point however, is the difference between the cast and the summaries, with largely more extreme values in the summaries. In the cast, women are represented the most in Adult films with about 50%, and the at most about 10% in Combat Films, whereas in the summaries, they represent about 83%

Has it always been like this?

Given that all this data is scattered over more than 70 years, it would be interesting to see whether there is some trend in the evolution.

A keen eye might notice an upward trend, altough not very discernable. This trend however, seems to be shared by both the representation in the cast, as for that in the summaries. The years which are displayed are only those for which the data was sufficient to estimate the true trend. Given the recend movements in empowering women, the trend that is shown here might change radically for the recent year, and those to come.


Based on our analysis, it is clear that there is a significant gender imbalance in the film industry. Female characters are underrepresented in terms of both the number of characters and the roles they play on screen. This gender imbalance has real-world consequences, as media representation can shape attitudes and beliefs and contribute to the marginalization and undervaluing of women in society. Our analysis of the Smurfette principle and the popularity of male and female actors over time highlights the need for greater representation and diversity in the film industry. It is important for the industry to recognize and address this issue in order to create a more inclusive and balanced representation of women on screen.