Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2013 Jun 24;8(6):e66443. doi: 10.1371/journal.pone.0066443

IMDB Network Revisited: Unveiling Fractal and Modular Properties from a Typical Small-World Network

Lazaros K Gallos 1, Fabricio Q Potiguar 2, José S Andrade Jr 3, Hernan A Makse 1,*
Editor: Matjaz Perc4
PMCID: PMC3691214  PMID: 23826098

Abstract

We study a subset of the movie collaboration network, http://www.imdb.com, where only adult movies are included. We show that there are many benefits in using such a network, which can serve as a prototype for studying social interactions. We find that the strength of links, i.e., how many times two actors have collaborated with each other, is an important factor that can significantly influence the network topology. We see that when we link all actors in the same movie with each other, the network becomes small-world, lacking a proper modular structure. On the other hand, by imposing a threshold on the minimum number of links two actors should have to be in our studied subset, the network topology becomes naturally fractal. This occurs due to a large number of meaningless links, namely, links connecting actors that did not actually interact. We focus our analysis on the fractal and modular properties of this resulting network, and show that the renormalization group analysis can characterize the self-similar structure of these networks.

Introduction

The study of real-life systems as complex networks [1], [2] has allowed us to probe into large-scale properties of databases that relate to a wide range of disciplines. Among them, social networks [3][6] play a special role because we can follow the trails that are left behind social interactions (especially in their online form). Moreover, understanding the structure and the mechanisms behind the resulting web of those interactions poses an extremely challenging problem with important implications for everyday life, such as disease spreading patterns [7], [8]. In recent years, the development of databases [9][12] that depict such interactions has resulted in many networks that model different aspects of social systems. In fact, social networks can also be classified as collaboration networks because people belonging to them are usually indirectly connected through their common collaboration entity, be it scientific collaboration [13], company board membership [14] or movie acting [15]. Usually, these networks are bipartite: there are two types of nodes and links run only among unlike vertices. However, they can also be taken as a one-mode projection of the bipartite structure, connecting nodes of the same type whether they share a node in the bipartite structure [16].

The majority of social networks has a small-world character [15], [17], substantiated by a small average short-path length and a large average clustering coefficient. At the same time, many social networks have been shown to be fractal [18], [19], i.e., they present a self-similar character under different length scales. These properties can be seen by the optimal covering of a network with boxes of maximum diameter less than Inline graphic [20], and a subsequent power-law scaling of the required minimum number of boxes Inline graphic, where Inline graphic is the fractal dimension of the network. Fractality in networks [21][25] is not a trivial property and in many cases it can be masked under the addition of shortcuts or long-range links on top of a purely fractal structure.

Recently, it was calculated analytically the small-world to large-world, a self-similar structure with a few shortcuts, transition using renormalization group (RG) theory [26]. The basic idea of the RG method is to compute the difference between the average degree of the renormalized network, Inline graphic, and the average degree of the original network (the network without short-cuts), Inline graphic, for different values of Inline graphic. The resulting relation is

graphic file with name pone.0066443.e007.jpg (1)

where Inline graphic is the average number of nodes in a box of diameter Inline graphic (the average ‘mass’ of the box). The exponent Inline graphic characterizes the extent of the small-world effect. The Inline graphic case corresponds to a large number of shortcuts, so that successive renormalization steps lead to an increase of the average degree by bringing the network closer to a fully-connected graph (small-world). When Inline graphic, only short-range shortcuts exist, so that the distance between nodes remains significant during successive steps and the network fractality is preserved (large-world).

In this work, we present an example of how to unmask fractality in a small-world network and elaborate on the methods that highlight the fractal characteristics. Our primal SW network is the well known IMDB movie co-appearance network [9], [15], [26][28], where two actors are connected if they have participated in the same movie. This environment is very dynamic and actors participate in new movies continuously, creating many links and rendering the network a typical example of a small-world prototype. On the other hand, the construction method of the IMDB dataset (each movie leads to a fully connected subgraph and actors tend to specialize in one particular genre) seems to naturally lead to a fractal and modular structure.

In order to observe this transition more clearly, we focus our analysis on a smaller subset of the IMDB database, by taking into account movies that have been labeled by IMDB as adult, and considering only the actors that have participated in them. Except for the smaller and more easily amenable size, there are many advantages in this choice: a) First, this network is a largely isolated subset of the original actors collaboration network, since there are very few links that connect actors in this genre with actors in other genres, so that it can safely be studied as a separate network on its own. b) Practically all these movies have been produced during the last few decades, so that the network is more focused in time, which reduces the inherent noise arising from connections between very distant actors. There are a number of examples of actors that had long lived careers and are certainly connected to other actors that did not live in the same time period.

Only taking the adult subset of IMDB is not enough to observe fractality, since both sets are typical small-worlds. This observation is achieved when we consider the interaction strength between actors, Inline graphic. This strength is defined as the number of times two actors were cast together in a movie. To consider this network with this strength is equivalent to take the one-mode approach to the bipartite structure with weighted links, whose value is given by Inline graphic. More specifically, by imposing a minimum threshold value for Inline graphic, we can unveil the fractality that was hidden below many links that were taken into account in the original network. In other words, our network is no longer small-world, but it is self-similar. This behavior is reminiscent of the transition calculated in the RG analysis of [26]. Given that the self-similarity can be seen already at Inline graphic, the full connection mode considers a great number of links that hide fractality and make the network look like a small-world case. Another way to look at this fact is that connecting all actors in a movie is equivalent to put in equal footing strong, long lived collaborations (like that between Katherine Hepburn and Spencer Tracy, one of the most famous couples in film history, which were cast in Inline graphic movies together) and weak ones (like those of the 1956 movie ‘Around the world in eighty days’ in which 1297 distinct actors are credited, which generates Inline graphic links). Clearly, the first ones, which are more specialized, are responsible for the modular character of the network.

Results

The time evolution of the movie industry is shown in figure 1. We associate each actor with a unique year, which is the average year of all the movies at which she/he participated. In this way, we can approximate the time that an actor was more active, although obviously she/he may have starred for decades. The number of movies and the number of actors increase monotonically with time as expected, both in the IMDB dataset and in the adult IMDB dataset.

Figure 1. Number of actors per year and number of movies per year for the entire IMDB database (lines) and for the adult IMDB database (symbols).

Figure 1

In the plot the adult IMDB points have been multiplied by a factor of 5 for clarity. Inset: the same numbers on a logarithmic y-scale.

The adult movie industry has a much more recent history, since movies started appearing in bulk only after circa 1990, and the increase in their production rate in these years is much higher than for all movies. The network representation, thus, between adult actors is more robust since it includes a total of only 20–30 years, compared to more than 100 of the typical IMDB database.

Initially we connected all the actors that participated in the same adult movie. The resulting largest connected component comprises 44719 actors/actresses, out of a total of 39397 movies. The average degree is Inline graphic, which is quite high but is still significantly smaller than the corresponding average for the entire IMDB network (where Inline graphic). Both the number of reported actors per movie and the probability that two actors participated in Inline graphic common movies decay as power laws for both networks, with similar exponents (figure 2). This points to the inhomogeneous character of the network, both in terms of diverging number of actors in a movie and in terms of how many times two actors collaborated with each other in their careers, a number which can also vary significantly from 1 to 350 times.

Figure 2. Comparison of the IMDB network to the adult IMDB network.

Figure 2

a) Probability distribution that a number Inline graphic of actors participated in a movie (as reported in IMDB). b) Distribution of the links weights, or equivalently, the probability that two actors have participated in Inline graphic movies together.

Naturally, the resulting collaboration network is weighted. Every link can be considered as carrying a different strength, depending on the number of movies in which two actors have participated. For example, the strongest link appears between two actors who have co-starred in 350 movies and obviously this connection is more important than between two actors with just one common appearance. Thus, we filter our network by imposing a threshold Inline graphic, so that two actors are connected only if they have collaborated in at least Inline graphic movies, with a larger Inline graphic value denoting a stronger link. Snapshots of the largest connected component of networks with varying Inline graphic values are shown in figure 3. It is clear from the plots that for Inline graphic there are many fully-connected modules and some of them appear largely isolated, since very few of the actors in that movie participated in any other movies. However, as we increase Inline graphic the network tends towards a more tree-like structure with less loops.

Figure 3. Snapshots of the ‘adult IMDB’ network, where two actors are connected if they have co-starred in at least a) Inline graphic, b) Inline graphic, c) Inline graphic, or d) Inline graphic movies, respectively.

Figure 3

Only the largest connected component is shown, and the corresponding network sizes are Inline graphic, 14444, 5315, and 2100 actors, respectively.

All of these networks exhibit a scale-free degree distribution, where asymptotically the number of links Inline graphic scales as a power law, i.e. Inline graphic, where Inline graphic is the degree exponent with a value around Inline graphic (figure 4). The unweighted network, though, behaves differently as compared to the other networks at small Inline graphic. When Inline graphic, there are very few nodes with 1 or 2 connections. This percentage increases with Inline graphic until we reach a maximum at Inline graphic, and afterwards decays as a power-law. In contrast, the decrease in the degree distribution is monotonic for any Inline graphic and the distribution remains almost invariant, independently of the Inline graphic value. The shifted peak for Inline graphic is a result of connecting all the actors in a movie with each other, which makes it rare for an actor to be related with only a small number of other actors. In contrast, the shape of the distribution follows a more typical pattern when we retain stronger connections only, even for Inline graphic.

Figure 4. Degree distribution of the adult IMDB network, for different connection strengths, Inline graphic, 2, 4, or 8 movies, respectively.

Figure 4

The solid line has a slope, the exponent of the power law degree distribution, Inline graphic (see text).

The process of increasing Inline graphic corresponds to diluting the network through the removal of a large number of links. As a result, we expect an increase of the network diameter (defined as the largest distance between all possible shortest paths) since we destroy many of the existing paths. However, at the same time many nodes are removed from the largest cluster as well, and the dependence of the diameter on Inline graphic cannot be easily predicted. In figure 5 we can see that the interplay between removing links and nodes results in a more complicated behaviors. As we increase Inline graphic, longer distances appear in the network as a result of removing more links than nodes. At Inline graphic, though, the maximum diameter drops significantly, because now the network size has also decreased a lot. The main peak of the distribution remains constant at distances Inline graphic between Inline graphic and Inline graphic. The tail of the distribution, though, becomes initially wider with increasing Inline graphic, but after Inline graphic the tail shifts back towards smaller values.

Figure 5. Probability distribution of the shortest paths in networks with varying weight threshold Inline graphic.

Figure 5

Inset: Maximum diameter in the network as a function of Inline graphic.

The IMDB network has been shown to possess fractal characteristics and we have found the same property for the adult IMDB network, as well. We used the box-covering technique [20] to perform this analysis. The number of boxes Inline graphic that are needed to optimally cover the network scales with the maximum box diameter Inline graphic as a power law (figure 6). The scaling is much more prominent for values of Inline graphic, indicating that connecting all the actors in a movie (Inline graphic) results in an over-connected network which may alter the apparent behavior of many measured properties. The scaling is much smoother for the presented values of Inline graphic or 4, with a fractal exponent Inline graphic.

Figure 6. Scaling of the number of boxes Inline graphic as a function of the maximum box diameter Inline graphic, for the adult IMDB network and for different Inline graphic values.

Figure 6

Except for Inline graphic, the other networks have a fractal dimension, shown by the dashed line with slope, Inline graphic.

Similarly, we have measured the degree of modularity in these networks, and how this modularity scales with the box size Inline graphic (figure 7). The process is as follows: First, we fix the value of Inline graphic and we cover the network with the minimum possible number of boxes, where every two nodes in the box are in a distance smaller than Inline graphic. These boxes can then be considered as modules, in the sense that all nodes in a box have to be close to each other. The level of modularity remains then to be shown, and following [24], [25], we can define the quantity

graphic file with name pone.0066443.e074.jpg (2)

Figure 7. Scaling of the modularity Inline graphic as a function of the maximum box diameter Inline graphic, for the adult IMDB network and for different Inline graphic values.

Figure 7

All networks scale with Inline graphic, with a slope close to Inline graphic, as shown by the dashed line with the same slope value.

where Inline graphic and Inline graphic represent the number of links that start in a given box Inline graphic and end either within or outside Inline graphic, respectively. Large values of Inline graphic correspond, thus, to a higher degree of modularity. The meaning of Inline graphic, as defined above, does not correspond directly to other measures of modularity in the literature, designed to detect modules in a given network, independently of the module size. Here, the value of Inline graphic varies with Inline graphic, so that we can detect the dependence of modularity on different length scales, or equivalently how the modules themselves are organized into larger modules that enhance or decrease modularity, and so on. This information can be detected more reliably through the modularity exponent Inline graphic, which we have defined in terms of the following power law relation:

graphic file with name pone.0066443.e089.jpg (3)

The value of Inline graphic represents the borderline case that separates modular (Inline graphic) from random non-modular (Inline graphic) networks. For example, in a lattice structure the value of Inline graphic is exactly equal to Inline graphic, while a randomly rewired network has Inline graphic.

It is not difficult to understand how these two values arise. Let us, first, consider a regular network optimally tiled with boxes with large Inline graphic. The number of links that terminate outside a box originate only in nodes positioned along the boxes’ surface, given the regularity of the network, i.e, there are no long range links. The number of surface links is proportional to the surface of the box, Inline graphic, where Inline graphic is the spatial dimension of the box. On the other hand, links that terminate inside this box are clearly related to the number of bulk nodes, i.e., nodes that are not at the surface of the box. The number of bulk nodes is proportional to the boxes’ volume, Inline graphic. Finally, the ratio of these two quantities yields Inline graphic.

This continuum argument cannot be used for a randomly rewired network because there may be links that connect nodes which are not first neighbors in the original regular lattice. As a consequence of this fact, a link that ends outside a box can start at any node within this box. Following the same reasoning, we conclude that a link that ends within a box can be originated in any node in this box. As we increase Inline graphic, new nodes will be added, whose links are random as well. Therefore, a box with a distinct size bears a link configuration which is not substantially different from a box with a smaller Inline graphic. We conclude, then, that the modularity, i.e., the number of links that terminate within or outside a box, is not affected by the boxes’ size. In other words, Inline graphic.

For the adult IMDB network we find an approximate exponent of Inline graphic for all different values of Inline graphic, which is an indication of a highly modular network.

The standard method of calculating fractality through the scaling of Inline graphic vs Inline graphic may sometimes yield ambiguous results, due to the relatively short range of Inline graphic. The renormalization group (RG) analysis [26], though, can provide a more reliable picture, and can highlight the self-similarity of a network under continuous scale transformations. The RG analysis is particularly useful in revealing long-range links (shortcuts) in the structure, since under renormalization these links will either persist or disappear.

In the case of the adult IMDB network, our finding of fractality and modularity is a strong indication that there are not many such long-range links in the network. This is an important concept that can be shown to determine the efficiency in navigability or the efficient information transfer in these networks [25], [29], [30]. In order to employ the RG analysis for the study of shortcuts, we again use the box-covering technique [20].

This renormalization analysis can provide very accurate information on the structure of self-similar objects. Snapshots of the resulting networks at different Inline graphic values are shown in figure 8.

Figure 8. Networks resulting from the box-covering procedure at different Inline graphic values.

Figure 8

Top row: Adult IMDB network at Inline graphic. Bottom row: Adult IMDB network at Inline graphic.

In the case of the adult IMDB network, we find that the exponent Inline graphic, see equation (1), is always negative (figure 9). This shows that nodes in the network tend to remain connected within their close neighborhood, and there are very few links that span different areas of the network, i.e., under successive renormalization steps, shortcuts in the network tend to disappear and we are left with a fractal structure. In this figure, we see that as we increase Inline graphic, the curves become independent on this quantity. This implies that for a high enough link strength, it has the same average distribution of links.

Figure 9. Variation of the average degree in the IMDB network for different Inline graphic values, as a function of the average box mass Inline graphic.

Figure 9

Methods

The box-covering technique [19] consists in covering the network with boxes of size Inline graphic. This means that within a box, there will be only vertices that are separated from each other by, at most, Inline graphic links. These boxes then become the super-nodes, which are connected with each other if there is at least one connection between nodes within these boxes. This yields a network representation at a different scale, which can in turn be covered, again, with boxes of size Inline graphic. This new network is renormalized again, and the process repeats with progressively coarser covering.

Conclusions

We have studied a subset of the IMDB network, the adult IMDB, as an one-mode projection of the original bipartite IMDB network. Therefore, we were able to remove many pathologies of the full IMDB, such as actors with long-spanning careers, and have taken a network more focused in time.

We have also imposed a minimum link strength threshold, Inline graphic, such that our network is composed of vertices that have at least this amount of links between them. We showed that the adult IMDB is a small-world network with an exponent for the distribution of links Inline graphic, if Inline graphic. We have seen that this procedure removed a large number of links between actors. These links changed the distribution of links in a way that it was more likely to find an actor with Inline graphic than one with Inline graphic. Since the average degree is larger for the network at Inline graphic, it is easy to cross the whole network in a smaller number of steps than to cross the same network built at Inline graphic. The unweighted network is very similar to a typical small-world case, while those at larger Inline graphic are closer to a fractal. In conclusion, we see that the parameter Inline graphic controls a transition from a small-world to a fractal network, in much the same way as was shown in [26].

We confirmed these conclusions with measurements of the modularity and with a renormalization group analysis of the network. We obtained that the modularity scales with the largest box diameter with an exponent of Inline graphic, which renders it a highly modular network. In addition to an efficient community separation, we were able to show that this is not a typical small-world network. This finding was based on a renormalization group analysis, which we have shown to be a robust method for detecting fractality. This analysis revealed that the difference between the average number of nodes in the renormalized and the original network scale with Inline graphic, a signature for large-world behavior. The highly modular fractal characteristics found here are expected to naturally emerge in many social networks, but in general they are hard to accurately detect in real-world networks. Ultimately, we feel that our weighting procedure constitutes a reliable way of unveiling fractality out of a small-world collaboration network, since the information lost by choosing larger Inline graphic values actually hides the fractal network.

Funding Statement

The authors acknowledge support from Brazilian agencies CNPq, CAPES, and FUNCAP, the FUNCAP/Cnpq Pronex grant, and the National Institute of Science and Technology for Complex Systems in Brazil for financial support. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Albert R, Barabási AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74: 47–97. [Google Scholar]
  • 2. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang DU (2006) Complex networks: Structure and dynamics. Physics Reports 424: 175–308. [Google Scholar]
  • 3.Fararo T, Sunshine MH (1964) A Study of a Biased Friendship Net. Syracuse, NY, USA: Syracuse University Press.
  • 4. Bernard HR, Killworth DD, Evans MJ, McCarty C, Shelley GA (1988) Studying social-relations cross-culturally. Ethnology 27: 155–179. [Google Scholar]
  • 5. Vanraan AFJ (1990) Fractal dimension of co-citations. Nature 347: 626. [Google Scholar]
  • 6.Wasserman S, Faust K (1994) Social Network Analysis: Methods and Applications. Cambridge: Cambridge University Press.
  • 7. Sattenspiel L, Simon CP (1988) The spread and persistence of infectious diseases in structured populations. Mathematical Biosciences 90: 341–366. [Google Scholar]
  • 8. Kretzschmar M, Morris M (1996) Measures of concurrency in networks and the spread of infectious disease. Mathematical Biosciences 133: 165–195. [DOI] [PubMed] [Google Scholar]
  • 9.Internet movie database (2009) Available: http://www.imdb.com.Accessed 2012 Dec 20.
  • 10.Airport council international annual worldwide airports traffic reports (2012) Available: http://www.aci.aero.Accessed 2012 Dec 20.
  • 11.Internet mapping project (2012) Available: http://www.lumeta.com/company/research.html.Accessed 2012 Dec 20.
  • 12.Database of interacting proteins (2012) Available: http://dip.doe-mbi.ucla.edu.Accessed 2012 Dec 20.
  • 13. Newman MEJ (2001) The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences 98: 404–409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Davis GF, Greve HR (1997) Corporate elite networks and governance changes in the 1980s. Am J Sociol 103: 1–37. [Google Scholar]
  • 15. Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393: 440–442. [DOI] [PubMed] [Google Scholar]
  • 16. Newman MEJ, Strogatz SH, Watts DJ (2001) Random graphs with arbitrary degree distributions and their applications. Phys Rev E 64: 026118. [DOI] [PubMed] [Google Scholar]
  • 17.Watts DJ (1999) Small Worlds: The Dynamics of Networks Between Order and Randomness. Princeton: Princeton University Press.
  • 18. Song C, Havlin S, Makse HA (2005) Self-similarity of complex networks. Nature 433: 392–395. [DOI] [PubMed] [Google Scholar]
  • 19. Song C, Havlin S, Makse HA (2006) Origins of fractality in the growth of complex networks. Nat Phys 2: 275–281. [Google Scholar]
  • 20. Song C, Gallos LK, Havlin S, Makse HA (2007) How to calculate the fractal dimension of a complex network: the box covering algorithm. Journal of Statistical Mechanics: Theory and Experiment 2007: P03006. [Google Scholar]
  • 21. Hartwell LH, Hopfield JJ, Leibler S, Murray AW (1999) From molecular to modular cell biology. Nature 402: C47–C52. [DOI] [PubMed] [Google Scholar]
  • 22. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabsi AL (2002) Hierarchical organization of modularity in metabolic networks. Science 297: 1551–1555. [DOI] [PubMed] [Google Scholar]
  • 23. Gallos LK, Song C, Havlin S, Makse HA (2007) Scaling theory of transport in complex biological networks. Proceedings of the National Academy of Sciences 104: 7746–7751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Galvo V, Miranda JGV, Andrade RFS, Andrade JS, Gallos LK, et al. (2010) Modularity map of the network of human cell differentiation. Proceedings of the National Academy of Sciences 107: 5750–5755. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Gallos LK, Makse HA, Sigman M (2012) A small world of weak ties provides optimal global integration of self-similar modules in functional brain networks. Proceedings of the National Academy of Sciences 109: 2825–2830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Rozenfeld HD, Song C, Makse HA (2010) Small-world to fractal transition in complex networks: A renormalization group approach. Phys Rev Lett 104: 025701. [DOI] [PubMed] [Google Scholar]
  • 27. Barabsi AL, Albert R (1999) Emergence of scaling in random networks. Science 286: 509–512. [DOI] [PubMed] [Google Scholar]
  • 28. Amaral LAN, Scala A, Barthelemy M, Stanley HE (2000) Classes of small-world networks. Proceedings of the National Academy of Sciences 97: 11149–11152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Kleinberg JM (2000) Navigation in a small world. Nature 406: 845–845. [DOI] [PubMed] [Google Scholar]
  • 30. Li G, Reis SDS, Moreira AA, Havlin S, Stanley HE, et al. (2010) Towards design principles for optimal transport networks. Phys Rev Lett 104: 018701. [DOI] [PubMed] [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES