Skip to main content
Heliyon logoLink to Heliyon
. 2023 Oct 25;9(11):e21211. doi: 10.1016/j.heliyon.2023.e21211

Decomposition-integration-based prediction study on the development trend of film industry

Yuan Ni a,b, Siyuan Li a,
PMCID: PMC10632694  PMID: 37954322

Abstract

Movies have the unique ability to both generate income and spread culture, thus predicting the direction of the film industry's growth has garnered a lot of interest. Given the volatility of the movie industry's entire box office revenue dataset and the peculiarities of tiny samples, this article incorporates the decomposition-integration notion to build the EEMD-PSO-LSSVM model movie box office prediction model. The historical box office data are first broken down into many components using the ensemble empirical modal decomposition technique, and then, distinct sequences are predicted using the least squares support vector machine prediction method with particle swarm optimization, and ultimately, the predictions for each sequence are combined. The experimental results demonstrate the effectiveness of the decomposition-integration technique in illustrating the data fluctuation characteristics of quarterly movie box office revenues. When compared to other models, the model proposed in this study has clear advantages in the problem of predicting the time series data of box office revenues that are non-linear, non-smooth, and non-large samples.

Keywords: Film industry, Situation prediction, Decomposition-integration, EEMD

1. Introduction

Movies are an essential component of our everyday pleasure. And the film industry, which generates billions of dollars in annual income, is one of the fastest-growing global revenue-generating sectors. Additionally, the film industry can play a significant role in fostering the spread of national culture due to its powerful penetrating capacity. For instance, in China, the international distribution of movies like "The Wandering Earth" has not only produced strong financial results but also actively encouraged the export of regional culture. And in most nations or areas, box office receipts are the primary source of income for the film industry. These receipts indicate consumer demand for films and the size of the local film market, and they are a key metric for assessing the health of the film industry. Therefore, an accurate projection of box office earnings for the film business will not only significantly lower investment risk for the sector but also aid government agencies in developing cultural economy and international cultural trade policies.

Numerous types of models, including linear regression, random forest, support vector machine, decision tree, neural network, etc., have been extensively employed in the research on the topic of predicting movie box office. However, there are a number of challenging issues in the current approaches that have not been effectively resolved, which may have a detrimental impact on the industry's projection of total box office receipts. In contrast to box office prediction research for a single movie, it is challenging to obtain large-scale data sources for the box office revenue of the movie industry, and machine learning prediction methods are susceptible to overfitting issues for non-large-sample data. The movie industry, as a complex ecosystem, has obvious cyclical and trending characteristics, and its fluctuations are brought on by a range of external conditions. However, the effect of ecological units on the cyclical fluctuations in the film business has not been adequately studied. Macroeconomic considerations are the main emphasis of most forecast indicators for the film business, and the delayed release of completely structured statistics makes forecasting difficult.

This study develops an index method for predicting box office receipts based on the idea of ecological situation, examines the industry's own fluctuation characteristics, and carefully weighs the impact of numerous "ecological environment" elements. Based on this, the Internet search index is added as a unique contextual component to strengthen the model's predictive capabilities and the box office revenue prediction model's predictive value. Drawing on the idea of decomposition integration-ideas, this paper also decomposes box office receipts into signals, reconstructs them into trend sequences and cycle sequences, and then integrates machine learning algorithms to predict various sequences separately. It then accumulates all the output values to obtain the prediction results with high accuracy. The approach presented in this paper exhibits strong prediction performance for box office revenue time series data, which aids in a deeper understanding of the decomposition-integration algorithm's application to prediction problems in the film industry and offers theoretical support for a rational and scientific prediction of time series data for small samples. Additionally, it gives stakeholders a theoretical framework for regulating the growth of the film industry, avoiding industry risks, and creating industry policies. It also has some reference value for forecasting the development trend of other cultural and creative industries.

2. Review of literature

Box office revenue forecasting has been a popular research problem in the film field, and the current forecasting models can be categorized into the following four categories: statistical models, probabilistic models and machine learning models. Linear regression is the most frequently used algorithm among statistical models. In the earliest study, Litman [1] used regression models as the basic model to predict the box office of movies obtained by distributors. Chintagunta, Gopinath and Venkataraman [2] constructed five prediction models by linear regression models based on word-of-mouth information and movie characteristics. Although the accuracy of the models was improved, their models were only applied to predict the opening day box office of a movie rather than the total box office, which limited the application of the models. Since these methods assume a linear relationship between input factors and movie box office, they are not good at dealing with time series modeling problems that possess complex characteristics such as nonlinearity and non-smoothness. The probabilistic model is introduced as an alternative model to the linear regression model. Based on queuing theory, the idea was developed into the BOXMOD-I prediction model [3], however, the model is not generalizable because only data from 19 movies were used to study it. The probabilistic model has not been sufficiently validated due to experiments on only a small number of movies. Unlike previous modeling methods, artificial intelligence methods represented by machine learning mimic the human brain thought process through computers, and their use for time series prediction modeling does not need to rely on strict data requirements, and they have better robustness and higher prediction accuracy than traditional methods. Typical representative machine learning models include BP neural networks, ANN networks, SVM, etc. The earliest study of movie box office in the field of machine learning was the model of Sharda and Delen [4], which used a multilayer BP neural network to classify and identify the box office. Abel et al. [5] tried to apply eight different machine learning algorithms based on blog data to predict movie box office revenue and music album sales, and the experimental results showed that the machine learning algorithms outperformed linear regression models.

Machine learning models are generally better for prediction of data with high complexity, however, they are prone to the problem of overfitting for small sample data. Support vector machine prediction models are effective in prediction of small sample data [6,7], but due to the existence of many parameters in support vector machines that can be subjectively determined, and these parameters can have an impact on the results of model operation. To solve this problem, a popular and widely used approach is the use of bio-intelligent optimization algorithms for the determination of the parameters to be determined for the model. Particle swarm optimization (PSO) algorithms have been found to have better prediction results in terms of accuracy while improving the running time of support vector-based models [8,9]. This class of methods is more sensitive to parameters and model settings; therefore, it is difficult to find a model that achieves optimal predictions in all cases [10]. To compensate for the shortcomings of single models for machine learning, combinatorial model approaches have achieved great improvements. A typical approach is the TEI@ I complex system research methodology proposed by Wang et al. [11], which mainly adopts the idea of "decomposition-integration" by first decomposing the complex system, studying each part of the decomposition, and finally integrating the parts. The TEI@ I methodology is based on the idea of "decomposition-integration". The decomposition algorithm is used to preprocess the original time-series data and obtain several different subseries of the original time-series data. Since each subseries has different properties, the prediction modeling of each decomposed subseries is required one by one and independently, and finally, the prediction values of different subseries are summed to form the overall prediction output with higher accuracy [12]. The representative time series data pre-decomposition techniques include wavelet decomposition WD [13] and empirical modal decomposition EMD [14]. The EMD algorithm decomposes the data based on the time-scale characteristics of the signal itself at different frequencies without relying on the basis function, which is fundamentally different from wavelet decomposition and other methods, and is better at decomposing nonlinear and non-stationary data. In order to overcome the EMD prone to the phenomenon of modal mixing, Wu and Huang [15] proposed the EEMD algorithm by adding a very small amplitude white noise sequence to the original time series, which can effectively avoid the modal mixing phenomenon by averaging multiple times and the noise canceling each other. Therefore, this paper draws on the TEI@ I theoretical idea to combine the ensemble empirical modal decomposition method with the least squares support vector machine model to construct the box office revenue prediction model, so as to realize the complementary advantages of the two characteristics.

In addition to the continuous improvement on the prediction model, the effective use of relevant indicators is also helpful to improve the prediction accuracy of the model. In traditional research on film industry analysis, the influencing factors usually include several dimensions such as infrastructure situation, market demand level, related industry support, and government policy support. According to the economic characteristics of the film industry, Xia Niya et al. [16] investigated the effects of nominal GDP, movie attendance, the number of screens, the frequency of screenings, and other factors on box office revenue. Yue Xian [17] used the diamond model to evaluate and analyze the Korean movie industry from five perspectives, including production factors, demand conditions, related industries, market strategies, and government support. Most of the above studies are based on traditional statistical data, and for the prediction problem, relying only on traditional indicators to predict the changing trends of the development situation, the impact of unexpected events within the industry cannot be adequately captured. In recent years, unexpected events have occurred frequently, and Internet search behavior reflects people's concern to a certain extent. Applying Internet search data as a new data source in forecasting research can make up for the statistical defects of traditional data and enhance the accuracy of forecasting models. In terms of movie box office prediction, Dai Debao et al. [18] linked web search with movie box office, and the integrated model was able to reduce the prediction error. Li Peizhi et al. [19] used the Baidu index to measure the influencing factors and applied it to the movie box office prediction model, and the results showed that the model had strong prediction ability. Thus, it can be seen that the research on prediction based on web search has also become a worthy application direction, and has achieved better prediction results.

In summary, the decomposition integration idea has achieved excellent results in time series forecasting research, which can solve the model fusion problem efficiently. For today's abrupt change scenarios, the network search data can make up for the shortage of traditional data and play a more significant role in improving the forecasting ability. Therefore, this paper attempts to explore a prediction method that combines the two, and proposes a decomposed integrated prediction model that considers mutation factors. Web search data is introduced into the forecasting framework, and the useful information in the web search data is fully utilized while mining the characteristics of box office revenue sequential data. Firstly, based on the decomposition of box office revenue data using the EEMD method, the decomposed subsequences are reconstructed into trend and period terms, and suitable forecasting methods are selected according to the characteristics of different components, and finally the prediction values are integrated to obtain more accurate forecasting results.

3. Materials and methods

3.1. EEMD-PSO-LSSVM decomposition integrated prediction model

This paper draws on TEI@I complex system methodology to construct a combined EEMD-PSO-LSSVM model to predict the development trend of the movie industry, and its detailed modeling flow is shown in Fig. 1. Firstly, based on the ecological niche posture theory, we construct the movie industry development posture prediction index system, and at the same time, we use the ensemble empirical modal decomposition algorithm to decompose the signal of the initial movie box office sequence, so as to get the relatively stable IMF sequence and RES residual term with different frequencies. Then the decomposed IMF sequence is reconstructed to form the period sequence and trend sequence. For the frequency series, the series itself is affected by multiple external factors with unstable fluctuation characteristics, so it is predicted by using the constructed prediction index system combined with the PSO-LSSVM model. For the trend series, it is predicted using historical data since it clearly shows a tendency to move in its own path, so the historical data is used for prediction of the trend series. Finally, the predicted values of each sequence are integrated to form the final movie box office prediction results.

Fig. 1.

Fig. 1

Flow chart of EEMD-PSO-LSSVM decomposition integrated prediction model.

3.2. Construction of prediction index system based on ecological niche theory

Grinnell first put forward the idea of an ecological niche as a crucial modern ecology concept [20], and in recent years, the ecological niche theory has also been extensively employed in assessing and analyzing the growth of businesses or industries. Zhu Chunquan was the first to put forth the principle of ecological niche posture [21], after years of study by numerous scholars, the industry's understanding of this theory is primarily divided into three levels: the "state" factor, the "potential" factor, and the interface between the two. The movie business can be seen as an ecology. The "state" refers to the current situation of the film business, which primarily involves the economy and the availability of resources, and summarizes the results of the available research. The "potential" depicts the potential for the film industry's development in the future, taking into account both internal industry factors and external influences like governmental financial or administrative inputs. The interface between "state" and "potential" represents the industry's size.

Based on the above research, this study categorizes the situation prediction index system used in China's film business into three levels: "state" factor, "potential" factor, and the intersection of "state" and "potential". And then propose three connotation layers: "Resource economy eco-location", "Industrial environment Eco-location", and "Industry Scale Eco-location", on which the prediction index system of film industry development is built, as shown in Table 1.

  • (1)

    The three ecological components of infrastructure, human resources, and the economic condition make up the "state" factor, or the ecological location of resources. The number of screens, GDP, and per capita spending on education, culture, and amusement of the people are used in this article to measure the infrastructure and economic condition that form the foundation for the development of the film industry. Human resource is an important factor in production, the long-term growth of the film industry cannot be separated from high-intensity mental labor, so this paper uses both the total population and the average number of students enrolled in higher education per 100,000 population to reflect human resource.

  • (2)

    Because the "state" and "potential" factors are not entirely independent of one another, this study builds an interface between the two to describe the effects of other associated supporting industries on the film industry, or the Industry Scale Eco-location. The quantity of TV series released serves as a proxy for the TV series business, which is closely tied to the film industry, in this essay.

  • (3)

    The "potential" factor, or the Industrial environment Eco-location. The strength of the investment heavily influences the quality of the development of the film industry, which mostly depends on national public financial spending. Due to the high level of uncertainty in modern society, some emergencies will have unexpected impacts on the growth of the sector. The public's awareness of catastrophes can be reflected in the Internet search index, which can also indicate how much of an impact they are having.

Table 1.

Film industry situation forecast index system.

Posture Connotation Representation Indicator
The "state" factor Resource economy Eco-location Infrastructure Number of screens (pieces)
Human Resource Total population (10,000)
Average number of students enrolled in higher education per 100,000 population (persons)
Economic Condition Gross domestic product (billion yuan)
Per capita expenditure on education, culture and entertainment of residents (yuan)
The intersection of "state" and "potential" Industry Scale Eco-location Related Industries Number of TV dramas distributed (parts)
The "potential" factor Industrial environment Eco-location Financial input National public expenditure on culture, sports and media (billion yuan)
Impact of emergencies Baidu index of public health emergencies

3.3. Ensemble empirical modal decomposition (EEMD)

The Empirical Mode Decomposition (EMD) algorithm is an adaptive time-frequency local decomposition method proposed by Huang [22], a Chinese-American scientist. This method is one of the important tools for dealing with non-smooth and nonlinear signals, mainly based on the data characteristics of the given signal itself, adaptively extracting the corresponding basis functions to decompose the signal, and obtaining a number of more stable intrinsic modal components (intrinsic oscillatory mode, IMF) and residual terms. Unlike traditional decomposition algorithms, EMD does not need to assume signal linearity or smoothness in advance and can be directly decomposed, so it is widely used. Its algorithm flow is shown in Fig. 2.

Fig. 2.

Fig. 2

Flowchart of EMD algorithm.

However, the EMD method is prone to the phenomenon of modal confusion and stacking, which is mainly manifested in two forms: one is that in the same IMF component, there are signals with a very wide range of scales, but different from each other. Secondly, in different IMF components, there are signals with very similar scales. In this regard, Wu and Huang [23] proposed the Ensemble Empirical Mode Decomposition (EEMD) algorithm, which can effectively compensate for the defects of EMD algorithm by superimposing Gaussian white noise, and its decomposition steps are as follows.

Step1

Add the white noise ni(t) satisfying the standard normal distribution to the original time series x(t) to produce a new time series:

xi(t)=x(t)+ni(t) (1)

Step2

The newly generated sequence containing noise is decomposed by EMD to obtain different IMF Ci,j(t) and residual terms ri(t).

xi(t)=j=1jCi,j(t)+ri(t) (2)

Step3

Repeat step1 and step2 several times and add white noise signal to the original signal repeatedly to obtain the final decomposition sequence:

Cj(t)=1ni=11Ci,j(t) (3)

3.4. Particle swarm optimization for least squares support vector machine (PSO-LSSVM)

  • (1)

    Least squares support vector machine prediction model (LSSVM)

A development over Support Vector Machine (SVM), Least Squares Support Vector Machine (LSSVM) can handle small sample, nonlinearity, and other difficulties while still having high generalization capabilities and being simple to train [24]. The LSSVM converts solving the quadratic programming problem to solving the system of linear equations problem by swapping out the inequality constraints present in the classic SVM with equation constraints. This can speed up the process and improve convergence accuracy [25]. These are the precise steps:

Each sequence x decomposed by EEMD, is mapped into the high-dimensional feature space H by RnH:

f(x)=(ω,φ(x))+b (4)

Using the principle of structural reduction, the LSSVM-optimized outcome function is as follows:

minJ(ω,e)=12ω2+12Ci=1lei2s.t.ωTφ(xi)+b+ei=yi,i=1,.l (5)

where: ei is the error variable; C is a penalty factor and a constant(C > 0); ω is the weight vector and ωRn. Establish the Lagrange equation to find the optimal solution:

minJ=12ω2+12Ci=1lei2i=1lαi(ωTφ(xi)+b+eiyi) (6)

where αi(i=1,l) is Lagrange multiplier. According to the conditions, after eliminating ω and ei, the prediction model expression is:

xi(t)=x(t)+ni(t) (7)

where k(xi,xj) is the RBF kernel function.

  • (2)

    Particle Swarm Optimization (PSO) algorithm

The predictability of the LSSVM prediction models is mostly determined by the penalty factor C and the kernel function parameter σ.Therefore, selecting the proper C and σ is a crucial step in enhancing the model's capacity for prediction and generalization. The following stages are used to optimize it in this paper using the particle swarm optimization algorithm:

Step1

Initialize the particle swarm parameters. Define the search space, upper and lower limits on particle velocities, the maximum number of algorithm iterations, and the learning factor.

Step2

Evaluate the fitness value of the particles. The ideal position for each particle and their best adaptation value should be saved after calculating the size of each particle's adaptation value based on the adaptation function.

Step3

Update the particle velocity and position according to the formula as follows:

v=wv+c1rand1(pbestp)+c2rand2(gbestp)p=p+vrand1,rand2[0,1] (8)

Step4

Determine the new particle fitness value, compare it to the fitness value at its previous optimal position, and if the current value is higher, consider the current position to be the particle's optimal position.

Step5

When comparing each particle's ideal position's fitness value to the population's best fitness value, update the population's best fitness value if it is higher.

Step6

Judge whether the search result has been iterated the maximum number of times or falls within the specified precision range. If so, stop iterating and output the result; otherwise, return to Step 3.

4. Results

4.1. Data collection and pre-processing

For an empirical analysis, the data from the first quarter of 2012 to the fourth quarter of 2019 are gathered and organized in this article. We chose the first quarter of 2012 as the starting point of data collection because the official's official quarterly box office revenue data for 2011 had missing data. This led to an incomplete system of prediction indexes. The widespread Covid-19 breakout in early 2020 might be viewed as a turning point, after which the movie industry's tendency has changed significantly and variations are less frequent. The primary elements influencing industrial development in the second stage have also altered and are now mostly controlled by macropolicies. The fourth quarter of 2019 is chosen as the final point of forecast data gathering since it is challenging to utilize the same methodology to balance the study of box office revenue estimates for the two stages with significant variances.

The sample data comes from the China Statistical Yearbook; the total population, the average number of students per 100,000 people, the gross domestic product, the per capita expenditure on education, culture, and recreation, and the national public finance expenditure on culture, sports, and media; the number of screens comes from the China Movie Yearbook; the box office receipts for movies come from the China Movie Market Data Report; and the number of TV drama releases comes from the China TV Drama Release Data Report. Since the average number of students enrolled in higher education per 100,000 people, the number of screens, and the national public finance expenditure on culture, sports, and media follow an annual statistical cycle, this paper uses Eviews10 to convert the annual data for these three indicators into monthly data in order to preserve the consistency of the frequency of the data. Additionally, the time frame is set to January 1, 2012 to December 31, 2019, and the keyword is "public health and safety emergencies." Baidu index search, Baidu search index through Python crawling, and then crawl the results of the cumulative summation of the development of quarterly search index are used. Finally, 32 quarterly datasets spanning the years 2012–2019 are gathered.

4.2. Raw data EEMD decomposition

Since movie box office revenue is a nonlinear non-stationary time series, the original movie box office revenue is broken down in this stage using the ensemble empirical modal model. In Fig. 3, the decomposed sequence is arranged from high to low frequencies and includes the original sequence, four IMF components, and one residual term. The decomposed modal components have a simpler and more regular structure, as seen in the image, which can aid the prediction model's fitting effect.

Fig. 3.

Fig. 3

Box office revenue EEMD decomposition results.

The frequency variability in the movie box office data series is seen in Fig. 3. Due to the nonlinear and noise components of the original time series data, which may be attributed to the box office receipts impacted by seasonal variations, the IMF1 derived from the decomposition of the EEMD method is erratic and the periodicity is not visible. In addition, IMF1–IMF3's average period is very short, indicating a high-frequency sequence that reflects the impact of outside influences on the growth of the film industry. IMF4's average period is longer, indicating a low-frequency sequence. The high-frequency and low-frequency series collectively reflect the frequency features of box office changes, therefore IMF1 through IMF4 are reconstructed as frequency series. The residual term is utilized as the trend sequence because it amply captures the overall trend of box office receipts for movies. The results of the reconstruction are displayed in Fig. 4.

Fig. 4.

Fig. 4

Sequence reconstruction results.

4.3. Analysis of prediction results

4.3.1. Analysis of EEMD-PSO-LSSVM forecasting results

For the frequency series, the PSO-LSSVM model is used to make predictions. The frequency series is used as the dependent variable, the indicator system built in the first part is used as the independent variable, and a total of 24 sets of data from Q1 2012 to Q4 2017 are used as the training set and 8 sets of data from Q1 2018 to Q4 2019 are used as the test set. The PSO-LSSVM model is also used to produce predictions for the trend series using the data from the training set, which is the data from 2012 Q1 to 2017 Q4, and the test set, which is the data from 2018 Q1 to 2019 Q4. The true value-error value comparison curve produced by using the EEMD-PSO-LSSVM model for prediction is shown in Fig. 5.

Fig. 5.

Fig. 5

EEMD-PSO-LSSVM final forecast results.

As illustrated in Fig. 5, the EEMD-PSO-LSSVM model is effective at tracking changes in the movie box office sequence data, and both the training set data and the test set data produce very accurate prediction results.

4.3.2. Comparison of multi-model prediction effect evaluation

Both the absolute and relative mistakes of the model should be taken into account when assessing prediction performance and prediction effect. The evaluation indices chosen in this study are root mean square error (RMSE) and mean absolute error (MAE).

MAE=1ni=1n|ytyˆt| (9)
RMSE=i=1n(ytyˆt)2n (10)

The root mean square error (RMSE) indicates the dispersion of the error between the predicted and true values; the mean absolute error (MAE) indicates the average size of the error between the predicted and true values. With these two indicators, the prediction effect of the model can be evaluated comprehensively.

To conduct an empirical study, the original data samples are split into training and testing sets in accordance with the original criteria, then entered into a number of single prediction models, including ARIMA, RF, BP, SVM, and combined prediction models EMD-PSO-LSSVM and EEMD-PSO-LSSVM. Although it was discovered that the combination model provided in this study has the longest computation time and the single BP model has the shortest, each model's computation time falls within the acceptable range for research purposes. In order to more clearly show the experimental results, this study uses the EEMD-PSO-LSSVM model as the benchmark and calculates the relative values of RMAE and MAE of other models to the benchmark model. Table 2 display the error relative values of several prediction models.

Table 2.

Prediction error values for the models.

model category Model MAE(/EEMD-PSO-LSSVM) RMSE(/EEMD-PSO-LSSVM) Time/s
single model ARIMA 2.927 3.526 0.437
ETS 3.749 4.123 0.591
RF 1.771 1.875 0.497
BP 1.627 1.640 0.316
RNN 1.740 1.956 0.563
SVM 2.016 1.906 0.479
LSSVM 1.765 1.755 0.728
PSO-LSSVM 1.358 1.385 2.377
combined mode EMD-PSO-LSSVM 1.452 1.174 4.910
EEMD-PSO-LSSVM 1 1 5.582

As can be seen from the experimental results, the single prediction model has a shorter computation time than both the single prediction model and the combined prediction model. The EEMD-PSO-LSSVM decomposition-integration prediction model has the longest computation time, but this is still within an acceptable range. The combined prediction models based on decomposition-integration all perform well in terms of prediction accuracy, and the accuracy is significantly better than that of a single prediction model when using the EEMD-PSO-LSSVM model as the benchmark to calculate the relative error value. This shows that the decomposition-integration prediction model has strong applicability and superiority in the prediction research on the time-series data of box office.

The model error based on undecomposed, EMD decomposition, and EEMD decomposition gradually decreases, and the EEMD decomposition algorithm has a higher degree of improving the accuracy of the box office revenue prediction model, indicating that different decomposition methods will have an impact on the prediction results. The decomposition of the raw data into relatively smooth and simple subsequences also helps to improve the prediction effect of the underrepresented model. Table 2 demonstrates that the prediction method suggested in this paper yields the best prediction results, demonstrating that the constructed index system can more accurately depict the external influence of industry development and that the decomposition-integration combination model can accurately capture the fluctuating characteristics in the original quarterly movie box office data. Additionally, the method offers a useful reference program for the prediction of box office revenues of movies.

5. Discussion

In a period of recurrent emergencies, it is crucial for the government, enterprises, and investors to understand the industry's development pattern. A detailed grasp of the film industry's swings, particularly the changes in direction, can assist related businesses and investors in making prudent trading decisions, reducing risks, and maximizing rewards. In order for the government to make informed decisions and implement appropriate early warning systems, it is important to have a thorough grasp of the film market's dynamics.

The total box office is used as a barometer for the dynamics of the cinema and television industries in this study. Based on the decomposition-integration concept, we then propose a method for projecting the development trend of the film business. This method is integrated with time series data from the China region's quarterly box office during the last eight years for model validation, and three findings are derived.

Finding 1: The prediction model based on "decomposition-integration" clearly beats baseline strategies like support vector machines, BP, and ARIMA neural networks in terms of prediction accuracy when employing box office time series data. The decomposition-integration modeling strategy firstly fully utilizes the benefits of the ensemble empirical modal decomposition algorithm in decomposing the data series, decomposing the complex raw data into simple components, reducing modeling difficulty, and achieving the effect of improving prediction accuracy [15,26,27,28]. Second, several academics suggested the "component reconstruction" optimization technique in light of the potential for error buildup throughout the outcome integration [29,30]. The decomposition results in this study are converted into trend and period series and then forecasted using various models in accordance with the fluctuation properties of the component parts. The results confirm the reconstruction optimization effectiveness of the "decomposition-integration" strategy, the negative effects of error accumulation on prediction outputs, and the much lower cost of model operation. The study's findings support the efficacy of reconstruction in enhancing decomposition-integration, with a notable decrease in the detrimental effects of error accumulation on the prediction outcomes. The conclusion suggests that modal decomposition focuses on data-level noise processing, but is insufficient for feature-level information mining. Similar component reconstruction can realize the fusion of various data components, better reflecting the reality of the data representation, and achieving the goal of improving the prediction effect at the decision-making level. Third, the successful "decomposition-integration" method is commonly used when processing large samples of time-series data. The findings in this study are more indicative of a group of small samples than they are of large samples because of the limitations on availability. The results show that the empirical modal decomposition adds a number of components to compensate for the short sample size, but the decomposition-integration technique also performs well. We will be able to verify if the decomposition-integration method is sensitive to the scale requirement of temporal data by selecting more small-sample data sets in the future, of course. As a result, the "decomposition-integration" prediction theory, the time prediction theory, and the small-sample prediction theory are all strengthened by this conclusion. The suggested study methodology has some reference value for forecasting the growth trend of the animation, TV series, and other film and cultural industries, which also have similar data characteristics time series.

Finding 2: The cyclical fluctuations in the film industry's development are the result of a range of environmental factors that exhibit typical ecological features, constituting a full "ecosystem." The growth of the film industry experiences both cyclical changes and trend changes, just like the development of any other industry. Trend changes reflect people's increased demand for cultural consumption, which is an inevitable trend once the regional economic level and material consumption level reach a certain level; however, cyclical changes are relatively complex, not only requiring consideration of the macro-environment, emergencies, and other factors, but also requiring consideration of the industry. Existing research, however, do not discriminate between the two types of change differences and assume that all factors would have an impact on them, which is obviously not the case. In this regard, this study sees the development of the movie industry as a superposition of the two independent processes mentioned above, and believes that cyclical changes are the main source of uncertainty in the industry's development, that it is necessary to consider the impact of various types of external environmental factors, and that small changes in each unit can cause huge fluctuations in the industry's situation. This paper uses the ecological theory to create an indicator system for forecasting cyclical changes in the movies because this process is remarkably similar to the ecological unit ecological niche formation in ecology. The study's findings also demonstrate that the theory has good applicability for depicting fluctuating trends in the film industry. The result increases our understanding of the factors contributing to uncertainty in the evolution of the film industry and broadens the application scenario for ecological niche theory.

Finding 3: The construction of a predictive indicator system using both structured and unstructured data has a favorable impact on forecasting the state of the film industry. Because structured data has a temporal lag, it frequently prevents machine learning models from accurately capturing the influence of indicator elements on predictor variables when used in time series forecasting. Previous studies have shown that better fusing of unstructured data sets has a substantial impact on improving model predictions [31]. The online search index can represent how people behave emotionally in particular circumstances, and adding it to the prediction model can increase model precision to varying degrees [19,32]. In this sense, this study includes the Baidu index, an unstructured data set, to give more features for the predictive model and enhance the predictive model's feature space. The conclusion demonstrates the efficacy of the fusion data indicator system on the effect of predictive modeling, which is useful for screening and optimizing predictive modeling indicator system.

Naturally, further research might be done in the following areas. First, the original data volume of box office revenues for movies is small, so better data collection and preprocessing solutions can be sought. Second, in the component reconstruction link, in addition to theoretically analyzing the fluctuation characteristics of the data, a mathematical model can be applied to assist in the classification, with the goal of obtaining a better reconstruction strategy and achieving a better prediction effect.

Data availability statement

Data will be made available on request.

Funding statement

This research was funded by National Key R&D Program (NO.2021YFF0900200).

CRediT authorship contribution statement

Yuan Ni: Conceptualization, Funding acquisition, Project administration, Supervision, Validation, Writing – review & editing. Siyuan Li: Data curation, Formal analysis, Software, Writing – original draft, Writing – review & editing, Visualization.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

None.

References

  • 1.Litman B.R. Predicting success of theatrical movies: an empirical study. J. Popular Cult. 1983;16(4):159. [Google Scholar]
  • 2.Chintagunta P.K., Gopinath S., Venkataraman S. The effects of online user reviews on movie box office performance: accounting for sequential rollout and aggregation across local markets. Market. Sci. 2010;29(5):944–957. [Google Scholar]
  • 3.Sawhney M.S., Eliashberg J. A parsimonious model for forecasting gross box-office revenues of motion pictures. Market. Sci. 1996;15(2):113–131. [Google Scholar]
  • 4.Sharda R., Delen D. Predicting box-office success of motion pictures with neural networks. Expert Syst. Appl. 2006;30(2):243–254. [Google Scholar]
  • 5.Abel F., Diaz-Aviles E., Henze N., Krause D., Siehndel P. 2010 International Conference on Advances in Social Networks Analysis and Mining. IEEE; 2010, August. Analyzing the blogosphere for predicting the success of music and movie products; pp. 276–280. [Google Scholar]
  • 6.Kim T., Hong J., Kang P. Box office forecasting using machine learning algorithms based on SNS data. Int. J. Forecast. 2015;31(2):364–390. [Google Scholar]
  • 7.Hur M., Kang P., Cho S. Box-office forecasting based on sentiments of movie reviews and Independent subspace method. Inf. Sci. 2016;372:608–624. [Google Scholar]
  • 8.Senkal S., Ozgonenel O. 2013 8th International Conference on Electrical and Electronics Engineering (ELECO) IEEE; 2013, November. Performance analysis of artificial and wavelet neural networks for short term wind speed prediction; pp. 196–198. [Google Scholar]
  • 9.Jiang A.N., Liang B. Nonlinear time series prediction model for dam seepage flow based on PSO-SVM. J. Hydraul. Eng. 2006;37(3):331–335. [Google Scholar]
  • 10.Wang S., Tang L., Yu L.A. Univariate decompose-ensemble method based milk demand forecasting. J. Syst. Sci. Math. Sci. 2013;33(1):11. [Google Scholar]
  • 11.Wang S.Y., Yu L.A., Lai K.K. Crude oil price forecasting with TEI@ I methodology. J. Syst. Sci. Complex. 2005;18(2):145. [Google Scholar]
  • 12.Yan Y., Wei X., Hui B., Yang S., Zhang W., Hong Y., Wang S.Y. Method for housing price forecasting based on TEI@ I methodology. Syst. Eng.-Theor. Prac. 2007;27(7):1–9. [Google Scholar]
  • 13.Walczak B., Massart D.L. Noise suppression and signal compression using the wavelet packet transform. Chemometr. Intell. Lab. Syst. 1997;36(2):81–94. [Google Scholar]
  • 14.Huang N.E., Shen Z., Long S.R., Wu M.C., Shih H.H., Zheng, et al. The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proc. Royal Soc. London. Ser. A: Math. Phys. Eng. Sci. 1998;454(1971):903–995. [Google Scholar]
  • 15.Wu Z., Huang N.E. A study of the characteristics of white noise using the empirical mode decomposition method. Proc. Royal Soc. London. Ser. A: Math. Phys. Eng. Sci. 2004;460(2046):1597–1611. [Google Scholar]
  • 16.Xia N.Y., Pu Y.J. Analysis of factors that influence box office income-based on multi-country panel data of film industry ecnomic feature. Econ. Probl. Expl. 2012;(6):136–144. [Google Scholar]
  • 17.Xian Y. Competitiveness analysis of Korean film and television industry based on Michael Porter diamond model. Front. Art Res. 2022;4(5) [Google Scholar]
  • 18.Dai D., Chen J. vol. 1952. IOP Publishing; 2021, June. Research on mathematical model of box office forecast through BP neural network and big data technology. (Journal of Physics: Conference Series). No. 4. [Google Scholar]
  • 19.Li P.Z., Dong Q.L. Box office prediction model based on web search data and machine learning. Oper. Res. Manag. Sci. 2021;30(11):168. [Google Scholar]
  • 20.Grinnell J. The niche-relationships of the California Thrasher. The Auk. 1917;34(4):427–433. [Google Scholar]
  • 21.Zhu C.Q. Ecological niche posture theory and expansion hypothesis. J. Ecol. 1997;3:324–332. [Google Scholar]
  • 22.Huang N.E., Wu Z. A review on Hilbert‐Huang transform: method and its applications to geophysical studies. Rev. Geophys. 2008;46(2) [Google Scholar]
  • 23.Wu Z., Huang N.E. Ensemble empirical mode decomposition: a noise-assisted data analysis method. Adv. Adapt. Data Anal. 2009;1(1):1–41. [Google Scholar]
  • 24.Yan W., Shao H., Wang X. Soft sensing modeling based on support vector machine and Bayesian model selection. Comput. Chem. Eng. 2004;28(8):1489–1498. [Google Scholar]
  • 25.Yu H., Chen Y., Hassan S.G., Li D. Prediction of the temperature in a Chinese solar greenhouse based on LSSVM optimized by improved PSO. Comput. Electron. Agric. 2016;122:94–102. [Google Scholar]
  • 26.Yu L., Dai W., Tang L. A novel decomposition ensemble model with extended extreme learning machine for crude oil price forecasting. Eng. Appl. Artif. Intell. 2016;47:110–121. [Google Scholar]
  • 27.Cao J., Li Z., Li J. Financial time series forecasting model based on CEEMDAN and LSTM. Phys. Stat. Mech. Appl. 2019;519:127–139. [Google Scholar]
  • 28.Freire P.K.D.M.M., Santos C.A.G., da Silva G.B.L. Analysis of the use of discrete wavelet transforms coupled with ANN for short-term streamflow forecasting. Appl. Soft Comput. 2019;80:494–505. [Google Scholar]
  • 29.Yu L., Wang Z., Tang L. A decomposition–ensemble model with data-characteristic-driven reconstruction for crude oil price forecasting. Appl. Energy. 2015;156:251–267. [Google Scholar]
  • 30.Wang Y., Liu J., Li R., Suo X., Lu E. Medium and long-term precipitation prediction using wavelet decomposition-prediction-reconstruction model. Water Resour. Manag. 2022;36(3):971–987. [Google Scholar]
  • 31.Bokelmann B., Lessmann S. Spurious patterns in Google Trends data-An analysis of the effects on tourism demand forecasting in Germany. Tourism Manag. 2019;75:1–12. [Google Scholar]
  • 32.Kou Y., Ye Q., Zhao F., Wang X. Effects of investor attention on commodity futures markets. Finance Res. Lett. 2018;25:190–195. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.


Articles from Heliyon are provided here courtesy of Elsevier

RESOURCES