Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2021 Aug 5;16(8):e0255419. doi: 10.1371/journal.pone.0255419

Analysis of big data job requirements based on K-means text clustering in China

Dai Debao 1, Ma Yinxia 1,*, Zhao Min 2
Editor: Bing Xue3
PMCID: PMC8341572  PMID: 34351951

Abstract

This paper aims to understand the characteristics of domestic big data jobs requirements through k-means text clustering, help enterprises, and employees to identify big data talents, and promote the further development of big data-related research. Firstly, the crawler software is used to crawl the recruitment information about "big data" on the zhaopin.com recruitment website. Then, Jieba word segmentation and K-means text clustering are used to cluster big data recruitment positions, and the number of clustering was determined by the average sum of squares within the group. Finally, big data jobs are divided into 10 categories, and the urban distribution, salary level, education requirements, and experience requirements of big data jobs are discussed and analyzed from the perspectives of the overall data set and clustering results, to clarify the characteristics of big data job demands. The analysis results show that the job demands of big data are mainly distributed in first-tier cities and new first-tier cities. Enterprises are more inclined to job seekers with a college degree or bachelor’s degree and more than one year’s relevant experience. There are wage differences among different types of jobs. The higher the position, the higher the requirement for education and experience will be.

1. Introduction

Modern society has entered the era of big data, and data has been generally recognized as the most competitive and valuable asset at present and has become one of the important strategic resources of all countries in the world [1]. Countries around the world have formulated big data strategies to seize the commanding heights of big data development strategies. China attaches great importance to the status and role of big data in promoting economic and social development. In 2014, big data was written into the government work report for the first time. Big data has gradually become a hot topic for governments at all levels. The concepts of open sharing of government data, data circulation and trading, using big data to guarantee and improve people’s livelihood have been deeply rooted in people’s minds. Since then, relevant departments of the state have issued a series of policies to encourage the development of the big data industry. With the great attention and policy support of the state, the big data industry has rapidly emerged and developed, and a large number of big data-related jobs have emerged at the right moment. As a result, the demand for data talents of enterprises has greatly increased. According to the 2016–2020 China Big Data Industry Panorama Survey and Development Trend Forecast Report, in 2013, the demand scale of China’s big data market was 1.12 billion yuan, which was more than twice that of the previous year in 2014 and 2015, and exceeded 10 billion yuan in 2016. It is estimated that the demand scale of China’s big data market is expected to reach 107 billion yuan in 2020. It can be inferred that the size of the big data market will grow exponentially in the next few years. However, the development and rise of big data technology in China have taken a short time, and the cultivation of big data talents has not kept pace with the needs of enterprises. The lack of skills and experience of big data practitioners has led to the widening gap of big data positions, and the phenomenon that big data positions are not in line with the needs has become one of the key obstacles restricting the development of China’s big data industry [2]. According to the Big Data Talent Report released by Datalink Search, there are currently only 460,000 big data talents in China. The gap of big data talents in the next 3–5 years will reach 1.5 million. Among them, data analysis talents are in great demand, but the data supply index of analytical talents is low, only 0.05, which is highly scarce [3]. According to statistics from the Data Analysis Professional Committee of the China Business Federation, China’s basic data talent gap will reach 14 million in the future, and more than 60% of China’s Internet industry recruitment positions are recruiting big data talents [4,5]. So what data positions does the company currently have, and what specific requirements do these positions have for talents? Through the collection and statistical analysis of my country’s big data talent network recruitment information, this paper sorts out the talent requirements, salary, geographical distribution, and job categories of big data-related positions, and then analyzes the demand for big data work in my country. This article analyzes the needs of big data people in depth, provides references for data talent training and job-seekers’ knowledge and ability construction, and helps companies quickly locate talents. It is important for scientifically and effectively to promote big data talent training, clarify big data talent positioning, and promote big data talents. The development of the data industry has important guiding significance.

2. Related studies

With the booming development of the big data industry, the demand for big data jobs has been surging and gradually diversified. However, at present, there are few types of research on big data jobs, most of which use traditional statistical analysis and information measurement methods. Puncheva-Michelotti [6]. studied whether the transmission of CSR awareness in online recruitment information can improve the attractiveness of job seekers through manual statistical analysis of CSR awareness. Zhang and Zheng [7]. used descriptive statistics to analyze gender differences in self-evaluation and salary expectation in Online recruitment information in China. Papoutsoglou [8]. used multivariate statistical analysis to explore the skills and abilities of IT talents from recruitment advertisements, and analyzed the correlation between skills and abilities. Uhm [9]. and Karakatsanis [10]. used O*NET analysis software to analyze the organizational relationship, industry trend, regional trend, and other industry information among recruitment posts, providing market change information for job seekers and recruiters.

To sum up, although traditional statistics and metrology methods have made some achievements in studying the job demands of big data, the above researches have the problem of the small amount of data. At the same time, a large amount of manpower is also needed to analyze the data in the research process, and the research results are easily affected by people’s cognitive level, which makes it impossible to fully excavate the laws hidden beneath the data surface. Compared with traditional statistics and information metrology, the unsupervised learning clustering algorithm is not limited to human subjective factors and sampling data to mine job recruitment information of big data. It can mine and discover massive data and better express the potential content of data [11]. In existing studies, some scholars use the KNN algorithm [1214] and unsupervised machine learning [15,16] to classify online recruitment information according to recruitment posts or job descriptions, to excavate labor market information or talent demand in professional fields. Besides, text mining technology [1719] is also widely used to deeply dig the talent demand information in the job description text, to obtain the skills, technologies, or knowledge fields that the position needs to master. Among them, the semanteme-based subject model [2023] is often used to mine job skills and requirements, extract the subject words of the job description, and then conduct manual induction, analysis, and summary of the subject words to further dig the potential demand information.

Text clustering is one of the common methods in text mining technology, which is widely used in many fields such as topic discovery and hot spot tracking [24]. In the specific implementation process, a variety of algorithms can be applied to text clustering. K-means is one of the classical clustering algorithms in the partitioning method. Due to its high efficiency, it is widely used to cluster large-scale data [25]. At present, there have been a lot of researches based on k-means text clustering. Kanungo [26] and Wu [27] improved the k-means algorithm, enhanced its superiority and operation speed, and made this method more effective in the field of text mining. Many studies have applied the k-means algorithm to document clustering [2830]. By extracting text features, building an overlapping document clustering model, setting term weight, and other methods, the accuracy of document clustering is improved and the error between clustering is reduced. The semantically based K-means algorithm can combine the TF-IDF algorithm [31], LDA model [32], and BTM model [33] to effectively detect and mining hot topics in social networks, make up for the sparse problem of the traditional topic model, and better understand short texts. It can also classify video users in combination with emotional analysis, so as to understand the emotional similarities and differences of certain types of video viewers [34]. It can be seen that the semantic-based K-means clustering can effectively understand the passage, process natural language from all aspects, and dig out its potential information and value.

To sum up, the innovation of this paper is to obtain recruitment information related to big data on Zhaopin.com recruitment website through web crawlers. Based on the traditional K-means clustering algorithm, combining stutter segmentation and regular expression to improve the word effect, the understanding of the model to the short text is improved, and the number of clustering is determined by calculating the sum of squares within each group so that the text clustering effect is better. This model is applied to big data job recruitment information mining, replacing traditional statistics and metrology methods to obtain clustering results, and analyze the results, so as to comprehensively mine the demand characteristics of big data-related jobs in the industry.

3. Methods

3.1 Data collection

Zhaopin.com recruitment website is one of the large recruitment websites in China, which publishes a large number of recruitment information related to big data. Using "big data" as the keyword, this paper uses web crawler software to capture the recruitment information of big data-related positions in all walks of life from five dimensions, such as job title, experience requirement, education requirement, working place, and salary range. The collected data fields and examples are shown in Table 1. The data were collected from 25 cities in total, as shown in Table 2. Among them, first-tier cities are Beijing, Shanghai, Guangzhou, and Shenzhen, 12 new first-tier cities are Chengdu, Hefei, and Hangzhou, and 9 other cities besides first-tier and confidence cities are Dalian, Fuzhou, and Harbin. The job postings were made in three months between August 2020 and October 2020, with 16,551 online job postings.

Table 1. Shows the collected data.

Data dimension Job title Salary range Working place Degree required Experience requirements
The sample Big data architect 20,000 to 30,000 Beijing junior college 5–10 years

Table 2. Name and number of cities in each category.

City category City name City number
First-tier cities Beijing, Shanghai, Guangzhou, Shenzhen 4
New first-tier cities Chengdu, Hefei, Hangzhou, Nanjing, Shenyang, Tianjin, Wuxi, Wuhan, Xi ’an, Changsha, Zhengzhou, Chongqing 12
Others Dalian, Fuzhou, Harbin, Guiyang, Jinan, Ningbo, Xiamen, Shijiazhuang, Changchun 9

3.2 Data procession

To improve the quality of unstructured text data, it is necessary to preprocess the data [35]. In this paper, first of all, the collected data were reprocessed, and 14,496 valid data were obtained after deleting invalid and incomplete data. Then, stammer segmentation was used to divide the job name data into words [36]. Since the data set contains technical terms, this paper added a self-defined dictionary in the pre-processing stage, which contains related terms for big data technology, such as SPARK, SQL, HADOOP, data analysis, data mining, and other professional terms. To ensure the accuracy of word segmentation, regular expression rules are added to match strings. Finally, the stop words list is used to automatically filter out auxiliary words, mood words, and other meaningless words. The flow chart of data preprocessing is shown in Fig 1.

Fig 1. Data preprocessing flow chart.

Fig 1

3.3 Data analysis

K-means text clustering is widely used in many fields of text analysis. This algorithm has high efficiency and can be used to cluster large-scale data. In this paper, k-means text clustering is used to conduct text clustering analysis of the processed job name data. By combining natural language processing and clustering algorithm, the similarity within the cluster is improved and the similarity between clusters is reduced. Big data positions are divided and different big data positions are identified. In the clustering process, the sum of squares within the group after each clustering was calculated to determine the optimal number of clustering clusters. Finally, the text selected 10 clusters, and the sum of squares within the group under this number of clustering was 0.6. The similarity in the groups divided by big data positions was relatively high.

4. Results

There are a large number of big data positions and titles. After cluster analysis, big data positions are divided into 10 categories, and the top 5 positions in the categories are counted by word frequency. However, the number of samples in category 6 is large, so 10 high-frequency positions are selected, as shown in Table 3. There are a large number of big data positions and titles. After cluster analysis, big data positions are divided into 10 categories, and the top 5 positions in the categories are counted by word frequency. However, the number of samples in category 6 is large, so 10 high-frequency positions are selected, as shown in Table 3. As can be seen from Table 3, the correlation and consistency between clustering keywords and high-frequency posts are relatively high. Category 1 mainly development engineer position, category 2 engineers, mainly engaged in data category 3 is the data operating post, category 4 is mainly engaged in basic data entry and management post, category 5 is greater demand for data analysis commissioner, category 6 is the biggest demand of data statistics, data annotation, data mining, such as jobs, category 7 represents the data analyst positions, including assistant, elementary, intermediate and advanced level of other position, category of 8 high frequency is a data analyst and data analysis engineer, class 9 is mainly related to data sales positions, rank higher, such as sales manager, Category 10 is the product manager position related to big data.

Table 3. K-means clustering results and related positions.

Clustering categories Sample size Clustering keywords Relevant position
Cluster 1 2034 development, big data, engineer, data, java, advanced, database, R&D, data warehouse, platform Big data development engineer database Development engineer
Big data engineer
Big data research and development engineer
Data development engineer
Cluster 2 994 engineer, data, data analysis, data warehouse, development, big data, data mining, database, data processing, expert Data engineer
Big data operation and maintenance engineer
Data governance engineer
Data communication engineer
Etl engineer
Cluster 3 534 data operations, specialist, assistant, supervisor, manager, analysis, direction, senior, intern, engineer Data operator
Data operation specialist
Data operation assistant
Data operation manager
Data operation supervisor
Cluster 4 756 data, development, engineer, direction, product, platform, data warehouse, supervisor, database, operation Data entry clerk
Data clerk
Data development
Data administrator
Data researcher
Cluster 5 1465 specialist, data, data analysis, commodities, statistics, data processing, operations, assistants, sales, data labeling Data analysis specialist
Data specialist
Commodity specialist
Data statistics specialist
Data processing specialist
Cluster 6 5122 data, data processing, engineers, assistants, clerks, sales, statisticians, data annotations, architects, interns Data statistician
Data clerk
Data annotator
Project manager
Statistician
Data supervisor
Data warehouse engineer
Big data architect
Data processing engineer
Data mining engineer
Gis data processing engineer
Cluster 7 1248 analyst, data, finance, big data, senior, assistant, direction, operations, intern, market Data analyst
Big data analyst
Financial data analyst
Senior data analyst
Data analyst assistant
Cluster 8 1352 data analysis, engineer, assistant, sales, intern, clerk, finance, direction, operation, supervisor Data analyst
Data analyst
Data analysis engineer
Data analysis assistant
Sales data analyst
Cluster 9 674 manager, sales, data, data analysis, big data, advanced, data center, data management, marketing, operations Sales manager
Data analysis manager
Regional sales manager
Software sales manager
Data management manager
Cluster 10 317 product manager, data, big data, advanced, direction, platform, product, finance, data analysis, data center Data product manager
Product manager
Big data product manager
Senior data product manager
Senior product manager

5. Analysis

To further explore the market demand for big data positions, this paper conducts a multi-dimensional and in-depth analysis from the aspects of job demand, urban distribution, educational background requirements, experience requirements, salary range, and so on.

5.1 The characteristics of big data jobs from the whole data set

On the whole, the demand for big data jobs is mainly distributed in first-tier cities and new first-tier cities. Among first-tier cities, Guangzhou and Shanghai have the highest demand for big data talents; among the new first-tier cities, Nanjing, Hangzhou, Xi ’an and Chengdu have the highest demand for big data jobs; while among other cities, Jinan has the highest demand for big data talents, reaching 6%, as shown in Table 4. Among 14,496 big data positions, 52.39% require a bachelor’s degree and 35.13% require a junior college degree. It can be seen that a bachelor’s degree is the demand subject for big data positions, followed by a junior college degree, which indicates that big data positions have low requirements for an academic degree, and most enterprises have no particularly high requirements for the academic degree of big data positions. Since the population base of master’s degree and doctor’s degree in China is much smaller than that of bachelor’s degree and a junior college degree, the highly educated talents in the field of big data only account for 3.76%. In terms of work experience, 36.09% of the big data jobs are not limited to experience requirements, but 31.44% of the jobs require 1–3 years of work experience, and 23.92% require 3–5 years of work experience. This indicates that most enterprises have experience requirements for big data work, and at least one year of work experience is required, so experienced job seekers are more popular in the field of big data.

Table 4. The proportion of job demand in each city.

City category Percentage of demand City name Proportion of demand per city
First-tier cities 26.43% Shanghai 7.03%
Guangzhou 6.77%
Shenzhen 6.41%
Beijing 6.23%
New first-tier cities 55.43% Xi ’an 6.97%
Nanjing 6.80%
Hangzhou 6.63%
Wuhan 6.16%
Chengdu 6.08%
Tianjin 5.86%
Zhengzhou 4.39%
Changsha 4.03%
Chongqing 3.85%
Shenyang 2.24%
Hefei 1.92%
Wuxi 0.49%
Others 18.14% Jinan 5.75%
Dalian 2.60%
Changchun 2.18%
Guiyang 1.95%
Xiamen 1.79%
Shijiazhuang 1.58%
Fuzhou 1.05%
Ningbo 0.75%
Harbin 0.50%

Can be seen from the Fig 2, all kinds of city, the salary range for the 4 k—8 k accounts for large data-position than most, the more developed city jobs pay less than 4 k is less, as the urban economic level has increased the average salary is on the rise, and more than 8 k proportion pay more and more high, bargaining proportion also shows ascendant trend, suggesting that big data post salary is associated with the city’s economic development level.

Fig 2. Salary structure chart of various cities.

Fig 2

K stands for one thousand RMB, the proportion of different salary stages in different city categories.

5.2 The characteristics of big data jobs from the whole data set

As shown in Fig 3 Category 1 accounts for 14.031% of the job demand of big data, mainly engaged in data development, research and development and other work. Java and database, and other words appear in the clustering keywords, indicating that big data development post requires job seekers to master a development language and database skills. Category 2 is mainly for a data engineer, big data operation and maintenance engineer, etc., with a total of 994 positions and less demand. Data operation is a keyword in category 3, and the demand for big data positions is not high. There are only 756 positions in category 4, which accounts for a small proportion of the demand for big data positions. It is mainly about basic data entry and management, which requires job seekers to master certain database skills. Category 5 has a total of 1,465 positions, accounting for 10.106%, ranking the third in the job demand of big data, mainly engaged in data statistics, processing and analysis, and the relevant big data positions are data specialist, data statistics specialist, etc. Category 6 has the largest demand in the recruitment of enterprises, with a total of 5,122 positions, accounting for 35.334% of all positions, mainly engaged in data processing, statistics, marking, and other work, the mainstream positions are data statistician, data clerk, data marker, project manager, data supervisor and so on. There are many overlapping words in the clustering keywords of category 7 and category 8. From the perspective of related high-frequency positions, the two categories have certain similarities, mainly engaged in data analysis, and the mainstream positions are data analysts, data analysis engineers, data analysis specialists, etc., with a total of 2,600 positions, accounting for 17.936% of the total. However, there are some differences between the two categories in the field of data analysis. The high-frequency positions in category 9 and category 10 are all at the manager level. They are high-end talents in the field of big data with higher ranks and fewer positions. They are mainly engaged in data management manager, product manager, sales manager, and other professions.

Fig 3. The proportion of vacancies in all categories.

Fig 3

Different job categories in big data jobs have certain differences in educational background and experience requirements. Combined with the analysis of Figs 4 and 5, the job requirements of category 10 are more inclined to candidates with a bachelor’s degree and more than 3 years of experience, because the product manager belongs to the company Executives naturally have higher requirements for experience and academic qualifications. Although category 9 mainly recruits manager-level positions, the sales field has lower requirements for academic qualifications than product managers and has a higher acceptance of college degrees, but they are still more inclined to job seekers with more than 3 years of experience. Category 1 has gathered a large number of big data development positions, which have high requirements for experience and academic qualifications. Companies hope to recruit personnel with a bachelor’s degree and more than 3 years of development experience. From the previous analysis, the difference in job fields between categories 7 and 8 is still reflected in academic qualifications and experience. Category 7 involving the financial field has higher requirements for academic qualifications and experience than category 8 in the sales field and the threshold for recruitment Relatively high. Category 6 contains a wide range of job categories. The demand for academic qualifications is relatively evenly distributed, and the experience requirements are relatively low. Many companies do not make special requirements for the experience of job seekers. From Figs 4 and 5, we can see that categories 2, 3, 4, and 5 have gradually reduced requirements for a bachelor’s degree and more than 3 years of experience, and the demand for a college degree and unlimited experience has become higher, which indicates that the more basic the big data work, the lower the threshold for job hunting, such as data entry, data statistics, data labeling, etc. When the job is more difficult and more complex, the higher the requirements for job seekers will be, such as data operation, data analysis specialist, and other positions.

Fig 4. Comparison of requirements for undergraduate and junior college degrees of various positions.

Fig 4

Fig 5. Different proportions of experience requirements for various positions.

Fig 5

The proportion of three experience requirements in different job categories.

The demand ratio and average salary of each job category in cities are shown in Fig 6. It can be seen from the figure that the demand for product manager positions represented by category 10 in first-tier cities is higher than that in the other two cities, and the average salary is the highest. The demand for other types of jobs in new first-tier cities is higher than that in first-tier cities. The average salary of category 9 is lower than that of Category 10. Although the high frequency of the two types of positions is manager level, category 9 is mainly concentrated in new first-tier cities, whose economic development level is not as good as that of first-tier cities, so the average salary is naturally low. Categories 7 and 8 have been analyzed in the previous article. There is a certain similarity between the two types of jobs, which is also reflected in their city distribution. The demand for the two types of jobs in each city is very small, but the average salary of category 7 is higher than that of category 8. Combined with the previous analysis, it can be seen that the salary of the financial field in the big data analysis position is higher than that in the sales field. The main demand of category 6 is concentrated in the new first-tier cities, and its average salary is similar to the average salary of the overall big data jobs in the new first-tier cities. Category 5 has the lowest average salary. The main reason may be that it is engaged in basic data processing and statistical work. Compared with other types of positions, the job is not difficult and does not require high levels of job applicants. Although the demand for category 4 in first-tier cities is lower than that of category 3, its average salary is higher than that of category 3, indicating that data management and research positions are higher than the salary level of data operations. The demand for category 2 is mainly concentrated in the new first-tier cities, and its salary level is relatively high, but it is lower than that of the big data engineer category 1 of development positions, indicating that the demand for development jobs in big data positions is large and the salary level is high.

Fig 6. The proportion of urban demand and the average salary for each job category.

Fig 6

The share of the demand for each job category in each city and the average salary of each job category are calculated based on the RMB income.

6. Conclusion

This article uses web crawlers to obtain a large amount of big data job recruitment information, combined with stuttering word segmentation and regular expressions and uses K-means text clustering to divide big data jobs into 10 different categories and explore the needs of big data jobs. After analyzing the data set and clustering results, the following conclusions are drawn. The demand for big-data-related jobs is concentrated in first-tier cities and new first-tier cities. The education threshold is relatively low, with bachelor’s degree and junior college degree as the major, and the proportion of doctor’s degree and master’s degree is very small. On the one hand, this is because the population base of highly educated people is small; on the other hand, many domestic universities have not started the doctoral and master’s degree programs related to big data. Secondly, from the perspective of the big data industry as a whole, enterprises are more inclined to recruit job seekers with at least one year’s experience. In some basic job categories, the requirements for experience will be reduced, while for high-level management and development positions, there will be higher requirements for experience. The salary level of a big data post is related to its city type and job position. On the one hand, the economic development of a city affects the salary level; on the other hand, the rank and job content of a post determine the salary level of the post. In the clustering results, the demand for manager-level posts is low, the average salary is high, and the requirement for their education and experience is relatively high. Compared with the manager position, the demand for development posts is relatively high, and the salary level is relatively high compared with other categories, but the requirement for experience is high, and there is a certain educational threshold. However, the data processing, analysis, management, research, operation and other posts have relatively low requirements for education and experience, so the salary level is relatively low.

There are few pieces of research on the job demand of big data related to text mining in China. This paper uses text mining to analyze the job demand characteristics of big data from the aspects of the city, salary, education background, and experience, etc., to clarify the job demand status of the big data industry, provide enterprises and job seekers with the information of the big data demand market, and provide support for the citation research of text mining in various fields. However, there are still some deficiencies in this study. Firstly, due to the uneven information release format on recruitment websites, some data will be omitted during the collection process, and many difficulties will be caused in the data pre-processing process, which will have a certain impact on the subsequent clustering effect. Secondly, there are categories with very unclear job definitions in the clustering results. This article does not further divide and dig into these categories, which may omit some other demand characteristics in the big data job requirements. In the future, we will combine more text mining technologies, such as the LDA model, Word2Vec model, etc. to further mine big data job recruitment information, and analyze the technology, skills, and knowledge required under different job categories.

Supporting information

S1 Data. Minimal data new.

(XLSX)

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The authors received no specific funding for this work.

References

  • 1.Huang Z, Destech Publicat I. Research on the Innovation of E-business Talents Training Mode Under the Background of Big Data. 2018 International Conference on E-Commerce and Contemporary Economic Development. DEStech Transactions on Economics Business and Management2018. p. 48–52.
  • 2.Chen L. Practice on the Sustainable Development of Talent Cultivation Mode in the Context of Big Data. In: Xu Z, Choo KKR, Dehghantanha A, Parizi R, Hammoudeh M, editors. Cyber Security Intelligence and Analytics. Advances in Intelligent Systems and Computing. 9282020. p. 682–91. [Google Scholar]
  • 3.Hilbert M. Big Data for Development: A Review of Promises and Challenges. Development Policy Review. 2016;34(1):135–74. doi: 10.1111/dpr.12142 WOS:000366450100006. [DOI] [Google Scholar]
  • 4.He D, Yu H, Shi Y, Yu S, Jin C. An analysis on the demand of artificial intelligence relatedtalents in China’s medical field: Survey based on recruitment information of two websites. Chinese Journal of Health Policy. 2019;12(7):59–64. CSCD:6595625. [Google Scholar]
  • 5.Lu Y, Cao K. Spatial Analysis of Big Data Industrial Agglomeration and Development in China. Sustainability. 2019;11(6). doi: 10.3390/su11061783 WOS:000464344500015. [DOI] [Google Scholar]
  • 6.Puncheva-Michelotti P, Hudson S, Jin G. Employer branding and CSR communication in online recruitment advertising. Business Horizons. 2018;61(4):643–51. 10.1016/j.bushor.2018.04.003. [DOI] [Google Scholar]
  • 7.Zhang X, Zheng Y. Gender differences in self-view and desired salaries: A study on online recruitment website users in China. Plos One. 2019;14(1). doi: 10.1371/journal.pone.0210072 WOS:000455483000045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Papoutsoglou M, Mittas N, Angelis L. Mining People Analytics from StackOverflow Job Advertisements. Felderer M, Olsson HH, Skavhaug A, editors2017. 108–15 p. doi: 10.1093/nar/gkx448 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Uhm M, Lee G, Jeon B. An analysis of BIM jobs and competencies based on the use of terms in the industry. Automation in Construction. 2017;81:67–98. 10.1016/j.autcon.2017.06.002. [DOI] [Google Scholar]
  • 10.Karakatsanis I, AlKhader W, MacCrory F, Alibasic A, Omar MA, Aung Z, et al. Data mining approach to monitoring the requirements of the job market: A case study. Information Systems. 2017;65:1–6. doi: 10.1016/j.is.2016.10.009 WOS:000393631700001. [DOI] [Google Scholar]
  • 11.Liu R, Ye W, Gao R, Tang M, Wang D. Research on Text Clustering Based on Requirements of Big Data Jobs. Data Analysis and Knowledge Discovery. 2017;1(12):32–40. CSCD:6175364. [Google Scholar]
  • 12.Xiao Q, Zhong X, Zhong C. Application Research of KNN Algorithm Based on Clustering in Big Data Talent Demand Information Classification. International Journal of Pattern Recognition and Artificial Intelligence. 2020;34(6). doi: 10.1142/s0218001420500159 WOS:000537887700003. [DOI] [Google Scholar]
  • 13.Alexander Calvo-Valverde L, Andres Mena-Arias J. Evaluation of different text representation techniques and distance metrics using KNN for documents classification. Tecnologia En Marcha. 2020;33(1):64–79. doi: 10.18845/tm.v33i1.5022 WOS:000518101300005. [DOI] [Google Scholar]
  • 14.Wowczko IA. Skills and Vacancy Analysis with Data Mining. Informatics-Basel. 2015;2(4):31–49. doi: 10.3390/informatics2040031 WOS:000442306300001. [DOI] [Google Scholar]
  • 15.Boselli R, Cesarini M, Mercorio F, Mezzanzanica M. Classifying online Job Advertisements through Machine Learning. Future Generation Computer Systems-the International Journal of Escience. 2018;86:319–28. doi: 10.1016/j.future.2018.03.035 WOS:000437555800025. [DOI] [Google Scholar]
  • 16.Wong T-L, Lam W, Chen B. Mining Employment Market via Text Block Detection and Adaptive Cross-Domain Information Extraction. Sanderson M, Zhai CX, Zobel J, Allan J, Aslam JA, editors2009. 283–90 p. [Google Scholar]
  • 17.Ningrum PK, Pansombut T, Ueranantasun A. Text mining of online job advertisements to identify direct discrimination during job hunting process: A case study in Indonesia. Plos One. 2020;15(6). doi: 10.1371/journal.pone.0233746 WOS:000542035200004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Pejic-Bach M, Bertoncel T, Mesko M, Krstic Z. Text mining of industry 4.0 job advertisements. International Journal of Information Management. 2020;50:416–31. doi: 10.1016/j.ijinfomgt.2019.07.014 WOS:000497989600034. [DOI] [Google Scholar]
  • 19.Debortoli S, Muller O, vom Brocke J. Comparing Business Intelligence and Big Data Skills A Text Mining Study Using Job Advertisements. Business & Information Systems Engineering. 2014;6(5):289–300. doi: 10.1007/s12599-014-0344-2 WOS:000342207000005. [DOI] [Google Scholar]
  • 20.De Mauro A, Greco M, Grimaldi M, Ritala P. Human resources for Big Data professions: A systematic classification of job roles and required skill sets. Information Processing & Management. 2018;54(5):807–17. 10.1016/j.ipm.2017.05.004. [DOI] [Google Scholar]
  • 21.Gurcan F. Extraction of Core Competencies for Big Data: Implications for Competency-Based Engineering Education. International Journal of Engineering Education. 2019;35(4):1110–5. WOS:000473265200011. [Google Scholar]
  • 22.Gurcan F, Cagiltay NE. Big Data Software Engineering: Analysis of Knowledge Domains and Skill Sets Using LDA-Based Topic Modeling. Ieee Access. 2019;7:82541–52. doi: 10.1109/access.2019.2924075 WOS:000475354200001. [DOI] [Google Scholar]
  • 23.Zheng JP, Wen Q, Qiang MS. Understanding Demand for Project Manager Competences in the Construction Industry: Data Mining Approach. Journal of Construction Engineering and Management. 2020;146(8). doi: 10.1061/(asce)co.1943-7862.0001865 WOS:000542675500006. [DOI] [Google Scholar]
  • 24.Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, et al. Top 10 algorithms in data mining. Knowledge and Information Systems. 2008;14(1):1–37. doi: 10.1007/s10115-007-0114-2 WOS:000251795900001. [DOI] [Google Scholar]
  • 25.Xu R, Wunsch D. Survey of clustering algorithms. Ieee Transactions on Neural Networks. 2005;16(3):645–78. doi: 10.1109/TNN.2005.845141 WOS:000228909900013. [DOI] [PubMed] [Google Scholar]
  • 26.Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY. An efficient k-means clustering algorithm: Analysis and implementation. Ieee Transactions on Pattern Analysis and Machine Intelligence. 2002;24(7):881–92. doi: 10.1109/tpami.2002.1017616 WOS:000176446100002. [DOI] [Google Scholar]
  • 27.Wu D, Zeng Y, Qu Y-C, Destech Publicat I. Text Document Clustering Based on Density K-means. International Conference on Computer, Mechatronics and Electronic Engineering. DEStech Transactions on Computer Science and Engineering2016.
  • 28.Kaur D, Bajwa JK. Text Document Clustering Based on Neural K-Mean Clustering Technique. In: Singh M, Gupta PK, Tyagi V, Sharma A, Oren T, Grosky W, editors. Advances in Computing and Data Sciences, Icacds 2016. Communications in Computer and Information Science. 7212017. p. 336–44. [Google Scholar]
  • 29.Beltran B, Vilarino D, Martinez-Trinidad JF, Carrasco-Ochoa JA, Pinto D. K-means based method for overlapping document clustering. Journal of Intelligent & Fuzzy Systems. 2020;39(2):2127–35. doi: 10.3233/jifs-179878 WOS:000568501000013. [DOI] [Google Scholar]
  • 30.Buatoom U, Kongprawechnon W, Theeramunkong T. Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints. Symmetry-Basel. 2020;12(6). doi: 10.3390/sym12060967 WOS:000550837500001. [DOI] [Google Scholar]
  • 31.Zhu Z, Liang J, Li D, Yu H, Liu G. Hot Topic Detection Based on a Refined TF-IDF Algorithm. Ieee Access. 2019;7:26996–7007. doi: 10.1109/access.2019.2893980 WOS:000462125100001. [DOI] [Google Scholar]
  • 32.Zhou F, Chen X, Ma J, Li S. A Microblog Hot Topic Mining Method Integrating Tag Semantics. Computer Engineering. 2019;45(10):283–7. CSCD:6591182. [Google Scholar]
  • 33.Li W, Wang Z, Yu Z. Micro-blog Topic Detection Method Integrating BTM Topic Model and K-means Clustering. Computer Science. 2017;44(2):257–61,74. CSCD:5917800. [Google Scholar]
  • 34.Hong Q, Wang S, Zhao Q, Li J, Rao W. Video user group classification based on barrage comments sentiment analysis and clustering algorithms. Computer Engineering and Science. 2018;40(6):1125–39. CSCD:6268097. [Google Scholar]
  • 35.Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, et al. Feature Selection: A Data Perspective. Acm Computing Surveys. 2018;50(6). doi: 10.1145/3136625 WOS:000419881700017. [DOI] [Google Scholar]
  • 36.Wang Y, Xu S, Li C, Ai S, Zhang W, Zhen L, et al. Classification model based on support vector machine for Chinese extremely short text. Application Research of Computers. 2020;37(2):347–50. CSCD:6724475. [Google Scholar]

Decision Letter 0

Bing Xue

13 Apr 2021

PONE-D-21-07023

Analysis of Big Data Job Requirements based on K-means Text Clustering in China

PLOS ONE

Dear Dr. yinxia,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by May 24 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Bing Xue, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

  1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dear Authors.

Thanks for the opportunity given me to read your Manuscript with Title: "Analysis of Big Data Job Requirements based on K-means Text Clustering in China" I found the paper very interesting.

In your Methodology where data was collected from Zhaopin.com recuitment website by using web crawler software to capture the recruitment information of big data. However you further stated the preprocessed data and 14,496 valid data were obtained after deleting invalid and incomplete data.

My question is can you make all the data available and clear. Also the process how the data was preprocessed.

Thank you for the good job.

Reviewer #2: The introduction section lacks in problem statement and motivation of the study. There is immense need to give arguments in favor of conducting this study. In the conceptual model and hypothesis, it is highly recommended to support the arguments with the support of theory.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Aug 5;16(8):e0255419. doi: 10.1371/journal.pone.0255419.r002

Author response to Decision Letter 0


24 May 2021

Responds to the reviewer’s comments:

Reviewer #1:

1. Response to comment: Make all the data available and clear

Response: Considering the Reviewer’s suggestion, we have added Table 1 on the basis of the original text to show the data we collected, and listed specific examples in 3.1 data collection, so that readers can better understand the content of the data we collected.

2. Response to comment: How the data was preprocessed

Response: We are sorry that we did not show the detailed data preprocessing process. In the revised version, we added Figure 1 to illustrate the specific data processing process in 3.2 data procession.

Special thanks to you for your good comments.

Reviewer #2:

1. Response to comment: The introduction section lacks in problem statement and motivation of the study

Response: We have re-written this part according to the Reviewer’s suggestion. We restate the question and motivation of the study in the introduction.

2. Response to comment: There is immense need to give arguments in favor of conducting this study

Response: It is really true as Reviewer suggested that this article lacks theoretical basis. Therefore, in the new version, we add three references as the theoretical basis and support of the research.

Special thanks to you for your good comments.

We tried our best to improve the manuscript and made some changes in the manuscript. These changes will not influence the content and framework of the paper. And here we did not list the changes but marked in red in revised paper.

We appreciate for Editors/Reviewers’ warm work earnestly, and hope that the correction will meet with approval.

Once again, thank you very much for your comments and suggestions.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Bing Xue

16 Jul 2021

Analysis of Big Data Job Requirements based on K-means Text Clustering in China

PONE-D-21-07023R1

Dear Dr. yinxia,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Bing Xue, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dear Authors,

Thank you for your great work. I reccommend this paper with Title: Analysis of Big Data Job Requirements based on K-means Text Clustering in China for publication.

Reviewer #2: The manuscript appears to be an interesting study and will be a valuable contribution to the relevant field of study. I wish the authors all the best.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Acceptance letter

Bing Xue

27 Jul 2021

PONE-D-21-07023R1

Analysis of Big Data Job Requirements based on K-means Text Clustering in China

Dear Dr. Yinxia:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Bing Xue

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Data. Minimal data new.

    (XLSX)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES