Skip to main content
PLOS One logoLink to PLOS One
. 2022 May 27;17(5):e0268212. doi: 10.1371/journal.pone.0268212

Measuring user interactions with websites: A comparison of two industry standard analytics approaches using data of 86 websites

Bernard J Jansen 1,*,#, Soon-gyo Jung 1,#, Joni Salminen 1,2,3,#
Editor: Hussein Suleman4
PMCID: PMC9140287  PMID: 35622858

Abstract

This research compares four standard analytics metrics from Google Analytics with SimilarWeb using one year’s average monthly data for 86 websites from 26 countries and 19 industry verticals. The results show statistically significant differences between the two services for total visits, unique visitors, bounce rates, and average session duration. Using Google Analytics as the baseline, SimilarWeb average values were 19.4% lower for total visits, 38.7% lower for unique visitors, 25.2% higher for bounce rate, and 56.2% higher for session duration. The website rankings between SimilarWeb and Google Analytics for all metrics are significantly correlated, especially for total visits and unique visitors. The accuracy/inaccuracy of the metrics from both services is discussed from the vantage of the data collection methods employed. In the absence of a gold standard, combining the two services is a reasonable approach, with Google Analytics for onsite and SimilarWeb for network metrics. Finally, the differences between SimilarWeb and Google Analytics measures are systematic, so with Google Analytics metrics from a known site, one can reasonably generate the Google Analytics metrics for related sites based on the SimilarWeb values. The implications are that SimilarWeb provides conservative analytics in terms of visits and visitors relative to those of Google Analytics, and both tools can be utilized in a complementary fashion in situations where site analytics is not available for competitive intelligence and benchmarking analysis.

Introduction

Web analytics is the collection, measurement, analysis, and reporting of digital data to enhance insights concerning the behavior of website visitors [1]. Web analytics is a critical component of business intelligence, competitive analysis, website benchmarking, online advertising, online marketing, and digital marketing [2] as business decisions are made based on website traffic measures obtained from website analytics services. Organizations monitor their sites’ incoming and outgoing traffic to identify popular pages, determine user interests, and stay abreast of emerging trends [3]. There are various ways to monitor this traffic, and the gathered data is used for re-structuring sites, highlighting security problems, indicating bandwidth issues, assessing organizational key performance indicators (KPIs), and obtaining societal insights [4].

Approaches to collecting website analytics data can be grouped by the focus of data collection efforts, resulting in the emergence of three general methodologies, namely: (a) user-centric, (b) site-centric, and (c) network-centric. The central traits of each are as follows.

  • User-centric: Web analytics data is gathered via a panel of users, which is tracked by software installed on users’ computers, such as a plugin for a web browser [58]. For example, when users install an extension to their browser, they approve in the license agreement that the data on the websites they visit will be processed and analyzed. The primary advantage here is that the user-centric approach does not rely on cookies or tags (i.e., snippets of information placed by a server to a user’s web browser in order to keep track of the user) but on direct observation. An additional advantage is comparing web analytics data across multiple websites. The challenge is recruiting and incentivizing a sufficiently large user panel that is a representative sample of the online population—due to this challenge, only a few companies have recruited sizeable user panels (e.g., Alexa). Another disadvantage may be the issue of privacy since many users are not willing to share information on every website that they visit, so some users may make efforts to mask their actual online actions from the tracking plugin.

  • Site-centric: Web analytics is gathered via software on a specific website [916]. Most websites use a site-centric approach for analytics data gathering, typically employing cookies and/or tagging pages on the website (e.g., Google Analytics, Adobe Analytics). The primary advantage of this approach lies in counting events and actions (e.g., pages viewed, times accessed), which is relatively straightforward. Another advantage is that users do not need to install specific software beyond the browser. However, there are disadvantages. First, site-centric software focuses on cookies/tags, so these counts may not reflect actual people (i.e., the measures are of the cookies and tags) or people’s actual actions on the website. Instead, site-centric approaches measure the number of cookies dropped or tags fired as proxies for people or interactions. Second, this approach is susceptible to bots (i.e., autonomous programs that pretend to be real users) and other forms of analytics inflation tactics, such as click fraud [17]. Finally, the site-centric analytics usually represent just one website and are only accessible to the owner of that website, making the site-centric approach not widely available for business intelligence, marketing, advertising, or other tasks requiring web analytics data from a large number of sites.

  • Network-centric: Web analytics is gathered via observing and collecting traffic in the network [18, 19]. There are various techniques for network-centric web analytics data gathering, with the most common being data purchased or acquired directly from Internet service providers (ISPs). However, other data gathering methods include leveraging search traffic, search engine rankings, paid search, and backlinks [20, 21]. The main advantage of the network-centric approach is that one can relatively easily collect analytics concerning a large number of websites. Also, the setup is comparatively easy, as neither users nor websites are required to install any software. The major disadvantage is that there is no information about the onsite actions of the users. A second disadvantage is that major ISPs do not freely share their data, so acquiring it can be expensive. However, companies can acquire other network-centric data more reasonably (i.e., SpyFu, SEMRush; two common industry tools for search marketing), albeit requiring substantial computational, programming, and storage resources.

Of course, one can use a combination of these methods [22], but these are three general approaches, with much academic research leveraging one or more of these methods [2326]. See Table 1 for a summary of the advantages, disadvantages, and examples of implementations.

Table 1. Comparison of user, site, and network-centric approaches to web analytics data collection showing advantages, disadvantages, and examples of each approach at the time of the study.

Approach Advantages Disadvantages Examples
User-centric • Focus on people
• Compare across websites; so can use for business intelligence
• Creating a representative user panel is challenging
• User computer software must be installed
• Alexa
• ComScore
Site-Centric • No special user software to install
• Wide range of analytics for a specific site
• Site software must be installed
• Focus on cookies and tags, not real people
• Access is limited to the website owner; cannot use for business intelligence among multiple sites
• Google Analytics
• Adobe Analytics
• IBM Analytics
Network-Centric • Data collection is straightforward
• No special software to install for users or sites
• Compare across websites; can use for business intelligence
• Data can be challenging to obtain
• Limited onsite analytics; generally only between sites data
• Hitwise
• SEMRush
• SpyFu

While site-centric web analytics tools, such as Google Analytics, can provide results for one’s own website, there is often a need to compare with other websites, though Google Analytics does provide some limited benchmarking reports by industry (https://support.google.com/analytics/answer/6086666). Therefore, competitive benchmarking services, such as SimilarWeb, have become essential for web analytics in the business intelligence area [27]. These analysis services provide computational web analytics results for one or more websites, a critically needed capability for competitive research and analysis [28]. These website analytics services allow benchmarking of web analytics measures and metrics among multiple websites. Website analytics services are essential for a variety of reasons, including competitive analysis, advertising, marketing, domain purchasing, programmatic media buying [2935], and firm acquisition [36], along with the use of website analytics services in academic research [37, 38]. They are also valuable for accessing the external view of one’s own website (i.e., what others who do not have access to site-centric analytics data see). These website analytics services return a variety of metrics depending on the platform. However, there are questions concerning the accuracy and reliability of both types of analytics platforms, affecting billions of dollars in online advertising, firm acquisition, and research. As such, there is a critical need to assess these tools and the validity of the reported metrics.

In this research, we compare web analytics statistics from Google Analytics (the industry-standard website analytics platform at the time of the study) and SimilarWeb (the industry-standard traffic analytics platform at the time of the study) using four core web analytics metrics (i.e., total visits, unique visitors, bounce rate, and average session duration) averaged monthly over 12 months for 86 websites. We select SimilarWeb due to the scope of its data collection, reportedly one billion daily digital signals, two terabytes of daily analyzed data, more than two hundred data scientists employed, and more than ten thousand daily traffic reports generated, with reporting features better or as good than other services [39] at the time of the study. As such, SimilarWeb represents state-of-the-art in the online competitive analytics area. We leave the investigation of others services besides Google Analytics and SimilarWeb to other research. We conduct statistical analysis along several fronts reporting both exploratory and statistical results. We then tease apart the nuanced differences in the metrics and possible sources of error [40] and present the theoretical and the practical implications of this research. The techniques employed by Google Analytics are similar to techniques employed by other analytics platforms, such as Adobe Analytics, IBM Analytics, and Piwik Analytics. The techniques used by SimilarWeb are similar to the techniques of other website analytics services, such as Alexa, comScore, SEMRush, Ahrefs, and Hitwise, in the employment of user, site, and/or network data collection. So, the results of this research apply to a wide range of analytics tasks, most notably in the website domain, providing an enhanced understanding of the data underlying competitive intelligence and the use of such analytics platforms. Moreover, the metrics reviewed are commonly used in many industries employing online analytics, such as advertising, online content creation, and e-commerce. Therefore, the findings are impactful for several domains.

Review of literature

Web analytics services have been employed in research and used by researchers for an array of inquiries and topics. These areas include, among others, online gaming [41], social media and multi-channel online marketing [42, 43], online community shopping [44], online purchase predictions [45, 46], online research methods [47], social science [48, 49], and user-generated content on social media [5054]. These services have also been used in research concerning online interests in specific topics [5557]–including online branding in social media [58, 59], online purchasing [60], and mobile application usage [61]. They have also been used in studies about website trust and privacy [6266], website design [37, 6769], and website popularity and ranking [42, 44, 7077]–for a variety of areas. These prior studies indicate that analytics tools are widely used in peer-reviewed academic research and relied on for various metrics. However, to our knowledge, none of the prior research studies examined the accuracy of these website analytics services prior to employment.

Academic research on this area of analytics evaluation is limited. Lo and Sedhain [78] evaluate six websites lists, including the ranked list from Alexa (the only service employed in the study that is still active, as of the date of this research). The researchers examined the top 100 websites and compared the rankings among the lists. They concluded that the ranking among the lists differed. This difference is not surprising given that the methodologies used to create the study lists varied in terms of website traffic, number of backlinks, and opinions of human judges. Vaughan and Yang [79] use organizations from the United States (U.S.) and China and collect web traffic data for these sites from Alexa Internet, Google Trends for Websites, and Compete (Alexa is the only service still active from the study, as of the date of this research). The researchers did not evaluate the traffic services but instead reported correlations between web traffic data and measures of academic quality for universities. In a ComScore study, Napoli, Lavrakas, and Callegaro [80] present some of the challenges and issues with the user-centric analytics approach, namely that the results often do not align with site-centric measures. The researchers attribute the discrepancies to the sampling of the user panels. Scheitle and fellow researchers [19] examine several websites’ rankings, including Alexa but not SimilarWeb, investigating similarity, stability, representativeness, responsiveness, and benignness in the cybersecurity domain, but they do not report actual analytics numbers. The researchers report that the ranked lists are unstable and open to manipulation. Pochat and colleagues [18] extend this research by introducing a list that is less susceptible to rank manipulation.

While few academic studies have examined analytics services, fewer have evaluated the actual analytics numbers; instead, they focus on the more easily accessible (and usually free) ranked lists. Studies are even rarer still on the performance of SimilarWeb, despite its standing and reputation as an industry leader. Scheitle and colleagues [19] attribute this absence to SimilarWeb charging for its service, although the researchers do not investigate this conjecture. Regardless of the reason, the only academic study that we are aware of as of the date of this research that explicitly examines traffic numbers, including SimilarWeb, is Prantl and Prantl [24]. This study compares rankings among Alexa, SimilarWeb, and NetMonitor [80] for a set of websites in the Czech Republic, using NetMonitor as the baseline. The research only reports the traffic comparison between SimilarWeb and NetMonitor. The researchers, unfortunately, provide neither detailed exploratory analysis nor statistical analysis of the analytics comparison. Also, NetMonitor uses a combination of site and user-centric measures, so it is unclear how the traffic metrics are calculated. The researchers [24] report that SimilarWeb over reports traffic compared to NetMonitor. They also note that SimilarWeb traffic results are +/- 30% compared to NetMonitor traffic measurements for 49% of the 487 websites.

Several industry reports have also compared site analytics, usually using Google Analytics, with the analytics reported by other services. Some of these reports [8183] show website analytics services, notably SimilarWeb, reportedly underestimating traffic, as much as 30% to 50%, while other reports (84–88] claim SimilarWeb overestimates traffic, from 11% for large websites to nearly 90% for small ones [84]. SimilarWeb itself states that reported values among analytics services will vary +/- 20%. However, a trend is that SimilarWeb [85, 86] consistently ranks as the best or one of the best analytics services in the industry [87, 88], as noted by several industry practitioners [30, 32, 33, 8991]. SimilarWeb consistently outperforms other services [92], with reported performance better sometimes in the double digits [93]. Even when the reported analytics numbers are off, the SimilarWeb results usually correlate with the baseline site traffic trends. The correlation is also positive relative to overall accuracy among sites [93].

Although providing insights into the area, there are potential issues regarding relying on industry reports, including possible questions on data appropriateness, lack of explicitly defined methods of analysis, and conflicts of interest (as some of these studies are performed by potential competitors of SimilarWeb). Also, some of these studies employ a small number of data points [81, 82, 94], making statistical analysis challenging. Other studies have a short temporal span [83, 88, 93], as there can be significant traffic fluctuations for sites depending on the time of year, or mainly high-traffic websites [83], which are easier to calculate. Finally, some reports have imprecise metric reporting [83, 84, 92, 95], raising doubt on the results, or a limited set of metrics [81, 83, 95] not central to analytics insights. Because of these potential issues, there is a critical need for a rigorous academic analysis of website analytics services to supplement these industry reports.

Given the substantial use of analytics services in academic research and their widespread use in the practitioner communities, there is a notable lack of research examining the accuracy of these services. Determining their accuracy is critical, given the extensive reliance on analytics numbers across many domains of research and practice [96]. However, due to the absence of academic studies in the area, several unanswered questions remain, including: How accurate are these analytics services? How do they compare with other analytics methods? Are these analytics tools better (or worse) at measuring specific analytics metrics than other methods? Are the reported metrics valid? These are essential questions that need addressing for critical evaluation of research findings and business decisions that rely on these services. Although the questions are conceptually straightforward, they are surprisingly difficult to evaluate in practice. This difficulty, especially in terms of data collection, may be a compounding factor for the dearth of academic research in the area.

Research questions

Our research objective is to compare and contrast the reported analytics measurements between SimilarWeb and Google Analytics in support of the broader goal of comparing these two approaches for measuring analytics and evaluating their accuracy. To investigate this research objective, we focus on four core web analytics metrics–total visits, unique visitors, bounce rate, and average session duration–which we define in the methods section. Although there is a lengthy list of possible metrics for investigation, these four metrics are central to addressing online behavioral user measurements, including frequency, reach, engagement, and duration, respectively. We acknowledge that there may be some conceptual overlap among these metrics. For example, bounce rates are sessions with an indeterminate duration that may indicate a lack of engagement, but average session duration also provides insights into user engagement. Nevertheless, these four metrics are central to the web analytics analysis of nearly any single website or set of websites; therefore, they are worthy of investigation. In the interest of space and impact of findings, we focus on these four metrics, leaving other metrics for future research.

Given that Google Analytics uses site-centric website data and SimilarWeb employs a triangulation of datasets and techniques, we would reasonably expect values would differ between the two. However, is it currently unknown how much they differ, which is most accurate, or if the results are correlated. Therefore, because Google Analytics is, at the time of the study, the de facto industry standard for websites, we use Google Analytics measurements as the baseline for this research. Our hypotheses (H) are:

  • H1: SimilarWeb measurement of total visits to websites differ from those reported by Google Analytics.

  • H2: SimilarWeb measurement of unique visitors to websites differ from those reported by Google Analytics.

  • H3: SimilarWeb measurement of bounce rates for websites differ from those reported by Google Analytics.

  • H4: SimilarWeb measurement of average session durations for websites differ from those reported by Google Analytics.

We investigate these hypotheses using the following methodology.

Material and methods

Our data collection platforms are Google Analytics and SimilarWeb. Each service is explained in the following subsections.

Google analytics

Google Analytics is a site-centric web analytics platform and, at the time of the study, is the most popular site analytics tool in use [97]–that is, it is the market leader. Google Analytics tracks and reports website analytics for a specific website. This tracking by Google Analytics is accomplished via cookies and tags [98]; a tag is a snippet of JavaScript code added to the individual pages. The tags are executed in the JavaScript-enabled browsers of the website visitors. Once executed, the tag sends the visit data to a data server and sets a first-party cookie on cookie-enabled browsers on visitors’ computers. The tag must be on a page on the site for Google Analytics to track the web analytics data for that page.

Concerning the data collection, analysis, and reporting algorithms of Google Analytics, they are proprietary. However, enough is known to validate their employment as being industry standard and state-of-the-art. The techniques of cookies and the general process of tagging are well-known, although there may be some nuances in implementation. Google Analytics employs statistical data sampling techniques [99], so the values in these cases may not be the result of the complete data analysis for some reports. However, the general overview of the data sampling approach is presented in reasonable detail [29], and the described subsampling is an industry standard methodology [100].

SimilarWeb

SimilarWeb [85, 86] is a service providing web analytics data for one or multiple websites. SimilarWeb uses a mix of user, site, and network-centric data collection approaches to triangulate data [39, 101], reportedly collecting and analyzing billions of data points per day [22]. SimilarWeb’s philosophical approach is that each method has strengths and weaknesses, and the best practice is triangulating multiple algorithms and data sources [39], a respected approach in data collection and analysis.

Regarding the data collection, analysis, and reporting algorithms of SimilarWeb, they are proprietary, but again, enough is known to validate the general implementation as state-of-the-art. The SimilarWeb foundational principle of triangulating user, site, and network-centric data collection data [39, 101] is academically sound, with triangulating data and methods used and advocated widely by scholars [5, 102]. SimilarWeb data collection, analysis, and reporting methodology are outlined in reasonable detail [22], although, like Google Analytics, the proprietary specifics are not provided. However, from the ample documentation that is available [22, 39, 86, 103105], the general approach is to collect data from three primary sources, which are: (a) a reportedly 400 million worldwide user panel [103] at the time of the study, (b) specific website analytics tracking [39], and (c) ISP and other traffic data [39]. These sources are supplemented with publicly available datasets (e.g., population statistics). Each of these datasets will overlap (i.e., the web analytics data from one collection method will also appear in one or both of the other collection methods). With the collected data augmented with publicly available data [39], SimilarWeb uses statistical techniques and ensemble machine learning approaches to generate web analytics results. These analytics can then be compared to the overlapped data to make algorithmic adjustments to the predictions. This is a more complex approach relative to Google Analytics; however, SimilarWeb’s scope of multiple websites also requires a more complicated approach. In sum, the general techniques employed by SimilarWeb are standard methodologies [101, 106, 107], academically sound, and industry standard state-of-the-art.

Data collection procedure

For our analysis, we identify a set of websites with analytics by SimilarWeb and having their Google Analytics accessible by SimilarWeb [104, 108], thereby making their Google Analytics values available. If a website has a Google Analytics associated, SimilarWeb, using the paid version, offers the option of reporting either the SimilarWeb or the Google Analytics numbers for these websites. For this access, the website owner grants SimilarWeb access to the website’s Google Analytics account, so the data pull is direct. We verified this process with a website not employed in the study, encountering no issues with either access or reported data. This feature allows us to compare the SimilarWeb and the Google Analytics numbers for our identified web analytics metrics of total visits, unique visitors, bounce rates, and average session duration.

We employ the Majestic Million [108] to identify our pool of possible websites. The Majestic Million list of websites is creative commons licensed and derives from Majestic’s web crawler. The Majestic Million list ranks sites by the number of /24 IPv4-subnets linking to that site, used as a proxy for website popularity. Using this large, open-licensed, and readily available list as the seed listing, we started at the top, submitted the link to the SimilarWeb application program interface (API), and checked whether SimilarWeb provided analytics or if the website associated its Google Analytics to the SimilarWeb service. We included it as a candidate website for our research if it had both SimilarWeb and Google Analytics metrics. If not, the website was excluded. We then proceeded to the following website on the list and repeated the submission and verification process.

We continued these steps until we identified 91 websites. There were five websites where Google Analytics and SimilarWeb values differed by orders of magnitude. As there seemed to be no discernible patterns among these five websites upon our examination, we excluded them as outliers and reserved them as candidates for future study. This action left us with 86 websites for analysis. We concluded that this was more than satisfactory for our research, as the number is adequate for statistical analysis [109].

We have determined not to make the specific links publicly available for the privacy of the companies’ websites and given that these web analytics comparisons are a paid business product of SimilarWeb. However, we outline our methodology in detail so that those interested can recreate our research. Also, we provide the web analytics and related data concerning the websites (excluding website name and website link) in S1 File.

Data analysis

We employ paired t-tests for our analysis. The paired t-test compares two means from the same population to determine whether or not there is a statistical difference. As the paired t-test is for normally distributed populations, we conduct the Shapiro-Wilk test for visits, unique visits, bounce rate, and average session duration for both platforms to test for normality. As expected, the Shapiro-Wilk tests showed a significant departure from the normality for all variables. Therefore, we transformed our data to a normal distribution via the Box-Cox transformation [110] using the log-transformation function, log(variable). We then again conducted the Shapiro-Wilk test; the effect sizes of non-normality were very small, small, or medium, indicating the magnitude of the difference between the sample and normal distribution. Therefore, the data is successfully normalized for our purposes, though a bit of skewness exists, as the data is weighted toward the center of the analytics numbers using the log transformation, as shown for visits in Fig 1.

Fig 1. Histogram of normalized Google Analytics and SimilarWeb visits data.

Fig 1

Effect sizes Are Very Small and Small Respectively, Indicating the Difference Between the Sample Distribution and the Normal Distribution is Very Small/Small.

Despite the existing skewness, previous work shows that a method such as the paired t-test is robust in these cases [111, 112]. The transformation ensured that our statistical approach is valid for the dataset’s distributions. We then execute the paired t-test on four groups to test the differences between the means of total visits, unique visitors, bounce rates, and average session duration on the transformed values.

Further, we employ the Pearson correlation test, which measures the strength of a linear relationship between two variables, using the normalized values for the metrics under evaluation. This correlation analysis informs us how the two analytics services rank the websites relative to each other for a given metric, regardless of the agreement on the absolute values. These analytics services are often employed in site rankings, which is a common task in many competitive intelligence endeavors and used in many industry verticals, so such correlation analysis is insightful for using the two services in various domains.

Using the SimilarWeb API, we collect the reported values for total visits, unique visitors, bounce rate, and average session duration for each month over 12 months (September 1, 2019, through August 31, 2020, inclusive) for each of the 86 websites on our list. We then average the monthly values for each metric for each platform to obtain the values that we use in our analysis. We use the monthly average to mitigate any specific monthly fluctuation. For example, some websites have seasonal fluctuations in analytics. Some websites may experience outages during specific months or denial of service attacks. Using the monthly average over 12 months helps mitigate the possible short-term variations.

Our four measures, total visits, unique visitors, bounce rate, and average session duration, are considered core metrics in the domain of web analytics [1, 113, 114]. A metric is typically a number, such as a count or a percentage. However, measuring or calculating these metrics may vary by platform or service; therefore, it is crucial to understand these differences. Additionally, the conceptual understanding of these metrics may differ from the specific ability of a method for tracking in implementation. Table 2 presents an overview of these metrics.

Table 2. Comparison of definitions of total visits, unique visitors, bounce rate, and session duration conceptually and for the two analytics platforms: Google Analytics and SimilarWeb.

Definition of: Total Visits Unique Visitors Bounced Rate Average Session Duration
Conceptually Sum of times that all people go to a website during a measurement period.
A measure of frequency.
Sum of actual people who have visited a website at least once during a period.
A measure of reach.
A bounced visit is the act of a person immediately leaving a website with no interaction.
A measure of engagement.
The average length of time that visitors are on the website.
A measure of duration.
Practically Sum of times at least one page of a website has been loaded into a browser during a measurement period. Sum of distinct tracking measures requesting pages from a website during a given period determined by a method such as cookie, tag, or plugin. Ratio of single-page visits divided by all visits to a website during a given period (i.e., single page visits divided by all visits) Total duration of all sessions divided by the number of sessions
Google Analytics Sum of single visits to a website consisting of one or more pageviews during a measurement period.
The default visit timeout is 30 minutes, meaning that if there is not activity for this visit on the website for more than 30 minutes, then a new visit will be reported if another interaction occurs.
Sum of unique Google Analytics tracking code and browser cookies that visit a website at least once during a measurement period. Ratio of single-page visits divided by all visits to a website during a measurement period
Single-page sessions have an undefined session duration, since there are no subsequent server hits after the first one that would let Analytics calculate the length of the session. However, using a period of inactivity for the exit, bounce sessions have a duration of zero.
Session duration is the period of a group of user interactions with a website from the first and subsequent interactions to a period of inactivity. By default, a session lasts until there are 30 minutes of inactivity.
Session duration relies on a period of inactivity to end the session, as there is no server hit when the visitor exits the website.
SimilarWeb Sum of times at least one page of a website has been loaded into a browser during a measurement period Subsequent page views are included in the same visit until the user is inactive for more than 30 minutes. If a user becomes active again after 30 minutes, that counts as a new visit. A new session will also start at midnight. Sum of computing devices visiting a website within a geographical area and during a measurement period. Ratio of single page visits by all visits for a website within a geographical area and during a measurement period. Session duration is the period of is a group of user interactions with a website from the first and subsequent interactions to a period of inactivity. By default, a session lasts until there are 30 minutes of inactivity.

Results

Exploratory results

Our 86 websites represent companies based in 26 countries, as shown in Table 3. We used the country classifications provided by SimilarWeb, and we verified the classifications based on our assessment of the websites and links.

Table 3. Host country of organization for 86 websites in study.

Country No. %
United States 43 50.0%
India 6 7.0%
Russian Federation 6 7.0%
Japan 4 4.7%
United Kingdom 4 4.7%
France 3 3.5%
Israel 3 3.5%
Spain 2 2.3%
One each (Belarus, Belgium, Canada, Chile, China, Cuba, Ecuador, Germany, Madagascar, Malaysia, Nigeria, Taiwan, Turkey, Ukraine, United Arab Emirates) 15 17.4%
86 100.0%

The 86 organizational websites are from the following 19 industry verticals, as shown in Table 4. We used the industry classifications provided by SimilarWeb [115, 116], and we verified the classifications based on our assessment of the websites and company background material provided.

Table 4. Industry vertical of organization for 86 websites in study.

Website Category No. %
News and Media 36 41.9%
Computers Electronics and Technology 10 11.6%
Arts and Entertainment 9 10.5%
Science and Education 5 5.8%
Community and Society 4 4.7%
Finance 4 4.7%
Business and Consumer Services 2 2.3%
E-commerce and Shopping 2 2.3%
Gambling 2 2.3%
Travel and Tourism/ 2 2.3%
Vehicles 2 2.3%
Health 1 1.2%
Hobbies and Leisure 1 1.2%
Home and Garden 1 1.2%
Jobs and Career 1 1.2%
Law and Government 1 1.2%
Lifestyle/Beauty and Cosmetics 1 1.2%
Lifestyle/Fashion and Apparel 1 1.2%
Sports 1 1.2%
86 100.0%

The types of the 86 organizational websites are shown in Table 5. We used the site type classifications provided by SimilarWeb, and we verified the classification based on our assessment of the website content and features. Content sites are websites that provide content as their primary function. Transactional websites are sites that are primarily selling a product. ‘Other’ refers to those websites that do not fit into the other two categories.

Table 5. Website type for the 86 websites in study.

Site Type No. %
Content 50 58.1%
Other 34 39.5%
Transactional 2 2.3%
86 100.0%

H1: Measurements of total visits differ

A paired t-test was conducted to compare the number of total visits reported by Google Analytics and SimilarWeb. There was a significant difference in the reported number of total visits for Google Analytics (M = 6.82, SD = 0.31) and SimilarWeb (M = 6.66, SD = 0.29); t(85) = 6.43, p < 0.01. These results indicate a difference in the number of total visits between the two approaches. Specifically, our results show that SimilarWeb’s reported number of total visits is statistically lower than the values reported by Google Analytics. Therefore, H1 is fully supported: SimilarWeb’s measurements of total visits to websites differ from those reported by Google Analytics.

The number of total visits for all 86 websites was 1,703.5 million (max = 292.5 million; min = 1,998, med = 7.8 million), as reported by Google Analytics, and 1,060.1 million (max = 140.8 million; min = 4,443; med = 5.9 million), as reported by SimilarWeb. Using the total aggregate visits for all 86 websites using Google Analytics as the baseline, SimilarWeb underestimated by 643 million (19.4%) total visits. Using Google Analytics numbers as the baseline for total visits, SimilarWeb overestimated 15 (17.4%) sites and underestimated 66 (76.7%) sites. The two platforms were nearly similar (~+/- 5%) for 5 (5.8%) sites.

Ranking the websites by total visits based on Google Analytics and SimilarWeb, we then conduct a Pearson correlation coefficient test. There was a significant strong positive association between the ranking of Google Analytics and SimilarWeb, rs(85) = .954, p < .001.

Graphically, we compare the reported total visits between Google Analytics and SimilarWeb in Fig 2, showing the correlational relationship. As shown in Fig 2, the number of total visits between Google Analytics and SimilarWeb has a strong, positive, linear correlation.

Fig 2. Scatter plot of total visits reported by Google Analytics and SimilarWeb showing strong, positive, linear correlation.

Fig 2

This finding implies that, although the reported total visits values differ between the two platforms, the trend for the set of websites is generally consistent. So, if one is interested in a ranking (e.g., “Where does website X rank within this set of websites based on total visits?”), then SimilarWeb values will generally align with those of Google Analytics for those websites. However, if one is specifically interested in numbers (e.g., “What is the number of total visits to each of N websites?), then the SimilarWeb total visit numbers will be ~20% below those reported by Google Analytics, on average.

H2: Measurements of unique visitors differ

A paired t-test was conducted to compare the number of unique visitors reported by Google Analytics and SimilarWeb. There was a significant difference in unique visitors for the Google Analytics (M = 6.56, SD = 0.26 million) and the SimilarWeb (M = 6.31, SD = 0.25) conditions; t(85) = 12.60, p < 0.01. These results indicate a difference in the number of unique visitors between the two approaches. Specifically, our results show that the reported number of unique visitors by SimilarWeb is statistically lower than the values reported by Google Analytics. Therefore, H2 is fully supported: SimilarWeb measurement of unique visitors to websites differ from those reported by Google Analytics.

The total number of unique visitors for all 86 websites was 834.7 million (max = 138.1 million; min = 1,799; med = 4.3 million) reported by Google Analytics and 439.0 million (max = 54.6 million; min = 2,361; med = 2.3 million) reported by SimilarWeb. Using the mean aggregate unique visitors for all 86 websites, using Google Analytics as the baseline, SimilarWeb underestimated by 395.6 million (38.7%) unique visitors. Using Google Analytics numbers as the baseline, SimilarWeb overestimated 4 (4.7%) sites and underestimated 82 (95.3%) sites.

Ranking the websites by unique visitors based on Google Analytics and SimilarWeb, we then conduct a Pearson correlation coefficient test. There was a significant strong positive association between the ranking of Google Analytics and SimilarWeb, rs(85) = .967, p < .001.

Graphically, we compare the reported unique visitors between Google Analytics and SimilarWeb in Fig 3, showing the correlational relationship. As shown in Fig 3, the number of total visits between Google Analytics and SimilarWeb has a strong, positive, linear correlation.

Fig 3. Scatter plot of unique visitors reported by Google Analytics and SimilarWeb showing strong, positive, linear correlation.

Fig 3

This finding indicates that, while the reported values for unique visitors differ between the two platforms, the trend for the set of websites is mostly consistent. So, if one is interested in a ranking (e.g., “Where does website X rank within this set of websites based on unique visitors?”), then SimilarWeb values will generally align with those of Google Analytics for those websites. However, if one is specifically interested in numbers (e.g., “What is the number of unique visitors to each of N websites?), then the SimilarWeb unique visitor numbers will be ~40% below those reported by Google Analytics, on average.

H3: Measurements of bounce rates differ

A paired t-test was conducted to compare bounce rates reported by Google Analytics and SimilarWeb. There was a significant difference in the bounce rates between the Google Analytics (M = 0.58, SD = 0.03) and the SimilarWeb (M = 0.63, SD = 0.02) conditions; t(85) = -2,96, p < 0.01. Specifically, our results showed that the reported bounce rate by SimilarWeb was significantly higher than that reported by Google Analytics, fully supporting H3: SimilarWeb measurement of bounce rates for websites differ from those reported by Google Analytics.

The average of bounce rate for all 86 websites was 56.2% (SS = 20.4%, max = 88.9%; min = 20.4%; med = 59.2%) reported by Google Analytics and 63.0% (SS = 13.8%, max = 86.0%; min = 28.8%; med = 65.3%) as reported SimilarWeb. Using Google Analytics as the baseline, SimilarWeb analytics were 6.8% more than the average bounce rate. Additionally, SimilarWeb over calculated 35 (40.7%) sites and under calculated 31 (36.0%) sites. The two platforms were nearly similar (+/- 5) for 20 (23.3%) sites.

We then conducted a Pearson correlation coefficient test to rank the websites by bounce rate based on Google Analytics and SimilarWeb. There was a significant positive association between the ranking of Google Analytics and SimilarWeb, rs(85) = .461, p < .001.

Graphically, this is illustrated in Fig 4, where we compare bounce rates between Google Analytics and SimilarWeb. As shown in Fig 4, the bounce rates between Google Analytics and SimilarWeb have a moderate, positive, linear correlation.

Fig 4. Scatter plot of bounce rate reported by Google Analytics and SimilarWeb showing moderate, positive, linear correlation.

Fig 4

This finding indicates that, although SimilarWeb and Google Analytics report similar bounce rates for more than 20% of the sites, the difference between the values for the other 80% for the two platforms was high. We address the possible reasons for this high discrepancy later in the discussion of results.

H4: Measurements of average session duration differ

A paired t-test was conducted to compare the average session duration reported by Google Analytics and SimilarWeb. There was a significant difference in the average session duration between the Google Analytics (M = 2.15, SD = 0.05) and the SimilarWeb (M = 2.47, SD = 0.71) conditions; t(85) = -8.59, p < 0.01. Specifically, our results showed that the reported average session duration by SimilarWeb was significantly higher than that reported by Google Analytics, fully supporting H4: SimilarWeb measurement of average session duration for websites differ from those reported by Google Analytics.

The average session duration for all 86 websites was 202.91 seconds (SS = 239.71, max = 1439.51; min = 33.25; med = 119.63) reported by Google Analytics and 463.51 seconds (SS = 640.99, max = 4498.08; min = 62.42; med = 267.13) as reported SimilarWeb. Using Google Analytics as the baseline, SimilarWeb reported a 52.6% more total average session duration. Additionally, SimilarWeb over reported 63 (73.3%) sites and under reported 9 (10.5%) sites, relative to Google Analytics. The two platforms were nearly similar (~+/- 5) for 14 (16.3%) sites.

Ranking the websites by average session duration based on Google Analytics and SimilarWeb, we then conduct a Pearson correlation. There was a significant positive association between the ranking of Google Analytics and SimilarWeb, rs(85) = .536, p < .001, as shown in Fig 5.

Fig 5. Scatter plot of average session duration reported by Google Analytics and SimilarWeb showing moderate, positive, linear correlation.

Fig 5

This finding indicates that, although SimilarWeb and Google Analytics report similar average sessions for about 16% of the sites, the difference between the values for the other 84% of the sites for the two platforms was generally high. We address the possible reasons for this high discrepancy later in the discussion of results.

Discussion

General discussion

Table 6 summarizes our findings for the 86 websites using average monthly total visits, unique visitors, bounce rate, and average session duration during the 12-month analysis period.

Table 6. Summary of results comparing Google Analytics and SimilarWeb for total visits, unique visitors, bounce rate, and average session duration.

Difference uses Google Analytics as the Baseline. Results based on Paired t-Test for Hypotheses Supported.

Metric / Service Google Analytics SimilarWeb Difference Hypotheses
Total Visits 1,703,584,207 1,060,137,189 19.4% Fully Supported–The reported values differ
Unique Visitors 834,656,530 439,016,436 38.7% Fully Supported–The reported values differ
Bounce Rate 56.2% 63.0% 6.8% Fully Supported–The reported values differ
Average Session Duration 202.91 463.51 56.2% Fully Supported–The reported values differ
Number of Sites (Relative to Google Analytics Values) Where SimilarWeb Numbers Were:
Higher Lower Similar (~+/- 5%)
Total Visits 15 (17.4%) 66 (76.7%) 5 (5.8%) SimilarWeb values will generally be lower than Google Analytics
Unique Visitors 4 (4.7%) 82 (95.3%) 0 (0.0%) SimilarWeb values will generally be lower than Google Analytics
Bounce Rate 35 (40.7%) 31 (36.0%) 20 (23.3%) SimilarWeb values will generally be higher than Google Analytics
Average Session Duration 63 (73.3%) 9 (10.5%) 14 (16.3%) SimilarWeb values will generally be higher than Google Analytics

As shown in Table 6, statistical testing of all four hypotheses is statistically significant, so all four hypotheses are supported. The reported values for total visits, unique visitors, bounce rates, and average session duration for Google Analytics and SimilarWeb differ significantly. The website rankings by each service are significantly correlated, so it seems that these ranked lists can be used for research on analytics, competitive analysis, and analytics calculations for a set of websites, with the caveat highlighted in [18, 19]. These analyses compare the two services’ precision (i.e., how close measured values are to each other).

However, the underlying question motivating our research remains this: How accurate are the reported metrics from website analytics services (i.e., how close are the reported values to the ‘true’ values)? Regardless of the statistical testing results, this motivational question is more challenging to address. In reality, there is one ‘true’ number of visits, visitors, bounces, and average session duration. However, is it realistic to expect any web analytics service to match reality perfectly? Moreover, what is the reality in terms of web analytics? In our perspective, it is a misconception to view web analytics data collection as “counting.” In most cases, web analytics is not counting; instead, it is “measuring.” It is well known that there will be an error rate (+/- n%) for nearly any measure [117]. No measure or measurement tool is perfect, and web data can be particularly messy.

Although one might lean toward considering metrics reported by Google Analytics as the ‘gold standard’ for website analytics (and justifiably so in many cases), it is also known within the industry that Google Analytics has tracking issues in some cases. Also, a reportedly high percentage of Google Analytics accounts are incorrectly set up [118121], perhaps skewing the measuring in some cases. There are also contexts where other analytics methods might be more appropriate. Google Analytics relies on one data collection approach: basically, a cookie and tagging technique. There are certainly cases (e.g., cleared cookies, incognito browsing) when this method is inaccurate (e.g., unique visitors). Furthermore, Google Analytics might have different settings in terms of filtering, such as housekeeping visits from organizational employees that would slant the results. Therefore, these concerns result in issues with Google Analytics being seen as the ‘gold standard.’

To investigate our motivation research question regarding the accuracy of Google Analytics and SimilarWeb as analytics services, we conduct a deductive analysis using a likelihood of error [122]. We analyze what makes theoretical sense for which web analytics approach, Google Analytics or SimilarWeb, would result in the most accurate measurement for each of our metrics. We discuss our analysis of each metric below.

Bounce rate (engagement)

A high bounce rate is undesirable for many sites. The bounce rate means that someone comes to a site and leaves without taking relevant action. For this metric, both Google Analytics [123] and SimilarWeb are conceptually incorrect due to the practical issues of measuring a bounce visit [124]. For a meaningful session measurement, there must be an entry point (where the person came to the site) and an exit point (where the person left the site). If there is no endpoint to the session, both Google Analytics and SimilarWeb count it as a single page visit and a bounce because there is no exit interaction.

There are many situations where relevant action is taken on a site, but there is no exit point [125]. For example, there can be an e-commerce site where a potential consumer arrives on a product page, reads the content, and takes no other action at that time. Another case is a newspaper site where an audience member comes to the site, scans the headlines, reads the article snippets, but takes no other action, such as clicking [126]. In each of these cases, the visit could last several minutes or longer. However, since there is no exit page (i.e., no second page), Google Analytics and SimilarWeb would count these example visits as bounces.

So, we can reasonably assume both Google Analytics and SimilarWeb are overcounting bounces, conceptually. This may be why the values vary substantially between the two services. However, since bounce is a site-centric specific measure, we would expect Google Analytics to be more precise (if not more accurate) than SimilarWeb when measuring bounce rate on a single given site. However, SimilarWeb’s panel data may help correct this somewhat for a set of sites, which Google Analytics does not measure. So, if one needs to examine the bounce rate of several websites, Google Analytics cannot be used since website owners usually do not make their web analytics data available to the public.

In terms of mechanical metrics, one would expect Google Analytics to be better for an individual site. SimilarWeb might be expected to give reasonable bounce rate numbers for some sites due to their user-centric panel data, and bounce rates are generally high, especially for highly trafficked sites. This reasonableness in results from both Google Analytics and SimilarWeb is borne out in our statistical analysis above, where the two services were more in agreement for the larger traffic sites for bounce rates (see Fig 4).

Average session duration (duration)

Again, for this metric, both Google Analytics [123] and SimilarWeb are conceptually incorrect due to the practical issues of measuring the end of a session. Similar to the bounce rate, there is no exit point (i.e., where the person left the site). As there is no endpoint, both Google Analytics and SimilarWeb rely on a temporal timeout measured from the time of the last interaction. So, this most likely under measures the duration of many sessions.

Again, since average session duration is a site-centric specific measure, Google Analytics would be expected to be at least more precise (if not more accurate) than SimilarWeb when measuring average session duration on a single given site. Again, SimilarWeb’s panel data may somewhat help correct this for a set of sites for which Google Analytics data is unavailable. So, similar to bounce rate, if one needs to examine the average session duration of several websites, Google Analytics cannot be used, as this data is usually not public. In the end, conceptually, both Google Analytics and SimilarWeb are most likely under measuring average session duration. In terms of practical implementation, one would expect Google Analytics to be better for an individual site. SimilarWeb might be expected to give reasonable numbers for some sites due to their user-centric panel data.

Total visits (frequency)

This seems like a straightforward site-centric metric for which Google Analytics should excel. Although there is room for some noise in the visits, such as housekeeping visits (i.e., visits from internal company personnel for site maintenance), bot-generated visits [127], purchased traffic, or hacking attacks that might not conceptually meet the definition of a visit, it is difficult to imagine how an analytics service could be better than a site-centric service in this regard. Using the site and network-centric data collection data employed by website analytics services like SimilarWeb would not mitigate some of the noise mentioned above; however, the user-centric panel data might compensate for some of the noise issues for at least a high traffic website and for bot traffic. However, in general, one would expect Google Analytics to be more accurate in measuring visits than SimilarWeb. However, Google Analytics data is generally unavailable for multiple websites, so relying on Google Analytics is not practical for these situations. For these cases, one would need to employ an analytics service, such as SimilarWeb. Based on our analysis above, values for total visits from SimilarWeb would be less than Google Analytics measurements by ~20% on average.

Unique visitors (reach)

Finally, we consider unique visitors. In this case, perhaps surprisingly, one would expect the greater likelihood of error to be with the site-centric measurements, resulting in SimilarWeb measures being more accurate.

Site-centric services, such as Google Analytics, typically rely on a combination of cookies and tags to measure unique visitors. This approach would generally result in an overcount of unique visitors by the service. For example, the expected life cycle of a computer is three to five years [128, 129], meaning a person changing computers would be registered as a new visitor. The market share of browsers has changed considerably over the years [130, 131], meaning when someone has changed browsers, he/she would be registered as a new visitor. Studies show that 40% of Internet users clear cookies daily, weekly, or monthly [132, 133], and about 3.7% of users disable all cookies [134, 135]. These actions would trigger a unique visitor count when visiting a website. Some studies point to a much higher rate, with more than 30% of users deleting cookies in a given month [132]. Many people also use the incognito mode on their browsers [136, 137], triggering a new visitor count in Google Analytics [138, 139]. Also, many people have multiple devices (e.g., personal computer, work computer, smartphone, tablet), with about 50% of Americans, for example, using four Internet-enabled devices [140, 141], so each device would be counted as a unique visitor even if it is the same person is using the multiple devices.

For these reasons, the unique visitor number measured using the cookie approach would likely lead to an overcount using site-centric metrics. How much of an overcount? Based on the issues just outlined, it seems that, for Google Analytics, a 20% overestimate in monthly unique visitors to 30% overestimate for more extended periods seems reasonable. However, more precise measures require an in-depth study and are a task for future research.

For unique visitors, it seems that panel data, such as those that Similar Web and other network-centric services use, might be more accurate. However, this might only hold for larger websites. It is not clear that panel data would be accurate for lower-traffic websites as there is not enough traffic to these sites to generate reasonable statistical analysis. Generally, for unique visitors, it seems that Google Analytics would most likely overestimate the number of unique visitors to the website. SimilarWeb might be more accurate for the larger traffic websites due to its user panel data approach but have questionable accuracy (either over- or underestimating) for the smaller traffic websites. Again, this conclusion is borne out by our analysis above, where the difference between Google Analytics and SimilarWeb increased for the smaller websites (see Fig 3).

Theoretical implications

We highlight three theoretical implications of this research, which are:

  • Triangulation of Data, Methods, and Services: There seems, at present, to be no single data collection approach (user, site, or network-centric) or web analytics service (including Google Analytics or SimilarWeb) that would be effective for all metrics, contexts, or business needs. Therefore, a triangulation of services, depending on the data, method of analysis, or need, seems to be the most appropriate approach. It appears reasonable that user-centric approaches can be leveraged for in-depth investigation of user online behaviors, albeit usually with a sample. Site-centric approaches can be leveraged for the investigation of users’ onsite behaviors. Network-centric approaches can be leveraged for in-depth investigation of user intersite behaviors (i.e., navigation between sites).

  • Discrepancies with Implementation: Regarding precision, we have established differences between the two services, and we know the general methodologies and metrics calculations. However, the nuances of implementation have not been independently audited as of the date of this study. So, in practice, we cannot say definitely which is the best implementation for a given metric. Again, this points to the need for triangulation of methods. It also highlights the lack of a gold standard for evaluating website analytics services. Regardless of any nuances in implementation, the values between the two services are correlated, and, as discussed above, we can infer the preferred approach using deductive analysis.

  • Discrepancies with Reality: Precision does not mean accuracy for either Google Analytics or SimilarWeb. We have already outlined potential issues with all four of the metrics examined (i.e., total visits, unique visitors, bounce rates, average session duration). The application mechanics are not aligned with conceptual definitions of what these metrics supposedly measure. This situation calls for both continued research into improved measures and a realization that the reported values (from both Google Analytics and SimilarWeb) are not counts per se and should not be viewed necessarily as ‘truth.’ Rather, the values are reported measures with some error rates (+/-).

Practical implications

We highlight three practical implications of the findings, which are:

  • Use of Google Analytics and SimilarWeb: Findings of our research show that, in general, SimilarWeb results for total visits and number of unique visitors will generally be lower than those reported by Google Analytics, and the correlation between the two platforms is high for these two metrics. So, if one is interested in ranking a set of websites for which one does not have the Google Analytics data, the SimilarWeb metrics are a workable proxy. If one is interested in the actual Google Analytics traffic for a set of websites, one can use the SimilarWeb results and increase by about 20% for total visits and about 40% for unique visitors, on average. As a caveat, the Google Analytics unique visitor’s numbers are probably an overcount, and the SimilarWeb values may be more in line with reality. As an easier ‘rule of thumb’, we suggest using a 20% adjustment (i.e., increase SimilarWeb numbers) for both metrics based on the analysis findings above. The realization that these services can be complementary can improve decision-making that relies on KPIs and metrics from website analytics data.

  • Verification of Analytics for a Single Website: In general, Google Analytics is a site-centric web analytics platform, so it would be a reasonable service to use for a single website that one owns and has access. However, comparing analytics values from Google Analytics to those of SimilarWeb (or other website analytics services) may be worthwhile, as these will be the values that outsiders see concerning the website.

  • Estimating Google Analytics Metrics for Multiple Websites: As shown above, the differences between Google Analytics and SimilarWeb metrics for total visits and unique visitors are systematic (i.e., the differences stay relatively constant), notably for visits and unique visitors. This means, if you have Google Analytics values for one site, you can adjust and use a similar difference for the other websites to get reasonable analytics numbers to those from Google Analytics. This technique is valuable in competitive analysis situations where you compare multiple sites against a known website and want the Google Analytics values for all sites. However, SimilarWeb generally provides conservative analytics metrics compared to Google Analytics, meaning that, if solely relying on this single service, analytics measures may be lower, especially for onsite interactions. So, decisions using these analytics metrics need to include this as a factor.

Limitations, future research, and strengths

Limitations and future research

The first limitation concerns data quality. In the absence of ground truth, we primarily measure precision and not the accuracy of the two web analytics services. As noted, there are inconsistencies between the two platforms. So, the analytics data that decision-makers may perceive as accurate, objective, and correct may not have these qualities due to the several potential sources for errors outlined above. Web analytics services should undertake future research to provide metric values with confidence intervals to depict them as ranges rather than exact values. Another limitation is that the source codes and specific implementations for either of these platforms are not available, so the nuances of the implementations cannot be verified. Although it is apparent from results and from company materials that both platforms use state-of-the-art algorithmic approaches, future research could focus on using open-source analytics platforms, such as Matomo [142], to tease apart some of these metric implementations. An additional limitation is that a large percentage of the sites used in this research are content creation sites based in the U.S.A., which might skew user behavior. Other future research involves replication studies with different sets of websites, other website analytics services, other metrics, and analysis of specific website segments based on type, size, industry vertical, or country (i.e., China being a critical region of interest).

Strengths

There are several strengths of this research. First, we use two popular web analytics services. Second, we employ 86 websites with various attributes, ensuring a robust sample size. Third, we collect data over an extended period of 12 months to mitigate for short periods of fluctuation with the website analytics measures. Fourth, we report and statistically evaluate four core web analytics metrics–total visits, unique visitors, bounce rates, and average session duration. Fifth, we discuss and offer theoretical and practical implications of our research. To our knowledge, this is one of the first and one of the most extensive academic examinations and analyses of these popular web analytics services.

Conclusion

For this research, we compared four analytics metrics from Google Analytics to those from SimilarWeb based on 12 months of data for 86 diverse websites. Findings show statistically significant differences between the two services for total visits, unique visitors, bounce rates, and average session duration. Compared to Google Analytics, SimilarWeb values were ~20% lower for total visits, ~40% lower for unique visitors, ~25% higher for bounce rate, and ~50% higher for average session duration, on average. The rankings of all four metrics are significantly correlated between Google Analytics and SimilarWeb, and the measurement differences are systematic between the two analytics services. The implications are that SimilarWeb provides conservative analytics results relative to Google Analytics, and these web analytics tools can be complementarily utilized in various contexts, especially when having data for one website and needing analytics data for other websites.

Supporting information

S1 File

(DOCX)

Data Availability

The data underlying the results presented in the study are available from SimilarWeb (https://www.similarweb.com/). The authors had no special access privileges to the data others would not have.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Jansen BJ. Understanding User-Web Interactions via Web Analytics. Synthesis Lectures on Information Concepts, Retrieval, and Services. 2009. Jan 1;1(1):1–102. [Google Scholar]
  • 2.Saura JR. Using Data Sciences in Digital Marketing: Framework, methods, and performance metrics. Journal of Innovation & Knowledge. 2021;6(2):92–102. [Google Scholar]
  • 3.Jiang J, He D, Allan J. Searching, browsing, and clicking in a search session: changes in user behavior by task and over time. In: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. New York, NY, USA: Association for Computing Machinery; 2014. p. 607–16. (SIGIR ‘14). [Google Scholar]
  • 4.Kämpf M, Tessenow E, Kenett DY, Kantelhardt JW. The Detection of Emerging Trends Using Wikipedia Traffic Data and Context Networks. PLOS ONE. 2015. Dec 31;10(12):e0141892. doi: 10.1371/journal.pone.0141892 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Choo C, Detlor B, Turnbull D. Information Seeking on the Web: An Integrated Model of Browsing and Searching. First Monday. 2000. Mar 27;5. [Google Scholar]
  • 6.Figueredo de Santana V, Ferreira Silva FE. User Test Logger: An Open Source Browser Plugin for Logging and Reporting Local User Studies. In: Antona M, Stephanidis C, editors. Universal Access in Human-Computer Interaction Theory, Methods and Tools. Cham: Springer International Publishing; 2019. p. 229–43. (Lecture Notes in Computer Science). [Google Scholar]
  • 7.Jansen BJ, McNeese MD. Evaluating the effectiveness of and patterns of interactions with automated searching assistance. Journal of the American Society for Information Science and Technology. 2005;56(14):1480–503. [Google Scholar]
  • 8.Miroglio B, Zeber D, Kaye J, Weiss R. The Effect of Ad Blocking on User Engagement with the Web. In: Proceedings of the 2018 World Wide Web Conference. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee; 2018. p. 813–21. (WWW ‘18).
  • 9.Ahmed H, Tahseen D, Haider W, Asad M, Nand S, Kamran S. Establishing Standard Rules for Choosing Best KPIs for an E-Commerce Business based on Google Analytics and Machine Learning Technique. International Journal of Advanced Computer Science and Applications. 2017. Jan 1;8. [Google Scholar]
  • 10.Gunter U, Önder I. Forecasting city arrivals with Google Analytics. Annals of Tourism Research. 2016. Nov 1;61:199–212. [Google Scholar]
  • 11.He D, Göker A, Harper DJ. Combining evidence for automatic Web session identification. Information Processing & Management. 2002. Sep 1;38(5):727–42. [Google Scholar]
  • 12.Jiang T, Chi Y, Gao H. A clickstream data analysis of Chinese academic library OPAC users’ information behavior. Library & Information Science Research. 2017. Jul 1;39(3):213–23. [Google Scholar]
  • 13.Jiang T, Yang J, Yu C, Sang Y. A Clickstream Data Analysis of the Differences between Visiting Behaviors of Desktop and Mobile Users. Data and Information Management. 2018. Dec 31;2(3):130–40. [Google Scholar]
  • 14.Ortiz-Cordova A, Jansen BJ. Classifying Web Search Queries in Order to Identify High Revenue Generating Customers. Journal of the American Society for Information Sciences and Technology. 2012;63(7):1426–41. [Google Scholar]
  • 15.Vecchione A, Brown D, Allen E, Baschnagel A. Tracking User Behavior with Google Analytics Events on an Academic Library Web Site. Journal of Web Librarianship. 2016. Jul 2;10(3):161–75. [Google Scholar]
  • 16.Wang P, Berry MW, Yang Y. Mining longitudinal web queries: trends and patterns. Journal of the American Society for Information Science and Technology. 2003;54(8):743–58. [Google Scholar]
  • 17.Midha V. The Glitch in On-line Advertising: A Study of Click Fraud in Pay-Per-Click Advertising Programs. International Journal of Electronic Commerce. 2008. Dec 1;13(2):91–112. [Google Scholar]
  • 18.Pochat VL, Van Goethem T, Tajalizadehkhoob S, Korczyński M, Joosen W. Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation. Proceedings 2019 Network and Distributed System Security Symposium. 2019;
  • 19.Scheitle Q, Hohlfeld O, Gamba J, Jelten J, Zimmermann T, Strowes SD, et al. A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists. Proceedings of the Internet Measurement Conference 2018. 2018;478–93. [Google Scholar]
  • 20.Martín-Martín A, Orduna-Malea E, Thelwall M, Delgado López-Cózar E. Google Scholar, Web of Science, and Scopus: A systematic comparison of citations in 252 subject categories. Journal of Informetrics. 2018. Nov 1;12(4):1160–77. [Google Scholar]
  • 21.Thelwall M. Interpreting social science link analysis research: A theoretical framework. Journal of the American Society for Information Science and Technology. 2006;57(1):60–8. [Google Scholar]
  • 22.Similarweb. SimilarWeb Data Methodology [Internet]. SimilarWeb Data Methodology. 2022 [cited 2022 Feb 2]. Available from: http://support.similarweb.com/hc/en-us/articles/360001631538
  • 23.Hang H, Bashir A, Faloutsos M, Faloutsos C, Dumitras T. “Infect-me-not”: A user-centric and site-centric study of web-based malware. In: 2016 IFIP Networking Conference (IFIP Networking) and Workshops. 2016. p. 234–42. [Google Scholar]
  • 24.Prantl D, Prantl M. Website traffic measurement and rankings: competitive intelligence tools examination. International Journal of Web Information Systems. 2018;14(4):423–37. [Google Scholar]
  • 25.Shukla A, Gullapuram SS, Katti H, Yadati K, Kankanhalli M, Subramanian R. Evaluating content-centric vs. user-centric ad affect recognition. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction. New York, NY, USA: Association for Computing Machinery; 2017. p. 402–10. (ICMI ‘17).
  • 26.Zheng Z (Eric), Fader P, Padmanabhan B. From Business Intelligence to Competitive Intelligence: Inferring Competitive Measures Using Augmented Site-Centric Data. Information Systems Research. 2011. Nov 3;23(3-part-1):698–720. [Google Scholar]
  • 27.Croll A, Power S. Complete Web Monitoring: Watching your visitors, performance, communities, and competitors. O’Reilly Media, Inc.; 2009. 666 p. [Google Scholar]
  • 28.Nepusz T, Petróczi A, Naughton DP. Network Analytical Tool for Monitoring Global Food Safety Highlights China. PLOS ONE. 2009. Aug 18;4(8):e6680. doi: 10.1371/journal.pone.0006680 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Shawki Amin. 6 Free Analytics Tools to Help You Understand Your Competitor’s Web Traffic [Internet]. InfoTrust. 2013. [cited 2022 Feb 2]. Available from: https://infotrust.com/articles/6-free-analytics-tools-to-help-you-understand-your-competitor-s-web-traffic/ [Google Scholar]
  • 30.Joint Laura. Metrics we measure • PR Resolution—by CoverageBook [Internet]. PR Resolution—by CoverageBook. 2016. [cited 2022 Feb 2]. Available from: https://resolution.coveragebook.com/metrics-we-measure/ [Google Scholar]
  • 31.Macanas Mark. SimilarWeb vs Google Analytics Traffic Data Mismatch, Explained [Internet]. TechPinas: Philippines’ Technology News, Tips and Reviews Blog. [cited 2022 Feb 2]. Available from: https://www.techpinas.com/2018/06/SimilarWeb-vs-Google-Analytics.html [Google Scholar]
  • 32.Novak K. SimilarWeb vs Alexa: Which Traffic Estimator is More Precise? [Internet]. Growtraffic Blog. 2019. [cited 2022 Feb 2]. Available from: https://growtraffic.com/blog/2019/03/similarweb-alexa-which-precise [Google Scholar]
  • 33.Finley Olly. SEMrush vs SimilarWeb: What is the Best Tool for Media Buyers? [Internet]. Blog lemonads. 2020. [cited 2022 Feb 2]. Available from: https://www.lemonads.com/blog/semrush-vs-similarweb-what-is-the-best-tool-for-media-buyers/ [Google Scholar]
  • 34.Hinkis Roy. Traffic and Engagement Metrics and Their Correlation to Google Rankings [Internet]. Moz. [cited 2022 Feb 2]. Available from: https://moz.com/blog/traffic-engagement-metrics-their-correlation-to-google-rankings [Google Scholar]
  • 35.Tyler Horvath. 8 Most Accurate Website Traffic Estimators [Internet]. Ninja Reports. 2020 [cited 2022 Feb 2]. Available from: https://www.ninjareports.com/website-traffic-estimators/
  • 36.Barzilay O. How SimilarWeb Helps Investors Make Decisions About Their Portfolio [Internet]. Forbes. 2017. [cited 2022 Feb 2]. Available from: https://www.forbes.com/sites/omribarzilay/2017/11/09/meet-similarweb-one-of-wall-streets-secret-weapons/ [Google Scholar]
  • 37.Bakaev M, Khvorostov V, Heil S, Gaedke M. Web Intelligence Linked Open Data for Website Design Reuse. In: Cabot J, De Virgilio R, Torlone R, editors. Web Engineering. Cham: Springer International Publishing; 2017. p. 370–7. (Lecture Notes in Computer Science). [Google Scholar]
  • 38.Ng YMM, Taneja H. Mapping User-Centric Internet Geographies: How Similar are Countries in Their Web Use Patterns? Journal of Communication. 2019. Oct 1;69(5):467–89. [Google Scholar]
  • 39.SimilarWeb. Our Data | Similarweb [Internet]. Similarweb. 2022 [cited 2022 Jan 30]. Available from: https://www.similarweb.com/corp/ourdata/
  • 40.West BT, Sakshaug JW, Aurelien GAS. How Big of a Problem is Analytic Error in Secondary Analyses of Survey Data? PLOS ONE. 2016. Jun 29;11(6):e0158120. doi: 10.1371/journal.pone.0158120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Zhang M. Understanding the relationships between interest in online math games and academic performance. Journal of Computer Assisted Learning. 2015;31(3):254–67. [Google Scholar]
  • 42.Ailawadi KL, Farris PW. Managing Multi- and Omni-Channel Distribution: Metrics and Research Directions. Journal of Retailing. 2017. Mar 1;93(1):120–35. [Google Scholar]
  • 43.Movsisyan SA. Social media marketing strategy of Yerevan brandy company. Annals of Agrarian Science. 2016. Sep 1;14(3):243–8. [Google Scholar]
  • 44.Leitner P, Grechenig T. Scalable Social Software Services: Towards a Shopping Community Model Based on Analyses of Established Web Service Components and Functions. In: 2009 42nd Hawaii International Conference on System Sciences. 2009. p. 1–10. [Google Scholar]
  • 45.Kagan S, Bekkerman R. Predicting Purchase Behavior of Website Audiences. International Journal of Electronic Commerce. 2018. Oct 2;22(4):510–39. [Google Scholar]
  • 46.Hazrati N, Ricci F. Recommender systems effect on the evolution of users’ choices distribution. Information Processing & Management. 2022. Jan 1;59(1):102766. [Google Scholar]
  • 47.Karpf D. Social Science Research Methods in Internet Time. Information, Communication & Society. 2012. Jun 1;15(5):639–61. [Google Scholar]
  • 48.Diaz F, Gamon M, Hofman JM, Kıcıman E, Rothschild D. Online and Social Media Data As an Imperfect Continuous Panel Survey. PLOS ONE. 2016. Jan 5;11(1):e0145406. doi: 10.1371/journal.pone.0145406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Théro H, Vincent EM. Investigating Facebook’s interventions against accounts that repeatedly share misinformation. Information Processing & Management. 2022. Mar 1;59(2):102804. [Google Scholar]
  • 50.Albarran AB. The Social Media Industries. Routledge; 2013. 274 p. [Google Scholar]
  • 51.Bergh BGV, Lee M, Quilliam ET, Hove T. The multidimensional nature and brand impact of user-generated ad parodies in social media. International Journal of Advertising. 2011. Jan 1;30(1):103–31. [Google Scholar]
  • 52.Blombach A, Dykes N, Evert S, Heinrich P, Kabashi B, Proisl T. A New German Reddit Corpus. In: Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019). 2019. p. 278–9.
  • 53.Blombach A, Dykes N, Heinrich P, Kabashi B, Proisl T. A Corpus of German Reddit Exchanges (GeRedE). In: Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association; 2020. p. 6310–6. [Google Scholar]
  • 54.Nakayama M, Wan Y. A quick bite and instant gratification: A simulated Yelp experiment on consumer review information foraging behavior. Information Processing & Management. 2021. Jan 1;58(1):102391. [Google Scholar]
  • 55.Lund A, Zukerfeld M. Profiting from Open Audiovisual Content. In: Lund A, Zukerfeld M, editors. Corporate Capitalism’s Use of Openness: Profit for Free? Cham: Springer International Publishing; 2020. p. 199–239. (Dynamics of Virtual Work). [Google Scholar]
  • 56.Miller CC, Fox KA. 834: Americans view widely varied blog advice about home birth. American Journal of Obstetrics and Gynecology. 2017. Jan;216(1):S478–9. [Google Scholar]
  • 57.Zhang M. Who are interested in online science simulations? Tracking a trend of digital divide in Internet use. Computers & Education. 2014. Jul 1;76:205–14. [Google Scholar]
  • 58.Lasuin CA, Omar A, Ramayah T. Social media and brand engagement in the age of the customer. In Kuching, Sarawak, MALAYSIA; 2015. p. 138–44. [Google Scholar]
  • 59.Rosenblatt M, Curran T, Treiber J. Building Brands through Social Listening. In 2018. p. 71–4. [Google Scholar]
  • 60.Tangmanee C. Comparisons of Website Visit Behavior between Purchase Outcomes and Product Categories. International Journal of Research in Business and Social Science (2147–4478). 2017. Jul 19;6(4):1–10. [Google Scholar]
  • 61.Kang S. Factors influencing intention of mobile application use. International Journal of Mobile Communications. 2014. Jan 1;12(4):360–79. [Google Scholar]
  • 62.Bansal H, Kohli S. Trust evaluation of websites: a comprehensive study. International Journal of Advanced Intelligence Paradigms. 2019. Jan 1;13(1–2):101–12. [Google Scholar]
  • 63.Gunawan AB. Socialization of Terms of Use and Privacy Policy on Indonesian e-commerce Websites. Journal of Social Science. 2020. Jul 26;1(3):41–5. [Google Scholar]
  • 64.Singal H, Kohli S. Trust necessitated through metrics: estimating the trustworthiness of websites. Procedia Computer Science. 2016;85:133–40. [Google Scholar]
  • 65.Singal H, Kohli S. Mitigating Information Trust: Taking the Edge off Health Websites. International Journal of Technoethics. 2016. Jan 1;7(1):16–33. [Google Scholar]
  • 66.Weissbacher M. These Browser Extensions Spy on 8 Million Users Extended [Internet]. 2016. [cited 2022 Feb 2]. Available from: /paper/These-Browser-Extensions-Spy-on-8-Million-Users-Weissbacher/3fe57d1556158da7fe373fb577ac5cbbc3f1e84b [Google Scholar]
  • 67.Chakrabortty K, Jose E. Relationship Analysis between Website Traffic, Domain Age and Google Indexed Pages of E-commerce Websites. IIM Kozhikode Society & Management Review. 2018. Jul 1;7(2):171–7. [Google Scholar]
  • 68.Król K, Halva J. Measuring Efficiency of Websites of Agrotouristic Farms from Poland and Slovakia. Economic and Regional Studies / Studia Ekonomiczne i Regionalne. 2017. Jun 1;10(2):50–9. [Google Scholar]
  • 69.Vyas C. Evaluating state tourism websites using Search Engine Optimization tools. Tourism Management. 2019. Aug 1;73:64–70. [Google Scholar]
  • 70.Akcan H, Suel T, Brönnimann H. Geographic web usage estimation by monitoring DNS caches. In: Proceedings of the first international workshop on Location and the web. New York, NY, USA: Association for Computing Machinery; 2008. p. 85–92. (LOCWEB ‘08). [Google Scholar]
  • 71.Bates S, Bowers J, Greenstein S, Weinstock J, Xu Y, Zittrain J. Evidence of Decreasing Internet Entropy: The Lack of Redundancy in DNS Resolution by Major Websites and Services [Internet]. National Bureau of Economic Research; 2018. Feb [cited 2022 Feb 2]. (Working Paper Series). Report No.: 24317. Available from: http://www.nber.org/papers/w24317 [Google Scholar]
  • 72.de Carlos P, Araújo N, Fraiz JA. The new intermediaries of tourist distribution: Analysis of online accommodation booking sites. The International Journal of Management Science and Information Technology. 2016;(19):39–58. [Google Scholar]
  • 73.Das DB, Sahoo JS. Social Networking Sites–A Critical Analysis of Its Impact on Personal and Social Life. International Journal of Business and Social Science. 2(14):222–8. [Google Scholar]
  • 74.Marine-Roig E. A Webometric Analysis of Travel Blogs and Review Hosting: The Case of Catalonia. Journal of Travel & Tourism Marketing. 2014. Apr 3;31(3):381–96. [Google Scholar]
  • 75.Smoliarova AS, Gromova TM. News Consumption Among Russian-Speaking Immigrants in Israel from 2006 to 2018. In: Alexandrov DA, Boukhanovsky AV, Chugunov AV, Kabanov Y, Koltsova O, Musabirov I, editors. Digital Transformation and Global Society. Cham: Springer International Publishing; 2019. p. 554–64. (Communications in Computer and Information Science). [Google Scholar]
  • 76.Suksida T, Santiworarak L. A study of website content in webometrics ranking of world university by using similar web tool. In: 2017 IEEE 2nd International Conference on Signal and Image Processing (ICSIP). 2017. p. 480–3. [Google Scholar]
  • 77.Social Network Sites and Its Popularity. International Journal of Research and Reviews in Computer Science. 2011;2(2):522–6. [Google Scholar]
  • 78.Lo B, Sedhain RS. How Reliable Are Website Rankings? Implications for E-Business Advertising and Internet Search. 2006;(2):233–8. [Google Scholar]
  • 79.Vaughan L, Yang R. Web traffic and organization performance measures: Relationships and data sources examined. Journal of Informetrics. 2013. Jul 1;7(3):699–711. [Google Scholar]
  • 80.Nielsen. NetMonitor [Internet]. Nielsen Admosphere. 2015 [cited 2022 Feb 2]. Available from: https://www.nielsen-admosphere.eu/products-and-services/internet-measurement/netmonitor/
  • 81.Olivo Frank. Is SEMRush Accurate? A Comparison with My Site’s Analytics [Internet]. Sagapixel. [cited 2022 Feb 2]. Available from: https://sagapixel.com/seo/is-semrush-accurate/ [Google Scholar]
  • 82.Mulder S. How to Compare Your Site Metrics to Your Local Competitors [Internet]. 2014. [cited 2022 Feb 2]. Available from: http://digitalservices.npr.org/post/how-compare-your-site-metrics-your-local-competitors [Google Scholar]
  • 83.Diachuk Olha, Loba Pavel, Mirgorodskaya Olga. Comparing accuracy: SEMrush vs SimilarWeb [Internet]. owox. 2020. [cited 2022 Feb 2]. Available from: https://www.owox.com/blog/articles/semrush-vs-similarweb/ [Google Scholar]
  • 84.Pupec Ioana. We analyzed 1787 eCommerce websites with SimilarWeb and Google Analytics and that’s what we learned—Omniconvert Blog [Internet]. ECOMMERCE GROWTH Blog. 2017. [cited 2022 Feb 2]. Available from: https://www.omniconvert.com/blog/we-analyzed-1787-ecommerce-websites-similarweb-google-analytics-thats-we-learned.html [Google Scholar]
  • 85.Husain Osman. Forget the garage–this multi-million dollar company started in a jewelry store [Internet]. Tech in Asia. 2015. [cited 2022 Feb 2]. Available from: https://www.techinasia.com/the-story-of-similarweb [Google Scholar]
  • 86.The SaaS Report. SimilarWeb | The Software Report [Internet]. n.d. [cited 2022 Feb 2]. Available from: https://www.thesoftwarereport.com/top-companies/similarweb/
  • 87.Hardwick J. Find Out How Much Traffic a Website Gets: 3 Ways Compared [Internet]. SEO Blog by Ahrefs. 2018. [cited 2022 Feb 2]. Available from: https://ahrefs.com/blog/website-traffic/ [Google Scholar]
  • 88.Kashuba Margaret. SimilarWeb vs. SEMrush: Which Offers More Accurate Data? | CustomerThink [Internet]. 2020. [cited 2022 Feb 2]. Available from: https://customerthink.com/similarweb-vs-semrush-which-offers-more-accurate-data/ [Google Scholar]
  • 89.Pace Richard D. SimilarWeb: Fuzzier and Warmer than Alexa—Everything PR [Internet]. Everything PR News. 2013. [cited 2022 Feb 2]. Available from: https://everything-pr.com/similarweb-alexa/ [Google Scholar]
  • 90.Times Internet. Insights—competition benchmarking tools in the internet industry | Times Internet [Internet]. 2020 [cited 2022 Feb 2]. Available from: https://timesinternet.in/advertise/marketing/insights/competition-benchmarking-tools-in-the-internet-industry/
  • 91.Sahnoun Yassir. SimilarWeb Review: Know Your Audience, Win Your Market [Internet]. Monitor Backlinks Blog. 2018. [cited 2022 Feb 2]. Available from: https://monitorbacklinks.com/blog/content-marketer/similarweb-review [Google Scholar]
  • 92.Hogan Bruce. How to Analyze Competitor Website Traffic [Internet]. SoftwarePundit. 2020. [cited 2022 Feb 2]. Available from: https://www.softwarepundit.com/seo/competitor-taffic-analysis [Google Scholar]
  • 93.Fishkin Rand. The Traffic Prediction Accuracy of 12 Metrics from Compete, Alexa, SimilarWeb, & More [Internet]. SparkToro. 2015. [cited 2022 Feb 2]. Available from: https://sparktoro.com/blog/traffic-prediction-accuracy-12-metrics-compete-alexa-similarweb/ [Google Scholar]
  • 94.Langridge Patrick. How Accurate Are Website Traffic Estimators? [Internet]. 2016. [cited 2022 Feb 2]. Available from: https://www.screamingfrog.co.uk/how-accurate-are-website-traffic-estimators/ [Google Scholar]
  • 95.Aguiar João. SimilarWeb vs SEMrush—Comparing Website Traffic (2020 Case Study) [Internet]. Mobidea Academy. 2020. [cited 2022 Feb 2]. Available from: https://www.mobidea.com/academy/similarweb-vs-semrush-website-traffic/ [Google Scholar]
  • 96.Siegel DA. The mystique of numbers: belief in quantitative approaches to segmentation and persona development. In: CHI ‘10 Extended Abstracts on Human Factors in Computing Systems. New York, NY, USA: Association for Computing Machinery; 2010. p. 4721–32. (CHI EA ‘10). [Google Scholar]
  • 97.W3Techs. Usage Statistics and Market Share of Google Analytics for Websites, October 2020 [Internet]. 2020 [cited 2022 Feb 2]. Available from: https://w3techs.com/technologies/details/ta-googleanalytics
  • 98.Google. Google Analytics Set up the Analytics global site tag—Analytics Help [Internet]. 2020 [cited 2022 Feb 2]. Available from: https://support.google.com/analytics/answer/1008080?hl=en
  • 99.Google. Google Analytics About data sampling—Analytics Help [Internet]. 2020 [cited 2022 Feb 2]. Available from: https://support.google.com/analytics/answer/2637192?hl=en
  • 100.Weber Jonathan. How Accurate Is Sampling In Google Analytics? | Bounteous [Internet]. 2016. [cited 2022 Feb 2]. Available from: https://www.bounteous.com/insights/2016/03/03/how-accurate-sampling-google-analytics/ [Google Scholar]
  • 101.Salkind N. Triangulation. In: Encyclopedia of Research Design. 2455 Teller Road, Thousand Oaks California 91320 United States: SAGE Publications, Inc.; 2010. [Google Scholar]
  • 102.Jick TD. Mixing Qualitative and Quantitative Methods: Triangulation in Action. Administrative Science Quarterly. 1979;24(4):602–11. [Google Scholar]
  • 103.SimilarWeb. SimilarWeb Marketing Solution: The Most Reliable and Comprehensive Data on Competitor and Market Strategies. 2018.
  • 104.SimilarWeb. What does connecting my Google Analytics account with SimilarWeb mean?–Knowledge Center—SimilarWeb [Internet]. 2022 [cited 2022 Feb 2]. Available from: https://support.similarweb.com/hc/en-us/articles/208420125-What-does-connecting-my-Google-Analytics-account-with-SimilarWeb-mean-
  • 105.Wasserman Yossi. How SimilarWeb analyze hundreds of terabytes of data every month with Amazon Athena and Upsolver [Internet]. Amazon Web Services. 2018. [cited 2022 Feb 2]. Available from: https://aws.amazon.com/blogs/big-data/how-similarweb-analyze-hundreds-of-terabytes-of-data-every-month-with-amazon-athena-and-upsolver/ [Google Scholar]
  • 106.Lin J, Kolcz A. Large-scale machine learning at twitter. In: Proceedings of the 2012 international conference on Management of Data—SIGMOD ‘12. Scottsdale, Arizona, USA: ACM Press; 2012. p. 793.
  • 107.Nie Zaiqing, Kambhampati S, Nambiar U. Effectively mining and using coverage and overlap statistics for data integration. IEEE Transactions on Knowledge and Data Engineering. 2005. May;17(5):638–51. [Google Scholar]
  • 108.SimilarWeb. Connecting Google Analytics and Similarweb [Internet]. Similarweb Knowledge Center. 2022. [cited 2022 Jan 30]. Available from: https://support.similarweb.com/hc/en-us/articles/208420125-Connecting-Google-Analytics-and-Similarweb [Google Scholar]
  • 109.Boos DD, Hughes-Oliver JM. How Large Does n Have to be for Z and t Intervals? The American Statistician. 2000;54(2):121–8. [Google Scholar]
  • 110.Box GEP, Cox DR. An Analysis of Transformations. Journal of the Royal Statistical Society Series B (Methodological). 1964;26(2):211–52. [Google Scholar]
  • 111.Box GEP, Andersen SL. Permutation Theory in the Derivation of Robust Criteria and the Study of Departures from Assumption. Journal of the Royal Statistical Society Series B (Methodological). 1955;17(1):1–34. [Google Scholar]
  • 112.Hull D. Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: Association for Computing Machinery; 1993. p. 329–38. (SIGIR ‘93).
  • 113.Avinash Kaushik. Web Analytics Standards: 26 New Metrics Definitions [Internet]. Occam’s Razor by Avinash Kaushik. 2007 [cited 2022 Feb 2]. Available from: https://www.kaushik.net/avinash/web-analytics-standards-26-new-metrics-definitions/
  • 114.Kaushik A. Web Analytics: An Hour a Day. 1st Edition. Indianapolis, Ind: Sybex; 2007. 480 p. [Google Scholar]
  • 115.SimilarWeb. SimilarWeb Category [Internet]. Knowledge Center—SimilarWeb. 2022 [cited 2022 Feb 2]. Available from: https://support.similarweb.com/hc/en-us/articles/360000810469
  • 116.SimilarWeb. SimilarWeb All Categories [Internet]. SimilarWeb.com. 2022 [cited 2022 Feb 2]. Available from: https://www.similarweb.com/category/
  • 117.Bovbjerg ML. Random Error. In: Foundations of Epidemiology [Internet]. Oregon State University; 2020. [cited 2020 Oct 9]. Available from: https://open.oregonstate.education/epidemiology/chapter/random-error/ [Google Scholar]
  • 118.Ramadan Alex. Common Google Analytics Setup Errors and Omissions [Internet]. UpBuild. 2019. [cited 2022 Feb 2]. Available from: https://www.upbuild.io/blog/common-google-analytics-setup-errors/ [Google Scholar]
  • 119.Gant Amanda. Inaccurate Google Analytics—Why Google Analytics is Wrong and How to Fix It [Internet]. Orbit Media Studios. 2020. [cited 2022 Feb 2]. Available from: https://www.orbitmedia.com/blog/inaccurate-google-analytics-traffic-sources/ [Google Scholar]
  • 120.Bloom Kevin. Why Your Google Analytics Data is Wrong and How To Fix It [Internet]. Hinge Marketing. 2020. [cited 2022 Feb 2]. Available from: https://hingemarketing.com/blog/story/why-your-google-analytics-data-is-wrong-and-how-to-fix-it [Google Scholar]
  • 121.Upton E. 88% of Shopify stores have Google Analytics set up incorrectly [Internet]. Econsultancy. 2018. [cited 2022 Feb 2]. Available from: https://econsultancy.com/shopify-stores-google-analytics-set-up-incorrectly/ [Google Scholar]
  • 122.Williamson K, Burstein F, McKemmish S. Chapter 2—The two major traditions of research. In: Williamson K, Bow A, Burstein F, Darke P, Harvey R, Johanson G, et al., editors. Research Methods for Students, Academics and Professionals (Second Edition). Chandos Publishing; 2002. p. 25–47. (Topics in Australasian Library and Information Studies). [Google Scholar]
  • 123.Google. Google Analytics Bounce rate—Analytics Help [Internet]. 2020 [cited 2022 Feb 2]. Available from: https://support.google.com/analytics/answer/1009409?hl=en
  • 124.Schneider Daniel, Trucks Ruth M. Bounce Rate: What you need to know and how to improve [Internet]. SimilarWeb. 2021. [cited 2022 Feb 2]. Available from: https://www.similarweb.com/corp/blog/bounce-rate/ [Google Scholar]
  • 125.Eraslan S, Yesilada Y, Harper S. “The Best of Both Worlds!”: Integration of Web Page and Eye Tracking Data Driven Approaches for Automatic AOI Detection. ACM Transactions on the Web. 2020;14(1):1:1–1:31. [Google Scholar]
  • 126.Jiang T, Guo Q, Chen S, Yang J. What prompts users to click on news headlines? Evidence from unobtrusive data analysis. Aslib Journal of Information Management. 2019. Jan 1;72(1):49–66. [Google Scholar]
  • 127.Vranica S. A “Crisis” in Online Ads: One-Third of Traffic Is Bogus. Wall Street Journal [Internet]. 2014. Mar 23 [cited 2022 Feb 2]; Available from: https://online.wsj.com/article/SB10001424052702304026304579453253860786362.html [Google Scholar]
  • 128.Acimovic J, Erize F, Hu K, Thomas DJ, Mieghem JAV. Product Life Cycle Data Set: Raw and Cleaned Data of Weekly Orders for Personal Computers. Manufacturing & Service Operations Management. 2018;21(1):171–6. [Google Scholar]
  • 129.Sarokin David. What Is the Life Span of the Average PC? | Small Business—Chron.com [Internet]. 2020. [cited 2022 Feb 2]. Available from: https://smallbusiness.chron.com/life-span-average-pc-69823.html [Google Scholar]
  • 130.Arkko J. The influence of internet architecture on centralised versus distributed internet services. Journal of Cyber Policy. 2020. Jan 2;5(1):30–45. [Google Scholar]
  • 131.Liu Shanhong. Desktop internet browser market share 2015–2020 [Internet]. Statista. [cited 2022 Feb 2]. Available from: https://www.statista.com/statistics/544400/market-share-of-internet-browsers-desktop/ [Google Scholar]
  • 132.Abraham DM, Meierhoefer C, Lipsman A. The Impact of Cookie Deletion on the Accuracy of Site-Server and Ad-Server Metrics: An Empirical Comscore Study. 2007;19. [Google Scholar]
  • 133.Wills CE, Zeljkovic M. A personalized approach to web privacy: awareness, attitudes and actions. Information Management & Computer Security. 2011. Mar 22;19(1):53–73. [Google Scholar]
  • 134.Opentracker. Third-Party Cookies vs First-Party Cookies [Internet]. Opentracker. [cited 2022 Feb 2]. Available from: https://www.opentracker.net/article/third-party-cookies-vs-first-party-cookies-2/
  • 135.WRAL. A study of Internet users’ cookie and javascript settings [Internet]. smorgasbork. 2009 [cited 2022 Feb 2]. Available from: http://www.smorgasbork.com/2009/04/29/a-study-of-internet-users-cookie-and-javascript-settings/
  • 136.Google. Google Browse in private—Computer—Google Chrome Help [Internet]. 2020 [cited 2022 Feb 2]. Available from: https://support.google.com/chrome/answer/95464?co=GENIE.Platform%3DDesktop&hl=en
  • 137.Habib H, Colnago J, Gopalakrishnan V, Pearman S, Thomas J, Acquisti A, et al. Away From Prying Eyes: Analyzing Usage and Understanding of Private Browsing. In: Fourteenth Symposium on Usable Privacy and Security. Baltimore, MD, USA; 2018. p. 18. [Google Scholar]
  • 138.Fettman Eric. A Sweet Treat, But Users Delete: Cookies and Cookie Deletion in Google Analytics [Internet]. Cardinal Path. 2015. [cited 2022 Feb 2]. Available from: https://www.cardinalpath.com/blog/cookies-and-cookie-deletion-in-google-analytics [Google Scholar]
  • 139.Ringvee S. How to Detect and Track Incognito Users with Google Analytics and Google Tag Manager [Internet]. Reflective Data. 2019. [cited 2022 Feb 2]. Available from: https://reflectivedata.com/how-to-detect-track-incognito-users-with-google-analytics-and-google-tag-manager/ [Google Scholar]
  • 140.Pew Research Center. Demographics of Mobile Device Ownership and Adoption in the United States [Internet]. Pew Research Center: Internet, Science & Tech. 2019 [cited 2022 Feb 2]. Available from: https://www.pewresearch.org/internet/fact-sheet/mobile/
  • 141.Spangler Todd. U.S. Households Have Average of 11 Devices. 5G Will Push That Higher—Variety [Internet]. 2019. [cited 2022 Feb 2]. Available from: https://variety.com/2019/digital/news/u-s-households-have-an-average-of-11-connected-devices-and-5g-should-push-that-even-higher-1203431225/ [Google Scholar]
  • 142.Matomo. Matomo—The Google Analytics alternative that protects your data [Internet]. Analytics Platform—Matomo. 2020 [cited 2022 Feb 2]. Available from: https://matomo.org/

Decision Letter 0

Hussein Suleman

13 Jan 2022

PONE-D-21-03616

Measuring website interaction: A comparison of two industry standard analytic approaches using 86 websites

PLOS ONE

Dear Dr. Jansen,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The reviewers agree that there is clear merit in this work, with some positive statements to this effect.  However, some reviewer comments are requests for clarification and this needs to be addressed.  In particular, reviewers have commented on the metrics used, their appropriateness and how they are used and interpreted.  There are also numerous comments about the statistical analysis that require a response, and/or clarification and/or update in the artlcle.  Reviewers also converge on requesting more reflection and discussion of results and implications of the work.

Please submit your revised manuscript by Feb 26 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Hussein Suleman, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that in order to use the direct billing option the corresponding author must be affiliated with the chosen institute. Please either amend your manuscript to change the affiliation or corresponding author, or email us at plosone@plos.org with a request to remove this option.

3. We note that Figures 1 and 2 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figures 1 and 2 to publish the content specifically under the CC BY 4.0 license. 

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission. 

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

4.  Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: General comment: The authors had made a meritorious effort and tried effectively to compare the produced values of two analytics platforms. The literature review that has been used is in quite good relationship with the research problematic. Moreover, an important effort has been made to address the practical contribution of the paper to other researchers or practitioners. However, there are some major issues. First the selection of bounce rate metric is not thoroughly aligned with the meaning of duration (see further justification in the comments 8-9 below). Second, there is some difficult understanding regarding the statistical tests that have been selected and what was finally presented in results (See comments 11-15). Lastly, both theoretical and practical contributions are unfolded in an organized and logical way. But it will be very useful to add even more practical contributions from a competitive intelligence point of view. What this manuscript offers compared to the prior relative research approaches on the field of Web Analytics validity and competitive intelligence strategy? And how the up-to-date theoretical scientific approaches are benefited from this paper? Once more, well done for your effort, and I hope the forthcoming suggestions/comments will help you to optimize the value of this paper.

1. Line 32. we need to be more explicit here. What other uses are available based on the citation 2?

2. Table 1. Line 76-77. In the third column the Ahrefs tool is more a backlink checking tool and not a behavioural analytics platform. Better not to include it and refer some other tool more relevant with the web behavioural analytics and not with the off-site optimization and backlinks building.

3. Lines 81-83. The same thing is conducted also with SEMrush as well. And more specifically, SEMRush provides explicit statistics on a daily basis for competitors through graphs, figures etc. And also, the triangulation perspective is adopted on SEMrush, just like SimilarWeb. So why we choose SimilarWeb compared to the others? In a general sense, it will reinforce furtherly the justification of the paper if we put a clear paragraph or a table referring that compared to the others, we choose SimilarWeb for these reasons (one, two, three, four, six, ten and so on reasons).

4. Line 101. ok this is good! But for what reason? the generalization of results to a wide range of analytics technologies what gives to the practical and research community? For example, greater competitive intelligence strategy? Better WA platforms design and capabilities? Something like that.

5. Line 102. Hmm, there are other metrics within these platforms. Mostly user-centric (average video duration, avg videos watched in channel, different types of engagement with a post, followers/subscribers gain and so on). Hence, the results of this study cannot impact on several other domain, but only between web analytics platforms that estimate only websites traffic. Better not to include this assumption.

6. Line 122-123. this seems to be a little be general sentence about their findings. What these correlations specifically depict? And actually, I suppose, that the purpose here in our paper is to present prior works that focus on the comparison of web traffic platforms, to find differences and fluctuations among them. Not to compare web traffic stats with organizational performance. So, it needs to be more explicit here.

7. Line 133-134. Please guys, reform this sentence. Personally, I believe that this is a little bit arrogant, and does not express academic ethos. It just like that saying "ok you there Scheitle and colleagues, you don't have money, but we have money, and we can do research and you cannot ;) . Probably it is true, but better to redefine this sentence.

8. Line 179. hmm ok Frequency is related with total visits per a determined time-range, Reach is related with the unique visitors. But duration is related mostly with visit duration and page per visit as metrics. Bounce rate express the immediate abandonment from a website without proceeding to any kind of interaction with the content thus this mean zero duration. Probably we can assume here that bounce rate is related mostly with content usability and representativeness of users search terms with what they retrieved as websites’ content from search engine results. That is, if we do not have a good alignment of search term and content, then we have high bounce rate and vice-versa. Or if we have poor usability, then bounce rate is increased as well. So better change the duration with something else more specific that is aligned in a better way with the bounce rate. In a general sense, the involvement of bounce rate metric and its inclusion under the meaning of measuring duration is one of the main issues within the paper. The metric itself is a little bit vexed and you pointed this out in your argumentation including several related references. In continuation of this comment, I try to help you more with another one comment related with bounce rate included in Table 2.

9. Table 2. Column 3. we mentioned “A bounced visit is the act of a person immediately leaving a website before any interaction can reasonably occur” This point is conflicting with the below one point "measure of duration". Bounce rate is not a measure of duration, so if there is no interaction, there is no duration. And based on Google as you stated below within the table << Bounce rate is single-page sessions divided by all sessions, or the percentage of all sessions on your site in which users viewed only a single page and triggered only a single request to the Analytics server. These single-page sessions have a session duration of 0 seconds since there are no subsequent hits after the first one that would let Analytics calculate the length of the session. >> Probably you take it from here. at: https://support.google.com/analytics/answer/1009409?hl=en#:~:text=Bounce%20rate%20is%20single%2Dpage,request%20to%20the%20Analytics%20server. Therefore, I am afraid that we cannot use Bounce rate within the whole paper. And I do not understand why we do not use pages per session or time spent as metrics for measuring duration. This also measures the depth of exploration.

10. Line 238. Reading the citation (number 107) and the paper itself from the acm, it is a little bit fuzzy how large-scale machine learning on a social media such as twitter is related with the SimilarWeb standard methods as it is mostly a website traffic intelligence tool and not a social media competitive intel platform.

11. Line 244-245. How confident we are that this linking process extracts the specific analytics from google analytics without deviations from the original one source, namely the GA platform? Ok till now, we are sure that the provided GA data within the similar web platform have differentiations with the provided similar web data. Ok very good on that. But are we sure that GA data within the SimilarWeb are the same with the original data extracted from GA platform for the examined websites? In other words, do we proceed into a preliminary comparison at the same time-period between the extracted google analytics data from the two platforms, that is original GA and GA data as they included within SimilarWeb? Or can we ask the admins of these Google Analytics Accounts if they can ensure -even in a small sample of the websites (5 or 10 of the total 86)- that the provided Google Analytics data from the Google Analytics platform are the same with the provided final Google Analytics data from the Similar Web? This for sure, will overhaul the trustworthiness of our research sample and also the validity of our methodology.

12. Line 255-256. Well, we do not agree into this assumption guys. Who says that the rule of thumb is about 30 websites and not 31 or 29 for descriptives? Better reinforce it with a citation here. You can retrieve it even from a statistical perspective paper (such as the citation 109 that you have already used). Or from the prior approaches that related with web analytics platforms comparisons and their gathered sample compared to ours in this paper. As it is now, is more an opinion, and not a documented justification.

13. Line 267-274. 1) Ok, if someone search, based on literature review, we need normal distribution to execute paired t-test. Now based on our implications here in this paragraph, we do not have normal distribution at the initial dataset. And indeed, after downloading the file from the Supporting Information, we discover very high skewness values within the items. Also we extract a low value of Shapiro-Wilk which has been conducted for testing normality and linearity of the sample.

1)Therefore, we need first a non-parametric test to prove that our data are not normally distributed or in other words to prove that all the variables do not follow normal distribution (we can also prove it with Wilcoxon signed rank test, the Mann-Whitney U Test and the Kruskal-Wallis test).

2) After proving non-normality then we take on the Box-Cox transformation. And ok this is good lads, as we deployed it. After that, we argue here that we have normal distribution even there is a bit of skewness. But which is the normality value of the variables right now after the transformation? This is missing. So here we need to re-run a second test to prove that we transformed our data and now we are in the right order; we have the required normality to conduct paired t-test. Therefore, we need to conduct a normal distribution and stating that the results indicating that after the data has been transformed, we have a normal distribution.

3) After the transformation of the data through box-cox how they shaped? how they transformed. What numbers where existing previously and how they are shaped now after the transformation. It will be useful to provide a small sample (4-5 websites in the three variables) within a table on how the dataset was; and how the dataset has been transformed right now after the box-cox.

4) Thereafter, our method to adopt paired t-test will be furtherly reinforced by the citations you included (111 and 112)

14. Line 276-277. Hmm might be a little bit confusing for the reader. So, we conducted the tests on the transformed data. That is good. But we report the non-transformed values? Why this choice lads? Why we conducted the transformation? Probably to make the dataset normal-distributed. But we present the non-transformed values? Therefore, so why to conduct the transformation before? And actually, the non-transformed values without the transformation would give a greater clarity as we refer here. Sorry guys for not understanding this choice, but we need to be more explicit for the sake of the forthcoming readers. Thank you.

15. Line 323-325. Oh guys hold on a second. Here the Spearman coefficient comes from the sky, without mentioned nothing within the Methodology section about its scope and what will give to the readers. We have mentioned on the theoretical part some things about correlations, but reading and reading again the theory, I cannot understand what this correlation will practically gives to us. How we interpret it? that is why we correlate them? And why we use Spearman instead of Pearson? Secondly, spearman is deployed mostly on non-normally distributed datasets. Have we conducted the Sprearman on the non-transformed dataset or on the transformed one? If it is the latter, then it needs Pearson which is conducted mostly on normal distributions.

In any case, if there is a reason for conducting Correlation Analysis then we must:

A) Refer with clarity why we do this and what proves in support to the scope of the paper.

B) Refer clearly in which dataset you have applied the Correlation Analysis. It is the non-transformed or transformed one. If it is the latter one, then Pearson is more appropriate.

C) Include scatter plots for all the three correlations for the involved metrics. the high numbers of coefficients say almost nothing to a demanding reader.

16. Regarding figures 6-8. They need improvement. What these numbers mean both in vertical and horizontal axes? And especially on the horizontal one. Although the comparison through the line is comprehensible, the rest are not. Also, we can minimize the white space (where it is possible) by eliminating the range of the vertical access.

17. Line 405. We refer “that these ranked lists can be used for research and other purposes”. Ok but for what other purposes? this is a little be general. Better to be more explicit here and point out the other purposes.

18. Line 414. This citation (118) is related with the messy situation in Scientometrics and has nothing to do with the web analytics of websites. Better find something else, or remove it.

19. Line 420. there is no "installed correctly or the same on all the websites". The script is one. If it is installed, then produces numbers. If it is not, then no numbers. Of course, there are incorrections between the connections of GA with Google Ads, Search Console or other platforms and their produced metrics. But in case of the three metrics that have been used here, there are measured properly or indicated zero values if there is a problem in set up. In addition, if we have doubts about the proper installation of GA, why we do not use the Tag Assistant Legacy of Google as browser extension in our data collection section? This tool identifies errors in analytics installation (check here https://chrome.google.com/webstore/detail/tag-assistant-legacy-by-g/kejbdjndbnbjgmefkgdddjlbokphdefk?hl=en)

20. Line 433-438. Again, regarding the Bounce Rate metric. Well, this is a contradictory justification with the aforementioned definition of Bounce Rate as can be seen in table 2. And if we want to consider the Duration as the third central measurement of Web Analytics, why we choose bounce rate which is at least contentious for many cases in the literature review regarding duration validity? And not choosing the visit duration (SW) and the Avg. Visit Duration (GA) to make a comparison among them? This will eliminate all these doubts about bounce rate validity.

21. Line 525. These two citations on this line. The first one 119, refers issues about the setup errors of GA. However, none of these errors of administrators affect the three metrics that we involve here. For example, if we were involving demographics, then ok, we have validity problems. But none of the statements of Alex Ramadan affect the Total visits, Unique Visitors and Bounce Rate. The other link (citation 119) is broken as a 404 page.

22. Regarding reference list. Citations 28, 32, 33, 54, 55, 84 are broken or are not working properly.

End of Comments/Suggestions.

Thank you for this opportunity.

Reviewer #2: This is a very well-written manuscript. Very easy to read. The material is well-organized.

The manuscript deals with an important problem area: the accuracy of popular website analytics and traffic estimation services (e.g., Google Analytics and SimilarWeb). The manuscript identifies and addresses a research gap: a lack of academic research and interest in studying web analytics.

To improve, can the authors provide more insight on why there is a lack of attention among academics in currently studying this phenomenon? In the abstract, rather than saying the accuracy of metrics provided by Google Analytics and SimilarWeb will be discussed, provide a short sentence or two that speaks to or describes the accuracy of these metrics. In the paper, provide more insight on what the impact of SimilarWeb providing conservative traffic metrics compared to Google Analytics actually means in terms of practice. Why should we care? Why is it important to know that SimilarWeb and Google Analytics can be used in a complementary fashion when direct website data is not available? How important is this to know? Elaborate more on the implications of this research.

There are 143 references included in this paper. This is great, but over the top. I think the references can be reduced to a more significant subset. This would reduced the paper's word count.

What impact do the study's findings have on user-centric, site-centric, and network-centric approaches to web analytics data collection identified earlier in the paper?

The USA represents half of the 86 websites studied. News and media content represents 42% of the 86 websites. It would be good to further describe how the this skewed sample affects the findings and interpretation of results.

Reviewer #3: The authors conducted a comparison between Google Analytics and SimilarWeb based on analytics metrics data. The results provide both theoretical and practical implications. The paper is clearly organized and well written. With some minor improvements this piece is worth publishing, and I have a few specific suggestions.

First, the authors need to justify their selection of total visits, unique visitors, and bounce rates as the three metrics. Why excluding other common metrics such as time on site/page?

Second, all three hypotheses are supported, but how does this help evaluate the accuracy of the two analytics services? I think it impossible to indicate which one is more accurate given the significant differences between them in terms of the three metrics.

Finally, I suggest that the authors improve their discussion section by providing more insights into the causes for the differences between Google Analytics and SimilarWeb.

A minor problem - I’m confused by the statement “The techniques used by SimilarWeb are similar to the techniques of other traffic services, such as Alexa, comScore, SEMRush, Ahrefs, and Hitwise.” (Page 5, Line 100). While Alexa and comScore user-centric, SEMRush, Ahrefs, and Hitwise are network-centric. Why “similar”? What are the “techniques”?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Georgios A. Giannakopoulos and Ioannis C. Drivas

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 May 27;17(5):e0268212. doi: 10.1371/journal.pone.0268212.r002

Author response to Decision Letter 0


28 Feb 2022

PONE-D-21-03616: Measuring user interactions with websites: A comparison of two industry standard analytics approaches using data of 86 websites

This response letter contains our replies to the reviewers’ suggestions and comments. We include our comments in italics. We reference locations in the manuscript that specifically address major suggestions. Where appropriate, we provide a snippet from the manuscript. In this version of the manuscript, we also highlight the many significant changes made from the prior submission.

We believe we have addressed both the spirit and the specifics of the reviewers’ comments in the manuscript’s current version.

We thank the reviewers for their many detailed and constructive comments that have greatly improved the research presented in this version of the manuscript.

META-REVIEW

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We thank the editor for this opportunity to revise our manuscript.

With this new round of comments, we believe we have addressed both the spirit and the specifics of the reviewers’ comments in the manuscript’s current version, as outlined below.

Thanks!

The reviewers agree that there is clear merit in this work, with some positive statements to this effect.

We thank the reviewers for their positive comments about the research presented in this manuscript, as we also believe that the research has clear merit.

However, some reviewer comments are requests for clarification and this needs to be addressed. In particular, reviewers have commented on the metrics used, their appropriateness and how they are used and interpreted. There are also numerous comments about the statistical analysis that require a response, and/or clarification and/or update in the artlcle. Reviewers also converge on requesting more reflection and discussion of results and implications of the work.

We believe we have addressed both the spirit and the specifics of the reviewers’ comments in the manuscript’s current version, as outlined below.

Again, thanks!

Please submit your revised manuscript by Feb 26 2022 11:59PM.

Thank you for the information. We are submitting the revised manuscript within the required timeframe. Actually, we are early!

We look forward to receiving your revised manuscript.

Again, thanks! We hope that you like it!

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

We thank the reviewers for their positive support of the research presented in this manuscript, as we also believe that the research is technically sound.

________________________________________

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

We thank the reviewers for their positive support of the research presented in this manuscript, as we also believe that the statistical analysis has been performed appropriately and rigorously (R#2 and R#3). Concerning revisions, we believe we have addressed both the spirit and the specifics of the reviewers’ (R#1) suggestions in the manuscript’s current version

________________________________________

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

We thank the reviewers for their positive support of the research presented in this manuscript, as we make all data underlying the findings described in this manuscript fully available.

________________________________________

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

We thank the reviewers for their positive support of the research presented in this manuscript.

-------------

5. Review Comments to the Author

Reviewer #1: General comment: The authors had made a meritorious effort and tried effectively to compare the produced values of two analytics platforms.

We thank the reviewer for the positive comment about the manuscript, which we also believe is a meritorious effort to effectively compare the produced values of two analytics platforms.

The literature review that has been used is in quite good relationship with the research problematic.

We thank the reviewer for the positive comment about the manuscript, as we also believe that the literature review used is in a quite good relationship with the research.

Moreover, an important effort has been made to address the practical contribution of the paper to other researchers or practitioners.

We thank the reviewer for the positive comment about the manuscript, as we also believe that an important effort has been made to address the practical contribution of the paper to other researchers or practitioners.

However, there are some major issues. First the selection of bounce rate metric is not thoroughly aligned with the meaning of duration (see further justification in the comments 8-9 below).

We thank the reviewer for pointing this out. Great catch! Upon reflection, you are correct. We have pivoted from the use of bounce rate as a measure of duration in this version of the manuscript, which we discuss in detail below.

Second, there is some difficult understanding regarding the statistical tests that have been selected and what was finally presented in results (See comments 11-15).

We thank the reviewer for the suggestions on better presenting the statistical tests that have been selected and what was finally presented in the results. We address the suggestions in this manuscript version, which we discuss in detail below.

Lastly, both theoretical and practical contributions are unfolded in an organized and logical way. But it will be very useful to add even more practical contributions from a competitive intelligence point of view. What this manuscript offers compared to the prior relative research approaches on the field of Web Analytics validity and competitive intelligence strategy? And how the up-to-date theoretical scientific approaches are benefited from this paper?

We thank the reviewer for these suggestions on better presenting the practical and theoretical contributions. We address the suggestions in this version of the manuscript, which we discuss in detail below.

Once more, well done for your effort, and I hope the forthcoming suggestions/comments will help you to optimize the value of this paper.

Thanks so much for the ‘well done’! We put a lot of work, spirit, thought, and heart into this research and manuscript, and the same for incorporating and implementing your suggestions! They improved the manuscript!

We thank you so much for the excellent suggestions and comments. Some made us think and pivot our direction. Others helped clarify, and others strengthened the research. We sincerely appreciate the thoroughness and thoughtfulness of the review.

With this round of revisions, we believe we have addressed both the spirit and the specifics of the reviewers’ comments in the manuscript’s current version as outlined below. The manuscript and reporting of the research are better because of your suggestions.

Again, thanks!

1. Line 32. we need to be more explicit here. What other uses are available based on the citation 2?

We thank the reviewer for this suggestion which we now address in this version of the manuscript.

Page 2, 35: Web analytics is a critical component of business intelligence, competitive analysis, website benchmarking, online advertising, online marketing, and digital marketing (2) as business decisions are made based on website traffic measures obtained from website analytics services.

2. Table 1. Line 76-77. In the third column the Ahrefs tool is more a backlink checking tool and not a behavioural analytics platform. Better not to include it and refer some other tool more relevant with the web behavioural analytics and not with the off-site optimization and backlinks building.

We thank the reviewer for this suggestion which we now address in this version of the manuscript. We removed the mention of Ahrefs.

See Table 1, page 4

3. Lines 81-83. The same thing is conducted also with SEMrush as well. And more specifically, SEMRush provides explicit statistics on a daily basis for competitors through graphs, figures etc. And also, the triangulation perspective is adopted on SEMrush, just like SimilarWeb. So why we choose SimilarWeb compared to the others? In a general sense, it will reinforce furtherly the justification of the paper if we put a clear paragraph or a table referring that compared to the others, we choose SimilarWeb for these reasons (one, two, three, four, six, ten and so on reasons).

We thank the reviewer for this suggestion which we now address in this version of the manuscript.

Page 5, 96: In this research, we compare web analytics statistics from Google Analytics (the industry-standard website analytics platform at the time of the study) and SimilarWeb (the industry-standard traffic analytics platform at the time of the study) using four core web analytics metrics (i.e., total visits, unique visitors, bounce rate, and average session duration) averaged monthly over 12 months for 86 websites. We select SimilarWeb due to the scope of its data collection, reportedly one billion daily digital signals, two terabytes of daily analyzed data, more than two hundred data scientists employed, and more than ten thousand daily traffic reports generated, with reporting features better or as good than other services (39) at the time of the study. As such, SimilarWeb represents state-of-the-art in the online competitive analytics area. We leave the investigation of others services besides Google Analytics and SimilarWeb to other research.

5. Line 102. Hmm, there are other metrics within these platforms. Mostly user-centric (average video duration, avg videos watched in channel, different types of engagement with a post, followers/subscribers gain and so on). Hence, the results of this study cannot impact on several other domain, but only between web analytics platforms that estimate only websites traffic. Better not to include this assumption.

We thank the reviewer for this suggestion which we now address in this version of the manuscript.

Page 6, 112: Moreover, the metrics reviewed are commonly used in many industries employing online analytics, such as advertising, online content creation, and e-commerce. Therefore, the findings are impactful for several domains.

6. Line 122-123. this seems to be a little be general sentence about their findings. What these correlations specifically depict? And actually, I suppose, that the purpose here in our paper is to present prior works that focus on the comparison of web traffic platforms, to find differences and fluctuations among them. Not to compare web traffic stats with organizational performance. So, it needs to be more explicit here.

We thank the reviewer for this pointing this out, and we now address this clarification in this version of the manuscript.

Page 6, 133: The researchers did not evaluate the traffic services but instead reported correlations between web traffic data and measures of academic quality for universities.

7. Line 133-134. Please guys, reform this sentence. Personally, I believe that this is a little bit arrogant, and does not express academic ethos. It just like that saying “ok you there Scheitle and colleagues, you don’t have money, but we have money, and we can do research and you cannot ;) . Probably it is true, but better to redefine this sentence.

To clarify, this was implied in the cited paper, not by us. However, we take your point and reword the sentence in this version of the manuscript.

Page 7, 144: Scheitle and colleagues (19) attribute this absence to SimilarWeb charging for its service, although the researchers do not investigate this conjecture.

8. Line 179. hmm ok Frequency is related with total visits per a determined time-range, Reach is related with the unique visitors. But duration is related mostly with visit duration and page per visit as metrics. Bounce rate express the immediate abandonment from a website without proceeding to any kind of interaction with the content thus this mean zero duration. Probably we can assume here that bounce rate is related mostly with content usability and representativeness of users search terms with what they retrieved as websites’ content from search engine results. That is, if we do not have a good alignment of search term and content, then we have high bounce rate and vice-versa. Or if we have poor usability, then bounce rate is increased as well. So better change the duration with something else more specific that is aligned in a better way with the bounce rate. In a general sense, the involvement of bounce rate metric and its inclusion under the meaning of measuring duration is one of the main issues within the paper. The metric itself is a little bit vexed and you pointed this out in your argumentation including several related references. In continuation of this comment, I try to help you more with another one comment related with bounce rate included in Table 2.

Again, we thank the reviewer for pointing this out. Really helpful! Upon reflection, you are correct, and we pivoted from the use of bounce rate as a measure of duration in this version of the manuscript.

Page 8, 187: To investigate this research objective, we focus on four core web analytics metrics – total visits, unique visitors, bounce rate, and average session duration – which we define in the methods section. Although there is a lengthy list of possible metrics for investigation, these four metrics are central to addressing online behavioral user measurements, including frequency, reach, engagement, and duration, respectively.

9. Table 2. Column 3. we mentioned “A bounced visit is the act of a person immediately leaving a website before any interaction can reasonably occur” This point is conflicting with the below one point “measure of duration”. Bounce rate is not a measure of duration, so if there is no interaction, there is no duration. And based on Google as you stated below within the table << Bounce rate is single-page sessions divided by all sessions, or the percentage of all sessions on your site in which users viewed only a single page and triggered only a single request to the Analytics server. These single-page sessions have a session duration of 0 seconds since there are no subsequent hits after the first one that would let Analytics calculate the length of the session. >> Probably you take it from here. at: https://support.google.com/analytics/answer/1009409?hl=en#:~:text=Bounce%20rate%20is%20single%2Dpage,request%20to%20the%20Analytics%20server. Therefore, I am afraid that we cannot use Bounce rate within the whole paper. And I do not understand why we do not use pages per session or time spent as metrics for measuring duration. This also measures the depth of exploration.

Again, we thank the reviewer for pointing this out. Really helpful! We really need to keep the reporting of the bounce rate, as it is a very common metric for website analytics. However, in this version of the manuscript, we re-focus bounce rate as a measure of engagement rather than duration.

Additionally, we now have added an additional section for analysis of average session duration for the duration metric.

See Table 2, page 14

10. Line 238. Reading the citation (number 107) and the paper itself from the acm, it is a little bit fuzzy how large-scale machine learning on a social media such as twitter is related with the SimilarWeb standard methods as it is mostly a website traffic intelligence tool and not a social media competitive intel platform.

We thank the reviewer for pointing out the possible irrelevant reference, which we have removed in this version of the manuscript.

Specifically, we removed: Brownlee J. A Tour of Machine Learning Algorithms [Internet]. Machine Learning Mastery. 2019 [cited 2020 Oct 6]. Available from: https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Page 12, 252: In sum, the general techniques employed by SimilarWeb are standard methodologies (101,106,107), academically sound, and industry standard state-of-the-art.

11. Line 244-245. How confident we are that this linking process extracts the specific analytics from google analytics without deviations from the original one source, namely the GA platform? Ok till now, we are sure that the provided GA data within the similar web platform have differentiations with the provided similar web data. Ok very good on that. But are we sure that GA data within the SimilarWeb are the same with the original data extracted from GA platform for the examined websites? In other words, do we proceed into a preliminary comparison at the same time-period between the extracted google analytics data from the two platforms, that is original GA and GA data as they included within SimilarWeb? Or can we ask the admins of these Google Analytics Accounts if they can ensure -even in a small sample of the websites (5 or 10 of the total 86)- that the provided Google Analytics data from the Google Analytics platform are the same with the provided final Google Analytics data from the Similar Web? This for sure, will overhaul the trustworthiness of our research sample and also the validity of our methodology.

Concerning the process of acquiring the Google Analytics data, it is a straightforward access to Google, so the data is being pulled directly from the connected Google Analytics account. We now mention this in the report, and our verification of the data access.

Page 12, 258: For this access, the website owner grants SimilarWeb access to the website’s Google Analytics account, so the data pull is direct. We verified this process with a website not employed in the study, encountering no issues with either access or reported data.

12. Line 255-256. Well, we do not agree into this assumption guys. Who says that the rule of thumb is about 30 websites and not 31 or 29 for descriptives? Better reinforce it with a citation here. You can retrieve it even from a statistical perspective paper (such as the citation 109 that you have already used). Or from the prior approaches that related with web analytics platforms comparisons and their gathered sample compared to ours in this paper. As it is now, is more an opinion, and not a documented justification.

We thank the reviewer for pointing this out concerning the exact number thirty. We address this comment by removing the offending sentence from this version of the manuscript.

13. Line 267-274. 1) Ok, if someone search, based on literature review, we need normal distribution to execute paired t-test. Now based on our implications here in this paragraph, we do not have normal distribution at the initial dataset. And indeed, after downloading the file from the Supporting Information, we discover very high skewness values within the items. Also we extract a low value of Shapiro-Wilk which has been conducted for testing normality and linearity of the sample.

1)Therefore, we need first a non-parametric test to prove that our data are not normally distributed or in other words to prove that all the variables do not follow normal distribution (we can also prove it with Wilcoxon signed rank test, the Mann-Whitney U Test and the Kruskal-Wallis test).

2) After proving non-normality then we take on the Box-Cox transformation. And ok this is good lads, as we deployed it. After that, we argue here that we have normal distribution even there is a bit of skewness. But which is the normality value of the variables right now after the transformation? This is missing. So here we need to re-run a second test to prove that we transformed our data and now we are in the right order; we have the required normality to conduct paired t-test. Therefore, we need to conduct a normal distribution and stating that the results indicating that after the data has been transformed, we have a normal distribution.

3) After the transformation of the data through box-cox how they shaped? how they transformed. What numbers where existing previously and how they are shaped now after the transformation. It will be useful to provide a small sample (4-5 websites in the three variables) within a table on how the dataset was; and how the dataset has been transformed right now after the box-cox.

4) Thereafter, our method to adopt paired t-test will be furtherly reinforced by the citations you included (111 and 112)

The comment from the reviewer (“And ok this is good lads, …), made us smile! �

We thank the reviewer for these suggestions, which we now address in this version of the manuscript.

For (1), we conduct the Shapiro-Wilk test for each variable and platform. In the interest of space and the general expectation by most readers that the data will not be normal, we do not include the results in the manuscript. However, we provide them below to show that we did conduct them.

• Google Analytics Visits: The Shapiro-Wilk test showed a significant departure from the normality, W(86) = .486, p < .001

• Google Analytics Unique Visitors: The Shapiro-Wilk test showed a significant departure from the normality, W(86) = .497, p < .001

• Google Analytics Bounce Rate: The Shapiro-Wilk test showed a significant departure from the normality, W(86) = .955, p = .004

• Google Analytics Session Duration: The Shapiro-Wilk test showed a significance departure from the normality, W(86) = .586, p < .001

• SimilarWeb Visits: The Shapiro-Wilk test showed a significant departure from the normality, W(86) = .584, p < .001

• SimilarWeb Unique Visitors: The Shapiro-Wilk test showed a significant departure from the normality, W(86) = .592, p < .001

• SimilarWeb Bounce Rate: The Shapiro-Wilk test showed a significant departure from the normality, W(86) = .967, p = .026

• SimilarWeb Session Duration: The Shapiro-Wilk test showed a significance departure from the normality, W(86) = .536, p < .001

For (2) and (3), we now included a graph (for total visits) of the transformed data in the manuscript after transformation. We conducted the Shapiro-Wilk test for each variable and platform post transformation. In the interests of space, we do not include the full results in the manuscript but do mention the effect sizes in the manuscript. We provide the effect size below to show that we did conduct them. We also include in this version of the manuscript, as suggested, a histogram of the distribution for one variable (total visits) as an example for the readers.

• Google Analytics Visits: The observed effect size KS - D is very small, 0.06478. This indicates that the magnitude of the difference between the sample distribution and the normal distribution is very small.

• SimilarWeb Visits: The observed effect size KS - D is small, 0.09128. This indicates that the magnitude of the difference between the sample distribution and the normal distribution is small.

• Google Analytics Unique Visits: The observed effect size KS - D is medium, 0.1065. This indicates that the magnitude of the difference between the sample distribution and the normal distribution is medium.

• SimilarWeb Unique Visits: The observed effect size KS - D is medium, 0.1065. This indicates that the magnitude of the difference between the sample distribution and the normal distribution is medium.

• Google Analytics Bounce Rate: The observed effect size KS - D is small, 0.09476. This indicates that the magnitude of the difference between the sample distribution and the normal distribution is small.

• SimilarWeb Bounce Rate: The observed effect size KS - D is small, 0.08994. This indicates that the magnitude of the difference between the sample distribution and the normal distribution is small.

• Google Analytics Bounce Rate: The observed effect size KS - D is medium, 0.1066. This indicates that the magnitude of the difference between the sample distribution and the normal distribution is medium.

• SimilarWeb Bounce Rate: The observed effect size KS - D is small, 0.09481. This indicates that the magnitude of the difference between the sample distribution and the normal distribution is small.

For (4), as the reviewer stated, our position for using the paired t-test is now further supported by the mentioned references. Thanks for the suggestions!

Again, we thank the reviewer for these suggestions concerning providing support for our analysis.

Page 13: We conducted the Shapiro-Wilk test for visits, unique visits, and bounce rate for both platforms. The Shapiro-Wilk tests showed a significant departure from the normality for all variables.

See Figure 3, page 13.

14. Line 276-277. Hmm might be a little bit confusing for the reader. So, we conducted the tests on the transformed data. That is good. But we report the non-transformed values? Why this choice lads? Why we conducted the transformation? Probably to make the dataset normal-distributed. But we present the non-transformed values? Therefore, so why to conduct the transformation before? And actually, the non-transformed values without the transformation would give a greater clarity as we refer here. Sorry guys for not understanding this choice, but we need to be more explicit for the sake of the forthcoming readers. Thank you.

We thought the actual values would be of more impact to the readers. However, it is a stylistic point. We take your comment that, if you wanted the transformed data than other readers may want the transformed data also. We now report the transformed data in this version of the manuscript.

Page 13, 281: We employ paired t-tests for our analysis. The paired t-test compares two means from the same population to determine whether or not there is a statistical difference. As the paired t-test is for normally distributed populations, we conduct the Shapiro-Wilk test for visits, unique visits, bounce rate, and average session duration for both platforms to test for normality. As expected, the Shapiro-Wilk tests showed a significant departure from the normality for all variables. Therefore, we transformed our data to a normal distribution via the Box-Cox transformation (110) using the log-transformation function, log(variable). We then again conducted the Shapiro-Wilk test; the effect sizes of non-normality were very small, small, or medium, indicating the magnitude of the difference between the sample and normal distribution. Therefore, the data is successfully normalized for our purposes, though a bit of skewness exists, as the data is weighted toward the center of the analytics numbers using the log transformation, as shown for visits in Figure 3.

Figure 3: Histogram of Normalized Google Analytics and SimilarWeb Visits Data. Effect sizes Are Very Small and Small Respectively, Indicating the Difference Between the Sample Distribution and the Normal Distribution is Very Small/Small

Despite the existing skewness, previous work shows that a method such as the paired t-test is robust in these cases (111,112). The transformation ensured that our statistical approach is valid for the dataset’s distributions. We then execute the paired t-test on four groups to test the differences between the means of total visits, unique visitors, bounce rates, and average session duration on the transformed values.

15. Line 323-325. Oh guys hold on a second. Here the Spearman coefficient comes from the sky, without mentioned nothing within the Methodology section about its scope and what will give to the readers. We have mentioned on the theoretical part some things about correlations, but reading and reading again the theory, I cannot understand what this correlation will practically gives to us. How we interpret it? that is why we correlate them? And why we use Spearman instead of Pearson? Secondly, spearman is deployed mostly on non-normally distributed datasets. Have we conducted the Sprearman on the non-transformed dataset or on the transformed one? If it is the latter, then it needs Pearson which is conducted mostly on normal distributions.

In any case, if there is a reason for conducting Correlation Analysis then we must:

A) Refer with clarity why we do this and what proves in support to the scope of the paper.

B) Refer clearly in which dataset you have applied the Correlation Analysis. It is the non-transformed or transformed one. If it is the latter one, then Pearson is more appropriate.

C) Include scatter plots for all the three correlations for the involved metrics. the high numbers of coefficients say almost nothing to a demanding reader.

We thank the reviewer for this suggestion which we now address in this version of the manuscript. Specifically, we now discuss the correlation analysis in the methods section of the manuscript, including the use of the analysis. As we use the normalized versions of the data, we report the results of the Pearson correlations in this version of the manuscript. We highlight several places in the manuscript where we discuss correlations between the two analytics platforms. We address the scatter plots in the next comment of this Response to the Reviewers.

Abstract: The website rankings between SimilarWeb and Google Analytics for all metrics are significantly correlated, especially for total visits and unique visitors.

Page 9, 196: Given that Google Analytics uses site-centric website data and SimilarWeb employs a triangulation of datasets and techniques, we would reasonably expect values would differ between the two. However, is it currently unknown how much they differ, which is most accurate, or if the results are correlated. Therefore, because Google Analytics is the de facto industry standard for websites, we use Google Analytics measures as the baseline for this research.

Page 14, 298: Further, we employ the Pearson correlation test, which measures the strength of a linear relationship between two variables, using the normalized values for the metrics under evaluation. This correlation analysis informs us how the two analytics services rank the websites relative to each other for a given metric, regardless of the agreement on the absolute values. These analytics services are often employed in site rankings, which is a common task in many competitive intelligence endeavors and used in many industry verticals, so such correlation analysis is insightful for using the two services in various domains.

Page 17, 349: Ranking the websites by total visits based on Google Analytics and SimilarWeb, we then conduct a Pearson correlation coefficient test. There was a significant strong positive association between the ranking of Google Analytics and SimilarWeb, rs(85) = .954, p < .001.

Page 18, 358: This finding implies that, although the reported total visits values differ between the two platforms, the trend for the set of websites is generally consistent. So, if one is interested in a ranking (e.g., “Where does website X rank within this set of websites based on total visits?”), then SimilarWeb values will generally align with those of Google Analytics for those websites. However, if one is specifically interested in numbers (e.g., “What is the number of total visits to each of N websites?), then the SimilarWeb total visit numbers will be ~20% below those reported by Google Analytics, on average.

Page 19, 377: Ranking the websites by unique visitors based on Google Analytics and SimilarWeb, we then conduct a Pearson correlation coefficient test. There was a significant strong positive association between the ranking of Google Analytics and SimilarWeb, rs(85) = .967, p < .001.

Page 20, 404: We then conducted a Pearson correlation coefficient test to rank the websites by bounce rate based on Google Analytics and SimilarWeb. There was a significant positive association between the ranking of Google Analytics and SimilarWeb, rs(85) = .461, p < .001.

Page 21, 428: Ranking the websites by average session duration based on Google Analytics and SimilarWeb, we then conduct a Pearson correlation. There was a significant positive association between the ranking of Google Analytics and SimilarWeb, rs(85) = .536, p < .001.

Page 27, 579: Use of Google Analytics and SimilarWeb: Findings of our research show that, in general, SimilarWeb results for total visits and number of unique visitors will generally be lower than those reported by Google Analytics, and the correlation between the two platforms is high for these two metrics. So, if one is interested in ranking a set of websites for which one does not have the Google Analytics data, the SimilarWeb metrics are a workable proxy. If one is interested in the actual Google Analytics traffic for a set of websites, one can use the SimilarWeb results and increase by about 20% for total visits and about 40% for unique visitors, on average. As a caveat, the Google Analytics unique visitor’s numbers are probably an overcount, and the SimilarWeb values may be more in line with reality. As an easier ‘rule of thumb’, we suggest using a 20% adjustment (i.e., increase SimilarWeb numbers) for both metrics based on the analysis findings above. The realization that these services can be complementary can improve decision-making that relies on KPIs and metrics from website analytics data.

Page 28, 594: Estimating Google Analytics Metrics for Multiple Websites: As shown above, the differences between Google Analytics and SimilarWeb metrics for total visits and unique visitors are systematic (i.e., the differences stay relatively constant), notably for visits and unique visitors. This means, if you have Google Analytics values for one site, you can adjust and use a similar difference for the other websites to get reasonable analytics numbers to those from Google Analytics. This technique is valuable in competitive analysis situations where you compare multiple sites against a known website and want the Google Analytics values for all sites. However, SimilarWeb generally provides conservative analytics metrics compared to Google Analytics, meaning that, if solely relying on this single service, analytics measures may be lower, especially for onsite interactions. So, decisions using these analytics metrics need to include this as a factor.

16. Regarding figures 6-8. They need improvement. What these numbers mean both in vertical and horizontal axes? And especially on the horizontal one. Although the comparison through the line is comprehensible, the rest are not. Also, we can minimize the white space (where it is possible) by eliminating the range of the vertical access.

We thank the reviewer for this suggestion which we now address in this version of the manuscript. Given the request above for scatterplots for the correlations, which was the purpose of these graphs, we have replaced the graphs with scatterplots in this version of the manuscript.

See Figure 4, page 18

See Figure 5, page 19

See Figure 6, page 20

See Figure 7, page 21

17. Line 405. We refer “that these ranked lists can be used for research and other purposes”. Ok but for what other purposes? this is a little be general. Better to be more explicit here and point out the other purposes.

We thank the reviewer for this suggestion which we now address in this version of the manuscript.

Page 23, 451: The website rankings by each service are significantly correlated, so it seems that these ranked lists can be used for research on analytics, competitive analysis, and analytics calculations for a set of websites, with the caveat highlighted in (18,19). These analyses compare the two services’ precision (i.e., how close measured values are to each other).

18. Line 414. This citation (118) is related with the messy situation in Scientometrics and has nothing to do with the web analytics of websites. Better find something else, or remove it.

We thank the reviewer for this suggestion which we now address in this version of the manuscript. Following the reviewer’s suggestion, we have removed the offending citation.

Page 23, 461: No measure or measurement tool is perfect, and web data can be particularly messy.

19. Line 420. there is no “installed correctly or the same on all the websites”. The script is one. If it is installed, then produces numbers. If it is not, then no numbers. Of course, there are incorrections between the connections of GA with Google Ads, Search Console or other platforms and their produced metrics. But in case of the three metrics that have been used here, there are measured properly or indicated zero values if there is a problem in set up. In addition, if we have doubts about the proper installation of GA, why we do not use the Tag Assistant Legacy of Google as browser extension in our data collection section? This tool identifies errors in analytics installation (check here https://chrome.google.com/webstore/detail/tag-assistant-legacy-by-g/kejbdjndbnbjgmefkgdddjlbokphdefk?hl=en)

We thank the reviewer for raising this issue. Perhaps “installed correctly” presented the incorrect impression, so we have modified the sentence.

Page 23, 469: Furthermore, Google Analytics might have different settings in terms of filtering, such as housekeeping visits from organizational employees that would slant the results.

20. Line 433-438. Again, regarding the Bounce Rate metric. Well, this is a contradictory justification with the aforementioned definition of Bounce Rate as can be seen in table 2. And if we want to consider the Duration as the third central measurement of Web Analytics, why we choose bounce rate which is at least contentious for many cases in the literature review regarding duration validity? And not choosing the visit duration (SW) and the Avg. Visit Duration (GA) to make a comparison among them? This will eliminate all these doubts about bounce rate validity.

We thank the reviewer for raising this issue concerning bounce rate, which we have changed from Duration to Engagement. As for the reviewer’s comment that visit duration or average visit duration totally addresses the duration is not entirely correct, as any measure of duration will suffer from the ‘no exit point’ issue, so this exit point issue does not only affect bounce rate.

Additionally, we now have added an additional section for analysis of average session duration for the duration metric.

Page 9, 204: H4: SimilarWeb measures of average session durations for websites differ from those reported by Google Analytics.

See section H4: Measurements of average session duration differ, beginning on page 21, 416

21. Line 525. These two citations on this line. The first one 119, refers issues about the setup errors of GA. However, none of these errors of administrators affect the three metrics that we involve here. For example, if we were involving demographics, then ok, we have validity problems. But none of the statements of Alex Ramadan affect the Total visits, Unique Visitors and Bounce Rate. The other link (citation 119) is broken as a 404 page.

We have removed the sentence from the paragraph in this version of the manuscript as upon review, it was not central to the paragraph’s main topic.

22. Regarding reference list. Citations 28, 32, 33, 54, 55, 84 are broken or are not working properly.

Thanks for pointing out this issue. Impressed that you checked them all!

We also tested the links. Of the six you mention, four worked fine for us, and two links did not.

We now provide updated functional links for those two references or removed them.

We also verify that every link in the reference listing was functional as of the date that we submitted this manuscript.

Again, thanks!

End of Comments/Suggestions. Thank you for this opportunity.

Hey, thank you for the GREAT comments and suggestions! Made us work, but we believe the suggestions and effort to address these suggestions really improved the manuscript!

Also, the tone of the comments and positive support were really motivating for us to do a good job with the revisions! Again, thanks so much!

-------------

Reviewer #2: This is a very well-written manuscript. Very easy to read. The material is well-organized.

We thank the reviewer for the positive comment about the manuscript, which we also believe is well-written, easy to read, and well-organized material.

The manuscript deals with an important problem area: the accuracy of popular website analytics and traffic estimation services (e.g., Google Analytics and SimilarWeb). The manuscript identifies and addresses a research gap: a lack of academic research and interest in studying web analytics.

We thank the reviewer for the positive comments about the research, which we also believe is an important problem, and the manuscript, which we also believe identifies and addresses a research gap: a lack of academic research and interest in studying web analytics.

To improve, can the authors provide more insight on why there is a lack of attention among academics in currently studying this phenomenon?

We thank the reviewer for this suggestion which we now address in this version of the manuscript.

Page 7, 142: While few academic studies have examined analytics services, fewer have evaluated the actual analytics numbers; instead, they focus on the more easily accessible (and usually free) ranked lists. Studies are even rarer still on the performance of SimilarWeb, despite its standing and reputation as an industry leader. Scheitle and colleagues (19) attribute this absence to SimilarWeb charging for its service, although the researchers do not investigate this conjecture.

Page 8, 181: Although the questions are conceptually straightforward, they are surprisingly difficult to execute in practice. This difficulty, especially in terms of data collection, may be a compounding factor for the dearth of academic research in the area.

In the abstract, rather than saying the accuracy of metrics provided by Google Analytics and SimilarWeb will be discussed, provide a short sentence or two that speaks to or describes the accuracy of these metrics.

We thank the reviewer for this suggestion which we now address in this version of the manuscript.

Abstract, 25: The accuracy/inaccuracy of the metrics from both services is discussed from the vantage of the data collection methods employed. In the absence of a gold standard, combining the two services is a reasonable approach, with Google Analytics for onsite and SimilarWeb for network metrics.

In the paper, provide more insight on what the impact of SimilarWeb providing conservative traffic metrics compared to Google Analytics actually means in terms of practice. Why should we care? Why is it important to know that SimilarWeb and Google Analytics can be used in a complementary fashion when direct website data is not available? How important is this to know? Elaborate more on the implications of this research.

We thank the reviewer for these suggestions, which we now address in this version of the manuscript.

• Page 27, 576: Use of Google Analytics and SimilarWeb: Findings of our research show that, in general, SimilarWeb results for total visits and number of unique visitors will generally be lower than those reported by Google Analytics, and the correlation between the two platforms is high for these two metrics. So, if one is interested in ranking a set of websites for which one does not have the Google Analytics data, the SimilarWeb metrics are a workable proxy. If one is interested in the actual Google Analytics traffic for a set of websites, one can use the SimilarWeb results and increase by about 20% for total visits and about 40% for unique visitors, on average. As a caveat, the Google Analytics unique visitor’s numbers are probably an overcount, and the SimilarWeb values may be more in line with reality. As an easier ‘rule of thumb’, we suggest using a 20% adjustment (i.e., increase SimilarWeb numbers) for both metrics based on the analysis findings above. The realization that these services can be complementary can improve decision-making that relies on KPIs and metrics from website analytics data.

• Page 28, 591: Estimating Google Analytics Metrics for Multiple Websites: As shown above, the differences between Google Analytics and SimilarWeb metrics for total visits and unique visitors are systematic (i.e., the differences stay relatively constant), notably for visits and unique visitors. This means, if you have Google Analytics values for one site, you can adjust and use a similar difference for the other websites to get reasonable analytics numbers to those from Google Analytics. This technique is valuable in competitive analysis situations where you compare multiple sites against a known website and want the Google Analytics values for all sites. However, SimilarWeb generally provides conservative analytics metrics compared to Google Analytics, meaning that, if solely relying on this single service, analytics measures may be lower, especially for onsite interactions. So, decisions using these analytics metrics need to include this as a factor.

There are 143 references included in this paper. This is great, but over the top. I think the references can be reduced to a more significant subset. This would reduced the paper’s word count.

We thank the reviewer for this suggestion. Since this is one of the first research papers in this area, we need to ensure that the literature review is comprehensive, requiring a substantial number of references, along with the many technical references required for a thorough explanation of the two platforms and measures employed.

What impact do the study’s findings have on user-centric, site-centric, and network-centric approaches to web analytics data collection identified earlier in the paper?

We thank the reviewer for this suggestion which we now address in this version of the manuscript.

Page 26,533: Triangulation of Data, Methods, and Services: There seems, at present, to be no single data collection approach (user, site, or network-centric) or web analytics service (including Google Analytics or SimilarWeb) that would be effective for all metrics, contexts, or business needs. Therefore, a triangulation of services, depending on the data, method of analysis, or need, seems to be the most appropriate approach. It appears reasonable that user-centric approaches can be leveraged for in-depth investigation of user online behaviors, albeit usually with a sample. Site-centric approaches can be leveraged for the investigation of users’ onsite behaviors. Network-centric approaches can be leveraged for in-depth investigation of user intersite behaviors.

The USA represents half of the 86 websites studied. News and media content represents 42% of the 86 websites. It would be good to further describe how the this skewed sample affects the findings and interpretation of results.

We thank the reviewer for this suggestion which we now address in this version of the manuscript.

Page 28, 610: An additional limitation is that a large percentage of the sites used in this research are content creation sites based in the U.S.A., which might skew user behavior. Other future research involves replication studies with different sets of websites, other website analytics services, other metrics, and analysis of specific website segments based on type, size, industry vertical, or country (i.e., China being a critical region of interest).

-------------

Reviewer #3: The authors conducted a comparison between Google Analytics and SimilarWeb based on analytics metrics data. The results provide both theoretical and practical implications. The paper is clearly organized and well written. With some minor improvements this piece is worth publishing, and I have a few specific suggestions.

We thank the reviewer for both the accurate summary of the research and the positive comments about the research (which we also believe provides both theoretical and practical implications) and the manuscript, which we also believe is clearly organized and well written. We also believe the research manuscript is worth publishing.

Concerning the few specific suggestions, we believe we have addressed both the spirit and the specifics of the reviewers’ comments in the manuscript’s current version, as outlined below.

First, the authors need to justify their selection of total visits, unique visitors, and bounce rates as the three metrics. Why excluding other common metrics such as time on site/page?

We thank the reviewer for this suggestion which we now address in this version of the manuscript. Additionally, we now include time on-site analysis in this version of the manuscript.

Page 8, 187: To investigate this research objective, we focus on four core web analytics metrics – total visits, unique visitors, bounce rate, and average session duration – which we define in the methods section. Although there is a lengthy list of possible metrics for investigation, these four metrics are central to addressing online behavioral user measurements, including frequency, reach, engagement, and duration, respectively. We acknowledge that there may be some conceptual overlap among these metrics. For example, bounce rates are sessions with an indeterminate duration, but average session duration also provides insights into user engagement. Nevertheless, these four metrics are central to the web analytics analysis of nearly any single website or set of websites; therefore, they are worthy of investigation. In the interest of space and impact of findings, we focus on these four metrics, leaving other metrics for future research.

Second, all three hypotheses are supported, but how does this help evaluate the accuracy of the two analytics services? I think it impossible to indicate which one is more accurate given the significant differences between them in terms of the three metrics.

We thank the reviewer for this suggestion which we now address in this version of the manuscript.

Abstract, 25: The accuracy/inaccuracy of the metrics from both services is discussed from the vantage of the data collection methods employed. In the absence of a gold standard, combining the two services is a reasonable approach, with Google Analytics for onsite and SimilarWeb for network metrics.

Page 9, 196: Given that Google Analytics uses site-centric website data and SimilarWeb employs a triangulation of datasets and techniques, we would reasonably expect values would differ between the two. However, is it currently unknown how much they differ, which is most accurate, or if the results are correlated. Therefore, because Google Analytics is the de facto industry standard for websites, we use Google Analytics measures as the baseline for this research.

Page 23,460: Although one might lean toward considering metrics reported by Google Analytics as the ‘gold standard’ for website analytics (and justifiably so in many cases), it is also known within the industry that Google Analytics has tracking issues in some cases. Also, a reportedly high percentage of Google Analytics accounts are incorrectly set up (118–121), skewing the measuring in some cases. There are also contexts where other analytics methods might be more appropriate. Google Analytics relies on one data collection approach: basically, a cookie and tagging technique. There are certainly cases (e.g., cleared cookies, incognito browsing) when this method is inaccurate (e.g., unique visitors). Furthermore, Google Analytics might have different settings in terms of filtering, such as housekeeping visits from organizational employees that would slant the results. Therefore, these issues result in issues with Google Analytics being seen as the ‘gold standard.’

See Practical Implications section, pages 27-28

Finally, I suggest that the authors improve their discussion section by providing more insights into the causes for the differences between Google Analytics and SimilarWeb.

We thank the reviewer for this suggestion which we now address in this version of the manuscript.

See Discussion section, pages 22-26

A minor problem - I’m confused by the statement “The techniques used by SimilarWeb are similar to the techniques of other traffic services, such as Alexa, comScore, SEMRush, Ahrefs, and Hitwise.” (Page 5, Line 100). While Alexa and comScore user-centric, SEMRush, Ahrefs, and Hitwise are network-centric. Why “similar”? What are the “techniques”?

Sorry for this confusion. We have now expanded the sentence with a clarifying phrase in this version of the manuscript.

Page 6, 198: The techniques used by SimilarWeb are similar to the techniques of other website analytics services, such as Alexa, comScore, SEMRush, Ahrefs, and Hitwise, in the employment of user, site, and/or network data collection.

Attachment

Submitted filename: PONE-D-21-03616_Response_to_Reviewers_Submitted.docx

Decision Letter 1

Hussein Suleman

26 Apr 2022

Measuring user interactions with websites: A comparison of two industry standard analytics approaches using data of 86 websites

PONE-D-21-03616R1

Dear Dr. Jansen,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Hussein Suleman, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: A big "Well Done" to the authors.

The have addressed all the suggestions and comments to improve the quality of the paper.

Each kind of suggestion has been taken into consideration, while correcting all the things that probably confuse the forthcoming readers.

One of the most crucial aspect (the bounce rate involvement) has also been overhauled with clarity and in a well-organized way. This is a tremendous effort of yours. One step further, you kept the bounce rate and add one more metric.

I think that the paper now stands sufficiently and it constitutes a scientific work that holistically improves the Web Analytics research topic.

Reviewer #2: The authors have adequately address all prior concerns that I (Reviewer #2) previously raised. They have also enriched the quality of the manuscript by adequately addressing all the detailed concerns outlined previously by Reviewer #1. The paper is ready for publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Prof. Georgios A. Giannakopoulos

Reviewer #2: No

Acceptance letter

Hussein Suleman

13 May 2022

PONE-D-21-03616R1

Measuring user interactions with websites: A comparison of two industry standard analytics approaches using data of 86 websites

Dear Dr. Jansen:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Hussein Suleman

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File

    (DOCX)

    Attachment

    Submitted filename: PONE-D-21-03616_Response_to_Reviewers_Submitted.docx

    Data Availability Statement

    The data underlying the results presented in the study are available from SimilarWeb (https://www.similarweb.com/). The authors had no special access privileges to the data others would not have.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES