. 2021 Nov 4;184:111136. doi: 10.1016/j.jss.2021.111136

Table 6.

Evidence-based lessons learned, recommendations, and implications derived by our study for various stakeholders.

Lesson learned/ recommendations/ implications	Based on discussions in section number…	Stakeholders
		Software engineering teams of the apps	Decision-makers, behavioral scientists, and public health experts	Developers (vendors) of app-review analytics/ mining tools	SE researchers
The users are generally dissatisfied with the apps under study, except the Scottish app. This issue is perhaps the clearest and the most important message of our study, which should be investigated by stakeholders.	4.1	x	x

Future studies could look into what factors have made the Scottish app be different than others in the pool of apps under study. That could a research question (RQ) to be studied by researchers in future works.	4.1				x

Contact-tracing apps should be designed to be as simple as possible to operate (for usability), as we cannot expect layperson citizens to review the online FAQ pages of the app to properly configure it, especially for a safety-critical health-related app.	4.2	x

Developers of the apps can and should engage directly with reviews and reply, not only gaining insight into the most commonly raised concerns but also answering the questions in public view. This can even provide a positive “image” of the software engineering team behind the app in public view (in terms of accountability, transparency, responsiveness, and being open to feedback).	4.2	x

Just like any other mobile app, user reviews for contact-tracing apps range from a short phrase such as “Not working”, often not that useful nor insightful, to detailed objective reviews, which could be useful for various stakeholders. Thus, if any stakeholder (e.g., the app’s development team) wants to benefit in a qualitative way from the reviewers, they need to filter and analyze the “informative” reviews.	4.2	x	x

A common issue for most apps is high battery usage (drainage). Software engineers should use various heuristics and approaches to minimize battery usage. Mobile phone users are sensitive about battery usage, and any app that uses a high amount of battery would be likely to be uninstalled by mobile users.	4.3	x

For the German apps, a substantial number of reviews are about the app not working, which can be seen as bug reports. But unfortunately, since most users are non-technical people, informative and important components of a bug report (e.g., phone model/version and steps to reproduce the defect) are not included in the review. Thus, it would be quite impossible for the app’s software engineering team to utilize those reviews as bug reports. A recommendation could be that in the app itself (e.g., in its “tutorial” screens), explicit messages are given to the users, asking them that, if they wish to submit bug reports as reviews, they should include important components of a bug report (e.g., phone model/version and steps to reproduce).	4.3.1	x

A large number of cross-(mobile) device issues have been reported for the German and other apps too. This denotes inadequate cross-device testing of the apps, possibly due to the rush to release the apps to the public. Given the nature of the apps, and since the apps could be installed on any mobile device model/version by any citizen, the development and testing teams should have taken extra care in cross-device development and testing of the apps. There are many sources both in the academic literature (Husmann et al., 2016, Nebeling et al., 2015) and also grey literature^a on this issue, which the development and testing teams can benefit from.	4.3.1	x

Certain features of a given app did not work for many users for several days, e.g., for the German app, a functionality called “Risk assessment.” Such a malfunction usually gives a negative perception to users about an app, even if the other features of the app do work properly. It is thus logical to recommend that app developers should not include a feature in the app release if they predict or see from reviews that the feature does not work for certain users or on certain times/days.	4.3.1	x

We see a rather trivial issue in the apps, i.e., users have to “activate” multiple times instead of just once. We would have hoped that the test teams of the apps had detected and fixed those trivial issues before release.	4.3.2	x

It is important that a given app automatically switches to the home country’s language since some non-English users will feel odd if they see a sudden switch from their native language to English in the app’s GUI.	4.3.2	x

The development team of all apps should be proactive in replying to user reviews, and filtering informative reviews and getting more information (e.g., steps to reproduce the defects/problems) from them, e.g., by direct replies to the reviews in app stores.	4.3.3	x

There seem to be rather trivial usability issues with some of the apps (e.g., the case of exposure notification errors in the NI app). This raises the question of the inadequate usability testing of the apps and the possibility of releasing them in a “rush”.	4.3.3	x

Some of the reviews provide insights on software engineering issues of the apps, e.g., not enough testing has been done on all possible types of QR codes, and not enough performance (load) testing has been done.	4.3.3	x

For Android phones, the update mechanism of the OS and its components (e.g., APIs) should be “seamless” (automatic), since we cannot expect all users have the “technical” skills to do such tasks properly.	4.3.3	x

The apps must be clearly identifiable and findable in app stores to maximize the number of users downloading it.	4.4.3	x

Where possible, some feedback (such as statistics about COVID cases in the region and also the number of close-by phone IDs recorded in the past) should be provided as a feature of the app to encourage users that the app is working to emphasize the pro-social and individual benefit it is having.	4.4.3	x	x

A variety of insightful feature requests has been provided by users, e.g., by use of the German app: How many encounters there were with other app users (how many people you exchanged the keys with); infection numbers and spread at district level; can the app be used without internet? As a form of “iterative“ requirements engineering” (elicitation) (Iacob and Harrison, 2013, Jha and Mahmoud, 2019, Jha and Mahmoud, 2017, Williams and Mahmoud, 2017, Guzman et al., 2017, Lu and Liang, 2017, Maalej et al., 2019, Nayebi et al., 2017) or “crowd-based” requirements engineering (Groen et al., 2015), the app’s software engineering teams are encouraged to review those feature requests and select a subset to be implemented.	4.5.1	x	x

Given the nature of the COVID pandemic, the governmental policies and guidelines regularly change, and thus, the contact-tracing apps have been regularly updated/adapted to those changes. This is related to the widely-discussed issue of changing/unstable software requirements. Thus, SE researchers are encouraged to work on such issues related to contact-tracing apps.	4.5.1				x

While AppBot’s feature to filter reviews to see feature requests only is a useful feature, we found many example reviews which AppBot incorrectly classified as feature requests. We realize that an NLP/AI-based algorithm has been used to do that classification, and such an algorithm will have limited precision, but still, there is a need to improve such algorithms by developers (vendors) of App review analytics tools, such as AppBot.	4.5.1			x

Many users have cast doubts on the usefulness of the apps, i.e., they do not provide most of the “right” and much-needed features that many users are looking for. Thus, using “crowd-based” requirements engineering (Groen et al., 2015) techniques for these apps are critically needed.	4.5.2	x	x

It would be interesting to examine the differences among the apps and also their two OS versions at a technical level, e.g., their code-base, software architecture.	4.6.1				x

The sentiment analysis of apps can provide more complex granular output compared to just the “star rating,” but there seems to be an inherent negative bias, especially on Android, which should be further investigated in future studies to better understand the phenomenon. A possible future Research Question (RQ) would be: Why is there an inherent negative bias in Android versions of an app compared to the iOS version?	4.6.2	x	x		x

The semantic-overlap measures between the two OS versions of the apps ranged between 45% to 86%. Possible root causes for low or high similarity should be studied in future works.	4.6.3				x

There is a moderate correlation between the number of downloads normalized by the population size and the Trust In Public Institutions index (TIPI). This seems to denote that the more trust a country’s population, as a whole, sin their government, the higher the ratio of app downloads, and expectedly the higher the use. Behavioral scientists can possibly investigate this issue in more detail.	4.7.1		x

There could be unexpected inter-dependencies among the apps and various aspects of the mobile OS that they are running on. Updates to the OS could adversely impact a given app and could easily cause major dissatisfaction by the app users. Thus, the development team should work with OS vendors (in the above example case, Google is behind the Android OS) to prevent such chaotic situations.	4.7.2	x

www.google.com/search?q=movbile+app+%22cross+device%22+testing.