iMedicalApps and JMIR Publications have partnered to help disseminate interesting & innovative digital health research being done worldwide. Each article in this series will feature summaries of interesting studies to help you keep up to date on the latest in digital health research. We invite you to share your thoughts on the study in the comments section.
Inter-rater Reliability of mHealth App Rating Measures: Analysis of Top Depression and Smoking Apps
1. What was the motivation behind your study?
As over 165,000 healthcare apps are available for download today, finding the right app can be difficult. While there is no gold standard for healthcare apps, numerous quality measures have been developed to help people find apps which are safe, easy to use, and effective. Despite the proliferation of app rating scales, little is known about how useful or reliable these tools are in real life. In this study we assessed the interrater reliability of 22 measures for evaluating the quality of healthcare apps to learn which measures may be most reproducible.
2. Describe your study.
A panel of six expert reviewers made up of four doctors, one nurse practitioner, and one healthcare economist each downloaded and reviewed the top 10 depression apps and 10 smoking cessation apps from the Apple iTunes App Store on a common set of 22 metrics. Krippendorff’s Alpha was calculated for each of the measures to determine whether or not the reviewers produced consistent ratings. The inter-rater reliability of the measures was calculated for depression and smoking apps separately so that the impact of the category of the app on inter-rater reliability could be examined.
3. What were the results of the study?
Despite rating the same apps on the same metrics, overall there was significant variation in how the reviewers rated the apps. Specifically, there was low interrater reliability for all but one of the measures evaluated. While the metric for interactiveness and feedback had reasonable interrater reliability, the interrater reliability for other metrics, such as ease of use and perceived effectiveness, were poor.
4. What is the main point that readers should take away from this study?
Rating mHealth apps is difficult, and many of the metrics currently being used may have poor interrater reliability.
Our results suggest exercising caution in widespread use of an app rating scale before it is thoroughly tested. Until scales are validated, when picking an app it may be best to seek out multiple opinions and not rely upon any single scale.
5. What was the most surprising finding from your study?
Seemingly helpful metrics to select healthcare apps, such as ease of use and perceived efficacy, actually had some of the lowest interrater reliability. If there is little agreement between reviewers on how smartphone apps perform on these metrics, it is possible that they are quite subjective.
6. What are the next steps? How do you envision this work ultimately translating into clinical practice or affect R&D?
Since patients may rate apps differently than non-patient reviewers, we plan to expand the scope of this study to now include patients and study their interrater reliability in reviewing healthcare apps. By combining the most reliable clinician and patient metrics, we may be able to find the best metrics to guide selection of the best healthcare apps.
This Q&A was contributed by all of the study authors.