Appraisal of Statistical Practices in HRI vis-a-vis the T-Test for Likert Items/Scales

Abstract

Likert items and scales are often used in human subject studies to measure subjective responses of subjects to the treatment levels. In the field of human-robot interaction (HRI), with few widely accepted quantitative metrics, researchers often rely on Likert items and scales to evaluate their systems. However, there is a debate on what is the best statistical method to evaluate the differences between experimental treatments based on Likert item or scale responses. Likert responses are ordinal and not interval, meaning, the differences between consecutive responses to a Likert item are not equally spaced quantitatively. Hence, parametric tests like t-test, which require interval and normally distributed data, are often claimed to be statistically unsound in evaluating Likert response data. The statistical purist would use non-parametric tests, such as the Mann-Whitney U test, to evaluate the differences in ordinal datasets; however, non-parametric tests sacrifice the sensitivity in detecting differences a more conservative specificity–or false positive rate. Finally, it is common practice in the field of HRI to sum up similar individual Likert items to form a Likert scale and use the t-test or ANOVA on the scale seeking the refuge of the central limit theorem. In this paper, we empirically evaluate the validity of the t-test vs. the Mann-Whitney U test for Likert items and scales. We conduct our investigation via Monte Carlo simulation to quantify sensitivity and specificity of the tests.

Publication
2016 AAAI Fall Symposium Series