Document Type
Article
Publication Title
Transactions of the Association for Computational Linguistics
Abstract
Despite the subjective nature of semantic textual similarity (STS) and pervasive disagreements in STS annotation, existing benchmarks have used averaged human ratings as gold standard. Averaging masks the true distribution of human opinions on examples of low agreement, and prevents models from capturing the semantic vagueness that the individual ratings represent. In this work, we introduce USTS, the first Uncertainty-aware STS dataset with ∼15,000 Chinese sentence pairs and 150,000 labels, to study collective human opinions in STS. Analysis reveals that neither a scalar nor a single Gaussian fits a set of observed judgments adequately. We further show that current STS models cannot capture the variance caused by human disagreement on individual instances, but rather reflect the predictive confidence over the aggregate dataset.
First Page
997
Last Page
1013
DOI
10.1162/tacl_a_00584
Publication Date
1-1-2023
Recommended Citation
Y. Wang et al., "Collective Human Opinions in Semantic Textual Similarity," Transactions of the Association for Computational Linguistics, vol. 11, pp. 997 - 1013, Jan 2023.
The definitive version is available at https://doi.org/10.1162/tacl_a_00584
Additional Links
DOI link: https://doi.org/10.1162/tacl_a_00584
Comments
Open Access
Archived thanks to MIT Press Direct
License: CC by 4.0
Uploaded: 22 March 2024