GWU NLP at SemEval-2016 shared task 1: Matrix factorization for crosslingual STS

Hanan Aldarmaki, The George Washington University & Mohamed bin Zayed University of Artificial Intelligence
Mona Diab, The George Washington University


We present a matrix factorization model for learning cross-lingual representations for sentences. Using sentence-aligned corpora, the proposed model learns distributed representations by factoring the given data into language-dependent factors and one shared factor. As a result, input sentences from both languages can be mapped into fixed-length vectors and then compared directly using the cosine similarity measure, which achieves 0.8 Pearson correlation on Spanish-English semantic textual similarity. © 2016 Association for Computational Linguistics.