Machine Learning Faculty Publications

Assigning Topics to Documents by Successive Projections

Olga Klopp, ESSEC Business School
Maxim Panov, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Suzanne Sigalla, Institut Polytechnique de Paris
Alexandre B. Tsybakov, Institut Polytechnique de Paris

Document Type

Article

Publication Title

Annals of Statistics

Abstract

Topic models provide a useful tool to organize and understand the structure of large corpora of text documents, in particular, to discover hidden thematic structure. Clustering documents from big unstructured corpora into topics is an important task in various fields, such as image analysis, e-commerce, social networks and population genetics. Since the number of topics is typically substantially smaller than the size of the corpus and of the dictionary, the methods of topic modeling can lead to a dramatic dimension reduction. We study the problem of estimating the topic-document matrix, which gives the topics distribution for each document in a given corpus, that is, we focus on the clustering aspect of the problem. We introduce an algorithm that we call Successive Projection Overlapping Clustering (SPOC) inspired by the successive projection algorithm for separable matrix factorization. This algorithm is simple to implement and computationally fast. We establish upper bounds on the performance of the SPOC algorithm for estimation of the topic-document matrix, as well as near matching minimax lower bounds. We also propose a method that achieves analogous results when the number of topics is unknown and provides an estimate of the number of topics. Our theoretical results are complemented with a numerical study on synthetic and semisynthetic data.

First Page

1989

Last Page

2014

DOI

10.1214/23-AOS2316

Publication Date

10-2023

Keywords

Adaptive estimation, Latent variable model, Minimax rate of convergence, Nonnegative matrix factorization, Topic model

Comments

Preprint version from arXiv

Uploaded on May 30, 2024

Recommended Citation

O. Klopp, M Panov, S Sigalla, and A Tsybakov, "Assigning Topics to Documents by Successive Projections," Annals of Statistics, vol. 51, no. 5, pp. 1989 - 2014, Oct 2023.

Download

Included in

Computer Sciences Commons

COinS

Machine Learning Faculty Publications

Assigning Topics to Documents by Successive Projections

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Included in

Browse

Contribute

Links

Machine Learning Faculty Publications

Assigning Topics to Documents by Successive Projections

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Included in

Share

Browse

Contribute

Links