Optimization of scientific publications clustering with ensemble approach for topic extraction

Document Type

Article

Publication Title

Scientometrics

Abstract

The continually developing Internet generates a considerable amount of text data. When attempting to extract general topics or themes from a massive corpus of documents, dealing with such a large volume of text data in an unstructured format is a big problem. Text document clustering (TDC) is a technique for grouping texts based on their content similarity. Partitioning text collection based on the documents’ content significance is one of the most challenging tasks at TDC. This study proposes the Bare-Bones Based Salp Swarm Algorithm (BBSSA) to solve the problem of TDC. In addition, to extract the topics from the clusters, an ensemble approach for automatic topic extraction (TE) is proposed. The proposed BBSSA and the ensemble TE approach are tested using six standard benchmarks and six scientific publishing datasets from top QS ranking UAE universities. BBSSA’s findings are compared with sixteen well-known techniques, including eleven metaheuristic algorithms, such as the Whale Optimization Algorithm (WOA), Firefly Algorithm (FFA), Bat Algorithm (BAT), Harmony Search (HS), Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Multi-Verse Optimizer (MVO), Grey Wolf Optimizer (GWO), Moth-Flame Optimization (MFO), Krill Herd Algorithm (KHA), SSA, and five clustering methods, such as K-means++, K-means, Density-based Spatial Clustering of Applications with Noise (DBSCAN), Spectral, and Agglomerative. The results of the ensemble TE approach are compared with those of seven well-known statistical methods, including Mutual Information (MI), TextRank (TR), Co-Occurrence Statistical Information-based Keyword Extraction (CSI), Term Frequency-Inverse Document Frequency (TF-IDF), most frequent based keyword extraction (TF), YAKE!, and RAKE. According to the experiments, the BBSSA outperforms all other approaches and is exceedingly competitive. The results also reveal that for most datasets, the proposed ensemble TE strategy outperforms all existing TE methods based on external metrics. Thus, the ensemble TE approach can be seen as a supplement to the other methods.

First Page

2819

Last Page

2877

DOI

10.1007/s11192-023-04674-w

Publication Date

3-21-2023

Keywords

Bare Bones, Ensemble method, Salp swarm algorithm, Scientific publications clustering, Topic extraction

Comments

IR Deposit conditions:

OA version (pathway b) Accepted version

12 month embargo

Published source must be acknowledged with citation

Must link to publisher version with DOI

Post-prints are subject to Springer Nature re-use terms

Share

COinS