Clustering of texts

Clustering of Texts
- Python Automation and Machine Learning for ICs -
- An Online Book -

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Text clustering is the process of grouping similar texts from a set of texts and has several levels of granularity, namely document, paragraph, sentence, or phrase level. Text clustering involves the task of categorizing a collection of texts, aiming to place similar texts within the same cluster while differentiating them from texts in other clusters. Manual text grouping is a labor-intensive process that demands a significant investment of time. Consequently, the integration of machine learning automation becomes imperative. Among the frequently employed techniques for representing textual data, Term Frequency Inverse Document Frequency (TFIDF) stands out. However, TFIDF lacks the ability to consider word position and context within sentences. To address this limitation, the Bidirectional Encoder Representation from Transformers (BERT) model generates text representations that encompass word position and sentence context. On the other hand, diverse methods of feature extraction and normalization are applied to enhance the data representation offered by the BERT model. To evaluate the performance of BERT, different clustering algorithms are employed: k-means clustering, eigenspace-based fuzzy c-means, deep embedded clustering, and improved deep embedded clustering.

From a machine learning point of view, text clustering is an unsupervised learning method utilizing unlabeled data [1]. For instance, text data available on the internet generally do not have a label. In fact, various unsupervised learning algorithms have been implemented to perform text clustering. Some examples are k-means clustering (KM) [4], eigenspace-based fuzzy c-means (EFCM) [5], deep embedded clustering (DEC) [6], and improved deep embedded clustering (IDEC) [7].

SentenceTransformers, as shown in Figure 4029, is a framework for state-of-the-art sentence, text and image embeddings in Python.

Figure 4029. SentenceTransformers.

Text clustering has been applied in many felds such as book organization, corpus summarization, document classifcation [2], and topic detection [3].

============================================

Clustering of texts. Code:
          Replace headers in a csv file
     Output:

============================================

[1] Bishop CM. Pattern recognition. Mach Learn. 2006;128:9.
[2] Aggarwal CC, Zhai C. A survey of text clustering algorithms. In: mining text data. New York, London: Springer; 2012. p. 77–128.
[3] Parlina A, Ramli K, Murf H. Exposing emerging trends in smart sustainable city research using deep autoencoders- based fuzzy c-means. Sustainability. 2021;13(5):2876.
[4] Xiong C, Hua Z, Lv K, Li X. An improved k-means text clustering algorithm by optimizing initial cluster centers. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD). New York: IEEE; 2016. p. 265–268.
[5] Murf H. The accuracy of fuzzy c-means in lower-dimensional space for topic detection. In: International Conference on Smart Computing and Communication. Berlin: Springer. 2018; p. 321–334.
[6] Xie J, Girshick R, Farhadi A. Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, PMLR. 2016; p. 478–487.
[7] Guo X, Gao L, Liu X, Yin J. Improved deep embedded clustering with local structure preservation. In: Ijcai, 2017. p. 1753–175.

=================================================================================