Clustering with empty clusters
DOI:
https://doi.org/10.24866/2311-2271/2024-2/1132Keywords:
clustering, cluster analysis, empty clusters, imputations, machine learningAbstract
Cluster analysis is widely used in various scientific and practical fields related to data analysis. It is an important tool for solving problems in such areas as machine learning, image processing, text recognition, etc. The absence of observations is not always the absence of information, therefore it is assumed that the presence of gaps in the data, the presence of “empty” clusters, also carries information about the object of study, as well as real observations. In this study, it is assumed that we do not observe not only a variable, but a whole set of objects forming a separate cluster. Thus, it is assumed that the missing in data is not the fact of the missing of a cluster of objects as such, but potentially existing objects that are absent from our selection. An algorithm is proposed to determine potential "empty" clusters for one-dimensional and two-dimensional data sets, taking into account their size and location in the feature space, depending on the initial distribution of samples. A method is implemented to fill in these gaps and estimate the displacement of the centroids of the initial clustering when taking into account an empty cluster. The application of this approach to rid the data of outliers is demonstrated.
References
Audigier V., Niang N., Resche-Rigon M. Clustering with missing data: which imputation model for which cluster analysis method? 2021. — URL: https://arxiv.org/pdf/2106.04424. pdf (дата обращения 15.01.2022).
Carreras G., Miccinesi G., Wilcock A. [et al.]. Missing not at random in end of life care studies: multiple imputation and sensitivity analysis on data from the ACTION study // BMC Medical Research Methodology. 2021. Vol. 21 (1). Р. 13. — DOI 10.1186/s12874-020-01180-y.
Demidenko E. Advanced Statistics with Applications in R. John Wiley & Sons Inc. 2019. — DOI 10.1002/9781119449195, restricted access.
Diday E. Optimisation en classification automatique et reconnaissance des forms // R.A.I.R.O. 1972. Vol. 3. Р. 61–69. — URL: https://www.rairo-ro.org/articles/ro/pdf/1972/03/ro197206V300611.pdf (дата обращения 30.01. 2024).
Diday E. Optimisation en classification automatique. Institut national de recherche en informatique et en automatique. 1979. — DOI 10.1002/9781119449195, restricted access.
Forina M., Casolino C., Lanteri S. Cluster analysis: Significance, empty space, clustering tendency, non-uniformity. I - Statistical tests on the significance of clusters // Annali di chimica. (2003. Vol. 93. Р. 55–68. — URL: https://www.researchgate.net/publication/10843361_Cluster_analysis_Significance_empty_space_clustering_tendency_non-uniformity_I_-_Statistical_ tests_on_the_significance_of_clusters (дата обращения 31.01.2024).
Giesen, J., Kühne L., Lucas P. Sclow Plots: Visualizing Empty Space // Computer Graphics Forum. 2017. Vol. 36. Р. 145–155. — DOI 10.1111/cgf.13175, restricted access.
Heymans M.W., Twisk J.W.R. Handling missing data in clinical research // Journal of Clinical Epidemiology. 2022. Vol. 151. Р. 185–188. — DOI 10.1016/j.jclinepi.2022.08.016.
Hua C., Li F., Zhang C. [et al.]. A Genetic XK-Means Algorithm with Empty Cluster Reassignment // Symmetry. 2019 Vol. 11. Р. 744. — DOI 10.3390/sym11060744, open access (дата обращения 29.01. 2024).
McGee G., Weisskopf M.G., Kioumourtzoglou M.-A. [et al.]. Informatively empty clusters with application to multigenerational studies // Biostatistics. 2020. Vol. 21. Р. 775–789. — DOI 10.1093/biostatistics/kxz005, open access (дата обращения 15.01. 2024).
Mirkin B. Clustering: A Data Recovery Approach. — Chapman & Hall, 2nd edition. 2016. — DOI 10.1201/9781420034912, open access дата обращения 10.01.2024).
Pakhira M. A Modified k-means Algorithm to Avoid Empty Clusters // International Journal of Recent Trends in Engineering. 2009. Vol. 1. Р. 220–226. — URL: https://www.researchgate.net/publication/228414762_A_Modified_k-means_Algorithm_to_Avoid_Empty_Clusters, open access (дата обращения 10.01.2024).
Pereira R.C., Santos M., Rodrigues P. [et al.]. MNAR Imputation with Distributed Healthcare Data // Progress in Artificial Intelligence. EPIA 2019. Lecture Notes in Computer Science. Vol. 11805. Р. 184–195. — DOI 10.1007/ 978-3-030-30244-3_16.
Piernik M., Morzy T. A study on using data clustering for feature extraction to improve the quality of classification. Knowledge and Information Systems. 2021. Vol. 63 (7). Р. 1771–1805. — DOI 10.1007/s10115-021-01572-6.
Raschka S., Mirjalili V. Python Machine Learning. Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2. Packt, 2rd edition. 2019. — URL: http://radio.eng.niigata-u.ac.jp/wp/wp-content/uploads/ 2020/06/python-machine-learning-2nd.pdf.
Raykov Y.P., Boukouvalas A., Baig F. [et al.]. What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm // PLoS ONE. 2016. Vol. 11 (9). Р. e0162259. — DOI 10.1371/journal.pone.0162259.
Reddy D., Jana P.K. Initialization for K-means Clustering using Voronoi Diagram // Procedia Technology. 2012. Vol. 4. Р. 395–400. — DOI 10.1016/j.protcy.2012.05.061, open access (дата обращения 14.01.2024).
Rubin D.B. Inference and missing data // Biometrika. 1976. Vol. 63 (3). Р. 581–592. — DOI 10.1093/biomet/63.3.581.
Shvedov A.S. Probability theory and mathematical statistics. Intermediate level. — Moscow: Higher School of Economics Publishing House, 2016. — URL: https://publications. hse.ru/en/books/179945401, restricted access.
Sokal R.R. Review [reviewed work: Optimisation en classification automatique. e. diday] // Journal of the American Statistical Association. 1984. Vol. 79 (387). Р. 741–741. — DOI 10.2307/2288450, restricted access.
Sokal R.R., Sneath P.H. Principles of numerical taxonomy. — W.H. Freeman and Company, 1963. — URL: https://archive.org/details/principlesofnume 0000unse/page/n7/mode/2up, limited access (дата обращения 18.03.2024).
Tavallali P., Tavallali P., Singhal, M. K-means tree: an optimal clustering tree for unsupervised learning // The Journal of Supercomputing. 2021. Vol. 77. Р. 5239–5266. — DOI 10.1007/s11227-020-03436-2, restricted access.
Xu D., Tian Y. A Comprehensive Survey of Clustering Algorithms // Annals of Data Science. 2015. Vol. 2. Р. 165–193. — DOI 10.1007/s40745-015-0040-1, open access (дата обращения 30.01.2024).
Yadav A., Dhingra S. An Enhanced K-Means Clustering Algorithm to Remove Empty Clusters // International Journal of Engineering Development and Research. 2016. Vol. 4. Р. 901–907. — URL: https://www.ijedr.org/papers/ IJEDR1604137.pdf, open access (дата обращения 30.01.2024).