Effects of Constraint Quality and Distribution on Semi-Supervised Clustering Performance
Main Article Content
Abstract
Semi-supervised clustering enhances traditional clustering algorithms by leveraging prior knowledge, such as class labels or pairwise constraints, to improve clustering quality. Despite numerous proposed methods, the effects of constraint set quality and distribution—particularly imbalanced or erroneous constraints—remain poorly understood. This paper investigates these issues through systematic simulations, evaluating six state-of-the-art semi-supervised clustering algorithms under imbalanced or incorrect constraint conditions. Experiments on multiple real-world and synthetic benchmark datasets reveal that must-link constraints generally provide greater benefit than cannot-link constraints, while relying solely on cannot-link constraints may reduce performance. Furthermore, incorrect constraints—especially erroneous must-link constraints—can significantly degrade clustering performance. These findings underscore the critical importance of carefully designing constraint sets to achieve reliable and effective semi-supervised clustering, providing clear guidance on the conditions under which different algorithmic approaches perform best.
Downloads
Article Details
Data Availability Statement
The data presented in this study are available in UCI Machine Learning Repository at https://archive.ics.uci.edu, reference number [52], and in mlbench: Machine Learning Benchmark Problems at https://CRAN.R-project.org/package=mlbench, reference number [53].
Section

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Interdisciplinary Journal of AI, Machine Learning & Data Science (IJAIMLDS) are licensed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
This license allows others to share, copy, distribute, and adapt the work, provided that proper credit is given to the original author(s) and the source.
Authors retain copyright and grant Interdisciplinary Journal of AI, Machine Learning & Data Science (IJAIMLDS) the right of first publication.
How to Cite
References
[1] Rui Xu and Donald Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645–678, 2005. https://doi.org/10.1109/TNN.2005.845141. DOI: https://doi.org/10.1109/TNN.2005.845141
[2] Dongkuan Xu and Yingjie Tian. A comprehensive survey of clustering algorithms. Annals of Data Science, 2(2):165–193, 2015. https://doi.org/10.1007/s40745-015-0040-1. DOI: https://doi.org/10.1007/s40745-015-0040-1
[3] Kiri L. Wagstaff. Value, cost, and sharing: Open issues in constrained clustering. In Proceedings of the International Workshop on Knowledge Discovery in Inductive Databases, pages 1–10. Springer International Publishing, 2006. https://doi.org/10.1007/978-3-540-75549-4_1. DOI: https://doi.org/10.1007/978-3-540-75549-4_1
[4] Derya Dinler and Mustafa Kemal Tural. A survey of constrained clustering. In Unsupervised Learning Algorithms, pages 207–235. Springer International Publishing, 2016. https://doi.org/10.1007/978-3-319-24211-8_9. DOI: https://doi.org/10.1007/978-3-319-24211-8_9
[5] Pierre Gançarski, Bruno Crémilleux, Germain Forestier, Thomas Lampert, et al. Constrained clustering: Current and new trends. In A Guided Tour of Artificial Intelligence Research, pages 447–484. Springer International Publishing, 2020. https://doi.org/10.1007/978-3-030-06167-8_14. DOI: https://doi.org/10.1007/978-3-030-06167-8_14
[6] Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schroedl, et al. Constrained K-means clustering with background knowledge. In Proceedings of the 18th International Conference on Machine Learning, volume 1, pages 577–584. Morgan Kaufmann Publishers Inc., 2001. https://dl.acm.org/doi/10.5555/645530.655669.
[7] Kiri Wagstaff and Claire Cardie. Clustering with instance-level constraints. In Proceedings of the 17th International Conference on Machine Learning, pages 1103–1110. Morgan Kaufmann Publishers Inc., 2000. https://doi.org/10.5555/645529.658275.
[8] Mikhail Bilenko, Sugato Basu, and Raymond J Mooney. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the 21st International Conference on Machine learning, page 11. Association for Computing Machinery, 2004. https://doi.org/10.1145/1015330.1015360. DOI: https://doi.org/10.1145/1015330.1015360
[9] Ian Davidson and S.S. Ravi. Clustering with constraints: Feasibility issues and the Kmeans algorithm. In Proceedings of the SIAM International Conference on Data Mining, pages 138–149. SIAM, 2005. https://doi.org/10.1137/1.9781611972757.13. DOI: https://doi.org/10.1137/1.9781611972757.13
[10] Dan Pelleg and Dorit Baras. K-means with large and noisy constraint sets. In Proceedings of the European Conference on Machine Learning, pages 674–682. Springer International Publishing, 2007. https://doi.org/10.1007/978-3-540-74958-5_67.
[11] Mohadeseh Ganji, James Bailey, and Peter J Stuckey. Lagrangian constrained clustering. In Proceedings of the SIAM International Conference on Data Mining, pages 288–296. SIAM, 2016. https://doi.org/10.1137/1.9781611974348.33. DOI: https://doi.org/10.1137/1.9781611974348.33
[12] Eric P Xing, Andrew Y Ng, Michael I Jordan, and Stuart Russell. Distance metric learning, with application to clustering with side-information. In Proceedings of the 15th International Conference on Neural Information Processing Systems, pages 521–528. MIT Press, 2002. https://doi.org/10.5555/2968618.2968683.
[13] Aharon Bar-Hillel, Tomer Hertz, Noam Shental, Daphna Weinshall, and Greg Ridgeway. Learning a Mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, 6(6):937–965, 2005. https://dl.acm.org/doi/10.5555/1046920.1088704.
[14] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Informationtheoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning, pages 209–216. Association for Computing Machinery, 2007. https://doi.org/10.1145/1273496.1273523. DOI: https://doi.org/10.1145/1273496.1273523
[15] Guo-Jun Qi, Jinhui Tang, Zheng-Jun Zha, Tat-Seng Chua, and Hong-Jiang Zhang. An efficient sparse metric learning in high-dimensional space via ℓ1-penalized log-determinant regularization. In Proceedings of the 26th International Conference on Machine Learning, pages 841–848. Association for Computing Machinery, 2009. https://doi.org/10.1145/1553374.1553482. DOI: https://doi.org/10.1145/1553374.1553482
[16] Ian Davidson, SS Ravi, and Leonid Shamis. A SAT-based framework for efficient constrained clustering. In Proceedings of the SIAM International Conference on Data Mining, pages 94–105. SIAM, 2010. https://doi.org/10.1137/1.9781611972801.9. DOI: https://doi.org/10.1137/1.9781611972801.9
[17] Tias Guns, Siegfried Nijssen, and Luc De Raedt. k-pattern set mining under constraints. IEEE Transactions on Knowledge and Data Engineering, 25(2):402–418, 2011. https://doi.org/10.1109/TKDE.2011.204. DOI: https://doi.org/10.1109/TKDE.2011.204
[18] Abdelkader Ouali, Samir Loudni, Yahia Lebbah, Patrice Boizumault, Albrecht Zimmermann, and Lakhdar Loukil. Efficiently finding conceptual clustering models with integer linear programming. In Proceedings of the 25th International Joint Conferences on Artificial Intelligence, pages 647–654. AAAI Press, 2016. https://doi.org/10.5555/3060621.3060712.
[19] Sean Gilpin and Ian Davidson. A flexible ILP formulation for hierarchical clustering. Artificial Intelligence, 244:95–109, 2017. https://doi.org/10.1016/j.artint.2015.05.009. DOI: https://doi.org/10.1016/j.artint.2015.05.009
[20] Germain Forestier, Pierre Gançarski, and Cédric Wemmert. Collaborative clustering with background knowledge. Data & Knowledge Engineering, 69(2):211–228, 2010. https://doi.org/10.1016/j.datak.2009.10.004. DOI: https://doi.org/10.1016/j.datak.2009.10.004
[21] Muna Al-Razgan and Carlotta Domeniconi. Clustering ensembles with active constraints. In Applications of Supervised and Unsupervised Ensemble Methods, pages 175–189. Springer International Publishing, 2009. https://doi.org/10.1007/978-3-642-03999-7_10. DOI: https://doi.org/10.1007/978-3-642-03999-7_10
[22] Ashraf Mohammed Iqbal, Abidalrahman Moh’d, and Zahoor Khan. Semi-supervised clustering ensemble by voting. arXiv preprint arXiv:1208.4138, pages 1–5, 2012.
[23] Wenchao Xiao, Yan Yang, Hongjun Wang, Tianrui Li, and Huanlai Xing. Semi-supervised hierarchical clustering ensemble and its application. Neurocomputing, 173:1362–1376, 2016. https://doi.org/10.1016/j.neucom.2015.09.009. DOI: https://doi.org/10.1016/j.neucom.2015.09.009
[24] Tianshu Yang, Nicolas Pasquier, and Frédéric Precioso. Semi-supervised consensus clustering based on closed patterns. Knowledge-Based Systems, 235:107599, 2022. https://doi.org/https://doi.org/10.1016/j.knosys.2021.107599. DOI: https://doi.org/10.1016/j.knosys.2021.107599
[25] Yazhou Ren, Kangrong Hu, Xinyi Dai, Lili Pan, Steven CH Hoi, and Zenglin Xu. Semisupervised deep embedded clustering. Neurocomputing, 325:121–130, 2019. https://doi.org/10.1016/j.neucom.2018.10.016. DOI: https://doi.org/10.1016/j.neucom.2018.10.016
[26] Hongjing Zhang, Tianyang Zhan, Sugato Basu, and Ian Davidson. A framework for deep constrained clustering. Data Mining and Knowledge Discovery, pages 1–28, 2021. https://doi.org/10.1007/s10618-020-00734-4. DOI: https://doi.org/10.1007/s10618-020-00734-4
[27] M Eduardo Ares, Javier Parapar, and Álvaro Barreiro. An experimental study of constrained clustering effectiveness in presence of erroneous constraints. Information Processing & Management, 48(3):537–551, 2012. https://doi.org/10.1016/j.ipm.2011.08.006. DOI: https://doi.org/10.1016/j.ipm.2011.08.006
[28] Xiatian Zhu, Chen Change Loy, and Shaogang Gong. Constrained clustering: Effective constraint propagation with imperfect oracles. In Proceedings of the 13th International Conference on Data Mining, pages 1307–1312. IEEE, 2013. https://doi.org/10.1109/ICDM.2013.45. DOI: https://doi.org/10.1109/ICDM.2013.45
[29] Blaine Nelson and Ira Cohen. Revisiting probabilistic models for clustering with pair-wise constraints. In Proceedings of the 24th International Conference on Machine Learning, pages 673–680. Association for Computing Machinery, 2007. https://doi.org/10.1145/1273496.1273581. DOI: https://doi.org/10.1145/1273496.1273581
[30] Hongjing Zhang, Sugato Basu, and Ian Davidson. A framework for deep constrained clustering-algorithms and advances. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 57–72. Springer International Publishing, 2019. https://doi.org/10.1007/978-3-030-46150-8_4. DOI: https://doi.org/10.1007/978-3-030-46150-8_4
[31] Tom Coleman, James Saunderson, and Anthony Wirth. Spectral clustering with inconsistent advice. In Proceedings of the 25th International Conference on Machine Learning, pages 152–159. Association for Computing Machinery, 2008. https://doi.org/10.1145/1390156.1390176. DOI: https://doi.org/10.1145/1390156.1390176
[32] Erliang Zeng, Chengyong Yang, Tao Li, and Giri Narasimhan. On the effectiveness of constraints sets in clustering genes. In Proceedings of the 7th International Symposium on BioInformatics and BioEngineering, pages 79–86. IEEE, 2007. https://doi.org/10.1109/BIBE.2007.4375548. DOI: https://doi.org/10.1109/BIBE.2007.4375548
[33] M. Eduardo Ares, Javier Parapar, and Alvaro Barreiro. Improving text clustering with social tagging. In Proceedings of the International AAAI Conference on Web and Social Media, volume 5, pages 430–433, 2021. https://doi.org/10.1609/icwsm.v5i1.14148. DOI: https://doi.org/10.1609/icwsm.v5i1.14148
[34] Ronglu Yan, J. Zhang, Jiangshuai Yang, and Alexander Hauptmann. Learning with pairwise constraints for video object classification, pages 397–430. Chapman and Hall/CRC, 2008. https://doi.org/10.1201/9781584889977. DOI: https://doi.org/10.1201/9781584889977.ch17
[35] Dan Pelleg and Dorit Baras. K-means with large and noisy constraint sets. In Proceedings of the European Conference on Machine Learning, pages 674–682. Springer, 2007. https://doi.org/10.1007/978-3-540-74958-5_67. DOI: https://doi.org/10.1007/978-3-540-74958-5_67
[36] Chen Gong, Keren Fu, Qiang Wu, Enmei Tu, and Jie Yang. Semi-supervised classification with pairwise constraints. Neurocomputing, 139:130–137, 2014. https://doi.org/10.1016/j.neucom.2014.02.053. DOI: https://doi.org/10.1016/j.neucom.2014.02.053
[37] Hui Liu, Yuheng Jia, Junhui Hou, and Qingfu Zhang. Imbalance-aware pairwise constraint propagation. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1605–1613. ACM, 2019. https://doi.org/10.1145/3343031.3350968. DOI: https://doi.org/10.1145/3343031.3350968
[38] Stephen Mussmann, Robin Jia, and Percy Liang. On the importance of adaptive data collection for extremely imbalanced pairwise tasks. In Empirical Methods in Natural Language Processing, pages 3400–3413. Association for Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.305. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.305
[39] Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284, 2009. https://doi.org/10.1109/TKDE.2008.239. DOI: https://doi.org/10.1109/TKDE.2008.239
[40] Bartosz Krawczyk. Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence, 5(4):221–232, 2016. https://doi.org/10.1007/s13748-016-0094-0. DOI: https://doi.org/10.1007/s13748-016-0094-0
[41] S. Layaq and Dr. B. Manjula. A recapitulation of imbalanced data. International Journal of Innovative Technology and Exploring Engineering, 9(3):452–455, 2020. https://doi.org/10.35940/ijitee.b8120.019320. DOI: https://doi.org/10.35940/ijitee.B8120.019320
[42] Eric Bair. Semi-supervised clustering methods. Wiley interdisciplinary reviews. Computational statistics, 5(5):349–361, 2013. https://doi.org/10.1002/wics.1270. DOI: https://doi.org/10.1002/wics.1270
[43] Korinna Bade and Andreas Nurnberger. Personalized hierarchical clustering. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pages 181–187, 2006. https://doi.org/10.1109/WI.2006.131. DOI: https://doi.org/10.1109/WI.2006.131
[44] Ian Davidson, Kiri L Wagstaff, and Sugato Basu. Measuring constraint-set utility for partitional clustering algorithms. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, pages 115–126. Springer International Publishing, 2006. https://doi.org/10.1007/11871637_15. DOI: https://doi.org/10.1007/11871637_15
[45] Ian Davidson and SS Ravi. Identifying and generating easy sets of constraints for clustering. In Proceedings of the 21st National Conference on Artificial Intelligence, volume 1, pages 336–341. AAAI Press, 2006. https://doi.org/10.5555/1597538.1597593.
[46] Thiago F. Covoes, Eduardo R. Hruschka, and Joydeep Ghosh. A study of K-means-based algorithms for constrained clustering. Intelligent Data Analysis, 17(3):485–505, 2013. https://doi.org/10.5555/2595566.2595574. DOI: https://doi.org/10.3233/IDA-130590
[47] Noam Shental, Tomer Hertz, Daphna Weinshall, and Misha Pavel. Adjustment learning and relevant component analysis. In Proceedings of the European Conference on Computer Vision, pages 776–790. Springer International Publishing, 2002. https://doi.org/10.1007/3-540-47979-1_52. DOI: https://doi.org/10.1007/3-540-47979-1_52
[48] J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California Press, 1967. https://scispace.com/pdf/some-methods-for-classification-and-analysis-of-multivariate-4pswti19oz.pdf.
[49] Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. Semi-supervised clustering by seeding. In Proceedings of the 19th International Conference on Machine Learning, pages 27–34. Morgan Kaufmann Publishers Inc., 2002. https://dl.acm.org/doi/10.5555/645531.656012.
[50] Liu Yang and Rong Jin. Distance Metric Learning: A Comprehensive Survey. Michigan State University, 2006. https://api.semanticscholar.org/CorpusID:850937.
[51] Brian Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287–364, 2012. https://doi.org/10.1561/2200000019. DOI: https://doi.org/10.1561/2200000019
[52] Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. The UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu
[53] Friedrich Leisch and Evgenia Dimitriadou. mlbench: Machine learning benchmark problems. Available online: https://CRAN.R-project.org/package=mlbench R package version 2.1-5.
[54] Jakub Svehla. Active semi-supervised clustering. Available online: https://github.com/datamole-ai/active-semi-supervised-clustering, 2020.
[55] William de Vazelhes, C. J. Carey, Yuan Tang, Nathalie Vauquier, and Aurélien Bellet. metric-learn: Metric Learning Algorithms in Python. Journal of Machine Learning Research, 21(138):1–6, 2020. https://doi.org/10.48550/arXiv.1908.04710.
[56] Tarald Kvålseth. On Normalized Mutual Information: Measure derivations and properties. Entropy, 19(11):631, 2017. https://doi.org/10.3390/e19110631. DOI: https://doi.org/10.3390/e19110631
[57] Christian Wiwie, Jan Baumbach, and Richard Röttger. Comparing the performance of biomedical clustering methods. Nature Methods, 12(11):1033–1038, 2015. https://doi.org/10.1038/nmeth.3583. DOI: https://doi.org/10.1038/nmeth.3583