Search button

Supervised clustering with SHAP values

Aluno: Rodrigo QueirÓs ConceiÇÃo


Resumo
In the last years, data has grown at a fast rate. Not only growing in size, data is also becoming far more complex then what it used to be. As companies are shifting to data-driven environments, this complexity dificults the analysis and extraction of value from the data. As a result traditional methods are becoming obsolete as their performance is decreasing and machine learning and deep learning models are becoming more complex so the desirable accuracy scores can be achieved. This work proposes an approach that is capable of recognizing complex relationships and identifies groups that are not visible at first glance while providing a full interpretability of the methods used. It combines a black-box model with SHAP values to generate clusters from the explanations that were previously unknown. The clusters obtained are a combination of multiple local explanations that SHAP values offer and are easily interpretable since the feature values correspond to the feature importance assigned by the model. To implement this approach, a dataset containing the properties of benign and malware samples, designed for malware detection tasks, was used. It is shown that by combining SHAP values with XGBoost it is possible to generate new clusters, that were previously hidden and unobtainable with traditional approaches. This clusters are highly interpretable as they derive from SHAP values and have the support of a supervised environment.


Trabalho final de Mestrado