Assessing pre-training bias in Health data and estimating its impact on machine learning algorithms
View/ Open
Date
2023Author
Advisor
Academic level
Graduation
Subject
Abstract
Machine learning (ML) is a rapidly growing field of computer science that has found many fruitful applications in several domains, including Health. However, ML is also highly susceptible to bias, which introduces concerns regarding their ability to inflict harm. Bias can come from various sources, such as the design of the algorithm, the selection of data, and the strategies underlying data collection. Thus, data scientists must be vigilant in ensuring that the developed models do not perpetua ...
Machine learning (ML) is a rapidly growing field of computer science that has found many fruitful applications in several domains, including Health. However, ML is also highly susceptible to bias, which introduces concerns regarding their ability to inflict harm. Bias can come from various sources, such as the design of the algorithm, the selection of data, and the strategies underlying data collection. Thus, data scientists must be vigilant in ensuring that the developed models do not perpetuate social disparities based on gender, religion, sexual orientation, or ethnicity. This work aims to explore pre-training bias met rics to investigate the existence of bias in Health data. The metrics also analyze how pro tected attributes and their correlated features are distributed for the predicted class against the target attributes, giving insight into how the trained model may produce biased pre dictions. Our goal is to evaluate pre-training bias metrics in three different health datasets and assess the impact of bias on the performance of ML algorithms. O Our experiments in volve artificially modified versions of the dataset to increase the values of the pre-training bias metrics to favor privileged classes as well as to lower the values of these metrics to reduce the discrepancy in the data and the risk of bias. We trained models using four supervised learning algorithms: Logistic Regression, Decision Tree, Random Forest, and K-Nearest Neighbors. Each algorithm was tested on six to ten different training sets with varying random seeds to split the data in each iteration. We evaluated the performance of the trained models using the same test sets for every dataset variation, reporting the Accuracy and F1-Score. By analyzing pre-training metric bias and the predictive perfor mance of models, this study demonstrates that performance can be significantly affected by skewed data distribution and that the performance metrics may sometimes mask the bias incorporated by the algorithm. In some cases, classification errors may be more pro nounced in one group (e.g., the disadvantaged group), accentuating specific errors such as false positives and false negatives, which may have different implications depending on the clinical prediction problem under analysis. ...
Institution
Universidade Federal do Rio Grande do Sul. Instituto de Informática. Curso de Ciência da Computação: Ênfase em Ciência da Computação: Bacharelado.
Collections
This item is licensed under a Creative Commons License