Assessment of Factors Influencing the Survival of Breast Cancer Patients using a Machine Learning Approach
Shivani Motarwar1, Dixshant Kumar Jha2

1Shivani Motarwar, School of Electronics Engineering, Vellore Institute of Technology, Chennai (Tamil Nadu), India.
2Dixshant Kumar Jha*, School of Electrical Engineering, Vellore Institute of Technology, Chennai (Tamil Nadu), India.
Manuscript received on January 09, 2022. | Revised Manuscript received on February 14, 2022. | Manuscript published on February 28, 2022. | PP: 80-84 | Volume-11, Issue-3, January 2022 | Retrieval Number: 100.1/ijitee.C97130111322 | DOI: 10.35940/ijitee.C9713.0111322
Open Access | Ethics and  Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Breast cancer is one of the deadliest diseases, claiming approximately 627,000 lives worldwide in 2018–2019. Therefore, early detection of breast cancer through automation in the prediction of the disease will help the medical industry to cure this disease at an early stage and thereby reduce the risk of death drastically. In the present study, the Breast Cancer Wisconsin (Diagnostic) Data Set has been taken from the University of California Irvine (UCI) Machine Learning Repository. The dataset (n=699) contained a total of 30 predictor parameters and one dependent parameter. The dependent variable referred to the type of cancer tissue, i.e., benign or malignant. To predict the type of cancer tissue present in the patient, prediction models were built using 1) Logistic Regression (LR), 2) Decision Tree Classifier (DTC), 3) Random Forest Classifier (RFC), 4) K Nearest Neighbor (KNN), 5) Support Vector Machine (SVM), and 6) Ada Boost Classifier (ABC). To improve the accuracy of the model, a correlation matrix was used and the top 8 features were selected. To improve the accuracy even further, the Synthetic Minority Oversampling Technique (SMOTE) was used to eliminate the problem of class imbalance, and then accuracy was compared before and after SMOTE. The Precision, Recall, and F1 scores are the performance metrics that have been taken into consideration for selecting the best model for the analysis. The results of the study reveal that the KNN algorithm gives the highest accuracy of 95.321% after the SMOTE technique is applied to each of the six algorithms. It has been revealed that while SMOTE aids in the accuracy of some algorithms, it affects the performance of others. This research may be turned into realistic tools that can be utilized in the medical field to more accurately predict the stage of disease for better treatment management. 
Keywords: Breast cancer, database, K Nearest Neighbors, Machine learning, random forest, SMOTE.
Scope of the Article: Machine Learning