Using Pyspark Environment for Solving a Big Data Problem: Searching for Supersymmetric Particles
Mourad Azhari1, Abdallah Abarda2, Badia Ettaki3, Jamal Zerouaoui4, Mohamed Dakkon5

1Mourad Azhari*, Laboratory of Engineering Sciences and Modeling, Faculty of Sciences, Ibn Tofail University,  Kenitra, Morocco.
2Abdallah Abarda, Laboratoire de Modélisation Mathématiques et de Calculs Economiques, FSJES, Université Hassan 1er, Settat, Morocco.
3Badia Ettaki, Laboratory of Research in Computer Science, Data Sciences and Knowledge Engineering, Department of Data, Content and knowledge Engineering School of Information Sciences Rabat, Morocco.
4Jamal Zerouaoui, Laboratory of Engineering Sciences and Modeling, Faculty of Sciences, Ibn Tofail University, Kenitra, Morocco.
5Mohamed Dakkon, Département de Statistique et Informatique de Gestion, Université Abdelmalek Essaadi, Tétouan, Morocco.
Manuscript received on April 20, 2020. | Revised Manuscript received on April 30, 2020. | Manuscript published on May 10, 2020. | PP: 541-546 | Volume-9 Issue-7, May 2020. | Retrieval Number: G5308059720/2020©BEIESP | DOI: 10.35940/ijitee.G5308.059720
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (

Abstract: Supersymmetry theory predicts that every particle in the standard model has a superpartner particle with a different mass. The Classification Problem of Supersymmetric Particles in High-Energy represents a major challenge for physicists. This paper aims to resolve the Big data Classification Problem in the area of Supersymmetric Particles using the Apache Spark Environment with the “MLlib” library. This contribution attempts to explore the performance of Machine Learning methods in the context of large data such as a “Susy” dataset, collected from the UCI Machine Learning repository. In this work, the performance is measured using three metrics: Accuracy, Area Under Curve (AUC), and training Computation Time (CT). The results are promising and show that the Gradient Boosted Tree (GBT) classifier achieves a high accuracy score (79%). While the Logistic Regression (LR) algorithm realizes a well AUC score (86%).
Keywords: Machine Learning methods, performance, Spark Environment, Pyspark, Supersymmetric Particles.
Scope of the Article: Machine Learning