Site Loader
Get a Quote
Rock Street, San Francisco

Technology is one of the factors that support human improvement. It is undeniable that today’s technological advances are growing rapidly. In many parts of society, technology has become a tool for information. With technology, information becomes easier and faster to obtain. Some people use information as a source of data to be processed into something useful. In the present, data sources will always increase over time. Every day, the data is generated from various sources such as posting on social media, reviews of a product, digital or video images, purchase transaction records and more.
A large data source is called Big Data. The structure of the data is structured data and unstructured data. This becomes a problem in conducting a Big Data analysis. The Big Data problem is divided into 3 characteristics, namely Volume, Velocity, and Variety (3Vs). These three characteristics are a challenge to the system in implementing the Machine Learning framework. Thus, a strong Machine Learning framework, strategy, and the environment are needed to properly analyze large data 1.
Many technologies can be used to perform large data processing. Typically, data processing is done by distributing data storage and computing across multiple computers. One of the technologies to handle large data processing is Hadoop.
Hadoop is the platform for compiling Big Data. Hadoop is an open source software project that allows distributed processing of large data sets across commodity servers. Hadoop is designed to solve problems with analytic purposes. In addition, Hadoop is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance 2.
Within the Hadoop ecosystem, there is a well-known architecture, the MapReduce framework. The framework allows the specification of an operation to be applied to a huge data set, divide the problem and data, and run it in parallel. However, MapReduce has some important flaws. When running a job, MapReduce has a high overhead. The existence of dependencies between data storage and the results of the computation to disk. In this case, it makes Hadoop relatively ill-suited for use cases of an iterative or low-latency nature 3.
Along with advances in technology, the development of the framework in the field of data computing is also growing. One of the most developed frameworks is the Apache Spark. Spark is present to solve some problems in the previous framework, such as Hadoop.
Apache Spark is a distributed memory-based computing framework. Apache Spark is designed to be optimized for low-latency tasks and to store intermediate data and results in memory. Therefore, Spark is suitable for machine learning and iterative application.
Spark is a general distributed computing framework. Spark is based on Hadoop MapReduce algorithms. It absorbs the advantages of Hadoop MapReduce but unlike MapReduce. The intermediate and output results of the Spark jobs can be stored in memory, which is called Memory Computing. Memory Computing improves the efficiency of data computing. Therefore, Spark is better suited for iterative applications, such as Data Mining and Machine Learning 3.
Machine Learning Library (MLlib) is one of the Apache Spark components that consists of common machine learning algorithms and utilities. In this research focuses on two MLlib classification algorithms for prediction. The two algorithms are Naive Bayes (NB) and Linear Support Vector Machine (LSVM). Naive Bayes and SVM were the best techniques to classify the data and could be regarded as the baseline learning methods 4.
Naïve Bayes is a linear classifier based on the Bayes theorem. It creates simple and well-performed models and assumes the features in the dataset are mutually independent, thus the term naive came along 5. While SVM is a learning algorithm that performs classification by finding the hyperplane that maximizes margin between two classes. The nearest points to the hyperplane are the support vectors that determine the maximum margin 6.
Based on previous research, the Naive Bayes and SVM algorithms are still prominent for future research. Both algorithms are suitable for classifying data such as Text Mining, Opinion Mining, or Sentiment Analysis. Then, in terms of the classification algorithm with the framework, Apache Spark has better results than some other frameworks. Previous studies are summarized in the second section of the Related Work, Classification Used Spark and Non-Spark.
Sentiment analysis was used in this study. Sentiment analysis can be said to be a form of application of the concept of text analysis, computational linguistics, and natural language processing. Sentiment analysis involves several processes, namely extracting, preprocessing, understanding, classifying ; presenting the sentiments expressed by users. Sentiment analysis generally involves classifying the polarity of a text as positive, negative or neutral. It also involves extraction of subjectivity, prediction of intensity and classification of emotions. Sentiment analysis is also carried out on terms, sentences, paragraphs, document level and also extended to other aspects 26.
The data used in this research as many as 122,138 rows of data. The dataset is taken from the Google App Store by taking an Indonesian language review of BlackBerry Messenger (BBM). The data captured has a .csv format. With this review, we can know the sentiment analysis of BBM application.
The purpose of this research was to obtain a comparison result of the classification algorithm between Naive Bayes and SVM under the Apache Spark framework. Evaluation is done by taking values from Precision, Recall, F-Measure, and ROC Curve. Comparative results are also useful as a reference for further studies in determining the use of classification algorithms. In addition, in this research, we want to know how great and fast framework Apache Spark in doing data processing with large-scale data.
The remains of the paper are structured as follows: Section 2 presents the related work and what matters that support the implementation of this research. Section 3 presents what methodology is used and the flow of this research. Section 4 presents the results and discussion of the research conducted. The last part, section 5 presents the conclusions of the research that has been done.

Post Author: admin