Site Loader
Get a Quote
Rock Street, San Francisco

1Source: http://hadoop.apache.org/ accessed 16 Feb 2018.
2Source: https://spark.apache.org/docs/2.1.0/index.html accessed 16 Feb
2018.
ABSTRACT
Over the past years Flight delays have negative effects
on passengers, airlines, and airports. Now it is possible to
predict that a flight will be delayed based on the statistics of
past flights. This paper is focusing on passenger
satisfaction unlike most of the previous researches which
are concerned about airlines and airports. In this work a
new Dynamic Double Delay Flight Predicting Web
(3DFPW) model is created to help a passenger to get the
prediction and the probability of delay status in origin and
destination airports using certain airline through a website
even before booking an airline ticket. Most of the previous
studies focused on flights departure delay only or arrival
delay only. This work focused on both delays at the same
time. Spark is used as an ecosystem cluster over Hadoop
cluster, it is handled through a SparkR library from R. This
work answers two questions. The first question is what is
the best classification algorithm to use from SparkR MLib?
The second question is what is the best caching level of
Sparkr which makes the best performance and robustness
and why?
Keywords: Machine Learning, Big data, Sparkr, Caching,
Flight Delay, R, Classification, Prediction, Naïve Bayes.
1- INTRODUCTION
Delays in air travel can be very expensive for both
passengers and airlines. While many delays are due to
weather or mechanical failures are unpredictable, it may be
possible to predict that a flight will be delayed based on the
statistics of past flights. Flight delays have adverse effects
on passengers, airlines and airports, especially economic.
Estimated flight delays can increase tactical and operational
decisions by airports and airline executives, and can alert
passengers to their plans 1.
A passenger is the one who pays the money so he is the
client, consequently, if he is dissatisfied with any airport or
any airline he will not use it, he will use only the flight he is
satisfied with. Therefore there will be a competition
between airlines and airports to make a better flight that
satisfies the passenger, therefore this paper is focusing on
passenger satisfaction despite most of the previous
researches which are concerned about airlines and airports.
A new Dynamic Double Delay Flight Predicting Web
(3DFPW) model is proposed to handle how to help a
passenger knows the prediction and the probability of his
flight before booking an airline ticket. Most of the previous
studies focused on flights departure delay only or flights
arrival delay only. This work focuses on both departure and
arrival delays at the same time to give the passenger full
information about delays in the origin airport and
destination airport. 3DFPW model is built on big data
machine learning predicting algorithms hence it is needed to
learn from a wide range of years of historical flights to
make a good prediction. Therefore, Hadoop cluster is used
as a store for big data and for fast predicting. Spark used as
an ecosystem cluster over Hadoop and it is handled through
SparkR library from RStudio.
SparkR is a distributed system. It’s simpler and less
complicated than Hadoop, easier to read. The high speed
and scalability of the algorithms created in this system are
good because it is inserted into the Spark memory. SparkR
can run faster for large-scale data files projects that require
parallel solutions 2.
For implementing a 3DFPW model which is built on a
commodity cluster two questions need to be answered. The
First question: what is the best classification algorithm to
use from SparkR MLib? The Second question: what is the
best caching level of Sparkr which make the best
performance and robustness and why? This work answers
these questions.
The rest of this paper is organized as follows: In
section 2, Background about technologies which were used
in the 3DFPW model is explored. A brief review of some
related work on flight delay is in section 3. Methods and
design are presented in Section 4. Results and discussions
are offered in Section 5. Conclusion and future work is
presented in Section 6.
2- BACKGROUND
Machine learning is research that explores the
development of algorithms that can learn from data and
provide predictions based on them. Work exploring flight
systems increases the use of machine learning methods 1.
Hadoop is an open-source software framework for storing
data and running applications on commodity hardware
clusters. It provides huge storage space for any kind of data,
tremendous processing power, and virtually unlimited
concurrent tasks or the ability to process jobs.1 Apache
Spark is a fast and general cluster computing system. It
offers high-level APIs in Java, Scala, Python, and R, as well
as an optimized engine that supports common execution
graphs. It also supports a number of higher-level tools,
including Spark SQL for SQL and structured data
Best Caching Storage Technique Using Sparkr for Big Data
Ahmed Elsayed
College of Computing and Information
Technology, AAST, Egypt
[email protected]
Prof. Dr. Mohamed Shaheen
College of Computing and Information
Technology, AAST, Egypt
[email protected]
Prof. Dr. Osama Badawy
College of Computing and Information
Technology, AAST, Egypt
[email protected]: https://www.r-project.org/about.html accessed 16 Feb 2018.
4Source: https://www.rstudio.com/ accessed 16 Feb 2018.
processing, MLlib for machine learning, GraphX for
graphics processing, and Spark Streaming.2 R: is an open
source programming language and software environment
widely used for statistical computation in data-intensive
roles such as data mining and statistics.3 RStudio is an
integrated development environment (IDE) for R. It
includes a console that supports direct code execution, a
syntax highlighting editor, as well as tools for plotting,
history, debugging, and workspace management.4 SparkR
is an R package that provides a lightweight interface to use
Spark from R. Apache Spark. SparkR provides a distributed
implementation of data frameworks that support operations
such as selection, filtering, aggregation, etc. (similar to R
data frames, dplyr) but in large data sets. SparkR also
supports distributed machine learning using MLlib.5
SparkDataFrame is a collection of data that is distributed
and organized into named columns. Conceptually, it is
equivalent to a table in a relational database or a data frame
in R, but more optimizations are made under the hood.
SparkDataFrames can be created from a wide variety of
sources, such as structured data files, Hive tables, external
databases, or existing local R data frames.5 Shiny is an R
package that makes it easy to create interactive web
applications directly from R. including standalone
applications on a web page, or embedding them in R
Markdown documents or creating display tables are
possible. And also extending Shiny applications with CSS
themes, HTML widgets, and JavaScript actions.4 Apache
Zeppelin: is a Web-based notebook that provides datadriven, interactive data analysis and collaboration
documentation with SQL, Scala and more.6
3- RELATED WORK
The flight delay has led to significant costs for
passengers, airlines, and society. Such high delay costs
motivate the analysis and prediction of air traffic delays and
the development of better delay mechanisms. Predicting
flight delays has been the topic of several previous efforts.
Sternberg, Soares, Carvalho ; Ogasawara have in 2017
developed a taxonomy scheme and classified models with
regard to detailed components based on previous
researchers of flight delay models to predict delays. That
work contributes to the analysis of these models from a
Data Science perspective, based on arrival delay 1.
Mazzeo in 2003 examined the hypothesis that the market
power enjoyed by dominant airlines allows them to provide
a lower service quality through increased flight delays,
based on arrival delay 3. Yi Ding in 2017 executed
regression and ordinal classification task based on the
multiple linear regression models to predict the delay. They
implemented the model and compared it with Naïve-Bayes
and C4.5 approach, based on arrival delay 4. Ugwu1,
Ntuk ; Ekaete in 2016 observed that airline carriers had the
highest impact on predicting for on-time and delay for flight
status. Following therefore the research aimed to predict ontime and delay for flight status based on using extensive
potentials of interpretability of decision tree model for
flights delays, the percentage accuracy of the system is
74.3%, based on Departure delay 5. Tu, Ball ; Jank in
2006 estimated a flight departure delay distribution,
Focused exclusively on downstream delays caused by
factors such as weather conditions, estimates of airport
surface congestion as well as others. Specifically, a model,
which is responding to changes in real time parameter
measurements, based on Departure delay 6. Montforta
; Berg in 2017 used two measures of delays, delays in
minutes later than scheduled and if the delay was more than
15 minutes and the results suggest that the larger the
nationwide size of an airline is, the shorter and less frequent
the delays. This result seems robust to the choice of
specification, controls and variable set-up. Larger airlines
have more resources, and the efficient use of these may
decrease delays, based on arrival delay 7. Cole ;
Donoghue in 2017 aimed to training a logistic regression
model to predict if a flight will be delayed by more than 15
minutes, based on departure delay 8. Venkataraman, et
al in 2016 found that their results are in line with previous
studies that measured the importance of caching in Spark,
benefits come not only from using faster storage media, but
also from avoiding CPU time in decompressing data and
parsing CSV files, caching helps to achieve low latencies
that make SparkR suitable for interactive query processing
from the R shell, caching the data can improve performance
by 10x to 30x for this workload 9.
4- METHODS AND DESIGN
4.1- System Components
Hadoop Cluster Specs: (version 2.6 on one namenode and
5 datanodes) 1 Master: Processor: AMD Phenom(TM)
8600B, Cores: 3, Memory: 8 GB, Hard disk: 120 GB,
Network card: Gigabit, OS: Linux (Ubuntu 14) System
type: 64-pit. 5 Slaves: Processor: Intel Core 2 Duo CPU
E8400 3.00GHz, Cores: 2, Memory: 4 GB, Hard disk: 40
GB, Network card: Gigabit, OS: Linux (Ubuntu 14),
System type: 64-pit. 6 Machines: connected together on 1
switch (Gigabits), speed approximately 600 Mbit. Spark
Cluster Specs: 1 Driver and 6 Workers on 6 machines over
13 cores, Memory in use: 19.6 GB total, 14.9 GB used,
Spark Master at spark://hdmaster:7077, spark version 2.1.0
installed on the same cluster of Hadoop, standalone level.
Dataset: The data were obtained from the Bureau of
Transportation Statistics, a Federal Agency of the United
States of America7. The dataset made up of records of all
USA domestic flights of major carriers, Airline on-time
performance dataset downloaded as CSV file. It is based on
details of the arrival and departure of all commercial flights
in the US, from October 1987 to April 2008. This is an
extensive dataset: a total of nearly 123 million records and
12 gigabytes of unpacked data.8 Variables descriptions(29
variables): Year : 1990-2008, Month: 1-12, DayofMonth:
1-31, DayOfWeek: 1 (Monday) – 7 (Sunday),5Source: https://spark.apache.org/docs/2.1.0/sparkr.html accessed 16
Feb 2018.
6Source: https://zeppelin.apache.org/ accessed 16 Feb 2018.
Figure 1: 3DFPW model
DepTime:actual departure time, CRSDepTime: scheduled
departure time, ArrTime: actual arrival time, CRSArrTime:
scheduled arrival time, UniqueCarrier: unique carrier code
Lookup csv.7 FlightNum: flight number, TailNum: plane
tail number, ActualElapsedTim: in minutes,
CRSElapsedTime: in minutes, AirTime: in minutes,
ArrDelay: arrival delay, in minutes, DepDelay: departure
delay in minutes, Origin: origin IATA airport code Lookup
csv7, Dest: destination IATA airport code Lookup csv.7
Distance: in miles, TaxiIn: taxi in time in minutes, TaxiOut:
taxi out time in minutes, Cancelled: was the flight
cancelled?, CancellationCode: reason for cancellation (A =
carrier, B = weather, C = NAS, D = security), Diverted: 1 =
yes 0 = no, CarrierDelay: in minutes, WeatherDelay: in
minutes, NASDelay: in minutes, SecurityDelay: in minutes,
LateAircraftDelay: in minutes.8
4.2- Dataset Preparing and Preprocessing
SparkR initiating: initiating sparkr by calling the library
of sparkr and determining the sparkr cluster IP and port and
running a new session using R and RStudio. Reading
dataset: reading a CSV file from Hadoop cluster and
converting this file to a sparkr dataframe as partitions which
are distributed on all spark cluster machines and cores, the
full dataset row numbers is 123534969 rows and 12
gigabytes. Preprocessing: by using a sparkr SQL it’s now
easy to preprocess the dataset and preparing it. This work is
focused on both departure flights delay and arrival delay.
The dataset contained many attributes of which some are
irrelevant, the irrelevant attributes were pruned during
extensive preprocessing. The resulting data was partitioned
into training and test sets. SQL select statement: Using
select statement a new columns were created from dataset
to make data more meaningful to an ordinary passenger
who wants to know if his travel selection will be ontime or
delayed as following. Month: a Month column was created
depending on Month column and the months were
converted into nominal Names (Jan, Feb, Mar…etc.).
Weekday: a weekday column was created depending on
DayOfWeek column and a number of days were converted
into short names of days like (1=’Mo’, 2=’Tu’…etc.).
UniqueCarrier: from the dataset, it is a unique code for each
airline company. Origin: origin airport code. Dest:
destination airport code. CRSDepTime: a CRSDeptime
column was created depending on dataset CRSDeptime
column and numbers were collected into three short
meaningful names, time between 0001 and 1159 into
Morning (‘MO_01_to_12’), time between 1200 and 1759
into Afternoon (‘AN_12_to_18’) and time between 1800
and 2359 into Night (‘NI_18_to_24’).9 in the meantime
canceled flights have no actual Deptime so CRSDepTime:
(scheduled departure time) was used instead as Deptime.
CRSArrTime: same thing as CRSDepTime. Canceled (0/1):
canceled flights were considered as a delayed flight. Class:
the dataset has no class so a class was built depending on
U.S. Department of transportation federal aviation
administration (FAA) air traffic organization policy, delays
to instrument flight rules (IFR), Airborne delays are
reported for all aircraft which incur 15 minutes or more. 11
Ontime binary class: if departure delay 15 or is canceled then ‘no’. Criteria: Some records
which have the wrong CRSDepTime were neglected, only
CRSDepTime rows which are less than 2401 are selected
and so CRSArrTime is the same as CRSDepTime.
4.3- Dynamic Double Delay Flight Predicting Web
(3DFPW) model
Criteria: the range of years selected are from 1989 to
2008 (19 years) nearly 112M rows. Splitting Dataset: After
preprocessing the dataset which was resulted from SQL
select statement it was separated into training and test sets.
80% for the training (89585408 rows) and 20% for the test
(22390987 rows), training and test sets are cashed in spark
dataframe cluster. Naïve Bayes Algorithm: Sparkr Naïve
Bayes model was ran based on training set and the ontime
column as a class on a correlation of columns for departure
delay (Month, WeekDay, UniqueCarrier, Origin, Dest,
Cancelled, and CRSDepTime) in iteration1 and for arrival
delay (Month, WeekDay, UniqueCarrier, Origin, Dest,
Cancelled, and CRSArrTime) in iteration2. Each iteration
has the same select statement and the same criteria each
iteration was implemented from beginning to the end
separated from each other. Prediction: after learning from a7Source: https://www.transtats.bts.gov/Fields.asp?Table_ID=236
accessed 17 Feb 2018.
8Source: http://stat-computing.org/dataexpo/2009/the-data.html accessed
17 Feb 2018.
9Source: https://www.fluentu.com/blog/english/how-to-tell-time-inenglish/ accessed 17 Feb 2018.
10Source: http://spark.apache.org/docs/latest/rdd-programmingguide.html accessed 19 Feb 2018.
11Source: https://www.faa.gov/documentlibrary/media/order/7210.55fbasic.pdf
Accessed 17 Feb 2018.
training set the prediction was implemented on the test set
for achieving class prediction from a random combination
of columns features also for each iteration separately.
Confusion matrix: R confusion matrix library is compatible
only with R dataframe (RDD) which is working as a
standalone machine only and can’t work with spark cluster
(spark dataframe) and it can’t read data bigger than
machine’s ram. Therefore, in this work a confusion matrix
has been written using R language to read from big data to
get results like (accuracy, recall, precision, and f-score)
based on related research 2.
Shiny: as shown in figure 1 for interacting online with
passengers a web site had to be designed and dynamically
can deal with R and sparkr machine learning to achieve the
goal of the 3DFPW Big data model, Therefore shiny has
been used for doing this. The shiny file has two sections UI
and Server. The select statement of the model used as a
dataset. Naïve Bayes algorithm was executed using a full
dataset as a training set without splitting it to training and
test. Both iteration of DepDelay and iteration of ArrDelay
were implemented respectively. Likewise, both predictions
were done in server section depending on incoming input
data from UI section which entered by the passenger. The
input data used as a test set for both predictions.
4.4- Classification Algorithms Comparison
In order to choose the best classifier algorithm for
implementing the 3DFPW model, three classification
algorithms from standard MLib of Sparkr have been tested
and matched (Naïve-Bayes(NB), Random Forest(RF) and
Gradient Boosted Tree(GBT)). Also, accuracy was matched
with another related research 5 to increase the
confirmation of the process. Criteria: January 2004
instances were selected (583944 rows). Splitting Dataset:
Same SQL select statement which used in the 3DFPW
model was separated into training and test sets. 70% for the
training (407761 rows) and 30% for the test (176183 rows),
training and test sets are cashed in spark dataframe cluster.
Columns: the three classification algorithms were ran based
on training set and the ontime column as a class on a
correlation of columns (ontime, Month, WeekDay,
UniqueCarrier, Origin, Dest, Cancelled and CRSDepTime).
Related research 5: the author in this work focused on the
same criteria and same terms of columns which were used
in this paper model he used a C4.5 algorithm.
4.5- Persisting Performance Evaluation
One of the most important options in Spark is the
persisting (caching) of a dataset in memory across
operations. When you persist an RDD, each node stores any
partitions of it that it calculates in memory, and reuses them
in other actions on the dataset (or datasets resulting from it).
This allows future actions to be much faster (often with
more than 10x). Caching is a key tool for iterative
algorithms and fast interactive use. You can label RDD as
persistent using persist () method or cache () method. The
first calculation in action is stored in the nodes. The Spark
cache is fault-tolerant if any RDD is lost, it will be
automatically recalculated using the transformations that
originally created it.10 Caching storages are (Memory_Only,
Memory_And_Disk, Disk_Only, Memory_Only_Ser, and
Memory_And_Disk_Ser).10 In order to choose the best
caching level in the best case (fully functional Hadoop and
spark clusters) and worse case (low numbers of spark cores
or any case of dead cores) caching storages had to be tested
for selecting the best persisting. Performance evaluation:
The test was done by calculating the time of processing of
NB algorithm on all columns of dataset (29 variables) and
for all the range of years (21 years) and this is to maximize
the overload on the algorithm. Using each caching level
individually on five stages, the first stage is running 6
Hadoop datanodes and 13 spark cores (executors), second
stage is running 5 Hadoop datanodes and 11 spark cores and
so on until the last stage of 2 Hadoop datanodes and 5 spark
cores. And by decreasing or increasing the number of nodes
and number of rows on a variety of machine learning
algorithms the cause was identified.
5- RESULTS AND DISCUSSIONS
As shown in figures (2, 3) some airlines selected as
samples for matching between the class label and prediction
class for illustrating the difference.
As a result of the 3DFPW model and as shown in the
tables (1, 2) the true positive in DepDelay is better than a
true positive in ArrDelay. as shown in table 3 the Model
time in both iterations nearly 2.5 minutes, however, the
accuracy in DepDelay (82%) better than accuracy in
Results in
section 5
Figure 2: Actual flight status against the carriers
Figure 3: Prediction flight status against the carriers
Explained in
section 5Table 3: the test for DepDelay and ArrDelay prediction
Table1- DepDelay Table2- ArrDelay
ArrDelay (78%) Likewise precision and f-score in
DepDelay are better. The Apriori for DepDelay iteration is
0.82 for Ontime and 0.18 For Delayed and the Apriori for
ArrDelay iteration is 0.78 for Ontime and 0.22 For Delayed.
The whole dataset is prepared as training set to take
advantage of the full knowledge, and to increase the chance
of predicting the incoming data from the passenger which is
in the form of one row, this row is considered as a test set
for prediction process in both iterations. Prediction time is
almost the same in two iterations. Consequently, iteration1
results are better.
Once the passenger inputting his combinations of flight
data and pressing the button ‘predict ontime status’ the
status (ontime or delayed will be displayed and the
probability of this status also will be displayed on the
browser for both delays as shown in figure (4).
As an answer to the first question. And as shown in table
(4) NB model has the less time 8 seconds and higher
accuracy (79.8%). RF and GBT are in the same level of
accuracy with (79.6%). and higher precision 79.3%. RF is
10 times more than the time of the NB. GBT is the worst
model in time 505 seconds, GBT is 63 times more than the
time of the NB. When trying to select a full range of years
or even more than two years using the current cluster
hardware configuration on RF and GBT it couldn’t
complete the algorithms processing and it threw errors
about connection and executors. However, NB did it well
with a range of 19 years. Sparkr NB algorithm has an
accuracy (79.8%) better than C4.5 algorithm accuracy
(74.3%) of the related work 5. Prediction time is almost
the same in all tests and it is calculated for one row only.
Therefore, NB is the best algorithm to use in the 3DFPW
model because of its accuracy and its time.
As an answer to the second question. And as shown in
Table 5, when running naive Bayes algorithm over a full
range of dataset (123m rows). in the first stage
Memory_Only, Disk_Only, and Memory_And_Disk are
almost the same time, however, Memory_Only_Ser and
Memory_And_Disk_Ser and Uncached almost the same
time which was 3.3 times more than first 3 caching levels,
Memory_Only_Ser and Memory_And_Disk_Ser test
unneeded any more as a caching level because their time is
almost as Uncached time.
In the second stage Memory_Only, Disk_Only, and
Memory_And_Disk are almost the same time, however,
uncached time is 3 times more than first 3 caching levels.
In the third stage Memory_Only and Memory_And_Disk
are almost the same time however they are 1.9 times more
than Disk_Only and their time is 2.6 times more than their
time in the second stage in the meantime Disk_Only time is
just 1.4 times more than their time in the second stage
which mean that there is an overload when using
Memory_Only and Memory_And_Disk and that because of
memory when dataset processing it was reached the limit of
memory of the cluster. Uncached time is 1.4 times more
than (Memory_Only, Memory_And_Disk) and 2.6 times
more than Disk_Only.
In the fourth stage Memory_Only became not available
because it exceeded the limit of cluster memory so it made
errors and did not complete the processing,
Memory_And_Disk is 2.2 times more than Disk_Only and
it is 1.5 of the fourth stage Memory_And_Disk time,
Disk_Only is 1.3 of fourth stage Disk_Only time, Uncached
Figure 4: 3DFPW model webpage
Table5: Persisting performance comparison (time in minutes)
Figure 6: Persisting performance comparison (time in minutes)
Table 4: Performance classification comparisontime is 1.2 times more than Memory_And_Disk and 2.7
times more than Disk_Only. In the fifth stage,
Memory_And_Disk became not available also like
Memory_Only from the fourth stage Uncached either not
completed. As shown in the figure (6) the only lasted
caching level in the worst case was Disk_Only. Following
therefore Disk_Only had the best time in all stages and it is
the best caching level to use in the 3DFPW model which
makes the best performance and robustness.
In order to know why when reaching the overload limit in
naive Bayes algorithm the Disk_Only was the best caching
storage. A test on a variety of other algorithms had to be
done. By running ML algorithms on divided datasets as
halves and quarters it was observed that. If the part of the
dataset is making no overload on ML and cluster
configuration the three caching storage (memory, memory
& desk, desk) make the same time. Therefore ML had to be
running in the overload limit, this limit is when each and
every of the three caching storage (memory, memory &
desk, desk) running together and reaching the greatest time
with succeeded process without any fail. It means that each
algorithm had to be running many times to reach the
overload limit, by decreasing or increasing the number of
nodes and number of rows on a variety of machine learning
algorithms many times to achieve these results as shown in
the table(6) :
Logit reached the overload limit running on 6 nodes using
43.8M rows over that it fails, the Memory_Only is the best
caching storage time. Naive Bayes reached the overload
limit running on 4 nodes using 123.4M rows, the
Disk_Only is the best caching storage time. Random forest
reached the overload limit running on 6 nodes using 3.5M
rows over that it fails, the Memory_And_Disk is the worse
caching storage time. Kmeans reached the overload limit
running on 5 nodes using 61.7M rows over that it fails, the
three caching almost have the same time. Consequently, it
is obvious that the best caching storage depends on ML
technique and how it accesses the data when it is
overloaded.
It was observed that some critics found in SparkR version
2.1.0 during experiments of this paper, for instance, a
Confusion matrix of R does not support spark dataframe
Therefore it was made programmatically instead. Ggplot2
library does not support spark dataframe and for making bar
charts apache zeppelin used to do that instead. Likewise,
Plot library does not support spark dataframe. Some of the
famous classification algorithms like C4.5 (decision tree)
not supported in sparkr however it supported in pyspark and
Scala.
CONCLUSION
Flight delays are a hot topic for the passenger
Nevertheless this research introduce a model using the
departure delay and arrival delay prediction status at the
same time to the passenger through a website unlike most of
the previous studies that focused on flights departure delay
only or on flights arrival delay only. Experiments in this
paper achieved that predicting departure delay has better
accuracy than arrival delay although they both are used in
the 3DFPW model. After testing RF, GBT, NB and related
research results of C4.5 algorithms, the NB classification
algorithm was the best in SparkR MLib. Disk_Only had the
best time and robustness in all test stages of Naive Bayes
algorithm and it is the best caching level to use in a 3DFPW
model for best performance.
By reaching the overload limit of a variety of machine
learning algorithms to know why Disk_Only is the best
caching storage for Naive Bayes algorithm, it is obvious
that the best caching storage depends on ML technique and
how it accesses the data when it is overloaded.
In future giving the passenger alternates of top ten of
ontime carriers and airports will be considered. Using Spark
from Pyspark instead of sparkr for more efficiency,
flexibility and spreading.
REFERENCES
1 Alice Sternberg, Jorge Soares, Diego Carvalho, Eduardo
Ogasawara. A Review on Flight Delay Prediction arXiv:
1703.06118v1 cs.CY 2017
2 Udeh Tochukwu Livinus, Rachid Chelouah, and Houcine
Senoussi. Recommender System in Big Data Environment
IJCSI ISSN (Online): 1694-0784 2016
3 Michael J. MAZZEO. Competition and Service Quality in the
U.S. Airline Industry Kluwer Academic Publishers. 2003
4 Yi Ding. Predicting flight delay based on multiple linear
regression Earth and Environmental Science 10.1088/1755-
1315/81/1/012198 2017
5 C. Ugwu1, Ntuk, Ekaete2. Dynamic Decision Tree Based
Ensembled Learning Model to Forecast Flight Status European
Centre for Research Traininging and Development UK Vol.4,
No.6, pp.15-24 2016
6 Yufeng Tu, Michael Ball, Wolfgang Jank. Estimating Flight
Departure Delay Distributions —A Statistical Approach With
Long-term Trend and Short-term Pattern Robert H. Smith
School Research Paper No. RHS 06-034 2006
7 Joep van Montforta & Vincent A.C. van den Berg. The total
size of an airline and the quality of its flights 2017
8 Scott Cole, Thomas Donoghue. Predicting departure delays of
US domestic flights S Cole, T Donoghue 2017
9 Shivaram Venkataraman, Zongheng Yang, Davies Liu2, Eric
Liang, Hossein Falaki Xiangrui Meng, Reynold Xin, Ali
Ghodsi, Michael Franklin, Ion Stoica, Matei Zaharia, AMPLab
UC Berkeley, Databricks Inc., MIT CSAIL. SparkR: Scaling
R Programs with Spark SIGMOD San Francisco, CA and
USA ACM. ISBN 978-1-4503-3531-7/16/06 2016
Table6: Persisting performance comparison in overload status

Post Author: admin

x

Hi!
I'm Victoria

Would you like to get a custom essay? How about receiving a customized one?

Check it out
x

Hi!
I'm Jeremy!

Would you like to get a custom essay? How about receiving a customized one?

Check it out