World Wide Web has seen as massive growth in different kinds of web services that include social networking, blogs. Sites like Facebook, Twitter and LinkedIn are the most viewed websites on the Web. The fact that Google detects about 300,000 malicious websites per month and proof that these opportunities are extensively used by criminals. These Malicious websites are employed to steal personal information like passwords, credit cards and to implement drive-by downloads. Criminals also employ phishing in order to tempt users into visiting their fake websites.
All of the above threats have something in common, all of them require the person to click on a link in a internet site deal with into a web browser, the previous being more commonplace. In each instances, the vector that is used to entice customers is a uniform resource locator. For this reason each time a consumer clicks the URL, the person plays sanity checks, consisting of paying near interest to the spelling of the website’s cope with and comparing the related threat that is probably encountered by way of touring the URL.
Recent work has shown that security practitioners have developed techniques, like blacklisting, so as to shield users from phishing websites. In blacklisting, a 3rd party compiles the names of identified phishing websites. Besides having a bottom question overhead, the blacklisting technique doesn’t offer thorough protection as no blacklist are often comprehensive and up-to-date. As a result, the user could click a link that directs to a phishing web site before the link seems on a blacklist. Another vital side to think about is that almost all domains that square measure registered with phishing intentions become active rapidly upon registration and it always takes it slow for detection phishing activity. Thus, the phishing web site won’t be a part of a blacklist throughout this era before its detection. Thus, blacklisting could be a helpful, however not thorough approach to shield novice users as a window of vulnerability still remains. Security analyzers have done in depth research so as to discover dishonest accounts on social networks used for spreading messages that lure the recipient to a possible phishing web site. Though their results square measure promising and thriving in detection an outsized variety of dishonest accounts, their technique of victimization honey-profiles to spot single spam bots in addition as large-scale campaigns doesn’t offer thorough protection for users.
Social networks have time period interaction, however the technique of victimization honey-profiles will incur delays within the detection of a dishonest account because of the requirement to make a profile of inappropriate activity. Thus, user’s square measure still vulnerable and criminals are going to be able to exploit users World Health Organization lack the data to distinguish between benign and phishing URLs. As a results of limitations in current techniques and keeping in mind that a time period uniform resource locator classification mechanism are going to be best aware of a bottom question overhead, a system that may classify uniform resource locators in time period and adapt to new and evolving trends in URL characteristics is projected within the current work. Requiring the uniform resource locator and not the context within which the uniform resource locator seems makes the approach applicable to any domain wherever a uniform resource locator seems.
2.1 SOFTWARE REQUIREMENTS
Operating System – Windows 7, Windows 8
2.2 HARDWARE REQUIREMENTS
Processor – minimum Intel core i3 to maximum Intel core i7
RAM – minimum 2GB to maximum 4GB
Hard Disk- minimum 20GB to maximum 100GB
2.3 REASON FOR CHOOSING WEKA
Weka is open source software under the GNU General Public License. Weka Stands for the Waikato Environment for Knowledge Analysis. It contains modules for facts preprocessing, classification, clustering and association rule extraction. Weka Knowledge Explorer is a smooth to use graphical user interface that harnesses the power of the weka software.
MACHINE LEARNING ALGORITHM
There is a wealthy circle of relatives of machine studying algorithms in literature, which can be carried out for solving malicious URL detection. After converting URL’s into feature vectors, lots of these studying algorithms may be normally applied to educate a predictive model in a reasonably straightforward manner. But, to efficiently clear up the problem, a few efforts have additionally been explored in devising particular mastering set of rules that both take advantage of the homes exhibited by means of the training statistics of malicious URL’s, or cope with a few unique challenges which the software faces.
In this phase, we categorize and assessment the getting to know algorithms that have been carried out for this undertaking, and additionally suggest suitable machine studying technologies that may be used to resolve specific demanding situations encountered.
We categorize the studying algorithms into: batch Learning algorithms, Online algorithms, Representation studying, and others. Batch Learning algorithms work under the belief that the whole schooling facts is to be had previous to the education mission. Online getting to know algorithms deal with the facts as a movement of instances, and learn a prediction model by sequentially making predictions and updates.
This makes them extremely scalable as compared to batch algorithms. Next, we discuss representation studying techniques, which inside the context of malicious URL detection are largely focused in the direction of function selection strategies. Lastly, we discuss different learning algorithms, in demanding situations precise to malicious url detection are addressed, which includes cost-touchy learning, active gaining knowledge of, similarity gaining knowledge of, unsupervised getting to know and string pattern matching.
In order to identify these malicious sites, Several studies in the literature hassle this problem from a Machine Learning standpoint. That is, they compile a list of URLs that have been classified as either malicious or benign and characterize each URL via a set of attributes. Classification algorithms are then predicted to examine the boundary among the decision classes. Phishing URLs for the experimentation were collected from PhishTank , which is an open source community site where manually verified phishing URLs are uploaded. PhishTank has been regarded to offer correct records for developers and researchers actively looking to become aware of criminals on the Web.
PhishTank verifies each individual submitted URL extensively before it is categorized as a phishing URL. Judgment is passed by human experts who have experience in identifying phishing URLs. The number of people verifying a URL is always more than one. Thus, the judgments are cross-verified and the final judgment for each individual URL is prepared. This makes PhishTank a reliable source for obtaining phishing URLs with correct ground truth labels.
Benign and phishing URL datasets used in the present day work are authentic, and are generated by using the feature extraction module. Various researchers have used one kind of sets of features to solve the problem of URL classification; as a result there is no standardized dataset upon which researchers conduct their research. Due to the ephemeral nature of phishing and continuously changing threat landscape, it is difficult to have a standardized dataset against which classification systems could be tested. Since the datasets used in the current work are original, the findings and comparisons made with other researchers works is done in a generic manner. Specific comparisons are not made since the underlying. The detection of malicious URLs as a binary classification problem and studies the performance of several well-known classifiers, namely Naïve Bayes, Support Vector Machines, Multi-Layer Perceptron, Decision Trees, Random Forest and k-Nearest Neighbors. Furthermore, we examine and compare the results.
4.1 Lexical Features
Phishing URLs and domains are known to exhibit characteristics that are different from other benign URLs and domains on the Web . Criminals on the Web are known to use various innovative methods in order to lure unsuspecting users. In recent times, they have resorted to a method called Typosquatting, also known as URL hijacking. This method targets users who incorrectly type in the address of a website into their web browser. For instance, a user might type the URL www.paypak.com or www.pavpal.com instead of www.paypal.com. In such cases, they might be led to an alternative website that closely represents the original website where they might be asked to enter login details or financial credentials. Thus, the user’s credentials are compromised.
Phishing URLs of this type are usually captured by analyzing the lexical content to find incorrectly spelt tokens; so, the user can be alerted. The host-based content also helps identifying such websites since the network infrastructure of other phishing campaigns might be closely related to the one being analyzed. Criminals are known to target specific branded URLs during specific time frames. Lexical features can be extracted quickly. An Online learning algorithm can retrain the model continuously, and update itself based upon the emerging trends in such phishing URLs. When a new brand is targeted, the classifier will be able to adapt to changes in the characteristics of lexical features, and as a consequence novice users can be protected. A wide variety of lexical features are extracted from a URL. The authors extract continuous valued features, such as the length of the URL, and the number of dots present in the URL. Various other authors also support using length of the URL as a feature.
In the aesthetic properties of phishing URLs, they usually tend to have different lengths when compared to other URLs and domains on the Web .The order of tokens from the URLs is not preserved. However, a distinction is made between tokens belonging to the hostname, path, and primary domain name of the URL. The most URLs are not necessarily constructed using proper terms in English and many times they are just a string of random characters.
For instance, a commonly occurring term in phishing URLs is ‘ebayisapi’ whose lexical property reveals that it is not a proper English word, but it is rather a string constructed with random characters. The motivation behind adding the usage of bigrams is that phishing URLs might exhibit a certain pattern of character strings permuted randomly, and occurring in certain combinations. The subtlety of such random occurrences in character strings can be captured by the use of bigrams. As an outcome, bigrams that show up often in phishing URLs should cause the Online algorithm to classify the URL as phishing. Benign bigrams would have the opposite effect, thereby confirming the benign nature of a URL.
4.2 Host-Based Features
Observes that criminals who register domains on the Internet for malicious purposes often operate those domains using related sets of name servers. As a result, identifying the name servers commonly used by perpetrators would serve as a useful indicator for identifying a phishing website. Support this identification as name server records represent the DNS infrastructure that leads the user to a phishing website. Further, the infrastructure may also be hosted on ISP’s that are known to host phishing websites. MX records for known phishing websites are collected, and if a new website in the future has an associated mail server that is present in this set, then the new website could be classified as phishing.
Prior research has also shown that criminals have been known to obtain hundreds of domains by registering in bulk with a registrar. The hosting infrastructure of these domains has a very high possibility of being closely related. Thus, the hosting infrastructure of such domains might be a reliable indicator for portions of phishing on the Web, as phishers often reuse the underlying hosting infrastructure for significant period. Obtaining DNS infrastructure related information and characterizing URLs with the aim of classifying them, would result in domains hosting phishing websites to have a poor reputation and record. Phishing domains end up having a poor record for hosting phishing sites, distributing exploits, malware, and Trojans. This information would help the classification system to learn a model, that would be able to discern between phishing URLs and benign URLs.
The wide literature on classification models that is simply called “classifiers” offers a range of possible solutions and methods for facing classification problems. Multilayer Perceptron, Naive Bayes, IBk, Random Forest, Random Tree stand among the most famous classifiers, each having its own advantages and downsides.
5.1 Multilayer Perceptron
A multilayer perceptron (MLP) is a feed-forward artificial neural network model that attempts to map a set of input data onto a set of corresponding outputs. An MLP can be described as a direct graph where nodes are called neurons and they are organized in three types of layers: input nodes, output nodes and hidden layers. Each layer is fully connected to the next one and each neuron (except for inputs) has a nonlinear activation function (e.g. hyperbolic tangent, logistic function). Each neuron updates its value taking into account the values of the connected neurons and the weights of these connections.
Several supervised learning algorithms have been proposed in literature for changing connection weights after each instance of historical data is presented, based on the amount of error in the output compared to the expected result. Back propagation is perhaps the most extended one, consisting in a generalization of the least means squares algorithm in the linear perceptron. MLP have been widely used in literature for classification problems of diverse origin, however, other techniques like RF or SVM are reported to be very competitive.
5.2 Naive Bayes
The naive Bayes classifier technique is based on the so-called Bayesian theorem it is particularly suited when the dimensionality of the input is high. It is highly scalable requiring a number of variables in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as under for many other types of classifiers.
Some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; In other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods. Advantage of naive Bayes is that it only requires a small number of training data to estimate the parameters necessary for classification.
5.3 Random Forest
Random Forest (RF) is a well-known ensemble learning method for supervised classification or regression. This machine learning technique operates by building an ensemble of random decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Therefore a RF is a classifier consisting in a collection of tree structured classifiers which uses random selection in two moments. In a first step, the algorithm selects several (e.g. 500) bootstrap samples from the historical data. For each bootstrap selection k, the size of the selected data is roughly 2/3rd of the total training data (exactly 63.2%). Cases are selected randomly with replacement from the original data and observations in the original data set that do not occur in a bootstrap sample are called out-of-bag (OOB) observation. In a second step, a classification tree is trained using each bootstrap sample, but only a small number of randomly selected variables (commonly the square root of the number of variables) are used for partitioning the tree. The OOB error rate is computed for each tree, using the rest (36.8%) of historical data. The overall OOB error rate is then aggregated, observe that RF does not require a split sampling method to assess accuracy of the model. The final output of the model is the mode (or mean) of the predictions from each individual tree. Random Forest comes at the expense of a some loss of interpretability, but generally greatly boosts the performance of the final model, becoming one of the most likely to be the best performing classifier in real-world classification problems.
5.4 Random Tree
Random Trees is a collection of individual decision trees where each tree is generated from different samples and subsets of the training data. The idea behind calling these decision trees is that for every pixel that is classified, a number of decisions are made in rank order of importance. When you graph these out for a pixel, it looks like a branch. When you classify the entire dataset, the branches form a tree. This method is called random trees because you are actually classifying the dataset a number of times based on a random sub selection of training pixels, thus resulting in many decision trees. To make a final decision, each tree has a vote. This process works to mitigate over fitting. Random Trees is a supervised machine-learning classifier based on constructing a multitude of decision trees, choosing random subsets of variables for each tree, and using the most frequent tree output as the overall classification. Random Trees corrects for the decision trees’ propensity for over fitting to their training sample data. In this method, a number of trees are grown—by an analogy, a forest—and variation among the trees is introduced by projecting the training data into a randomly chosen subspace before fitting each tree. The decision at each node is optimized by a randomized procedure.
5.4 Decision Tree
In the context of machine learning, a decision tree is a tree-like graph structure, where each node represents a test on an attribute. Each branch represents the outcome of the test and the leaf nodes represent the class label obtained after all decisions made through that branch. The paths from root to leaf represent classification rules. The goal in this scheme then is to represent the data while minimizing the complexity of the model. Several algorithms for constructing such optimized trees have been proposed. For example, derives from the well-known divide-and-conquer technique and have been widely used in several application fields, being one of the most popular machine learning algorithms. This scheme builds decision trees from a set of training data using the concept of information entropy. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set according to the normalized information gain (difference in entropy). It is important to note that this simple, yet efficient technique, is capable of handling missing values in the data sets and both numerical and categorical attributes.
6.1 SCREEN SHOTS
Figure 6.1.1 Random Forest classifier
Figure 6.1.2 IBk classifier
Figure 6.1.3 Random Tree classifier
Figure 6.1.4 Multilayer Perception classifier
6.2 AFTER MODIFY A DATASET
Figure 6.2.1 Multilayer Perceptron classifier
Figure 6.2.1 IBk Classifier
Figure 6.2.3 Random Forest Classifier
Figure 6.2.4 Random Tree Classifier