1.Introduction to the Logistic Regression Model
Fitting the Logistic Regression Model
Testing for the Significance of the Coefficients
Confidence Interval Estimation
2.The Multiple Logistic Regression Model
The Multiple Logistic Regression Model
Fitting the Multiple Logistic Regression Model
Testing for the Significance of the Model
Confidence Interval Estimation
3. Interpretation of the Fitted Logistic Regression Model
Dichotomous Independent Variable
Polychotomous Independent Variable
Continuous Independent Variable
Presentation and Interpretation of the Fitted Values
4.Model-Building Strategies and Methods for Logistic Regression
Purposeful Selection of Covariates
Types of Data & Measurement Scales: Nominal, Ordinal, interval, and ratio.
These are simply ways to categorize different types of variables.
Nominal- Nominal scales are used for labeling variables, without any quantitative value. A good way to remember all of this is that “nominal” sounds a lot like “name” and nominal scales are kind of like “names” or labels. Examples of nominal variables include region, zip code, or gender of individual or religious affiliation. The nominal scale can also be coded by the researcher in order to ease out the analysis process, for example; M=Female, F= Female, etc.
Ordinal -This level of measurement involves ordering or ranking the variable to be mea¬sured, it is the order of the values is what’s important and significant, but the differences between each one are not really known. For example, is the difference between “OK” and “Unhappy” the same as the difference between “Very Happy” and “Happy?” We cant say. Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort, etc.
Interval- The interval level of measurement not only classifies and orders the measurements, but it also specifies that the distances between each interval on the scale are equivalent along the scale from low interval to high interval. For example, on a standardized intelligence measure, a 10-point difference in IQ scores has the same meaning anywhere along the scale. Thus, the difference in IQ test scores between 80 and 90 is the same as the difference between 110 and 120. However, it would not be correct to say that a person with an IQ score of 100 is twice as intelligent as a person with a score of 50. The reason for this is because intelligence test scales (and other similar interval scales) do not have a true zero that represents a complete absence of intelligence.
Ratio -In this level of measurement, the observations, in addition to having equal intervals, can have a value of zero as well. The zero in the scale makes this type of measurement unlike the other types of measurement, although the properties are similar to that of the interval level of measurement. In the ratio level of measurement, the divisions between the points on the scale have an equivalent distance between them.
The four data types
Attribute Nominal Ordinal Interval Ratio
Name2 Categorical Sequence Equal Interval Ratio
Name3 Set Fully ordered, rank ordered Unit size fixed Zero or ref.pt fixed
Statistics Count, Mode, chi-squared + median, rank order correlation + ANOVA, mean, SD + Logs??
Example1 Set of participants makes of car order of finishing a race centigrade scale Degrees Kelvin or absolute
Types of relativity A?B A;B |(A-B)| ; |(C-D)| ?
Types of absolute The identity of individual entities order, sequence intervals, differences ratios, proportions
P=(outcomes of interest)/( all possible outcomes )
Odds= (p(occurring ))/(p(not occurring))= p/((1-p)) =The odds of an event are the number of events / the number of non-events.
Odds ratio- odds ratio is a ratio of two odds.
Odds ratio = odds1/odds0
Odds ratio = ((p1/(1-p1)))/((p0/(1-p0)))
Introduction to the Logistic
Logistic regression is the appropriate regression analysis to conduct when the dependent variable (y)is dichotomous (binary) such as “yes” or “no”, “1” or “2”, “A” or “B” or “c”. Logistic regression allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. Generally, the dependent variable is dichotomous, such as male/female, smoker/non¬-smoker or success/failure like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. The logistic regression model is the most frequently used regression model for the analysis of these data. The independent variables are often called covariates.
What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is categorical. This difference between logistic and linear regression is reflected both in the form of the model and its assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow, more or less, the same general principles used in linear regression. Thus, the techniques used in linear regression analysis motivate our approach to logistic regression.
There are three primary uses of logistic regression:
Prediction of group membership and outcome.
The goal is to correctly predict the category of the outcome of individual cases. Thus, the research question asked is whether an outcome can be predicted from a selected set of independent variables. For instance, in epidemiologi¬cal studies, can the development of lung cancer be predicted from the incidence and duration of smoking as well as from demographic variables such as gender, age, and social and economic status (SES)?
2. Logistic regression provides knowledge of the relationships and strengths among the variables.
The goal is to identify which independent vari¬ables predict the outcome, that is, increase or decrease the probabil¬ity of the outcome or have no effect. For example, does inclusion of information about the incidence and duration of smoking improve prediction of lung cancer, and is a particular variable associated with an increase or decrease in the probability that a case has lung cancer? These parameter estimates (the coefficients of the predictors included in a model) can also be used to calculate and interpret the odds ratio. For instance, what are the odds that a person has lung cancer at age 65, given that he has smoked 10 packs a day for the past 30 years?
3.Classification of cases.
The goal is to understand how reliable the logistic regression model is in classifying cases for whom the effect is known. For instance, how many people with or without lung can¬cer are diagnosed correctly? The researcher establishes a cut point of say .5, and then asks, for instance: How many people with lung cancer are correctly classified if everyone with a predicted probabil-ity more is diagnosed as having lung cancer?
Why will other regression procedure not work?
Simple linear regression is one quantitative variable predicting another.
Multiple regression is a simple linear regression with more independent variables.
Nonlinear regression is still two quantitative variables, but the data is curvilinear.
Running a typical linear regression in some way has major problems since binary data does not have a normal distribution which is a condition needed for most other types of regression.
Example1: Table 1.1 lists the age in years (AGE), and presence or absence of
Evidence of significant coronary heart disease (CHD) for 100 subjects in a hypothetical
Study of risk factors for heart disease. The table also contains an identifier
Variable (ID) and an age group variable (AGEGRP). The outcome variable is CHD,
Which is coded with a value of “0” to indicate that CHD is absent, or “1” to indicate
That it is present in the individual. In general, any two values could be used, but
We have found it most convenient to use zero and one. We refer to this dataset as the CHDAGE data.
A scatterplot of the data in Table 1.1 is given in Figure 1.1.
In this scatterplot, all points fall on one of two parallel lines representing the
absence of CHD (y = 0) or the presence of CHD (y = 1). There is some tendency
for the individuals with no evidence of CHD to be younger than those with evidence
of CHD. While this plot does depict the dichotomous nature of the outcome variable
quite clearly, it does not provide a clear picture of the nature of the relationship
between CHD and AGE.
The main problem with Figure 1.1 is that the variability in CHD at all ages is
large. This makes it difficult to see any functional relationship between AGE and
CHD. One common method of removing some variation, while still maintaining
the structure of the relationship between the outcome and the independent variable,
is to create intervals for the independent variable and compute the mean of the
outcome variable within each group. We use this strategy by grouping age into the
categories (AGEGRP) defined in Table 1.1. Table 1.2 contains, for each age group,
the frequency of occurrence of each outcome, as well as the percent with CHD present.
Age group n Absent Present Mean
20–29 10 9 1 0.1
30–34 15 13 2 0.133
35–39 12 9 3 0.25
40–44 15 10 5 0.333
45–49 13 7 6 0.462
50–54 8 3 5 0.625
55–59 17 4 13 0.765
60–69 10 2 8 0.8
Total 100 57 43 0.43
By examining Table 1.2, a clearer picture of the relationship begins to emerge. It
shows that as age increases, the proportion (mean) of individuals with evidence of
CHD increases. Figure 1.2 presents a plot of the percent of individuals with CHD
versus the midpoint of each age interval. This plot provides considerable insight
into the relationship between CHD and AGE in this study, but the functional form
for this relationship needs to be described. The plot in this figure is similar to what
one might obtain if this same process of grouping and averaging were performed
in a linear regression. We note two important differences.
Some important facts:-
The dependent variable in logistic regression follows the Bernoulli distribution having an unknown probability, p.
Bernoulli distribution is just a special case of the Binomial distribution where n=1 (just one trial)
Success is “1” and failure is “0”.
The probability of success is “p” and failure is “q=1-p”.
In logistic regression, we are estimating an unknown p, for any given linear combination of the independent variables.
Therefore, we need to link together our independent variable to essentially the Bernoulli distribution, that link is called the logit.
The first difference concerns the nature of the relationship between the outcome
and independent variables.
In any regression problem, the key quantity is the mean value of the outcome variable, given the value of the independent variable. This quantity is called the conditional mean and is expressed as “E(Y|x)” where Y
Denotes the outcome variable and x denotes a specific value of the independent
Variable. The quantity E (Y|x) is read “the expected value of Y, given the value x”.
In linear regression, we assume that this mean may be expressed as an equation. This expression implies that it is possible for E (Y|x) to take on any value as x
Ranges between ??and +?.The column labeled “Mean” in Table 1.2 provides an estimate of E (Y|x). We assume, for purposes of exposition, that the estimated values plotted in Figure 1.2are close enough to the true values of E (Y|x) to provide a reasonable assessment of the functional relationship between CHD and AGE. With a dichotomous outcome variable, the conditional mean must be greater than or equal to zero and less than or equal to one (i.e., 0 ?E (Y|x) ?1). This can be seen in Figure 1.2. In addition,
The plot shows that this mean approaches zero and one “gradually”. The change in
The E (Y|x) per unit change in x becomes progressively smaller as the conditional
Mean gets closer to zero or one. The curve is said to be S-shaped and resembles a
The plot of the cumulative distribution of a continuous random variable. Thus, it should
Not seem surprising that some well-known cumulative distributions have been used
To provide a model for E (Y|x) in the case when Y is dichotomous. The model we use is based on the logistic distribution.
In order to simplify notation, we use the quantity ?(x) = E (Y|x) to represent
The conditional mean of Y given x when the logistic distribution is used. The
The specific form of the logistic regression model we use is:
?(x) = e^(?0+?1x)/(1+e^(?0+?1x) )(1.1)
A transformation of ?(x) that is central to our study of logistic regression is the logit transformation. This transformation is defined, in terms of ?(x), as:
g(x) = ln (?(x))/(1-?(x))
g(x) = ?0+ ?1x. (1.1*)
The importance of this transformation is that g(x) has many of the desirable properties
Of a linear regression model. The logit, g(x), is linear in its parameters, May
Be continuous, and may range from ??to +?, depending on the range of x.
The second important difference between the linear and logistic regression
Models concern the conditional distribution of the outcome variable. In the linear
Regression model we assume that an observation of the outcome variable may be
Expressed as y = E (Y|x) + ?. The quantity ? is called the error and expresses an
Observation’s deviation from the conditional mean. The most common assumption
Is that ? follows a normal distribution with mean zero and some variance that is
Constant across levels of the independent variable. It follows that the conditional
Distribution of the outcome variable given x is normal with mean E (Y|x), and a
The variance that is constant. This is not the case with a dichotomous outcome variable.
In this situation, we may express the value of the outcome variable given x
As y = ?(x) + ?. Here the quantity ? may assume one of two possible values. If
y = 1 then ? = 1 ??(x) with probability ?(x), and if y = 0 then ? = ??(x) with
Probability 1 ??(x). Thus, ? has a distribution with mean zero and variance equal
To ?(x) 1 ??(x). That is, the conditional distribution of the outcome variable
Follows a binomial distribution with probability given by the conditional mean,
In summary, we have shown that in a regression analysis when the outcome
Variable is dichotomous:
1. The model for the conditional mean of the regression equation must be bounded between zero and one. The logistic regression model, ?(x), given
In equation (1.1), satisfies this constraint.
2. The binomial, not the normal, distribution describes the distribution of the
Errors and is the statistical distribution on which the analysis is based.
3. The principles that guide an analysis using linear regression also guide us in