Edit filter log

Details for log entry 36,830,797

02:06, 22 January 2024: Safetystuff (talk | contribs) triggered filter 550, performing the action "edit" on Logistic regression. Actions taken: Tag; Filter description: nowiki tags inserted into an article (examine | diff)

Changes made in edit

Action parameters

Variable	Value
Edit count of the user (`user_editcount`)	108
Name of the user account (`user_name`)	'Safetystuff'
Age of the user account (`user_age`)	472637
Groups (including implicit) the user is in (`user_groups`)	[ 0 => '*', 1 => 'user', 2 => 'autoconfirmed' ]
Rights that the user has (`user_rights`)	[ 0 => 'createaccount', 1 => 'read', 2 => 'edit', 3 => 'createtalk', 4 => 'writeapi', 5 => 'viewmyprivateinfo', 6 => 'editmyprivateinfo', 7 => 'editmyoptions', 8 => 'abusefilter-log-detail', 9 => 'urlshortener-create-url', 10 => 'centralauth-merge', 11 => 'abusefilter-view', 12 => 'abusefilter-log', 13 => 'vipsscaler-test', 14 => 'collectionsaveasuserpage', 15 => 'reupload-own', 16 => 'move-rootuserpages', 17 => 'createpage', 18 => 'minoredit', 19 => 'editmyusercss', 20 => 'editmyuserjson', 21 => 'editmyuserjs', 22 => 'sendemail', 23 => 'applychangetags', 24 => 'viewmywatchlist', 25 => 'editmywatchlist', 26 => 'spamblacklistlog', 27 => 'mwoauthmanagemygrants', 28 => 'reupload', 29 => 'upload', 30 => 'move', 31 => 'autoconfirmed', 32 => 'editsemiprotected', 33 => 'skipcaptcha', 34 => 'ipinfo', 35 => 'ipinfo-view-basic', 36 => 'transcode-reset', 37 => 'transcode-status', 38 => 'createpagemainns', 39 => 'movestable', 40 => 'autoreview' ]
Whether the user is editing from mobile app (`user_app`)	false
Whether or not a user is editing through the mobile interface (`user_mobile`)	true
Page ID (`page_id`)	226631
Page namespace (`page_namespace`)	0
Page title without namespace (`page_title`)	'Logistic regression'
Full page title (`page_prefixedtitle`)	'Logistic regression'
Edit protection level of the page (`page_restrictions_edit`)	[]
Page age in seconds (`page_age`)	653025557
Action (`action`)	'edit'
Edit summary/reason (`summary`)	'/* Applications */'
Old content model (`old_content_model`)	'wikitext'
New content model (`new_content_model`)	'wikitext'
Old page wikitext, before the edit (`old_wikitext`)	'{{Short description\|Statistical model for a binary dependent variable}} {{Redirect-distinguish\|Logit model\|Logit function}} [[File:Exam pass logistic curve.svg\|thumb\|400px\|Example graph of a logistic regression curve fitted to data. The curve shows the estimated probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See {{slink\|\|Example}} for worked details.]] In [[statistics]], the '''logistic model''' (or '''logit model''') is a [[statistical model]] that models the [[log-odds]] of an event as a [[linear function (calculus)\|linear combination]] of one or more [[independent variable]]s. In [[regression analysis]], '''logistic regression'''<ref>{{cite journal\|last1=Tolles\|first1=Juliana\|last2=Meurer\|first2=William J\|date=2016\|title=Logistic Regression Relating Patient Characteristics to Outcomes\|journal=JAMA \|language=en\|volume=316\|issue=5\|pages=533–4\|issn=0098-7484\|oclc=6823603312\|doi=10.1001/jama.2016.7653\|pmid=27483067}}</ref> (or '''logit regression''') is [[estimation theory\|estimating]] the parameters of a logistic model (the coefficients in the linear combination). Formally, in binary logistic regression there is a single [[binary variable\|binary]] [[dependent variable]], coded by an [[indicator variable]], where the two values are labeled "0" and "1", while the [[independent variable]]s can each be a binary variable (two classes, coded by an indicator variable) or a [[continuous variable]] (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling;<ref name=Hosmer/> the function that converts log-odds to probability is the [[logistic function]], hence the name. The [[unit of measurement]] for the log-odds scale is called a ''[[logit]]'', from '''''log'''istic un'''it''''', hence the alternative names. See {{slink\|\|Background}} and {{slink\|\|Definition}} for formal mathematics, and {{slink\|\|Example}} for a worked example. Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see {{slink\|\|Applications}}), and the logistic model has been the most commonly used model for [[binary regression]] since about 1970.{{sfn\|Cramer\|2002\|p=10–11}} Binary variables can be generalized to [[categorical variable]]s when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to [[multinomial logistic regression]]. If the multiple categories are [[Level of measurement#Ordinal scale\|ordered]], one can use the [[ordinal logistic regression]] (for example the proportional odds ordinal logistic model<ref name=wal67est />). See {{slink\|\|Extensions}} for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform [[statistical classification]] (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a [[binary classifier]]. Analogous linear models for binary variables with a different [[sigmoid function]] instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the [[probit model]]; see {{slink\|\|Alternatives}}. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a ''constant'' rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the [[odds ratio]]. More abstractly, the logistic function is the [[natural parameter]] for the [[Bernoulli distribution]], and in this sense is the "simplest" way to convert a real number to a probability. In particular, it maximizes entropy (minimizes added information), and in this sense makes the fewest assumptions of the data being modeled; see {{slink\|\|Maximum entropy}}. The parameters of a logistic regression are most commonly estimated by [[maximum-likelihood estimation]] (MLE). This does not have a closed-form expression, unlike [[linear least squares (mathematics)\|linear least squares]]; see {{section link\|\|Model fitting}}. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by [[ordinary least squares]] (OLS) plays for [[Scalar (mathematics)\|scalar]] responses: it is a simple, well-analyzed baseline model; see {{slink\|\|Comparison with linear regression}} for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by [[Joseph Berkson]],{{sfn\|Cramer\|2002\|p=8}} beginning in {{harvtxt\|Berkson\|1944}}, where he coined "logit"; see {{slink\|\|History}}. {{Regression bar}} {{TOC limit}} ==Applications== Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score ([[TRISS]]), which is widely used to predict mortality in injured patients, was originally developed by Boyd ''{{Abbr\|et al.\|''et alia'', with others - usually other authors}}'' using logistic regression.<ref>{{cite journal\| last1 = Boyd \| first1 = C. R.\| last2 = Tolson \| first2 = M. A.\| last3 = Copes \| first3 = W. S.\| title = Evaluating trauma care: The TRISS method. Trauma Score and the Injury Severity Score\| journal = The Journal of Trauma\| volume = 27 \| issue = 4\| pages = 370–378\| year = 1987 \| pmid = 3106646 \| doi= 10.1097/00005373-198704000-00005\| doi-access = free}}</ref> Many other medical scales used to assess severity of a patient have been developed using logistic regression.<ref>{{cite journal \|pmid= 11268952 \|year= 2001\|last1= Kologlu \|first1= M.\|title=Validation of MPI and PIA II in two different groups of patients with secondary peritonitis \|journal=Hepato-Gastroenterology \|volume= 48 \|issue=37 \|pages= 147–51 \|last2=Elker\|first2=D. \|last3= Altun \|first3= H. \|last4= Sayek \|first4= I.}}</ref><ref>{{cite journal \|pmid= 11129812 \|year= 2000 \|last1= Biondo \|first1= S. \|title= Prognostic factors for mortality in left colonic peritonitis: A new scoring system \|journal= Journal of the American College of Surgeons\|volume= 191 \|issue= 6 \|pages= 635–42 \|last2= Ramos\|first2=E.\|last3=Deiros \|first3= M. \|last4=Ragué\|first4=J. M.\|last5=De Oca \|first5= J. \|last6= Moreno \|first6=P.\|last7=Farran\|first7=L.\|last8= Jaurrieta \|first8= E. \|doi= 10.1016/S1072-7515(00)00758-4}}</ref><ref>{{cite journal\|pmid=7587228 \|year= 1995 \|last1=Marshall \|first1= J. C.\|title=Multiple organ dysfunction score: A reliable descriptor of a complex clinical outcome\|journal=Critical Care Medicine\|volume= 23 \|issue= 10\|pages= 1638–52 \|last2= Cook\|first2=D. J.\|last3=Christou\|first3=N. V. \|last4= Bernard \|first4= G. R. \|last5=Sprung\|first5=C. L.\|last6=Sibbald\|first6=W. J.\|doi= 10.1097/00003246-199510000-00007}}</ref><ref>{{cite journal\|pmid=8254858\|year=1993 \|last1= Le Gall \|first1= J. R.\|title=A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study\|journal=JAMA\|volume=270\|issue= 24 \|pages= 2957–63 \|last2= Lemeshow \|first2=S.\|last3=Saulnier\|first3=F.\|doi= 10.1001/jama.1993.03510240069035}}</ref> Logistic regression may be used to predict the risk of developing a given disease (e.g. [[Diabetes mellitus\|diabetes]]; [[Coronary artery disease\|coronary heart disease]]), based on observed characteristics of the patient (age, sex, [[body mass index]], results of various [[blood test]]s, etc.).<ref name = "Freedman09">{{cite book \|author=David A. Freedman \|year=2009\|title=Statistical Models: Theory and Practice \|publisher=[[Cambridge University Press]]\|page=128\|author-link=David A. Freedman}}</ref><ref>{{cite journal \| pmid = 6028270 \| year = 1967 \| last1 = Truett \| first1 = J \| title = A multivariate analysis of the risk of coronary heart disease in Framingham \| journal = Journal of Chronic Diseases \| volume = 20 \| issue = 7 \| pages = 511–24 \| last2 = Cornfield\| first2 = J\| last3 = Kannel\| first3 = W \| doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Any Other Party, based on age, income, sex, race, state of residence, votes in previous elections, etc.<ref name="rms" /> The technique can also be used in [[engineering]], especially for predicting the probability of failure of a given process, system or product.<ref name= strano05>{{cite journal \| author = M. Strano \| author2 = B.M. Colosimo \| year = 2006 \| title = Logistic regression analysis for experimental determination of forming limit diagrams \| journal = International Journal of Machine Tools and Manufacture \| volume = 46 \| issue = 6 \| pages = 673–682 \| doi = 10.1016/j.ijmachtools.2005.07.005 }}</ref><ref name= safety>{{cite journal \| last1 = Palei \| first1 = S. K. \| last2 = Das \| first2 = S. K. \| doi = 10.1016/j.ssci.2008.01.002 \| title = Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: An approach \| journal = Safety Science \| volume = 47 \| pages = 88–96 \| year = 2009 }}</ref> It is also used in [[marketing]] applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.<ref>{{cite book\|title=Data Mining Techniques For Marketing, Sales and Customer Support\|last= Berry \|first=Michael J.A\|publisher=Wiley\|year=1997\|page=10}}</ref> In [[economics]], it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a [[mortgage]]. [[Conditional random field]]s, an extension of logistic regression to sequential data, are used in [[natural language processing]]. ==Example== ===Problem=== As a simple example, we can use a logistic regression with one explanatory variable and two categories to answer the following question: <blockquote> A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam? </blockquote> The reason for using logistic regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1" and "0", are not [[cardinal number]]s. If the problem was changed so that pass/fail was replaced with the grade 0–100 (cardinal numbers), then simple [[regression analysis]] could be used. The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0). {\| class="wikitable" \|- ! Hours (''x<sub>k</sub>'') \| 0.50\|\| 0.75\|\| 1.00\|\| 1.25\|\| 1.50\|\| 1.75\|\| 1.75\|\| 2.00\|\| 2.25\|\| 2.50\|\| 2.75\|\| 3.00\|\| 3.25\|\| 3.50\|\| 4.00\|\| 4.25\|\| 4.50\|\| 4.75\|\| 5.00 \|\| 5.50 \|- ! Pass (''y<sub>k</sub>'') \| 0\|\| 0\|\| 0\|\| 0\|\| 0\|\| 0\|\| 1\|\| 0\|\| 1\|\| 0\|\| 1\|\| 0\|\| 1\|\| 0\|\| 1\|\| 1\|\| 1\|\| 1\|\| 1\|\| 1 \|} We wish to fit a logistic function to the data consisting of the hours studied (''x<sub>k</sub>'') and the outcome of the test (''y<sub>k</sub>'' =1 for pass, 0 for fail). The data points are indexed by the subscript ''k'' which runs from <math>k=1</math> to <math>k=K=20</math>. The ''x'' variable is called the "[[explanatory variable]]", and the ''y'' variable is called the "[[categorical variable]]" consisting of two categories: "pass" or "fail" corresponding to the categorical values 1 and 0 respectively. ===Model=== [[File:Exam pass logistic curve.svg\|thumb\|400px\|Graph of a logistic regression curve fitted to the (''x<sub>m</sub>'',''y<sub>m</sub>'') data. The curve shows the probability of passing an exam versus hours studying.]] The [[logistic function]] is of the form: :<math>p(x)=\frac{1}{1+e^{-(x-\mu)/s}}</math> where ''μ'' is a [[location parameter]] (the midpoint of the curve, where <math>p(\mu)=1/2</math>) and ''s'' is a [[scale parameter]]. This expression may be rewritten as: :<math>p(x)=\frac{1}{1+e^{-(\beta_0+\beta_1 x)}}</math> where <math>\beta_0 = -\mu/s</math> and is known as the [[vertical intercept\|intercept]] (it is the ''vertical'' intercept or ''y''-intercept of the line <math>y = \beta_0+\beta_1 x</math>), and <math>\beta_1= 1/s</math> (inverse scale parameter or [[rate parameter]]): these are the ''y''-intercept and slope of the log-odds as a function of ''x''. Conversely, <math>\mu=-\beta_0/\beta_1</math> and <math>s=1/\beta_1</math>. ===Fit=== The usual measure of [[goodness of fit]] for a logistic regression uses [[logistic loss]] (or [[log loss]]), the negative [[log-likelihood]]. For a given ''x<sub>k</sub>'' and ''y<sub>k</sub>'', write <math>p_k=p(x_k)</math>. The {{tmath\|p_k}} are the probabilities that the corresponding {{tmath\|y_k}} will equal one and {{tmath\|1-p_k}} are the probabilities that they will be zero (see [[Bernoulli distribution]]). We wish to find the values of {{tmath\|\beta_0}} and {{tmath\|\beta_1}} which give the "best fit" to the data. In the case of linear regression, the sum of the squared deviations of the fit from the data points (''y<sub>k</sub>''), the [[squared error loss]], is taken as a measure of the goodness of fit, and the best fit is obtained when that function is ''minimized''. The log loss for the ''k''-th point {{tmath\|\ell_k}} is: :<math>\ell_k = \begin{cases} -\ln p_k & \text{ if } y_k = 1, \\ -\ln (1 - p_k) & \text{ if } y_k = 0. \end{cases}</math> The log loss can be interpreted as the "[[surprisal]]" of the actual outcome {{tmath\|y_k}} relative to the prediction {{tmath\|p_k}}, and is a measure of [[information content]]. Log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction (i.e., when <math>p_k = 1</math> and <math>y_k = 1</math>, or <math>p_k = 0</math> and <math>y_k = 0</math>), and approaches infinity as the prediction gets worse (i.e., when <math>y_k = 1</math> and <math>p_k \to 0</math> or <math>y_k = 0 </math> and <math>p_k \to 1</math>), meaning the actual outcome is "more surprising". Since the value of the logistic function is always strictly between zero and one, the log loss is always greater than zero and less than infinity. Unlike in a linear regression, where the model can have zero loss at a point by passing through a data point (and zero loss overall if all points are on a line), in a logistic regression it is not possible to have zero loss at any points, since {{tmath\|y_k}} is either 0 or 1, but {{tmath\|0 < p_k < 1}}. These can be combined into a single expression: :<math>\ell_k = -y_k\ln p_k - (1 - y_k)\ln (1 - p_k).</math> This expression is more formally known as the [[cross-entropy]] of the predicted distribution <math>\big(p_k, (1-p_k)\big)</math> from the actual distribution <math>\big(y_k, (1-y_k)\big)</math>, as probability distributions on the two-element space of (pass, fail). The sum of these, the total loss, is the overall negative log-likelihood {{tmath\|-\ell}}, and the best fit is obtained for those choices of {{tmath\|\beta_0}} and {{tmath\|\beta_1}} for which {{tmath\|-\ell}} is ''minimized''. Alternatively, instead of ''minimizing'' the loss, one can ''maximize'' its inverse, the (positive) log-likelihood: :<math>\ell = \sum_{k:y_k=1}\ln(p_k) + \sum_{k:y_k=0}\ln(1-p_k) = \sum_{k=1}^K \left(\,y_k \ln(p_k)+(1-y_k)\ln(1-p_k)\right)</math> or equivalently maximize the [[likelihood function]] itself, which is the probability that the given data set is produced by a particular logistic function: :<math>L = \prod_{k:y_k=1}p_k\,\prod_{k:y_k=0}(1-p_k)</math> This method is known as [[maximum likelihood estimation]]. ===Parameter estimation=== Since ''ℓ'' is nonlinear in {{tmath\|\beta_0}} and {{tmath\|\beta_1}}, determining their optimum values will require numerical methods. One method of maximizing ''ℓ'' is to require the derivatives of ''ℓ'' with respect to {{tmath\|\beta_0}} and {{tmath\|\beta_1}} to be zero: :<math>0 = \frac{\partial \ell}{\partial \beta_0} = \sum_{k=1}^K(y_k-p_k)</math> :<math>0 = \frac{\partial \ell}{\partial \beta_1} = \sum_{k=1}^K(y_k-p_k)x_k</math> and the maximization procedure can be accomplished by solving the above two equations for {{tmath\|\beta_0}} and {{tmath\|\beta_1}}, which, again, will generally require the use of numerical methods. The values of {{tmath\|\beta_0}} and {{tmath\|\beta_1}} which maximize ''ℓ'' and ''L'' using the above data are found to be: :<math>\beta_0 \approx -4.1</math> :<math>\beta_1 \approx 1.5</math> which yields a value for ''μ'' and ''s'' of: :<math>\mu = -\beta_0/\beta_1 \approx 2.7</math> :<math>s = 1/\beta_1 \approx 0.67</math> ===Predictions=== The {{tmath\|\beta_0}} and {{tmath\|\beta_1}} coefficients may be entered into the logistic regression equation to estimate the probability of passing the exam. For example, for a student who studies 2 hours, entering the value <math>x = 2</math> into the equation gives the estimated probability of passing the exam of 0.25: : <math> t = \beta_0+2\beta_1 \approx - 4.1 + 2 \cdot 1.5 = -1.1 </math> : <math> p = \frac{1}{1 + e^{-t} } \approx 0.25 = \text{Probability of passing exam} </math> Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87: : <math>t = \beta_0+4\beta_1 \approx - 4.1 + 4 \cdot 1.5 = 1.9</math> : <math>p = \frac{1}{1 + e^{-t} } \approx 0.87 = \text{Probability of passing exam} </math> This table shows the estimated probability of passing the exam for several values of hours studying. {\| class="wikitable" \|- ! rowspan="2" \| Hours<br />of study<br />(x) ! colspan="3" \| Passing exam \|- ! Log-odds (t) !! Odds (e<sup>t</sup>) !! Probability (p) \|- style="text-align: right;" \| 1\|\| −2.57 \|\| 0.076 ≈ 1:13.1 \|\| 0.07 \|- style="text-align: right;" \| 2\|\| −1.07 \|\| 0.34 ≈ 1:2.91 \|\| 0.26 \|- style="text-align: right;" \|{{tmath\|\mu \approx 2.7}} \|\| 0 \|\|1 \|\| <math>\tfrac{1}{2}</math> = 0.50 \|- style="text-align: right;" \| 3\|\| 0.44 \|\| 1.55 \|\| 0.61 \|- style="text-align: right;" \| 4\|\| 1.94 \|\| 6.96 \|\| 0.87 \|- style="text-align: right;" \| 5\|\| 3.45 \|\| 31.4 \|\| 0.97 \|} ===Model evaluation=== The logistic regression analysis gives the following output. {\| class="wikitable" \|- ! !! Coefficient!! Std. Error !! ''z''-value !! ''p''-value (Wald) \|- style="text-align:right;" ! Intercept (''β''<sub>0</sub>) \| −4.1 \|\| 1.8 \|\| −2.3 \|\| 0.021 \|- style="text-align:right;" ! Hours (''β''<sub>1</sub>) \| 1.5 \|\| 0.6 \|\| 2.4 \|\| 0.017 \|} By the [[Wald test]], the output indicates that hours studying is significantly associated with the probability of passing the exam (<math>p = 0.017</math>). Rather than the Wald method, the recommended method<ref name="NeymanPearson1933">{{citation \| last1 = Neyman \| first1 = J. \| author-link1 = Jerzy Neyman\| last2 = Pearson \| first2 = E. S. \| author-link2 = Egon Pearson\| doi = 10.1098/rsta.1933.0009 \| title = On the problem of the most efficient tests of statistical hypotheses \| journal = [[Philosophical Transactions of the Royal Society of London A]] \| volume = 231 \| issue = 694–706 \| pages = 289–337 \| year = 1933 \| jstor = 91247 \|bibcode = 1933RSPTA.231..289N \| url = http://www.stats.org.uk/statistical-inference/NeymanPearson1933.pdf \| doi-access = free }}</ref> to calculate the ''p''-value for logistic regression is the [[likelihood-ratio test]] (LRT), which for these data give <math>p \approx 0.00064</math> (see {{slink\|\|Deviance and likelihood ratio tests}} below). ===Generalizations=== This simple model is an example of binary logistic regression, and has one explanatory variable and a binary categorical variable which can assume one of two categorical values. [[Multinomial logistic regression]] is the generalization of binary logistic regression to include any number of explanatory variables and any number of categories. ==Background== [[Image:Logistic-curve.svg\|thumb\|320px\|right\|Figure 1. The standard logistic function <math>\sigma (t)</math>; <math>\sigma (t) \in (0,1)</math> for all <math>t</math>.]] ===Definition of the logistic function=== An explanation of logistic regression can begin with an explanation of the standard [[logistic function]]. The logistic function is a [[sigmoid function]], which takes any [[Real number\|real]] input <math>t</math>, and outputs a value between zero and one.<ref name=Hosmer/> For the logit, this is interpreted as taking input [[log-odds]] and having output [[probability]]. The ''standard'' logistic function <math>\sigma:\mathbb R\rightarrow (0,1)</math> is defined as follows: :<math>\sigma (t) = \frac{e^t}{e^t+1} = \frac{1}{1+e^{-t}}</math> A graph of the logistic function on the ''t''-interval (−6,6) is shown in Figure 1. Let us assume that <math>t</math> is a linear function of a single [[dependent and independent variables\|explanatory variable]] <math>x</math> (the case where <math>t</math> is a ''linear combination'' of multiple explanatory variables is treated similarly). We can then express <math>t</math> as follows: :<math>t = \beta_0 + \beta_1 x</math> And the general logistic function <math>p:\mathbb R \rightarrow (0,1)</math> can now be written as: :<math>p(x) = \sigma(t)= \frac {1}{1+e^{-(\beta_0 + \beta_1 x)}}</math> In the logistic model, <math>p(x)</math> is interpreted as the probability of the dependent variable <math>Y</math> equaling a success/case rather than a failure/non-case. It is clear that the [[Dependent and independent variables\|response variables]] <math>Y_i</math> are not identically distributed: <math>P(Y_i = 1\mid X)</math> differs from one data point <math>X_i</math> to another, though they are independent given [[design matrix]] <math>X</math> and shared parameters <math>\beta</math>.<ref name = "Freedman09" /> ===Definition of the inverse of the logistic function=== We can now define the [[logit]] (log odds) function as the inverse <math>g = \sigma^{-1}</math> of the standard logistic function. It is easy to see that it satisfies: :<math>g(p(x)) = \sigma^{-1} (p(x)) = \operatorname{logit} p(x) = \ln \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x ,</math> and equivalently, after exponentiating both sides we have the odds: :<math>\frac{p(x)}{1 - p(x)} = e^{\beta_0 + \beta_1 x}.</math> ===Interpretation of these terms=== In the above equations, the terms are as follows: * <math>g</math> is the logit function. The equation for <math>g(p(x))</math> illustrates that the [[logit]] (i.e., log-odds or natural logarithm of the odds) is equivalent to the linear regression expression. * <math>\ln</math> denotes the [[natural logarithm]]. * <math>p(x)</math> is the probability that the dependent variable equals a case, given some linear combination of the predictors. The formula for <math>p(x)</math> illustrates that the probability of the dependent variable equaling a case is equal to the value of the logistic function of the linear regression expression. This is important in that it shows that the value of the linear regression expression can vary from negative to positive infinity and yet, after transformation, the resulting expression for the probability <math>p(x)</math> ranges between 0 and 1. * <math>\beta_0</math> is the [[Y-intercept\|intercept]] from the linear regression equation (the value of the criterion when the predictor is equal to zero). * <math>\beta_1 x</math> is the regression coefficient multiplied by some value of the predictor. * base <math>e</math> denotes the exponential function. ===Definition of the odds=== The odds of the dependent variable equaling a case (given some linear combination <math>x</math> of the predictors) is equivalent to the exponential function of the linear regression expression. This illustrates how the [[logit]] serves as a link function between the probability and the linear regression expression. Given that the logit ranges between negative and positive infinity, it provides an adequate criterion upon which to conduct linear regression and the logit is easily converted back into the odds.<ref name=Hosmer/> So we define odds of the dependent variable equaling a case (given some linear combination <math>x</math> of the predictors) as follows: :<math>\text{odds} = e^{\beta_0 + \beta_1 x}.</math> ===The odds ratio=== For a continuous independent variable the odds ratio can be defined as: :[[File:Odds Ratio-1.jpg\|thumb\|The image represents an outline of what an odds ratio looks like in writing, through a template in addition to the test score example in the "Example" section of the contents. In simple terms, if we hypothetically get an odds ratio of 2 to 1, we can say... "For every one-unit increase in hours studied, the odds of passing (group 1) or failing (group 0) are (expectedly) 2 to 1 (Denis, 2019).]]<math> \mathrm{OR} = \frac{\operatorname{odds}(x+1)}{\operatorname{odds}(x)} = \frac{\left(\frac{p(x+1)}{1 - p(x+1)}\right)}{\left(\frac{p(x)}{1 - p(x)}\right)} = \frac{e^{\beta_0 + \beta_1 (x+1)}}{e^{\beta_0 + \beta_1 x}} = e^{\beta_1}</math> This exponential relationship provides an interpretation for <math>\beta_1</math>: The odds multiply by <math>e^{\beta_1}</math> for every 1-unit increase in x.<ref>{{cite web\|url=https://stats.idre.ucla.edu/stata/faq/how-do-i-interpret-odds-ratios-in-logistic-regression/\|title=How to Interpret Odds Ratio in Logistic Regression?\|publisher=Institute for Digital Research and Education}}</ref> For a binary independent variable the odds ratio is defined as <math>\frac{ad}{bc}</math> where ''a'', ''b'', ''c'' and ''d'' are cells in a 2×2 [[contingency table]].<ref>{{cite book \| last = Everitt \| first = Brian \| title = The Cambridge Dictionary of Statistics \| publisher = Cambridge University Press \| location = Cambridge, UK New York \| year = 1998 \| isbn = 978-0-521-59346-5 \| url-access = registration \| url = https://archive.org/details/cambridgediction00ever_0 }}</ref> ===Multiple explanatory variables=== If there are multiple explanatory variables, the above expression <math>\beta_0+\beta_1x</math> can be revised to <math>\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m = \beta_0+ \sum_{i=1}^m \beta_ix_i</math>. Then when this is used in the equation relating the log odds of a success to the values of the predictors, the linear regression will be a [[multiple regression]] with ''m'' explanators; the parameters <math>\beta_j</math> for all <math>j = 0, 1, 2, \dots, m</math> are all estimated. Again, the more traditional equations are: :<math>\log \frac{p}{1-p} = \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m</math> and :<math>p = \frac{1}{1+b^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_mx_m )}}</math> where usually <math>b=e</math>. ==Definition== The basic setup of logistic regression is as follows. We are given a dataset containing ''N'' points. Each point ''i'' consists of a set of ''m'' input variables ''x''<sub>1,''i''</sub> ... ''x''<sub>''m,i''</sub> (also called [[independent variable]]s, explanatory variables, predictor variables, features, or attributes), and a [[binary variable\|binary]] outcome variable ''Y''<sub>''i''</sub> (also known as a [[dependent variable]], response variable, output variable, or class), i.e. it can assume only the two possible values 0 (often meaning "no" or "failure") or 1 (often meaning "yes" or "success"). The goal of logistic regression is to use the dataset to create a predictive model of the outcome variable. As in linear regression, the outcome variables ''Y''<sub>''i''</sub> are assumed to depend on the explanatory variables ''x''<sub>1,''i''</sub> ... ''x''<sub>''m,i''</sub>. ; Explanatory variables The explanatory variables may be of any [[statistical data type\|type]]: [[real-valued]], [[binary variable\|binary]], [[categorical variable\|categorical]], etc. The main distinction is between [[continuous variable]]s and [[discrete variable]]s. (Discrete variables referring to more than two possible choices are typically coded using [[Dummy variable (statistics)\|dummy variables]] (or [[indicator variable]]s), that is, separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning "variable does have the given value" and a 0 meaning "variable does not have that value".) ;Outcome variables Formally, the outcomes ''Y''<sub>''i''</sub> are described as being [[Bernoulli distribution\|Bernoulli-distributed]] data, where each outcome is determined by an unobserved probability ''p''<sub>''i''</sub> that is specific to the outcome at hand, but related to the explanatory variables. This can be expressed in any of the following equivalent forms: ::<math> \begin{align} Y_i\mid x_{1,i},\ldots,x_{m,i} \ & \sim \operatorname{Bernoulli}(p_i) \\ \operatorname{\mathbb E}[Y_i\mid x_{1,i},\ldots,x_{m,i}] &= p_i \\ \Pr(Y_i=y\mid x_{1,i},\ldots,x_{m,i}) &= \begin{cases} p_i & \text{if }y=1 \\ 1-p_i & \text{if }y=0 \end{cases} \\ \Pr(Y_i=y\mid x_{1,i},\ldots,x_{m,i}) &= p_i^y (1-p_i)^{(1-y)} \end{align} </math> The meanings of these four lines are: # The first line expresses the [[probability distribution]] of each ''Y''<sub>''i''</sub> : conditioned on the explanatory variables, it follows a [[Bernoulli distribution]] with parameters ''p''<sub>''i''</sub>, the probability of the outcome of 1 for trial ''i''. As noted above, each separate trial has its own probability of success, just as each trial has its own explanatory variables. The probability of success ''p''<sub>''i''</sub> is not observed, only the outcome of an individual Bernoulli trial using that probability. # The second line expresses the fact that the [[expected value]] of each ''Y''<sub>''i''</sub> is equal to the probability of success ''p''<sub>''i''</sub>, which is a general property of the Bernoulli distribution. In other words, if we run a large number of Bernoulli trials using the same probability of success ''p''<sub>''i''</sub>, then take the average of all the 1 and 0 outcomes, then the result would be close to ''p''<sub>''i''</sub>. This is because doing an average this way simply computes the proportion of successes seen, which we expect to converge to the underlying probability of success. # The third line writes out the [[probability mass function]] of the Bernoulli distribution, specifying the probability of seeing each of the two possible outcomes. # The fourth line is another way of writing the probability mass function, which avoids having to write separate cases and is more convenient for certain types of calculations. This relies on the fact that ''Y''<sub>''i''</sub> can take only the value 0 or 1. In each case, one of the exponents will be 1, "choosing" the value under it, while the other is 0, "canceling out" the value under it. Hence, the outcome is either ''p''<sub>''i''</sub> or 1 − ''p''<sub>''i''</sub>, as in the previous line. ; Linear predictor function The basic idea of logistic regression is to use the mechanism already developed for [[linear regression]] by modeling the probability ''p''<sub>''i''</sub> using a [[linear predictor function]], i.e. a [[linear combination]] of the explanatory variables and a set of [[regression coefficient]]s that are specific to the model at hand but the same for all trials. The linear predictor function <math>f(i)</math> for a particular data point ''i'' is written as: :<math>f(i) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_m x_{m,i},</math> where <math>\beta_0, \ldots, \beta_m</math> are [[regression coefficient]]s indicating the relative effect of a particular explanatory variable on the outcome. The model is usually put into a more compact form as follows: * The regression coefficients ''β''<sub>0</sub>, ''β''<sub>1</sub>, ..., ''β''<sub>''m''</sub> are grouped into a single vector '''''β''''' of size ''m'' + 1. * For each data point ''i'', an additional explanatory pseudo-variable ''x''<sub>0,''i''</sub> is added, with a fixed value of 1, corresponding to the [[Y-intercept\|intercept]] coefficient ''β''<sub>0</sub>. * The resulting explanatory variables ''x''<sub>0,''i''</sub>, ''x''<sub>1,''i''</sub>, ..., ''x''<sub>''m,i''</sub> are then grouped into a single vector '''''X<sub>i</sub>''''' of size ''m'' + 1. This makes it possible to write the linear predictor function as follows: :<math>f(i)= \boldsymbol\beta \cdot \mathbf{X}_i,</math> using the notation for a [[dot product]] between two vectors. [[File:Logistic Regression in SPSS.png\|thumb\|356x356px\|This is an example of an SPSS output for a logistic regression model using three explanatory variables (coffee use per week, energy drink use per week, and soda use per week) and two categories (male and female).]] === Many explanatory variables, two categories === The above example of binary logistic regression on one explanatory variable can be generalized to binary logistic regression on any number of explanatory variables ''x<sub>1</sub>, x<sub>2</sub>,...'' and any number of categorical values <math>y=0,1,2,\dots</math>. To begin with, we may consider a logistic model with ''M'' explanatory variables, ''x<sub>1</sub>'', ''x<sub>2</sub>'' ... ''x<sub>M</sub>'' and, as in the example above, two categorical values (''y'' = 0 and 1). For the simple binary logistic regression model, we assumed a [[linear model\|linear relationship]] between the predictor variable and the log-odds (also called [[logit]]) of the event that <math>y=1</math>. This linear relationship may be extended to the case of ''M'' explanatory variables: :<math>t = \log_b \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_M x_M </math> where ''t'' is the log-odds and <math>\beta_i</math> are parameters of the model. An additional generalization has been introduced in which the base of the model (''b'') is not restricted to the [[Euler number]] ''e''. In most applications, the base <math>b</math> of the logarithm is usually taken to be ''[[E (mathematical constant)\|e]]''. However, in some cases it can be easier to communicate results by working in base 2 or base 10. For a more compact notation, we will specify the explanatory variables and the ''β'' coefficients as {{tmath\|(M+1)}}-dimensional vectors: :<math>\boldsymbol{x}=\{x_0,x_1,x_2,\dots,x_M\}</math> :<math>\boldsymbol{\beta}=\{\beta_0,\beta_1,\beta_2,\dots,\beta_M\}</math> with an added explanatory variable ''x<sub>0</sub>'' =1. The logit may now be written as: :<math>t =\sum_{m=0}^{M} \beta_m x_m = \boldsymbol{\beta} \cdot x</math> Solving for the probability ''p'' that <math>y=1</math> yields: :<math>p(\boldsymbol{x}) = \frac{b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}{1+b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}= \frac{1}{1+b^{-\boldsymbol{\beta} \cdot \boldsymbol{x}}}=S_b(t)</math>, where <math>S_b</math> is the [[sigmoid function]] with base <math>b</math>. The above formula shows that once the <math>\beta_m</math> are fixed, we can easily compute either the log-odds that <math>y=1</math> for a given observation, or the probability that <math>y=1</math> for a given observation. The main use-case of a logistic model is to be given an observation <math>\boldsymbol{x}</math>, and estimate the probability <math>p(\boldsymbol{x})</math> that <math>y=1</math>. The optimum beta coefficients may again be found by maximizing the log-likelihood. For ''K'' measurements, defining <math>\boldsymbol{x}_k</math> as the explanatory vector of the ''k''-th measurement, and <math>y_k</math> as the categorical outcome of that measurement, the log likelihood may be written in a form very similar to the simple <math>M=1</math> case above: :<math>\ell = \sum_{k=1}^K y_k \log_b(p(\boldsymbol{x_k}))+\sum_{k=1}^K (1-y_k) \log_b(1-p(\boldsymbol{x_k}))</math> As in the simple example above, finding the optimum ''β'' parameters will require numerical methods. One useful technique is to equate the derivatives of the log likelihood with respect to each of the ''β'' parameters to zero yielding a set of equations which will hold at the maximum of the log likelihood: :<math>\frac{\partial \ell}{\partial \beta_m} = 0 = \sum_{k=1}^K y_k x_{mk} - \sum_{k=1}^K p(\boldsymbol{x}_k)x_{mk}</math> where ''x<sub>mk</sub>'' is the value of the ''x<sub>m</sub>'' explanatory variable from the ''k-th'' measurement. Consider an example with <math>M=2</math> explanatory variables, <math>b=10</math>, and coefficients <math>\beta_0=-3</math>, <math>\beta_1=1</math>, and <math>\beta_2=2</math> which have been determined by the above method. To be concrete, the model is: :<math>t=\log_{10}\frac{p}{1 - p} = -3 + x_1 + 2 x_2</math> :<math>p = \frac{b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}{1+b^{\boldsymbol{\beta} \cdot x}} = \frac{b^{\beta_0 + \beta_1 x_1 + \beta_2 x_2}}{1+b^{\beta_0 + \beta_1 x_1 + \beta_2 x_2} } = \frac{1}{1 + b^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2)}}</math>, where ''p'' is the probability of the event that <math>y=1</math>. This can be interpreted as follows: * <math>\beta_0 = -3</math> is the [[y-intercept\|''y''-intercept]]. It is the log-odds of the event that <math>y=1</math>, when the predictors <math>x_1=x_2=0</math>. By exponentiating, we can see that when <math>x_1=x_2=0</math> the odds of the event that <math>y=1</math> are 1-to-1000, or <math>10^{-3}</math>. Similarly, the probability of the event that <math>y=1</math> when <math>x_1=x_2=0</math> can be computed as <math> 1/(1000 + 1) = 1/1001.</math> * <math>\beta_1 = 1</math> means that increasing <math>x_1</math> by 1 increases the log-odds by <math>1</math>. So if <math>x_1</math> increases by 1, the odds that <math>y=1</math> increase by a factor of <math>10^1</math>. The '''probability''' of <math>y=1</math> has also increased, but it has not increased by as much as the odds have increased. * <math>\beta_2 = 2</math> means that increasing <math>x_2</math> by 1 increases the log-odds by <math>2</math>. So if <math>x_2</math> increases by 1, the odds that <math>y=1</math> increase by a factor of <math>10^2.</math> Note how the effect of <math>x_2</math> on the log-odds is twice as great as the effect of <math>x_1</math>, but the effect on the odds is 10 times greater. But the effect on the '''probability''' of <math>y=1</math> is not as much as 10 times greater, it's only the effect on the odds that is 10 times greater. === Multinomial logistic regression: Many explanatory variables and many categories === {{main\|Multinomial logistic regression}} In the above cases of two categories (binomial logistic regression), the categories were indexed by "0" and "1", and we had two probabilities: The probability that the outcome was in category 1 was given by <math>p(\boldsymbol{x})</math>and the probability that the outcome was in category 0 was given by <math>1-p(\boldsymbol{x})</math>. The sum of these probabilities equals 1, which must be true, since "0" and "1" are the only possible categories in this setup. In general, if we have {{tmath\|M+1}} explanatory variables (including ''x<sub>0</sub>'') and {{tmath\|N+1}} categories, we will need {{tmath\|N+1}} separate probabilities, one for each category, indexed by ''n'', which describe the probability that the categorical outcome ''y'' will be in category ''y=n'', conditional on the vector of covariates '''x'''. The sum of these probabilities over all categories must equal 1. Using the mathematically convenient base ''e'', these probabilities are: :<math>p_n(\boldsymbol{x}) = \frac{e^{\boldsymbol{\beta}_n\cdot \boldsymbol{x}}}{1+\sum_{u=1}^N e^{\boldsymbol{\beta}_u\cdot \boldsymbol{x}}}</math> for <math>n=1,2,\dots,N</math> :<math>p_0(\boldsymbol{x}) = 1-\sum_{n=1}^N p_n(\boldsymbol{x})=\frac{1}{1+\sum_{u=1}^N e^{\boldsymbol{\beta}_u\cdot \boldsymbol{x}}}</math> Each of the probabilities except <math>p_0(\boldsymbol{x})</math> will have their own set of regression coefficients <math>\boldsymbol{\beta}_n</math>. It can be seen that, as required, the sum of the <math>p_n(\boldsymbol{x})</math> over all categories ''n'' is 1. The selection of <math>p_0(\boldsymbol{x})</math> to be defined in terms of the other probabilities is artificial. Any of the probabilities could have been selected to be so defined. This special value of ''n'' is termed the "pivot index", and the log-odds (''t<sub>n</sub>'') are expressed in terms of the pivot probability and are again expressed as a linear combination of the explanatory variables: :<math>t_n = \ln\left(\frac{p_n(\boldsymbol{x})}{p_0(\boldsymbol{x})}\right) = \boldsymbol{\beta}_n \cdot \boldsymbol{x}</math> Note also that for the simple case of <math>N=1</math>, the two-category case is recovered, with <math>p(\boldsymbol{x})=p_1(\boldsymbol{x})</math> and <math>p_0(\boldsymbol{x})=1-p_1(\boldsymbol{x})</math>. The log-likelihood that a particular set of ''K'' measurements or data points will be generated by the above probabilities can now be calculated. Indexing each measurement by ''k'', let the ''k''-th set of measured explanatory variables be denoted by <math>\boldsymbol{x}_k</math> and their categorical outcomes be denoted by <math>y_k</math> which can be equal to any integer in [0,N]. The log-likelihood is then: :<math>\ell = \sum_{k=1}^K \sum_{n=0}^N \Delta(n,y_k)\,\ln(p_n(\boldsymbol{x}_k))</math> where <math>\Delta(n,y_k)</math> is an [[indicator function]] which equals 1 if ''y<sub>k</sub> = n'' and zero otherwise. In the case of two explanatory variables, this indicator function was defined as ''y<sub>k</sub>'' when ''n'' = 1 and ''1-y<sub>k</sub>'' when ''n'' = 0. This was convenient, but not necessary.<ref>For example, the indicator function in this case could be defined as <math>\Delta(n,y)=1-(y-n)^2</math></ref> Again, the optimum beta coefficients may be found by maximizing the log-likelihood function generally using numerical methods. A possible method of solution is to set the derivatives of the log-likelihood with respect to each beta coefficient equal to zero and solve for the beta coefficients: :<math>\frac{\partial \ell}{\partial \beta_{nm}} = 0 = \sum_{k=1}^K \Delta(n,y_k)x_{mk} - \sum_{k=1}^K p_n(\boldsymbol{x}_k)x_{mk}</math> where <math>\beta_{nm}</math> is the ''m''-th coefficient of the <math>\boldsymbol{\beta}_n</math> vector and <math>x_{mk}</math> is the ''m''-th explanatory variable of the ''k''-th measurement. Once the beta coefficients have been estimated from the data, we will be able to estimate the probability that any subsequent set of explanatory variables will result in any of the possible outcome categories. ==Interpretations== There are various equivalent specifications and interpretations of logistic regression, which fit into different types of more general models, and allow different generalizations. ===As a generalized linear model=== The particular model used by logistic regression, which distinguishes it from standard [[linear regression]] and from other types of [[regression analysis]] used for [[binary-valued]] outcomes, is the way the probability of a particular outcome is linked to the linear predictor function: :<math>\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid x_{1,i},\ldots,x_{m,i}]) = \operatorname{logit}(p_i) = \ln \left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_m x_{m,i}</math> Written using the more compact notation described above, this is: :<math>\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i]) = \operatorname{logit}(p_i)=\ln\left(\frac{p_i}{1-p_i}\right) = \boldsymbol\beta \cdot \mathbf{X}_i</math> This formulation expresses logistic regression as a type of [[generalized linear model]], which predicts variables with various types of [[probability distribution]]s by fitting a linear predictor function of the above form to some sort of arbitrary transformation of the expected value of the variable. The intuition for transforming using the logit function (the natural log of the odds) was explained above{{Clarify\|reason=What exactly was explained?\|date=February 2023}}. It also has the practical effect of converting the probability (which is bounded to be between 0 and 1) to a variable that ranges over <math>(-\infty,+\infty)</math> — thereby matching the potential range of the linear prediction function on the right side of the equation. Both the probabilities ''p''<sub>''i''</sub> and the regression coefficients are unobserved, and the means of determining them is not part of the model itself. They are typically determined by some sort of optimization procedure, e.g. [[maximum likelihood estimation]], that finds values that best fit the observed data (i.e. that give the most accurate predictions for the data already observed), usually subject to [[regularization (mathematics)\|regularization]] conditions that seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients. The use of a regularization condition is equivalent to doing [[maximum a posteriori]] (MAP) estimation, an extension of maximum likelihood. (Regularization is most commonly done using [[Ridge regression\|a squared regularizing function]], which is equivalent to placing a zero-mean [[Gaussian distribution\|Gaussian]] [[prior distribution]] on the coefficients, but other regularizers are also possible.) Whether or not regularization is used, it is usually not possible to find a closed-form solution; instead, an iterative numerical method must be used, such as [[iteratively reweighted least squares]] (IRLS) or, more commonly these days, a [[quasi-Newton method]] such as the [[L-BFGS\|L-BFGS method]].<ref>{{cite conference \|url=https://dl.acm.org/citation.cfm?id=1118871 \|title=A comparison of algorithms for maximum entropy parameter estimation \|last1=Malouf \|first1=Robert \|date= 2002\|book-title= Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002) \|pages= 49–55 \|doi=10.3115/1118853.1118871 \|doi-access=free }}</ref> The interpretation of the ''β''<sub>''j''</sub> parameter estimates is as the additive effect on the log of the [[odds]] for a unit change in the ''j'' the explanatory variable. In the case of a dichotomous explanatory variable, for instance, gender <math>e^\beta</math> is the estimate of the odds of having the outcome for, say, males compared with females. An equivalent formula uses the inverse of the logit function, which is the [[logistic function]], i.e.: :<math>\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i] = p_i = \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) = \frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}</math> The formula can also be written as a [[probability distribution]] (specifically, using a [[probability mass function]]): : <math>\Pr(Y_i=y\mid \mathbf{X}_i) = {p_i}^y(1-p_i)^{1-y} =\left(\frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i}}{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{y} \left(1-\frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i}}{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{1-y} = \frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i \cdot y} }{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}</math> ===As a latent-variable model=== The logistic model has an equivalent formulation as a [[latent-variable model]]. This formulation is common in the theory of [[discrete choice]] models and makes it easier to extend to certain more complicated models with multiple, correlated choices, as well as to compare logistic regression to the closely related [[probit model]]. Imagine that, for each trial ''i'', there is a continuous [[latent variable]] ''Y''<sub>''i''</sub><sup></sup> (i.e. an unobserved [[random variable]]) that is distributed as follows: : <math> Y_i^\ast = \boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon_i \, </math> where : <math>\varepsilon_i \sim \operatorname{Logistic}(0,1) \, </math> i.e. the latent variable can be written directly in terms of the linear predictor function and an additive random [[error variable]] that is distributed according to a standard [[logistic distribution]]. Then ''Y''<sub>''i''</sub> can be viewed as an indicator for whether this latent variable is positive: : <math> Y_i = \begin{cases} 1 & \text{if }Y_i^\ast > 0 \ \text{ i.e. } - \varepsilon_i < \boldsymbol\beta \cdot \mathbf{X}_i, \\ 0 &\text{otherwise.} \end{cases} </math> The choice of modeling the error variable specifically with a standard logistic distribution, rather than a general logistic distribution with the location and scale set to arbitrary values, seems restrictive, but in fact, it is not. It must be kept in mind that we can choose the regression coefficients ourselves, and very often can use them to offset changes in the parameters of the error variable's distribution. For example, a logistic error-variable distribution with a non-zero location parameter ''μ'' (which sets the mean) is equivalent to a distribution with a zero location parameter, where ''μ'' has been added to the intercept coefficient. Both situations produce the same value for ''Y''<sub>''i''</sub><sup></sup> regardless of settings of explanatory variables. Similarly, an arbitrary scale parameter ''s'' is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by ''s''. In the latter case, the resulting value of ''Y''<sub>''i''</sub><sup>''''</sup> will be smaller by a factor of ''s'' than in the former case, for all sets of explanatory variables — but critically, it will always remain on the same side of 0, and hence lead to the same ''Y''<sub>''i''</sub> choice. (This predicts that the irrelevancy of the scale parameter may not carry over into more complex models where more than two choices are available.) It turns out that this formulation is exactly equivalent to the preceding one, phrased in terms of the [[generalized linear model]] and without any [[latent variable]]s. This can be shown as follows, using the fact that the [[cumulative distribution function]] (CDF) of the standard [[logistic distribution]] is the [[logistic function]], which is the inverse of the [[logit function]], i.e. :<math>\Pr(\varepsilon_i < x) = \operatorname{logit}^{-1}(x)</math> Then: :<math> \begin{align} \Pr(Y_i=1\mid\mathbf{X}_i) &= \Pr(Y_i^\ast > 0\mid\mathbf{X}_i) \\[5pt] &= \Pr(\boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon_i > 0) \\[5pt] &= \Pr(\varepsilon_i > -\boldsymbol\beta \cdot \mathbf{X}_i) \\[5pt] &= \Pr(\varepsilon_i < \boldsymbol\beta \cdot \mathbf{X}_i) & & \text{(because the logistic distribution is symmetric)} \\[5pt] &= \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) & \\[5pt] &= p_i & & \text{(see above)} \end{align} </math> This formulation—which is standard in [[discrete choice]] models—makes clear the relationship between logistic regression (the "logit model") and the [[probit model]], which uses an error variable distributed according to a standard [[normal distribution]] instead of a standard logistic distribution. Both the logistic and normal distributions are symmetric with a basic unimodal, "bell curve" shape. The only difference is that the logistic distribution has somewhat [[heavy-tailed distribution\|heavier tails]], which means that it is less sensitive to outlying data (and hence somewhat more [[robust statistics\|robust]] to model mis-specifications or erroneous data). ===Two-way latent-variable model=== Yet another formulation uses two separate latent variables: : <math> \begin{align} Y_i^{0\ast} &= \boldsymbol\beta_0 \cdot \mathbf{X}_i + \varepsilon_0 \, \\ Y_i^{1\ast} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 \, \end{align} </math> where : <math> \begin{align} \varepsilon_0 & \sim \operatorname{EV}_1(0,1) \\ \varepsilon_1 & \sim \operatorname{EV}_1(0,1) \end{align} </math> where ''EV''<sub>1</sub>(0,1) is a standard type-1 [[extreme value distribution]]: i.e. :<math>\Pr(\varepsilon_0=x) = \Pr(\varepsilon_1=x) = e^{-x} e^{-e^{-x}}</math> Then : <math> Y_i = \begin{cases} 1 & \text{if }Y_i^{1\ast} > Y_i^{0\ast}, \\ 0 &\text{otherwise.} \end{cases} </math> This model has a separate latent variable and a separate set of regression coefficients for each possible outcome of the dependent variable. The reason for this separation is that it makes it easy to extend logistic regression to multi-outcome categorical variables, as in the [[multinomial logit]] model. In such a model, it is natural to model each possible outcome using a different set of regression coefficients. It is also possible to motivate each of the separate latent variables as the theoretical [[utility]] associated with making the associated choice, and thus motivate logistic regression in terms of [[utility theory]]. (In terms of utility theory, a rational actor always chooses the choice with the greatest associated utility.) This is the approach taken by economists when formulating [[discrete choice]] models, because it both provides a theoretically strong foundation and facilitates intuitions about the model, which in turn makes it easy to consider various sorts of extensions. (See the example below.) The choice of the type-1 [[extreme value distribution]] seems fairly arbitrary, but it makes the mathematics work out, and it may be possible to justify its use through [[rational choice theory]]. It turns out that this model is equivalent to the previous model, although this seems non-obvious, since there are now two sets of regression coefficients and error variables, and the error variables have a different distribution. In fact, this model reduces directly to the previous one with the following substitutions: :<math>\boldsymbol\beta = \boldsymbol\beta_1 - \boldsymbol\beta_0</math> :<math>\varepsilon = \varepsilon_1 - \varepsilon_0</math> An intuition for this comes from the fact that, since we choose based on the maximum of two values, only their difference matters, not the exact values — and this effectively removes one [[Degrees of freedom (statistics)\|degree of freedom]]. Another critical fact is that the difference of two type-1 extreme-value-distributed variables is a logistic distribution, i.e. <math>\varepsilon = \varepsilon_1 - \varepsilon_0 \sim \operatorname{Logistic}(0,1) .</math> We can demonstrate the equivalent as follows: :<math>\begin{align} \Pr(Y_i=1\mid\mathbf{X}_i) = {} & \Pr \left (Y_i^{1\ast} > Y_i^{0\ast}\mid\mathbf{X}_i \right ) & \\[5pt] = {} & \Pr \left (Y_i^{1\ast} - Y_i^{0\ast} > 0\mid\mathbf{X}_i \right ) & \\[5pt] = {} & \Pr \left (\boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 - \left (\boldsymbol\beta_0 \cdot \mathbf{X}_i + \varepsilon_0 \right ) > 0 \right ) & \\[5pt] = {} & \Pr \left ((\boldsymbol\beta_1 \cdot \mathbf{X}_i - \boldsymbol\beta_0 \cdot \mathbf{X}_i) + (\varepsilon_1 - \varepsilon_0) > 0 \right ) & \\[5pt] = {} & \Pr((\boldsymbol\beta_1 - \boldsymbol\beta_0) \cdot \mathbf{X}_i + (\varepsilon_1 - \varepsilon_0) > 0) & \\[5pt] = {} & \Pr((\boldsymbol\beta_1 - \boldsymbol\beta_0) \cdot \mathbf{X}_i + \varepsilon > 0) & & \text{(substitute } \varepsilon\text{ as above)} \\[5pt] = {} & \Pr(\boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon > 0) & & \text{(substitute }\boldsymbol\beta\text{ as above)} \\[5pt] = {} & \Pr(\varepsilon > -\boldsymbol\beta \cdot \mathbf{X}_i) & & \text{(now, same as above model)}\\[5pt] = {} & \Pr(\varepsilon < \boldsymbol\beta \cdot \mathbf{X}_i) & \\[5pt] = {} & \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) \\[5pt] = {} & p_i \end{align}</math> ====Example==== : {{Original research\|example\|discuss=Talk:Logistic_regression#Utility_theory_/_Elections_example_is_irrelevant\|date=May 2022}} As an example, consider a province-level election where the choice is between a right-of-center party, a left-of-center party, and a secessionist party (e.g. the [[Parti Québécois]], which wants [[Quebec]] to secede from [[Canada]]). We would then use three latent variables, one for each choice. Then, in accordance with [[utility theory]], we can then interpret the latent variables as expressing the [[utility]] that results from making each of the choices. We can also interpret the regression coefficients as indicating the strength that the associated factor (i.e. explanatory variable) has in contributing to the utility — or more correctly, the amount by which a unit change in an explanatory variable changes the utility of a given choice. A voter might expect that the right-of-center party would lower taxes, especially on rich people. This would give low-income people no benefit, i.e. no change in utility (since they usually don't pay taxes); would cause moderate benefit (i.e. somewhat more money, or moderate utility increase) for middle-incoming people; would cause significant benefits for high-income people. On the other hand, the left-of-center party might be expected to raise taxes and offset it with increased welfare and other assistance for the lower and middle classes. This would cause significant positive benefit to low-income people, perhaps a weak benefit to middle-income people, and significant negative benefit to high-income people. Finally, the secessionist party would take no direct actions on the economy, but simply secede. A low-income or middle-income voter might expect basically no clear utility gain or loss from this, but a high-income voter might expect negative utility since he/she is likely to own companies, which will have a harder time doing business in such an environment and probably lose money. These intuitions can be expressed as follows: {\|class="wikitable" \|+Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables \|- ! !! Center-right !! Center-left !! Secessionist \|- ! High-income \| strong + \|\| strong − \|\| strong − \|- ! Middle-income \| moderate + \|\| weak + \|\| none \|- ! Low-income \| none \|\| strong + \|\| none \|- \|} This clearly shows that # Separate sets of regression coefficients need to exist for each choice. When phrased in terms of utility, this can be seen very easily. Different choices have different effects on net utility; furthermore, the effects vary in complex ways that depend on the characteristics of each individual, so there need to be separate sets of coefficients for each characteristic, not simply a single extra per-choice characteristic. # Even though income is a continuous variable, its effect on utility is too complex for it to be treated as a single variable. Either it needs to be directly split up into ranges, or higher powers of income need to be added so that [[polynomial regression]] on income is effectively done. ==={{anchor\|log-linear model}}As a "log-linear" model=== Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of the [[multinomial logit]]. Here, instead of writing the [[logit]] of the probabilities ''p''<sub>''i''</sub> as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes: : <math> \begin{align} \ln \Pr(Y_i=0) &= \boldsymbol\beta_0 \cdot \mathbf{X}_i - \ln Z \\ \ln \Pr(Y_i=1) &= \boldsymbol\beta_1 \cdot \mathbf{X}_i - \ln Z \end{align} </math> Two separate sets of regression coefficients have been introduced, just as in the two-way latent variable model, and the two equations appear a form that writes the [[logarithm]] of the associated probability as a linear predictor, with an extra term <math>- \ln Z</math> at the end. This term, as it turns out, serves as the [[normalizing factor]] ensuring that the result is a distribution. This can be seen by exponentiating both sides: : <math> \begin{align} \Pr(Y_i=0) &= \frac{1}{Z} e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} \\[5pt] \Pr(Y_i=1) &= \frac{1}{Z} e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} \end{align} </math> In this form it is clear that the purpose of ''Z'' is to ensure that the resulting distribution over ''Y''<sub>''i''</sub> is in fact a [[probability distribution]], i.e. it sums to 1. This means that ''Z'' is simply the sum of all un-normalized probabilities, and by dividing each probability by ''Z'', the probabilities become "[[normalizing constant\|normalized]]". That is: :<math> Z = e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}</math> and the resulting equations are :<math> \begin{align} \Pr(Y_i=0) &= \frac{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} \\[5pt] \Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}. \end{align} </math> Or generally: :<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}{\sum_h e^{\boldsymbol\beta_h \cdot \mathbf{X}_i}}</math> This shows clearly how to generalize this formulation to more than two outcomes, as in [[multinomial logit]]. This general formulation is exactly the [[softmax function]] as in :<math>\Pr(Y_i=c) = \operatorname{softmax}(c, \boldsymbol\beta_0 \cdot \mathbf{X}_i, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \dots) .</math> In order to prove that this is equivalent to the previous model, the above model is overspecified, in that <math>\Pr(Y_i=0)</math> and <math>\Pr(Y_i=1)</math> cannot be independently specified: rather <math>\Pr(Y_i=0) + \Pr(Y_i=1) = 1</math> so knowing one automatically determines the other. As a result, the model is [[nonidentifiable]], in that multiple combinations of '''''β'''''<sub>0</sub> and '''''β'''''<sub>1</sub> will produce the same probabilities for all possible explanatory variables. In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities: :<math> \begin{align} \Pr(Y_i=1) &= \frac{e^{(\boldsymbol\beta_1 +\mathbf{C}) \cdot \mathbf{X}_i}}{e^{(\boldsymbol\beta_0 +\mathbf{C})\cdot \mathbf{X}_i} + e^{(\boldsymbol\beta_1 +\mathbf{C}) \cdot \mathbf{X}_i}} \\[5pt] &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}} \\[5pt] &= \frac{e^{\mathbf{C} \cdot \mathbf{X}_i}e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\mathbf{C} \cdot \mathbf{X}_i}(e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i})} \\[5pt] &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}. \end{align} </math> As a result, we can simplify matters, and restore identifiability, by picking an arbitrary value for one of the two vectors. We choose to set <math>\boldsymbol\beta_0 = \mathbf{0} .</math> Then, :<math>e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} = e^{\mathbf{0} \cdot \mathbf{X}_i} = 1</math> and so :<math> \Pr(Y_i=1) = \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{1 + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} = \frac{1}{1+e^{-\boldsymbol\beta_1 \cdot \mathbf{X}_i}} = p_i</math> which shows that this formulation is indeed equivalent to the previous formulation. (As in the two-way latent variable formulation, any settings where <math>\boldsymbol\beta = \boldsymbol\beta_1 - \boldsymbol\beta_0</math> will produce equivalent results.) Most treatments of the [[multinomial logit]] model start out either by extending the "log-linear" formulation presented here or the two-way latent variable formulation presented above, since both clearly show the way that the model could be extended to multi-way outcomes. In general, the presentation with latent variables is more common in [[econometrics]] and [[political science]], where [[discrete choice]] models and [[utility theory]] reign, while the "log-linear" formulation here is more common in [[computer science]], e.g. [[machine learning]] and [[natural language processing]]. ===As a single-layer perceptron=== The model has an equivalent formulation :<math>p_i = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}}. \, </math> This functional form is commonly called a single-layer [[perceptron]] or single-layer [[artificial neural network]]. A single-layer neural network computes a continuous output instead of a [[step function]]. The derivative of ''p<sub>i</sub>'' with respect to ''X'' = (''x''<sub>1</sub>, ..., ''x''<sub>''k''</sub>) is computed from the general form: : <math>y = \frac{1}{1+e^{-f(X)}}</math> where ''f''(''X'') is an [[analytic function]] in ''X''. With this choice, the single-layer neural network is identical to the logistic regression model. This function has a continuous derivative, which allows it to be used in [[backpropagation]]. This function is also preferred because its derivative is easily calculated: : <math>\frac{\mathrm{d}y}{\mathrm{d}X} = y(1-y)\frac{\mathrm{d}f}{\mathrm{d}X}. \, </math> ===In terms of binomial data=== A closely related model assumes that each ''i'' is associated not with a single Bernoulli trial but with ''n''<sub>''i''</sub> [[independent identically distributed]] trials, where the observation ''Y''<sub>''i''</sub> is the number of successes observed (the sum of the individual Bernoulli-distributed random variables), and hence follows a [[binomial distribution]]: :<math>Y_i \,\sim \operatorname{Bin}(n_i,p_i),\text{ for }i = 1, \dots , n</math> An example of this distribution is the fraction of seeds (''p''<sub>''i''</sub>) that germinate after ''n''<sub>''i''</sub> are planted. In terms of [[expected value]]s, this model is expressed as follows: :<math>p_i = \operatorname{\mathbb E}\left[\left.\frac{Y_i}{n_{i}}\,\right\|\,\mathbf{X}_i \right]\,, </math> so that :<math>\operatorname{logit}\left(\operatorname{\mathbb E}\left[\left.\frac{Y_i}{n_i}\,\right\|\,\mathbf{X}_i \right]\right) = \operatorname{logit}(p_i) = \ln \left(\frac{p_i}{1-p_i}\right) = \boldsymbol\beta \cdot \mathbf{X}_i\,,</math> Or equivalently: :<math>\Pr(Y_i=y\mid \mathbf{X}_i) = {n_i \choose y} p_i^y(1-p_i)^{n_i-y} ={n_i \choose y} \left(\frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^y \left(1-\frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{n_i-y}\,.</math> This model can be fit using the same sorts of methods as the above more basic model. ==Model fitting== {{expand section\|date=October 2016}} ===Maximum likelihood estimation (MLE)=== The regression coefficients are usually estimated using [[maximum likelihood estimation]].<ref name=Menard/><ref>{{cite journal \|first1=Christian \|last1=Gourieroux \|first2=Alain \|last2=Monfort \|title=Asymptotic Properties of the Maximum Likelihood Estimator in Dichotomous Logit Models \|journal=Journal of Econometrics \|volume=17 \|issue=1 \|year=1981 \|pages=83–97 \|doi=10.1016/0304-4076(81)90060-9 }}</ref> Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function, so that an iterative process must be used instead; for example [[Newton's method]]. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.<ref name="Menard" /> In some instances, the model may not reach convergence. Non-convergence of a model indicates that the coefficients are not meaningful because the iterative process was unable to find appropriate solutions. A failure to converge may occur for a number of reasons: having a large ratio of predictors to cases, [[multicollinearity]], [[sparse matrix\|sparseness]], or complete [[Separation (statistics)\|separation]]. Having a large ratio of variables to cases results in an overly conservative Wald statistic (discussed below) and can lead to non-convergence. [[Regularization (mathematics)\|Regularized]] logistic regression is specifically intended to be used in this situation. * Multicollinearity refers to unacceptably high correlations between predictors. As multicollinearity increases, coefficients remain unbiased but standard errors increase and the likelihood of model convergence decreases.<ref name=Menard/> To detect multicollinearity amongst the predictors, one can conduct a linear regression analysis with the predictors of interest for the sole purpose of examining the tolerance statistic <ref name=Menard/> used to assess whether multicollinearity is unacceptably high. * Sparseness in the data refers to having a large proportion of empty cells (cells with zero counts). Zero cell counts are particularly problematic with categorical predictors. With continuous predictors, the model can infer values for the zero cell counts, but this is not the case with categorical predictors. The model will not converge with zero cell counts for categorical predictors because the natural logarithm of zero is an undefined value so that the final solution to the model cannot be reached. To remedy this problem, researchers may collapse categories in a theoretically meaningful way or add a constant to all cells.<ref name=Menard/> * Another numerical problem that may lead to a lack of convergence is complete separation, which refers to the instance in which the predictors perfectly predict the criterion – all cases are accurately classified and the likelihood maximized with infinite coefficients. In such instances, one should re-examine the data, as there may be some kind of error.<ref name=Hosmer/>{{explain\|date=May 2017\|reason= Why is there likely some kind of error? How can this be remedied?}} * One can also take semi-parametric or non-parametric approaches, e.g., via local-likelihood or nonparametric quasi-likelihood methods, which avoid assumptions of a parametric form for the index function and is robust to the choice of the link function (e.g., probit or logit).<ref name="sciencedirect.com">{{cite journal\| doi=10.1016/j.csda.2016.10.024 \| volume=108 \| title=Nonparametric estimation of dynamic discrete choice models for time series data \| year=2017 \| journal=Computational Statistics & Data Analysis \| pages=97–120 \| last1 = Park \| first1 = Byeong U. \| last2 = Simar \| first2 = Léopold \| last3 = Zelenyuk \| first3 = Valentin\| url=https://espace.library.uq.edu.au/view/UQ:415620/UQ415620_OA.pdf }}</ref> === Iteratively reweighted least squares (IRLS) === Binary logistic regression (<math>y=0</math> or <math> y=1</math>) can, for example, be calculated using ''iteratively reweighted least squares'' (IRLS), which is equivalent to maximizing the [[log-likelihood]] of a [[Bernoulli distribution\|Bernoulli distributed]] process using [[Newton's method]]. If the problem is written in vector matrix form, with parameters <math>\mathbf{w}^T=[\beta_0,\beta_1,\beta_2, \ldots]</math>, explanatory variables <math>\mathbf{x}(i)=[1, x_1(i), x_2(i), \ldots]^T</math> and expected value of the Bernoulli distribution <math>\mu(i)=\frac{1}{1+e^{-\mathbf{w}^T\mathbf{x}(i)}}</math>, the parameters <math>\mathbf{w}</math> can be found using the following iterative algorithm: :<math>\mathbf{w}_{k+1} = \left(\mathbf{X}^T\mathbf{S}_k\mathbf{X}\right)^{-1}\mathbf{X}^T \left(\mathbf{S}_k \mathbf{X} \mathbf{w}_k + \mathbf{y} - \mathbf{\boldsymbol\mu}_k\right) </math> where <math>\mathbf{S}=\operatorname{diag}(\mu(i)(1-\mu(i)))</math> is a diagonal weighting matrix, <math>\boldsymbol\mu=[\mu(1), \mu(2),\ldots]</math> the vector of expected values, :<math>\mathbf{X}=\begin{bmatrix} 1 & x_1(1) & x_2(1) & \ldots\\ 1 & x_1(2) & x_2(2) & \ldots\\ \vdots & \vdots & \vdots \end{bmatrix}</math> The regressor matrix and <math>\mathbf{y}(i)=[y(1),y(2),\ldots]^T</math> the vector of response variables. More details can be found in the literature.<ref>{{cite book\|last1=Murphy\|first1=Kevin P.\|title=Machine Learning – A Probabilistic Perspective\|publisher=The MIT Press\|date=2012\|page=245\|isbn=978-0-262-01802-9}}</ref> ===Bayesian=== [[File:Logistic-sigmoid-vs-scaled-probit.svg\|right\|300px\|thumb\|Comparison of [[logistic function]] with a scaled inverse [[probit function]] (i.e. the [[cumulative distribution function\|CDF]] of the [[normal distribution]]), comparing <math>\sigma(x)</math> vs. <math display="inline">\Phi(\sqrt{\frac{\pi}{8}}x)</math>, which makes the slopes the same at the origin. This shows the [[heavy-tailed distribution\|heavier tails]] of the logistic distribution.]] In a [[Bayesian statistics]] context, [[prior distribution]]s are normally placed on the regression coefficients, for example in the form of [[Gaussian distribution]]s. There is no [[conjugate prior]] of the [[likelihood function]] in logistic regression. When Bayesian inference was performed analytically, this made the [[posterior distribution]] difficult to calculate except in very low dimensions. Now, though, automatic software such as [[OpenBUGS]], [[Just another Gibbs sampler\|JAGS]], [[PyMC3]], [[Stan (software)\|Stan]] or [[Turing.jl]] allows these posteriors to be computed using simulation, so lack of conjugacy is not a concern. However, when the sample size or the number of parameters is large, full Bayesian simulation can be slow, and people often use approximate methods such as [[variational Bayesian methods]] and [[expectation propagation]]. ==="Rule of ten"=== {{main\|One in ten rule}} A widely used rule of thumb, the "[[one in ten rule]]", states that logistic regression models give stable values for the explanatory variables if based on a minimum of about 10 events per explanatory variable (EPV); where ''event'' denotes the cases belonging to the less frequent category in the dependent variable. Thus a study designed to use <math>k</math> explanatory variables for an event (e.g. [[myocardial infarction]]) expected to occur in a proportion <math>p</math> of participants in the study will require a total of <math>10k/p</math> participants. However, there is considerable debate about the reliability of this rule, which is based on simulation studies and lacks a secure theoretical underpinning.<ref>{{cite journal\|pmid=27881078\|pmc=5122171\|year=2016\|last1=Van Smeden\|first1=M.\|title=No rationale for 1 variable per 10 events criterion for binary logistic regression analysis\|journal=BMC Medical Research Methodology\|volume=16\|issue=1\|page=163\|last2=De Groot\|first2=J. A.\|last3=Moons\|first3=K. G.\|last4=Collins\|first4=G. S.\|last5=Altman\|first5=D. G.\|last6=Eijkemans\|first6=M. J.\|last7=Reitsma\|first7=J. B.\|doi=10.1186/s12874-016-0267-3 \|doi-access=free }}</ref> According to some authors<ref>{{cite journal\|last=Peduzzi\|first=P\|author2=Concato, J \|author3=Kemper, E \|author4=Holford, TR \|author5=Feinstein, AR \|title=A simulation study of the number of events per variable in logistic regression analysis\|journal=[[Journal of Clinical Epidemiology]]\|date=December 1996\|volume=49\|issue=12\|pages=1373–9\|pmid=8970487\|doi=10.1016/s0895-4356(96)00236-3\|doi-access=free}}</ref> the rule is overly conservative in some circumstances, with the authors stating, "If we (somewhat subjectively) regard confidence interval coverage less than 93 percent, type I error greater than 7 percent, or relative bias greater than 15 percent as problematic, our results indicate that problems are fairly frequent with 2–4 EPV, uncommon with 5–9 EPV, and still observed with 10–16 EPV. The worst instances of each problem were not severe with 5–9 EPV and usually comparable to those with 10–16 EPV".<ref>{{cite journal\|last1=Vittinghoff\|first1=E.\|last2=McCulloch\|first2=C. E.\|title=Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression\|journal=American Journal of Epidemiology\|date=12 January 2007\|volume=165\|issue=6\|pages=710–718\|doi=10.1093/aje/kwk052\|pmid=17182981\|doi-access=free}}</ref> Others have found results that are not consistent with the above, using different criteria. A useful criterion is whether the fitted model will be expected to achieve the same predictive discrimination in a new sample as it appeared to achieve in the model development sample. For that criterion, 20 events per candidate variable may be required.<ref name=plo14mod/> Also, one can argue that 96 observations are needed only to estimate the model's intercept precisely enough that the margin of error in predicted probabilities is ±0.1 with a 0.95 confidence level.<ref name=rms/> == Error and significance of fit == === Deviance and likelihood ratio test ─ a simple case === In any fitting procedure, the addition of another fitting parameter to a model (e.g. the beta parameters in a logistic regression model) will almost always improve the ability of the model to predict the measured outcomes. This will be true even if the additional term has no predictive value, since the model will simply be "[[overfitting]]" to the noise in the data. The question arises as to whether the improvement gained by the addition of another fitting parameter is significant enough to recommend the inclusion of the additional term, or whether the improvement is simply that which may be expected from overfitting. In short, for logistic regression, a statistic known as the [[deviance (statistics)\|deviance]] is defined which is a measure of the error between the logistic model fit and the outcome data. In the limit of a large number of data points, the deviance is [[Chi-squared distribution\|chi-squared]] distributed, which allows a [[chi-squared test]] to be implemented in order to determine the significance of the explanatory variables. Linear regression and logistic regression have many similarities. For example, in simple linear regression, a set of ''K'' data points (''x<sub>k</sub>'', ''y<sub>k</sub>'') are fitted to a proposed model function of the form <math>y=b_0+b_1 x</math>. The fit is obtained by choosing the ''b'' parameters which minimize the sum of the squares of the residuals (the squared error term) for each data point: :<math>\epsilon^2=\sum_{k=1}^K (b_0+b_1 x_k-y_k)^2.</math> The minimum value which constitutes the fit will be denoted by <math>\hat{\epsilon}^2</math> The idea of a [[null model]] may be introduced, in which it is assumed that the ''x'' variable is of no use in predicting the y<sub>k</sub> outcomes: The data points are fitted to a null model function of the form ''y=b<sub>0</sub>'' with a squared error term: :<math>\epsilon^2=\sum_{k=1}^K (b_0-y_k)^2.</math> The fitting process consists of choosing a value of ''b<sub>0</sub>'' which minimizes <math>\epsilon^2</math> of the fit to the null model, denoted by <math>\epsilon_\varphi^2</math> where the <math>\varphi</math> subscript denotes the null model. It is seen that the null model is optimized by <math>b_0=\overline{y}</math> where <math>\overline{y}</math> is the mean of the ''y<sub>k</sub>'' values, and the optimized <math>\epsilon_\varphi^2</math> is: :<math>\hat{\epsilon}_\varphi^2=\sum_{k=1}^K (\overline{y}-y_k)^2</math> which is proportional to the square of the (uncorrected) sample standard deviation of the ''y<sub>k</sub>'' data points. We can imagine a case where the ''y<sub>k</sub>'' data points are randomly assigned to the various ''x<sub>k</sub>'', and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the ''y<sub>k</sub>'' outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a [[chi-squared distribution]], with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be 2-1=1. Using the [[chi-squared test]], we may then estimate how many of these permuted sets of ''y<sub>k</sub>'' will yield an minimum error less than or equal to the minimum error using the original ''y<sub>k</sub>'', and so we can estimate how significant an improvement is given by the inclusion of the ''x'' variable in the proposed model. For logistic regression, the measure of goodness-of-fit is the likelihood function ''L'', or its logarithm, the log-likelihood ''ℓ''. The likelihood function ''L'' is analogous to the <math>\epsilon^2</math> in the linear regression case, except that the likelihood is maximized rather than minimized. Denote the maximized log-likelihood of the proposed model by <math>\hat{\ell}</math>. In the case of simple binary logistic regression, the set of ''K'' data points are fitted in a probabilistic sense to a function of the form: :<math>p(x)=\frac{1}{1+e^{-t}}</math> where {{tmath\|p(x)}} is the probability that <math>y=1</math>. The log-odds are given by: :<math>t=\beta_0+\beta_1 x</math> and the log-likelihood is: :<math>\ell=\sum_{k=1}^K \left( y_k \ln(p(x_k))+(1-y_k) \ln(1-p(x_k))\right)</math> For the null model, the probability that <math>y=1</math> is given by: :<math>p_\varphi(x)=\frac{1}{1+e^{-t_\varphi}}</math> The log-odds for the null model are given by: :<math>t_\varphi=\beta_0</math> and the log-likelihood is: :<math>\ell_\varphi=\sum_{k=1}^K \left( y_k \ln(p_\varphi)+(1-y_k) \ln(1-p_\varphi)\right)</math> Since we have <math>p_\varphi=\overline{y}</math> at the maximum of ''L'', the maximum log-likelihood for the null model is :<math>\hat{\ell}_\varphi=K(\,\overline{y} \ln(\overline{y}) + (1-\overline{y})\ln(1-\overline{y}))</math> The optimum <math>\beta_0</math> is: :<math>\beta_0=\ln\left(\frac{\overline{y}}{1-\overline{y}}\right)</math> where <math>\overline{y}</math> is again the mean of the ''y<sub>k</sub>'' values. Again, we can conceptually consider the fit of the proposed model to every permutation of the ''y<sub>k</sub>'' and it can be shown that the maximum log-likelihood of these permutation fits will never be smaller than that of the null model: :<math> \hat{\ell} \ge \hat{\ell}_\varphi</math> Also, as an analog to the error of the linear regression case, we may define the [[deviance (statistics)\|deviance]] of a logistic regression fit as: :<math>D=\ln\left(\frac{\hat{L}^2}{\hat{L}_\varphi^2}\right) = 2(\hat{\ell}-\hat{\ell}_\varphi)</math> which will always be positive or zero. The reason for this choice is that not only is the deviance a good measure of the goodness of fit, it is also approximately chi-squared distributed, with the approximation improving as the number of data points (''K'') increases, becoming exactly chi-square distributed in the limit of an infinite number of data points. As in the case of linear regression, we may use this fact to estimate the probability that a random set of data points will give a better fit than the fit obtained by the proposed model, and so have an estimate how significantly the model is improved by including the ''x<sub>k</sub>'' data points in the proposed model. For the simple model of student test scores described above, the maximum value of the log-likelihood of the null model is <math>\hat{\ell}_\varphi= -13.8629...</math> The maximum value of the log-likelihood for the simple model is <math>\hat{\ell}=-8.02988...</math> so that the deviance is <math>D = 2(\hat{\ell}-\hat{\ell}_\varphi)=11.6661...</math> Using the [[chi-squared test]] of significance, the integral of the [[chi-squared distribution]] with one degree of freedom from 11.6661... to infinity is equal to 0.00063649... This effectively means that about 6 out of a 10,000 fits to random ''y<sub>k</sub>'' can be expected to have a better fit (smaller deviance) than the given ''y<sub>k</sub>'' and so we can conclude that the inclusion of the ''x'' variable and data in the proposed model is a very significant improvement over the null model. In other words, we reject the [[null hypothesis]] with <math>1-D\approx 99.94 \%</math> confidence. ===Goodness of fit summary=== [[Goodness of fit]] in linear regression models is generally measured using [[R square\|R<sup>2</sup>]]. Since this has no direct analog in logistic regression, various methods<ref name=Greene>{{cite book \|last=Greene \|first=William N. \|title=Econometric Analysis \|edition=Fifth \|publisher=Prentice-Hall \|year=2003 \|isbn=978-0-13-066189-0 }}</ref>{{rp\|ch.21}} including the following can be used instead. ====Deviance and likelihood ratio tests==== In linear regression analysis, one is concerned with partitioning variance via the [[Partition of sums of squares\|sum of squares]] calculations – variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, [[Deviance (statistics)\|deviance]] is used in lieu of a sum of squares calculations.<ref name=Cohen/> Deviance is analogous to the sum of squares calculations in linear regression<ref name=Hosmer/> and is a measure of the lack of fit to the data in a logistic regression model.<ref name=Cohen/> When a "saturated" model is available (a model with a theoretically perfect fit), deviance is calculated by comparing a given model with the saturated model.<ref name=Hosmer/> This computation gives the [[likelihood-ratio test]]:<ref name=Hosmer/> :<math> D = -2\ln \frac{\text{likelihood of the fitted model}} {\text{likelihood of the saturated model}}.</math> In the above equation, {{mvar\|D}} represents the deviance and ln represents the natural logarithm. The log of this likelihood ratio (the ratio of the fitted model to the saturated model) will produce a negative value, hence the need for a negative sign. {{mvar\|D}} can be shown to follow an approximate [[chi-squared distribution]].<ref name=Hosmer/> Smaller values indicate better fit as the fitted model deviates less from the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square values indicate very little unexplained variance and thus, good model fit. Conversely, a significant chi-square value indicates that a significant amount of the variance is unexplained. When the saturated model is not available (a common case), deviance is calculated simply as −2·(log likelihood of the fitted model), and the reference to the saturated model's log likelihood can be removed from all that follows without harm. Two measures of deviance are particularly important in logistic regression: null deviance and model deviance. The null deviance represents the difference between a model with only the intercept (which means "no predictors") and the saturated model. The model deviance represents the difference between a model with at least one predictor and the saturated model.<ref name=Cohen/> In this respect, the null model provides a baseline upon which to compare predictor models. Given that deviance is a measure of the difference between a given model and the saturated model, smaller values indicate better fit. Thus, to assess the contribution of a predictor or set of predictors, one can subtract the model deviance from the null deviance and assess the difference on a <math>\chi^2_{s-p},</math> chi-square distribution with [[Degrees of freedom (statistics)\|degrees of freedom]]<ref name=Hosmer/> equal to the difference in the number of parameters estimated. Let :<math>\begin{align} D_{\text{null}} &=-2\ln \frac{\text{likelihood of null model}} {\text{likelihood of the saturated model}}\\[6pt] D_{\text{fitted}} &=-2\ln \frac{\text{likelihood of fitted model}} {\text{likelihood of the saturated model}}. \end{align} </math> Then the difference of both is: :<math>\begin{align} D_\text{null}- D_\text{fitted} &= -2 \left(\ln \frac{\text{likelihood of null model}} {\text{likelihood of the saturated model}}-\ln \frac{\text{likelihood of fitted model}} {\text{likelihood of the saturated model}}\right)\\[6pt] &= -2 \ln \frac{ \left( \dfrac{\text{likelihood of null model}}{\text{likelihood of the saturated model}}\right)}{\left(\dfrac{\text{likelihood of fitted model}}{\text{likelihood of the saturated model}}\right)}\\[6pt] &= -2 \ln \frac{\text{likelihood of the null model}}{\text{likelihood of fitted model}}. \end{align}</math> If the model deviance is significantly smaller than the null deviance then one can conclude that the predictor or set of predictors significantly improve the model's fit. This is analogous to the {{mvar\|F}}-test used in linear regression analysis to assess the significance of prediction.<ref name=Cohen/> ====Pseudo-R-squared==== {{main article\| Pseudo-R-squared}} In linear regression the squared multiple correlation, {{mvar\|R}}<sup>2</sup> is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors.<ref name=Cohen/> In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations.<ref name=Cohen/><ref name=":0">{{cite web \|url=https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf \|title=Measures of fit for logistic regression \|last=Allison \|first=Paul D. \|publisher=Statistical Horizons LLC and the University of Pennsylvania}}</ref> Four of the most commonly used indices and one less commonly used one are examined on this page: * Likelihood ratio {{mvar\|R}}<sup>2</sup>{{sub\|L}} * Cox and Snell {{mvar\|R}}<sup>2</sup>{{sub\|CS}} * Nagelkerke {{mvar\|R}}<sup>2</sup>{{sub\|N}} * McFadden {{mvar\|R}}<sup>2</sup>{{sub\|McF}} * Tjur {{mvar\|R}}<sup>2</sup>{{sub\|T}} ====Hosmer–Lemeshow test==== The [[Hosmer–Lemeshow test]] uses a test statistic that asymptotically follows a [[chi-squared distribution\|<math>\chi^2</math> distribution]] to assess whether or not the observed event rates match expected event rates in subgroups of the model population. This test is considered to be obsolete by some statisticians because of its dependence on arbitrary binning of predicted probabilities and relative low power.<ref>{{cite journal\|last1=Hosmer\|first1=D.W.\|title=A comparison of goodness-of-fit tests for the logistic regression model\|journal=Stat Med\|date=1997\|volume=16\|issue=9\|pages=965–980\|doi=10.1002/(sici)1097-0258(19970515)16:9<965::aid-sim509>3.3.co;2-f\|pmid=9160492}}</ref> ===Coefficient significance=== After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor.<ref name=Cohen/> In logistic regression, however, the regression coefficients represent the change in the logit for each unit change in the predictor. Given that the logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential function of the regression coefficient – the odds ratio (see [[#Logistic function, odds, odds ratio, and logit\|definition]]). In linear regression, the significance of a regression coefficient is assessed by computing a ''t'' test. In logistic regression, there are several different tests designed to assess the significance of an individual predictor, most notably the likelihood ratio test and the Wald statistic. ====Likelihood ratio test==== The [[likelihood-ratio test]] discussed above to assess model fit is also the recommended procedure to assess the contribution of individual "predictors" to a given model.<ref name=Hosmer/><ref name=Menard/><ref name=Cohen/> In the case of a single predictor model, one simply compares the deviance of the predictor model with that of the null model on a chi-square distribution with a single degree of freedom. If the predictor model has significantly smaller deviance (c.f. chi-square using the difference in degrees of freedom of the two models), then one can conclude that there is a significant association between the "predictor" and the outcome. Although some common statistical packages (e.g. SPSS) do provide likelihood ratio test statistics, without this computationally intensive test it would be more difficult to assess the contribution of individual predictors in the multiple logistic regression case.{{Citation needed\|date=October 2019}} To assess the contribution of individual predictors one can enter the predictors hierarchically, comparing each new model with the previous to determine the contribution of each predictor.<ref name=Cohen/> There is some debate among statisticians about the appropriateness of so-called "stepwise" procedures.{{weasel inline\|date=October 2019}} The fear is that they may not preserve nominal statistical properties and may become misleading.<ref>{{cite book \|first=Frank E. \|last=Harrell \|title=Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis \|location=New York \|publisher=Springer \|year=2010 \|isbn=978-1-4419-2918-1 }}{{page needed\|date=October 2019}}</ref> ====Wald statistic==== Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the [[Wald test\|Wald statistic]]. The Wald statistic, analogous to the ''t''-test in linear regression, is used to assess the significance of coefficients. The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution.<ref name=Menard/> : <math>W_j = \frac{\beta^2_j} {SE^2_{\beta_j}}</math> Although several statistical packages (e.g., SPSS, SAS) report the Wald statistic to assess the contribution of individual predictors, the Wald statistic has limitations. When the regression coefficient is large, the standard error of the regression coefficient also tends to be larger increasing the probability of [[Type I and Type II errors\|Type-II error]]. The Wald statistic also tends to be biased when data are sparse.<ref name=Cohen/> ====Case-control sampling==== Suppose cases are rare. Then we might wish to sample them more frequently than their prevalence in the population. For example, suppose there is a disease that affects 1 person in 10,000 and to collect our data we need to do a complete physical. It may be too expensive to do thousands of physicals of healthy people in order to obtain data for only a few diseased individuals. Thus, we may evaluate more diseased individuals, perhaps all of the rare outcomes. This is also retrospective sampling, or equivalently it is called unbalanced data. As a rule of thumb, sampling controls at a rate of five times the number of cases will produce sufficient control data.<ref name="islr">https://class.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/classification.pdf slide 16</ref> Logistic regression is unique in that it may be estimated on unbalanced data, rather than randomly sampled data, and still yield correct coefficient estimates of the effects of each independent variable on the outcome. That is to say, if we form a logistic model from such data, if the model is correct in the general population, the <math>\beta_j</math> parameters are all correct except for <math>\beta_0</math>. We can correct <math>\beta_0</math> if we know the true prevalence as follows:<ref name="islr"/> : <math>\widehat{\beta}_0^* = \widehat{\beta}_0+\log \frac \pi {1 - \pi} - \log{ \tilde{\pi} \over {1 - \tilde{\pi}} } </math> where <math>\pi</math> is the true prevalence and <math>\tilde{\pi}</math> is the prevalence in the sample. ==Discussion== Like other forms of [[regression analysis]], logistic regression makes use of one or more predictor variables that may be either continuous or categorical. Unlike ordinary linear regression, however, logistic regression is used for predicting dependent variables that take [[categorical variable\|membership in one of a limited number of categories]] (treating the dependent variable in the binomial case as the outcome of a [[Bernoulli trial]]) rather than a continuous outcome. Given this difference, the assumptions of linear regression are violated. In particular, the residuals cannot be normally distributed. In addition, linear regression may make nonsensical predictions for a binary dependent variable. What is needed is a way to convert a binary variable into a continuous one that can take on any real value (negative or positive). To do that, binomial logistic regression first calculates the [[odds]] of the event happening for different levels of each independent variable, and then takes its [[logarithm]] to create a continuous criterion as a transformed version of the dependent variable. The logarithm of the odds is the {{math\|[[logit]]}} of the probability, the {{math\|logit}} is defined as follows: <math display="block"> \operatorname{logit} p = \ln \frac p {1-p} \quad \text{for } 0<p<1\,. </math> Although the dependent variable in logistic regression is Bernoulli, the logit is on an unrestricted scale.<ref name=Hosmer/> The logit function is the [[link function]] in this kind of generalized linear model, i.e. <math display="block"> \operatorname{logit} \operatorname{\mathcal E}(Y) = \beta_0 + \beta_1 x </math> {{mvar\|Y}} is the Bernoulli-distributed response variable and {{mvar\|x}} is the predictor variable; the {{mvar\|β}} values are the linear parameters. The {{math\|logit}} of the probability of success is then fitted to the predictors. The predicted value of the {{math\|logit}} is converted back into predicted odds, via the inverse of the natural logarithm – the [[exponential function]]. Thus, although the observed dependent variable in binary logistic regression is a 0-or-1 variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a 'success'. In some applications, the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a 'success'; this categorical prediction can be based on the computed odds of success, with predicted odds above some chosen cutoff value being translated into a prediction of success. == Maximum entropy == Of all the functional forms used for estimating the probabilities of a particular categorical outcome which optimize the fit by maximizing the likelihood function (e.g. [[Probit model\|probit regression]], [[Poisson regression]], etc.), the logistic regression solution is unique in that it is a [[Maximum entropy probability distribution\|maximum entropy]] solution.<ref name="Mount2011">{{cite web \|url=http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf \|title=The Equivalence of Logistic Regression and Maximum Entropy models \|last=Mount \|first=J. \|date=2011 \|website= \|publisher= \|access-date=Feb 23, 2022 \|quote=}}</ref> This is a case of a general property: an [[exponential family]] of distributions maximizes entropy, given an expected value. In the case of the logistic model, the logistic function is the [[natural parameter]] of the Bernoulli distribution (it is in "[[canonical form]]", and the logistic function is the canonical link function), while other sigmoid functions are non-canonical link functions; this underlies its mathematical elegance and ease of optimization. See {{slink\|Exponential family\|Maximum entropy derivation}} for details. === Proof === In order to show this, we use the method of [[Lagrange multipliers]]. The Lagrangian is equal to the entropy plus the sum of the products of Lagrange multipliers times various constraint expressions. The general multinomial case will be considered, since the proof is not made that much simpler by considering simpler cases. Equating the derivative of the Lagrangian with respect to the various probabilities to zero yields a functional form for those probabilities which corresponds to those used in logistic regression.<ref name="Mount2011"/> As in the above section on [[#Multinomial logistic regression : Many explanatory variable and many categories\|multinomial logistic regression]], we will consider {{tmath\|M+1}} explanatory variables denoted {{tmath\|x_m}} and which include <math>x_0=1</math>. There will be a total of ''K'' data points, indexed by <math>k=\{1,2,\dots,K\}</math>, and the data points are given by <math>x_{mk}</math> and {{tmath\|y_k}}. The ''x<sub>mk</sub>'' will also be represented as an {{tmath\|(M+1)}}-dimensional vector <math>\boldsymbol{x}_k = \{x_{0k},x_{1k},\dots,x_{Mk}\}</math>. There will be {{tmath\|N+1}} possible values of the categorical variable ''y'' ranging from 0 to N. Let ''p<sub>n</sub>('''x''')'' be the probability, given explanatory variable vector '''x''', that the outcome will be <math>y=n</math>. Define <math>p_{nk}=p_n(\boldsymbol{x}_k)</math> which is the probability that for the ''k''-th measurement, the categorical outcome is ''n''. The Lagrangian will be expressed as a function of the probabilities ''p<sub>nk</sub>'' and will minimized by equating the derivatives of the Lagrangian with respect to these probabilities to zero. An important point is that the probabilities are treated equally and the fact that they sum to 1 is part of the Lagrangian formulation, rather than being assumed from the beginning. The first contribution to the Lagrangian is the [[Entropy (information theory)\|entropy]]: :<math>\mathcal{L}_{ent}=-\sum_{k=1}^K\sum_{n=0}^N p_{nk}\ln(p_{nk})</math> The log-likelihood is: :<math>\ell=\sum_{k=1}^K\sum_{n=0}^N \Delta(n,y_k)\ln(p_{nk})</math> Assuming the multinomial logistic function, the derivative of the log-likelihood with respect the beta coefficients was found to be: :<math>\frac{\partial \ell}{\partial \beta_{nm}}=\sum_{k=1}^K ( p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math> A very important point here is that this expression is (remarkably) not an explicit function of the beta coefficients. It is only a function of the probabilities ''p<sub>nk</sub>'' and the data. Rather than being specific to the assumed multinomial logistic case, it is taken to be a general statement of the condition at which the log-likelihood is maximized and makes no reference to the functional form of ''p<sub>nk</sub>''. There are then (''M''+1)(''N''+1) fitting constraints and the fitting constraint term in the Lagrangian is then: :<math>\mathcal{L}_{fit}=\sum_{n=0}^N\sum_{m=0}^M \lambda_{nm}\sum_{k=1}^K (p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math> where the ''λ<sub>nm</sub>'' are the appropriate Lagrange multipliers. There are ''K'' normalization constraints which may be written: :<math>\sum_{n=0}^N p_{nk}=1</math> so that the normalization term in the Lagrangian is: :<math>\mathcal{L}_{norm}=\sum_{k=1}^K \alpha_k \left(1-\sum_{n=1}^N p_{nk}\right) </math> where the ''α<sub>k</sub>'' are the appropriate Lagrange multipliers. The Lagrangian is then the sum of the above three terms: :<math>\mathcal{L}=\mathcal{L}_{ent} + \mathcal{L}_{fit} + \mathcal{L}_{norm}</math> Setting the derivative of the Lagrangian with respect to one of the probabilities to zero yields: :<math>\frac{\partial \mathcal{L}}{\partial p_{n'k'}}=0=-\ln(p_{n'k'})-1+\sum_{m=0}^M (\lambda_{n'm}x_{mk'})-\alpha_{k'}</math> Using the more condensed vector notation: :<math>\sum_{m=0}^M \lambda_{nm}x_{mk} = \boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k</math> and dropping the primes on the ''n'' and ''k'' indices, and then solving for <math>p_{nk}</math> yields: :<math>p_{nk}=e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}/Z_k</math> where: :<math>Z_k=e^{1+\alpha_k}</math> Imposing the normalization constraint, we can solve for the ''Z<sub>k</sub>'' and write the probabilities as: :<math>p_{nk}=\frac{e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}}{\sum_{u=0}^N e^{\boldsymbol{\lambda}_u\cdot\boldsymbol{x}_k}}</math> The <math>\boldsymbol{\lambda}_n</math> are not all independent. We can add any constant {{tmath\|(M+1)}}-dimensional vector to each of the <math>\boldsymbol{\lambda}_n</math> without changing the value of the <math>p_{nk}</math> probabilities so that there are only ''N'' rather than {{tmath\|N+1}} independent <math>\boldsymbol{\lambda}_n</math>. In the [[#Multinomial logistic regression : Many explanatory variable and many categories\|multinomial logistic regression]] section above, the <math>\boldsymbol{\lambda}_0</math> was subtracted from each <math>\boldsymbol{\lambda}_n</math> which set the exponential term involving <math>\boldsymbol{\lambda}_0</math> to 1, and the beta coefficients were given by <math>\boldsymbol{\beta}_n=\boldsymbol{\lambda}_n-\boldsymbol{\lambda}_0</math>. ===Other approaches=== In machine learning applications where logistic regression is used for binary classification, the MLE minimises the [[cross-entropy]] loss function. Logistic regression is an important [[machine learning]] algorithm. The goal is to model the probability of a random variable <math>Y</math> being 0 or 1 given experimental data.<ref>{{cite journal \| last = Ng \| first = Andrew \| year = 2000 \| pages = 16–19 \| journal = CS229 Lecture Notes \| title = CS229 Lecture Notes \| url = http://akademik.bahcesehir.edu.tr/~tevfik/courses/cmp5101/cs229-notes1.pdf}}</ref> Consider a [[generalized linear model]] function parameterized by <math>\theta</math>, :<math> h_\theta(X) = \frac{1}{1 + e^{-\theta^TX}} = \Pr(Y=1 \mid X; \theta) </math> Therefore, :<math> \Pr(Y=0 \mid X; \theta) = 1 - h_\theta(X) </math> and since <math> Y \in \{0,1\}</math>, we see that <math> \Pr(y\mid X;\theta) </math> is given by <math> \Pr(y \mid X; \theta) = h_\theta(X)^y(1 - h_\theta(X))^{(1-y)}. </math> We now calculate the [[likelihood function]] assuming that all the observations in the sample are independently Bernoulli distributed, :<math>\begin{align} L(\theta \mid y; x) &= \Pr(Y \mid X; \theta) \\ &= \prod_i \Pr(y_i \mid x_i; \theta) \\ &= \prod_i h_\theta(x_i)^{y_i}(1 - h_\theta(x_i))^{(1-y_i)} \end{align}</math> Typically, the log likelihood is maximized, :<math> N^{-1} \log L(\theta \mid y; x) = N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) </math> which is maximized using optimization techniques such as [[gradient descent]]. Assuming the <math>(x, y)</math> pairs are drawn uniformly from the underlying distribution, then in the limit of large ''N'', :<math>\begin{align} & \lim \limits_{N \rightarrow +\infty} N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \log \Pr(Y=y \mid X=x; \theta) \\[6pt] = {} & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \left( - \log\frac{\Pr(Y=y \mid X=x)}{\Pr(Y=y \mid X=x; \theta)} + \log \Pr(Y=y \mid X=x) \right) \\[6pt] = {} & - D_\text{KL}( Y \parallel Y_\theta ) - H(Y \mid X) \end{align}</math> where <math>H(Y\mid X)</math> is the [[conditional entropy]] and <math>D_\text{KL}</math> is the [[Kullback–Leibler divergence]]. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters. ==Comparison with linear regression== Logistic regression can be seen as a special case of the [[generalized linear model]] and thus analogous to [[linear regression]]. The model of logistic regression, however, is based on quite different assumptions (about the relationship between the dependent and independent variables) from those of linear regression. In particular, the key differences between these two models can be seen in the following two features of logistic regression. First, the conditional distribution <math>y \mid x</math> is a [[Bernoulli distribution]] rather than a [[Gaussian distribution]], because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the [[logistic function\|logistic distribution function]] because logistic regression predicts the '''probability''' of particular outcomes rather than the outcomes themselves. ==Alternatives== A common alternative to the logistic model (logit model) is the [[probit model]], as the related names suggest. From the perspective of [[generalized linear model]]s, these differ in the choice of [[link function]]: the logistic model uses the [[logit function]] (inverse logistic function), while the probit model uses the [[probit function]] (inverse [[error function]]). Equivalently, in the latent variable interpretations of these two methods, the first assumes a standard [[logistic distribution]] of errors and the second a standard [[normal distribution]] of errors.<ref>{{cite book\|title=Lecture Notes on Generalized Linear Models\|last=Rodríguez\|first=G.\|year=2007\|pages=Chapter 3, page 45\|url=http://data.princeton.edu/wws509/notes/}}</ref> Other [[sigmoid function]]s or error distributions can be used instead. Logistic regression is an alternative to Fisher's 1936 method, [[linear discriminant analysis]].<ref>{{cite book \|author1=Gareth James \|author2=Daniela Witten \|author3=Trevor Hastie \|author4=Robert Tibshirani \|title=An Introduction to Statistical Learning \|publisher=Springer \|year=2013 \|url=http://www-bcf.usc.edu/~gareth/ISL/ \|page=6}}</ref> If the assumptions of linear discriminant analysis hold, the conditioning can be reversed to produce logistic regression. The converse is not true, however, because logistic regression does not require the multivariate normal assumption of discriminant analysis.<ref>{{cite journal\|last1=Pohar\|first1=Maja\|last2=Blas\|first2=Mateja\|last3=Turk\|first3=Sandra\|year=2004\|title=Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study\|url=https://www.researchgate.net/publication/229021894\|journal=Metodološki Zvezki\|volume= 1\|issue= 1}}</ref> The assumption of linear predictor effects can easily be relaxed using techniques such as [[Spline (mathematics)\|spline functions]].<ref name=rms/> ==History== A detailed history of the logistic regression is given in {{harvtxt\|Cramer\|2002}}. The logistic function was developed as a model of [[population growth]] and named "logistic" by [[Pierre François Verhulst]] in the 1830s and 1840s, under the guidance of [[Adolphe Quetelet]]; see {{slink\|Logistic function\|History}} for details.{{sfn\|Cramer\|2002\|pp=3–5}} In his earliest paper (1838), Verhulst did not specify how he fit the curves to the data.<ref>{{cite journal\|first= Pierre-François \|last=Verhulst \|year= 1838\| title = Notice sur la loi que la population poursuit dans son accroissement \| journal = Correspondance Mathématique et Physique \|volume = 10\| pages = 113–121 \|url = https://books.google.com/books?id=8GsEAAAAYAAJ \| format = PDF\| access-date = 3 December 2014}}</ref><ref>{{harvnb\|Cramer\|2002\|p=4\|ps=, "He did not say how he fitted the curves."}}</ref> In his more detailed paper (1845), Verhulst determined the three parameters of the model by making the curve pass through three observed points, which yielded poor predictions.<ref>{{cite journal\|first= Pierre-François \|last=Verhulst \|year= 1845\| title = Recherches mathématiques sur la loi d'accroissement de la population \| journal = Nouveaux Mémoires de l'Académie Royale des Sciences et Belles-Lettres de Bruxelles \|volume = 18 \| url = http://gdz.sub.uni-goettingen.de/dms/load/img/?PPN=PPN129323640_0018&DMDID=dmdlog7\| access-date = 2013-02-18\|trans-title= Mathematical Researches into the Law of Population Growth Increase}}</ref>{{sfn\|Cramer\|2002\|p=4}} The logistic function was independently developed in chemistry as a model of [[autocatalysis]] ([[Wilhelm Ostwald]], 1883).{{sfn\|Cramer\|2002\|p=7}} An autocatalytic reaction is one in which one of the products is itself a [[catalyst]] for the same reaction, while the supply of one of the reactants is fixed. This naturally gives rise to the logistic equation for the same reason as population growth: the reaction is self-reinforcing but constrained. The logistic function was independently rediscovered as a model of population growth in 1920 by [[Raymond Pearl]] and [[Lowell Reed]], published as {{harvtxt\|Pearl\|Reed\|1920}}, which led to its use in modern statistics. They were initially unaware of Verhulst's work and presumably learned about it from [[L. Gustave du Pasquier]], but they gave him little credit and did not adopt his terminology.{{sfn\|Cramer\|2002\|p=6}} Verhulst's priority was acknowledged and the term "logistic" revived by [[Udny Yule]] in 1925 and has been followed since.{{sfn\|Cramer\|2002\|p=6–7}} Pearl and Reed first applied the model to the population of the United States, and also initially fitted the curve by making it pass through three points; as with Verhulst, this again yielded poor results.{{sfn\|Cramer\|2002\|p=5}} In the 1930s, the [[probit model]] was developed and systematized by [[Chester Ittner Bliss]], who coined the term "probit" in {{harvtxt\|Bliss\|1934}}, and by [[John Gaddum]] in {{harvtxt\|Gaddum\|1933}}, and the model fit by [[maximum likelihood estimation]] by [[Ronald A. Fisher]] in {{harvtxt\|Fisher\|1935}}, as an addendum to Bliss's work. The probit model was principally used in [[bioassay]], and had been preceded by earlier work dating to 1860; see {{slink\|Probit model\|History}}. The probit model influenced the subsequent development of the logit model and these models competed with each other.{{sfn\|Cramer\|2002\|p=7–9}} The logistic model was likely first used as an alternative to the probit model in bioassay by [[Edwin Bidwell Wilson]] and his student [[Jane Worcester]] in {{harvtxt\|Wilson\|Worcester\|1943}}.{{sfn\|Cramer\|2002\|p=9}} However, the development of the logistic model as a general alternative to the probit model was principally due to the work of [[Joseph Berkson]] over many decades, beginning in {{harvtxt\|Berkson\|1944}}, where he coined "logit", by analogy with "probit", and continuing through {{harvtxt\|Berkson\|1951}} and following years.<ref>{{harvnb\|Cramer\|2002\|p=8\|ps=, "As far as I can see the introduction of the logistics as an alternative to the normal probability function is the work of a single person, Joseph Berkson (1899–1982), ..."}}</ref> The logit model was initially dismissed as inferior to the probit model, but "gradually achieved an equal footing with the logit",{{sfn\|Cramer\|2002\|p=11}} particularly between 1960 and 1970. By 1970, the logit model achieved parity with the probit model in use in statistics journals and thereafter surpassed it. This relative popularity was due to the adoption of the logit outside of bioassay, rather than displacing the probit within bioassay, and its informal use in practice; the logit's popularity is credited to the logit model's computational simplicity, mathematical properties, and generality, allowing its use in varied fields.{{sfn\|Cramer\|2002\|p=10–11}} Various refinements occurred during that time, notably by [[David Cox (statistician)\|David Cox]], as in {{harvtxt\|Cox\|1958}}.<ref name=wal67est>{{cite journal\|last1=Walker\|first1=SH\|last2=Duncan\|first2=DB\|title=Estimation of the probability of an event as a function of several independent variables\|journal=Biometrika\|date=1967\|volume=54\|issue=1/2\|pages=167–178\|doi=10.2307/2333860\|jstor=2333860}}</ref> The multinomial logit model was introduced independently in {{harvtxt\|Cox\|1966}} and {{harvtxt\|Theil\|1969}}, which greatly increased the scope of application and the popularity of the logit model.{{sfn\|Cramer\|2002\|p=13}} In 1973 [[Daniel McFadden]] linked the multinomial logit to the theory of [[discrete choice]], specifically [[Luce's choice axiom]], showing that the multinomial logit followed from the assumption of [[independence of irrelevant alternatives]] and interpreting odds of alternatives as relative preferences;<ref>{{cite book \|chapter=Conditional Logit Analysis of Qualitative Choice Behavior \|chapter-url=https://eml.berkeley.edu/reprints/mcfadden/zarembka.pdf \|archive-url=https://web.archive.org/web/20181127110612/https://eml.berkeley.edu/reprints/mcfadden/zarembka.pdf \|archive-date=2018-11-27 \|access-date=2019-04-20 \|first=Daniel \|last=McFadden \|author-link=Daniel McFadden \|editor=P. Zarembka \|title=Frontiers in Econometrics \|pages=105–142 \|publisher=Academic Press \|location=New York \|year=1973 }}</ref> this gave a theoretical foundation for the logistic regression.{{sfn\|Cramer\|2002\|p=13}} ==Extensions== There are large numbers of extensions: * [[Multinomial logistic regression]] (or '''multinomial logit''') handles the case of a multi-way [[categorical variable\|categorical]] dependent variable (with unordered values, also called "classification"). The general case of having dependent variables with more than two values is termed ''polytomous regression''. * [[Ordered logistic regression]] (or '''ordered logit''') handles [[Levels of measurement#Ordinal measurement\|ordinal]] dependent variables (ordered values). * [[Mixed logit]] is an extension of multinomial logit that allows for correlations among the choices of the dependent variable. * An extension of the logistic model to sets of interdependent variables is the [[conditional random field]]. * [[Conditional logistic regression]] handles [[Matching (statistics)\|matched]] or [[stratification (clinical trials)\|stratified]] data when the strata are small. It is mostly used in the analysis of [[observational studies]]. ==Software== Most [[statistical software]] can do binary logistic regression. * [[SPSS]] ** [http://www-01.ibm.com/support/docview.wss?uid=swg21475013] for basic logistic regression. * [[Stata]] * [[SAS (software)\|SAS]] [https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#logistic_toc.htm PROC LOGISTIC] for basic logistic regression. [https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_catmod_sect003.htm PROC CATMOD] when all the variables are categorical. ** [https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#glimmix_toc.htm PROC GLIMMIX] for [[multilevel model]] logistic regression. * [[R (programming language)\|R]] [[Generalized linear model\|<code>glm</code>]] in the stats package (using family = binomial)<ref>{{cite book \|first1=Andrew \|last1=Gelman \|first2=Jennifer \|last2=Hill\|author2-link=Jennifer Hill \|title=Data Analysis Using Regression and Multilevel/Hierarchical Models \|location=New York \|publisher=Cambridge University Press \|year=2007 \|isbn=978-0-521-68689-1 \|pages=79–108 \|url=https://books.google.com/books?id=lV3DIdV0F9AC&pg=PA79 }}</ref> <code>lrm</code> in the [https://cran.r-project.org/web/packages/rms rms package] GLMNET package for an efficient implementation regularized logistic regression lmer for mixed effects logistic regression Rfast package command <code>gm_logistic</code> for fast and heavy calculations involving large scale data. arm package for bayesian logistic regression * [[Python (programming language)\|Python]] [http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html <code>Logit</code>] in the [[Statsmodels]] module. [http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html <code>LogisticRegression</code>] in the [[scikit-learn]] module. [https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/learn/LogisticRegressor <code>LogisticRegressor</code>] in the [[TensorFlow]] module. Full example of logistic regression in the Theano tutorial [http://deeplearning.net/software/theano/tutorial/examples.html] Bayesian Logistic Regression with ARD prior [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/skbayes/rvm_ard_models/fast_rvm.py code], [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/ipython_notebooks_tutorials/rvm_ard/ard_classification_demo.ipynb tutorial] Variational Bayes Logistic Regression with ARD prior [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/skbayes/rvm_ard_models/vrvm.py code], [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/ipython_notebooks_tutorials/rvm_ard/vbard_classification.ipynb tutorial] ** Bayesian Logistic Regression [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/skbayes/linear_models/bayes_logistic.py code], [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/ipython_notebooks_tutorials/linear_models/bayesian_logistic_regression_demo.ipynb tutorial] * [[NCSS (statistical software)\|NCSS]] ** [http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Logistic_Regression.pdf Logistic Regression in NCSS] * [[Matlab]] <code>mnrfit</code> in the [[Statistics Toolbox for MATLAB\|Statistics and Machine Learning Toolbox]] (with "incorrect" coded as 2 instead of 0) <code>fminunc/fmincon, fitglm, mnrfit, fitclinear, mle</code> can all do logistic regression. [[Java (programming language)\|Java]] ([[Java virtual machine\|JVM]]) [[LibLinear]] [https://nightlies.apache.org/flink/flink-ml-docs-release-2.0/docs/operators/classification/logisticregression/ Apache Flink] [[Apache Spark]] *[[Apache Spark\|SparkML]] supports Logistic Regression [[Field-programmable gate array\|FPGA]] ** [https://github.com/inaccel/logisticregression <code>Logistic Regresesion IP core</code>] in [[High-level synthesis\|HLS]] for [[Field-programmable gate array\|FPGA]]. Notably, [[Microsoft Excel]]'s statistics extension package does not include it. ==See also== {{Portal\|Mathematics}} * [[Logistic function]] * [[Discrete choice]] * [[Jarrow–Turnbull model]] * [[Limited dependent variable]] * [[Multinomial logit\|Multinomial logit model]] * [[Ordered logit]] * [[Hosmer–Lemeshow test]] * [[Brier score]] * [[mlpack]] - contains a [[C++]] implementation of logistic regression * [[Local case-control sampling]] * [[Logistic model tree]] ==References== {{Reflist\|32em\|refs= <ref name=Hosmer>{{cite book \| last1 = Hosmer \| first1 = David W. \| first2= Stanley \|last2=Lemeshow \| title = Applied Logistic Regression \|edition= 2nd \| publisher = Wiley \| year = 2000 \| isbn = 978-0-471-35632-5 }} {{page needed\|date=May 2012}}</ref> <ref name=Menard>{{cite book \| last = Menard \| first = Scott W. \| title = Applied Logistic Regression \|edition= 2nd \| publisher = SAGE \| year = 2002 \| isbn = 978-0-7619-2208-7 }} {{page needed\|date=May 2012}}</ref> <ref name=Cohen>{{cite book \| last1 = Cohen \| first1 = Jacob \| first2= Patricia \|last2=Cohen \|first3= Steven G. \|last3= West \|first4= Leona S. \|last4= Aiken \| author4-link= Leona S. Aiken \| title = Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences \|edition= 3rd \| publisher = Routledge \| year = 2002 \| isbn = 978-0-8058-2223-6 }} {{page needed\|date=May 2012}} </ref> <ref name=rms>{{cite book \| last = Harrell \| first = Frank E. \| title = Regression Modeling Strategies \| edition = 2nd \| publisher = New York; Springer \| year = 2015 \| isbn = 978-3-319-19424-0 \| doi = 10.1007/978-3-319-19425-7\| series = Springer Series in Statistics }} </ref> <ref name=plo14mod>{{cite journal\|last1=van der Ploeg\|first1=Tjeerd\|last2=Austin\|first2=Peter C.\|last3=Steyerberg\|first3=Ewout W.\|title=Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints\|journal=BMC Medical Research Methodology\|date=2014\|volume=14\|page=137\|doi=10.1186/1471-2288-14-137\|pmid=25532820\|pmc=4289553 \|doi-access=free }} </ref> }} ==Sources== {{Refbegin\|32em}} * {{Cite journal \| last = Berkson\| first = Joseph \| doi = 10.1080/01621459.1944.10500699 \| title = Application of the Logistic Function to Bio-Assay \| journal = Journal of the American Statistical Association \| date = 1944\| volume = 39\| issue = 227\| pages = 357–365\| jstor = 2280041}} * {{cite journal \|last1=Berkson \|first1=Joseph \|title=Why I Prefer Logits to Probits \|journal=Biometrics \|date=1951 \|volume=7 \|issue=4 \|pages=327–339 \|doi=10.2307/3001655 \|jstor=3001655 \|issn=0006-341X}} * {{cite journal \|first=C. I. \|last=Bliss \|title=The Method of Probits \|journal=[[Science (journal)\|Science]] \|volume=79 \|issue=2037 \|pages=38–39 \|year=1934 \|quote=These arbitrary probability units have been called 'probits'. \|doi=10.1126/science.79.2037.38 \|pmid=17813446 \|bibcode=1934Sci....79...38B }} * {{cite journal \|last=Cox\|first=David R. \|author-link=David Cox (statistician) \|title=The regression analysis of binary sequences (with discussion)\|journal=J R Stat Soc B\|date=1958\|volume=20\|issue=2\|pages=215–242\|jstor=2983890}} * {{cite book \|author-link=David Cox (statistician) \|last=Cox \|first=David R. \|year=1966 \|chapter=Some procedures connected with the logistic qualitative response curve \|title=Research Papers in Probability and Statistics (Festschrift for J. Neyman) \|editor=F. N. David \|publisher=Wiley \|location=London \|pages=55–71 }} * {{cite tech report \|last=Cramer\|first=J. S. \|title=The origins of logistic regression \|institution=Tinbergen Institute \|date=2002\|volume=119\|issue=4\|pages=167–178 \|doi=10.2139/ssrn.360300 \|url=https://papers.tinbergen.nl/02119.pdf }} ** Published in: {{cite journal \|title=The early origins of the logit model \|last=Cramer\|first=J. S. \|journal=Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences \|volume=35 \|number=4 \|year=2004 \|pages=613–626 \|doi=10.1016/j.shpsc.2004.09.003 }} * {{cite journal \|last=Fisher \|first=R. A. \|title=The Case of Zero Survivors in Probit Assays \|journal=Annals of Applied Biology \|volume=22 \|pages=164–165 \|year=1935 \|url=https://ebooks.adelaide.edu.au/dspace/handle/2440/15223 \|archive-url=https://archive.today/20140430203018/http://ebooks.adelaide.edu.au/dspace/handle/2440/15223 \|archive-date=2014-04-30 \|doi=10.1111/j.1744-7348.1935.tb07713.x }} * {{cite book \|last=Gaddum \|first=John H. \|title=Reports on Biological Standards: Methods of biological assay depending on a quantal response. III \|date=1933 \|publisher=H.M. Stationery Office \|oclc=808240121}} * {{cite journal \|first=Henri \|last=Theil \|title=A Multinomial Extension of the Linear Logit Model \|journal=International Economic Review \|volume=10 \|number=3 \|pages=251–59 \|year=1969 \|doi=10.2307/2525642 \|jstor=2525642 }} {{Cite journal \| last1 = Pearl\| first1 = Raymond \| last2 = Reed\| first2 = Lowell J. \| title = On the Rate of Growth of the Population of the United States since 1790 and Its Mathematical Representation \| journal = Proceedings of the National Academy of Sciences \| date = June 1920 \| volume = 6 \| number = 6 \| pages = 275–288 \| doi = 10.1073/pnas.6.6.275 \| pmid = 16576496 \| pmc = 1084522 \| bibcode = 1920PNAS....6..275P \| doi-access = free }} {{cite journal \|title=The Determination of L.D.50 and Its Sampling Error in Bio-Assay. \|last1=Wilson \|first1=E.B. \|author-link1=Edwin Bidwell Wilson \|last2=Worcester \|first2=J. \|author-link2=Jane Worcester \|year=1943 \|volume=29 \|number=2 \|pages=79–85 \|journal=[[Proceedings of the National Academy of Sciences of the United States of America]] \|pmid=16588606 \|pmc=1078563 \|doi=10.1073/pnas.29.2.79 \|bibcode=1943PNAS...29...79W \|doi-access=free }} * {{cite book \| last = Agresti \| first = Alan. \| title = Categorical Data Analysis \| publisher = New York: Wiley-Interscience \| year = 2002 \| isbn = 978-0-471-36093-3 }} * {{cite book \|last=Amemiya \|first=Takeshi \|chapter=Qualitative Response Models \|title=Advanced Econometrics \|year=1985 \|publisher=Basil Blackwell \|location=Oxford \|isbn=978-0-631-13345-2 \|pages=267–359 \|chapter-url=https://books.google.com/books?id=0bzGQE14CwEC&pg=PA267 }} * {{cite book \| last = Balakrishnan \| first = N. \| title = Handbook of the Logistic Distribution \| publisher = Marcel Dekker, Inc. \| year = 1991 \| isbn = 978-0-8247-8587-1 }} * {{cite book \|first=Christian \|last=Gouriéroux \|author-link=Christian Gouriéroux \|chapter=The Simple Dichotomy \|title=Econometrics of Qualitative Dependent Variables \|location=New York \|publisher=Cambridge University Press \|year=2000 \|isbn=978-0-521-58985-7 \|pages=6–37 \|chapter-url=https://books.google.com/books?id=dE2prs_U0QMC&pg=PA6 }} * {{cite book \| last = Greene \| first = William H. \| title = Econometric Analysis, fifth edition \| publisher = Prentice Hall \| year = 2003 \| isbn = 978-0-13-066189-0 }} * {{cite book \| last = Hilbe \| first = Joseph M. \| author-link=Joseph Hilbe \| title = Logistic Regression Models \| publisher = Chapman & Hall/CRC Press \| year = 2009 \| isbn = 978-1-4200-7575-5}} * {{cite book \| last = Hosmer \| first = David \| title = Applied logistic regression \| publisher = Wiley \| location = Hoboken, New Jersey \| year = 2013 \| isbn = 978-0-470-58247-3 }} * {{cite book \| last = Howell \| first = David C. \| title = Statistical Methods for Psychology, 7th ed \| publisher = Belmont, CA; Thomson Wadsworth \| year = 2010 \| isbn = 978-0-495-59786-5 }} * {{cite journal \| last = Peduzzi \| first = P. \|author2=J. Concato \|author3=E. Kemper \|author4=T.R. Holford \|author5=A.R. Feinstein \| title = A simulation study of the number of events per variable in logistic regression analysis \| journal = [[Journal of Clinical Epidemiology]] \| volume = 49 \| issue = 12 \| pages = 1373–1379 \| year = 1996 \| pmid = 8970487 \| doi=10.1016/s0895-4356(96)00236-3\| doi-access = free }} {{cite book \| last1 = Berry \| first1 = Michael J.A. \| first2 = Gordon \| last2 = Linoff \| title = Data Mining Techniques For Marketing, Sales and Customer Support \| publisher = Wiley \| year = 1997}} {{Refend}} ==External links== {{Wikiversity}} {{Commons category-inline}} {{YouTube\|id=JvioZoK1f4o&t=64m48s\|title=Econometrics Lecture (topic: Logit model)}} by [[Mark Thoma]] [http://www.omidrouhani.com/research/logisticregression/html/logisticregression.htm Logistic Regression tutorial] *[https://czep.net/stat/mlelr.html mlelr]: software in [[C (programming language)\|C]] for teaching purposes {{Statistics\|correlation}} {{Authority control}} [[Category:Logistic regression\| ]] [[Category:Predictive analytics]] [[Category:Regression models]]'
New page wikitext, after the edit (`new_wikitext`)	'{{Short description\|Statistical model for a binary dependent variable}} {{Redirect-distinguish\|Logit model\|Logit function}} [[File:Exam pass logistic curve.svg\|thumb\|400px\|Example graph of a logistic regression curve fitted to data. The curve shows the estimated probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See {{slink\|\|Example}} for worked details.]] In [[statistics]], the '''logistic model''' (or '''logit model''') is a [[statistical model]] that models the [[log-odds]] of an event as a [[linear function (calculus)\|linear combination]] of one or more [[independent variable]]s. In [[regression analysis]], '''logistic regression'''<ref>{{cite journal\|last1=Tolles\|first1=Juliana\|last2=Meurer\|first2=William J\|date=2016\|title=Logistic Regression Relating Patient Characteristics to Outcomes\|journal=JAMA \|language=en\|volume=316\|issue=5\|pages=533–4\|issn=0098-7484\|oclc=6823603312\|doi=10.1001/jama.2016.7653\|pmid=27483067}}</ref> (or '''logit regression''') is [[estimation theory\|estimating]] the parameters of a logistic model (the coefficients in the linear combination). Formally, in binary logistic regression there is a single [[binary variable\|binary]] [[dependent variable]], coded by an [[indicator variable]], where the two values are labeled "0" and "1", while the [[independent variable]]s can each be a binary variable (two classes, coded by an indicator variable) or a [[continuous variable]] (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling;<ref name=Hosmer/> the function that converts log-odds to probability is the [[logistic function]], hence the name. The [[unit of measurement]] for the log-odds scale is called a ''[[logit]]'', from '''''log'''istic un'''it''''', hence the alternative names. See {{slink\|\|Background}} and {{slink\|\|Definition}} for formal mathematics, and {{slink\|\|Example}} for a worked example. Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see {{slink\|\|Applications}}), and the logistic model has been the most commonly used model for [[binary regression]] since about 1970.{{sfn\|Cramer\|2002\|p=10–11}} Binary variables can be generalized to [[categorical variable]]s when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to [[multinomial logistic regression]]. If the multiple categories are [[Level of measurement#Ordinal scale\|ordered]], one can use the [[ordinal logistic regression]] (for example the proportional odds ordinal logistic model<ref name=wal67est />). See {{slink\|\|Extensions}} for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform [[statistical classification]] (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a [[binary classifier]]. Analogous linear models for binary variables with a different [[sigmoid function]] instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the [[probit model]]; see {{slink\|\|Alternatives}}. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a ''constant'' rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the [[odds ratio]]. More abstractly, the logistic function is the [[natural parameter]] for the [[Bernoulli distribution]], and in this sense is the "simplest" way to convert a real number to a probability. In particular, it maximizes entropy (minimizes added information), and in this sense makes the fewest assumptions of the data being modeled; see {{slink\|\|Maximum entropy}}. The parameters of a logistic regression are most commonly estimated by [[maximum-likelihood estimation]] (MLE). This does not have a closed-form expression, unlike [[linear least squares (mathematics)\|linear least squares]]; see {{section link\|\|Model fitting}}. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by [[ordinary least squares]] (OLS) plays for [[Scalar (mathematics)\|scalar]] responses: it is a simple, well-analyzed baseline model; see {{slink\|\|Comparison with linear regression}} for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by [[Joseph Berkson]],{{sfn\|Cramer\|2002\|p=8}} beginning in {{harvtxt\|Berkson\|1944}}, where he coined "logit"; see {{slink\|\|History}}. {{Regression bar}} {{TOC limit}} ==Applications== Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score ([[TRISS]]), which is widely used to predict mortality in injured patients, was originally developed by Boyd ''{{Abbr\|et al.\|''et alia'', with others - usually other authors}}'' using logistic regression.<ref>{{cite journal\| last1 = Boyd \| first1 = C. R.\| last2 = Tolson \| first2 = M. A.\| last3 = Copes \| first3 = W. S.\| title = Evaluating trauma care: The TRISS method. Trauma Score and the Injury Severity Score\| journal = The Journal of Trauma\| volume = 27 \| issue = 4\| pages = 370–378\| year = 1987 \| pmid = 3106646 \| doi= 10.1097/00005373-198704000-00005\| doi-access = free}}</ref> Many other medical scales used to assess severity of a patient have been developed using logistic regression.<ref>{{cite journal \|pmid= 11268952 \|year= 2001\|last1= Kologlu \|first1= M.\|title=Validation of MPI and PIA II in two different groups of patients with secondary peritonitis \|journal=Hepato-Gastroenterology \|volume= 48 \|issue=37 \|pages= 147–51 \|last2=Elker\|first2=D. \|last3= Altun \|first3= H. \|last4= Sayek \|first4= I.}}</ref><ref>{{cite journal \|pmid= 11129812 \|year= 2000 \|last1= Biondo \|first1= S. \|title= Prognostic factors for mortality in left colonic peritonitis: A new scoring system \|journal= Journal of the American College of Surgeons\|volume= 191 \|issue= 6 \|pages= 635–42 \|last2= Ramos\|first2=E.\|last3=Deiros \|first3= M. \|last4=Ragué\|first4=J. M.\|last5=De Oca \|first5= J. \|last6= Moreno \|first6=P.\|last7=Farran\|first7=L.\|last8= Jaurrieta \|first8= E. \|doi= 10.1016/S1072-7515(00)00758-4}}</ref><ref>{{cite journal\|pmid=7587228 \|year= 1995 \|last1=Marshall \|first1= J. C.\|title=Multiple organ dysfunction score: A reliable descriptor of a complex clinical outcome\|journal=Critical Care Medicine\|volume= 23 \|issue= 10\|pages= 1638–52 \|last2= Cook\|first2=D. J.\|last3=Christou\|first3=N. V. \|last4= Bernard \|first4= G. R. \|last5=Sprung\|first5=C. L.\|last6=Sibbald\|first6=W. J.\|doi= 10.1097/00003246-199510000-00007}}</ref><ref>{{cite journal\|pmid=8254858\|year=1993 \|last1= Le Gall \|first1= J. R.\|title=A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study\|journal=JAMA\|volume=270\|issue= 24 \|pages= 2957–63 \|last2= Lemeshow \|first2=S.\|last3=Saulnier\|first3=F.\|doi= 10.1001/jama.1993.03510240069035}}</ref> Logistic regression may be used to predict the risk of developing a given disease (e.g. [[Diabetes mellitus\|diabetes]]; [[Coronary artery disease\|coronary heart disease]]), based on observed characteristics of the patient (age, sex, [[body mass index]], results of various [[blood test]]s, etc.).<ref name = "Freedman09">{{cite book \|author=David A. Freedman \|year=2009\|title=Statistical Models: Theory and Practice \|publisher=[[Cambridge University Press]]\|page=128\|author-link=David A. Freedman}}</ref><ref>{{cite journal \| pmid = 6028270 \| year = 1967 \| last1 = Truett \| first1 = J \| title = A multivariate analysis of the risk of coronary heart disease in Framingham \| journal = Journal of Chronic Diseases \| volume = 20 \| issue = 7 \| pages = 511–24 \| last2 = Cornfield\| first2 = J\| last3 = Kannel\| first3 = W \| doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Any Other Party, based on age, income, sex, race, state of residence, votes in previous elections, etc.<ref name="rms" /> The technique can also be used in [[engineering]], especially for predicting the probability of failure of a given process, system or product.<ref name= strano05>{{cite journal \| author = M. Strano \| author2 = B.M. Colosimo \| year = 2006 \| title = Logistic regression analysis for experimental determination of forming limit diagrams \| journal = International Journal of Machine Tools and Manufacture \| volume = 46 \| issue = 6 \| pages = 673–682 \| doi = 10.1016/j.ijmachtools.2005.07.005 }}</ref><ref name= safety>{{cite journal \| last1 = Palei \| first1 = S. K. \| last2 = Das \| first2 = S. K. \| doi = 10.1016/j.ssci.2008.01.002 \| title = Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: An approach \| journal = Safety Science \| volume = 47 \| pages = 88–96 \| year = 2009 }}</ref> It is also used in [[marketing]] applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.<ref>{{cite book\|title=Data Mining Techniques For Marketing, Sales and Customer Support\|last= Berry \|first=Michael J.A\|publisher=Wiley\|year=1997\|page=10}}</ref> In [[economics]], it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a [[mortgage]]. [[Conditional random field]]s, an extension of logistic regression to sequential data, are used in [[natural language processing]].Disaster planners and engineers rely on these models to predict decision take by householders or building occupants in small-scale and large-scales evacuations ,such as building fires, wildfires, hurricanes among others.<ref>{{Cite journal \|last=Mesa-Arango \|first=Rodrigo \|last2=Hasan \|first2=Samiul \|last3=Ukkusuri \|first3=Satish V. \|last4=Murray-Tuite \|first4=Pamela \|date=2013-02 \|title=Household-Level Model for Hurricane Evacuation Destination Type Choice Using Hurricane Ivan Data \|url=https://ascelibrary.org/doi/10.1061/%28ASCE%29NH.1527-6996.0000083 \|journal=Natural Hazards Review \|language=en \|volume=14 \|issue=1 \|pages=11–20 \|doi=10.1061/(ASCE)NH.1527-6996.0000083 \|issn=1527-6988}}</ref><ref>{{Cite journal \|last=Wibbenmeyer \|first=Matthew J. \|last2=Hand \|first2=Michael S. \|last3=Calkin \|first3=David E. \|last4=Venn \|first4=Tyron J. \|last5=Thompson \|first5=Matthew P. \|date=2013-06 \|title=Risk Preferences in Strategic Wildfire Decision Making: A Choice Experiment with U.S. Wildfire Managers \|url=https://onlinelibrary.wiley.com/doi/10.1111/j.1539-6924.2012.01894.x \|journal=Risk Analysis \|language=en \|volume=33 \|issue=6 \|pages=1021–1037 \|doi=10.1111/j.1539-6924.2012.01894.x \|issn=0272-4332}}</ref><ref>{{Cite journal \|last=Lovreglio \|first=Ruggiero \|last2=Borri \|first2=Dino \|last3=dell’Olio \|first3=Luigi \|last4=Ibeas \|first4=Angel \|date=2014-02-01 \|title=A discrete choice model based on random utilities for exit choice in emergency evacuations \|url=https://www.sciencedirect.com/science/article/pii/S0925753513002294 \|journal=Safety Science \|volume=62 \|pages=418–426 \|doi=10.1016/j.ssci.2013.10.004 \|issn=0925-7535}}</ref> These models help in the development of reliable [[Emergency management\|disaster managing plan]]<nowiki/>s and safer design for the [[built environment]]. ==Example== ===Problem=== As a simple example, we can use a logistic regression with one explanatory variable and two categories to answer the following question: <blockquote> A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam? </blockquote> The reason for using logistic regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1" and "0", are not [[cardinal number]]s. If the problem was changed so that pass/fail was replaced with the grade 0–100 (cardinal numbers), then simple [[regression analysis]] could be used. The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0). {\| class="wikitable" \|- ! Hours (''x<sub>k</sub>'') \| 0.50\|\| 0.75\|\| 1.00\|\| 1.25\|\| 1.50\|\| 1.75\|\| 1.75\|\| 2.00\|\| 2.25\|\| 2.50\|\| 2.75\|\| 3.00\|\| 3.25\|\| 3.50\|\| 4.00\|\| 4.25\|\| 4.50\|\| 4.75\|\| 5.00 \|\| 5.50 \|- ! Pass (''y<sub>k</sub>'') \| 0\|\| 0\|\| 0\|\| 0\|\| 0\|\| 0\|\| 1\|\| 0\|\| 1\|\| 0\|\| 1\|\| 0\|\| 1\|\| 0\|\| 1\|\| 1\|\| 1\|\| 1\|\| 1\|\| 1 \|} We wish to fit a logistic function to the data consisting of the hours studied (''x<sub>k</sub>'') and the outcome of the test (''y<sub>k</sub>'' =1 for pass, 0 for fail). The data points are indexed by the subscript ''k'' which runs from <math>k=1</math> to <math>k=K=20</math>. The ''x'' variable is called the "[[explanatory variable]]", and the ''y'' variable is called the "[[categorical variable]]" consisting of two categories: "pass" or "fail" corresponding to the categorical values 1 and 0 respectively. ===Model=== [[File:Exam pass logistic curve.svg\|thumb\|400px\|Graph of a logistic regression curve fitted to the (''x<sub>m</sub>'',''y<sub>m</sub>'') data. The curve shows the probability of passing an exam versus hours studying.]] The [[logistic function]] is of the form: :<math>p(x)=\frac{1}{1+e^{-(x-\mu)/s}}</math> where ''μ'' is a [[location parameter]] (the midpoint of the curve, where <math>p(\mu)=1/2</math>) and ''s'' is a [[scale parameter]]. This expression may be rewritten as: :<math>p(x)=\frac{1}{1+e^{-(\beta_0+\beta_1 x)}}</math> where <math>\beta_0 = -\mu/s</math> and is known as the [[vertical intercept\|intercept]] (it is the ''vertical'' intercept or ''y''-intercept of the line <math>y = \beta_0+\beta_1 x</math>), and <math>\beta_1= 1/s</math> (inverse scale parameter or [[rate parameter]]): these are the ''y''-intercept and slope of the log-odds as a function of ''x''. Conversely, <math>\mu=-\beta_0/\beta_1</math> and <math>s=1/\beta_1</math>. ===Fit=== The usual measure of [[goodness of fit]] for a logistic regression uses [[logistic loss]] (or [[log loss]]), the negative [[log-likelihood]]. For a given ''x<sub>k</sub>'' and ''y<sub>k</sub>'', write <math>p_k=p(x_k)</math>. The {{tmath\|p_k}} are the probabilities that the corresponding {{tmath\|y_k}} will equal one and {{tmath\|1-p_k}} are the probabilities that they will be zero (see [[Bernoulli distribution]]). We wish to find the values of {{tmath\|\beta_0}} and {{tmath\|\beta_1}} which give the "best fit" to the data. In the case of linear regression, the sum of the squared deviations of the fit from the data points (''y<sub>k</sub>''), the [[squared error loss]], is taken as a measure of the goodness of fit, and the best fit is obtained when that function is ''minimized''. The log loss for the ''k''-th point {{tmath\|\ell_k}} is: :<math>\ell_k = \begin{cases} -\ln p_k & \text{ if } y_k = 1, \\ -\ln (1 - p_k) & \text{ if } y_k = 0. \end{cases}</math> The log loss can be interpreted as the "[[surprisal]]" of the actual outcome {{tmath\|y_k}} relative to the prediction {{tmath\|p_k}}, and is a measure of [[information content]]. Log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction (i.e., when <math>p_k = 1</math> and <math>y_k = 1</math>, or <math>p_k = 0</math> and <math>y_k = 0</math>), and approaches infinity as the prediction gets worse (i.e., when <math>y_k = 1</math> and <math>p_k \to 0</math> or <math>y_k = 0 </math> and <math>p_k \to 1</math>), meaning the actual outcome is "more surprising". Since the value of the logistic function is always strictly between zero and one, the log loss is always greater than zero and less than infinity. Unlike in a linear regression, where the model can have zero loss at a point by passing through a data point (and zero loss overall if all points are on a line), in a logistic regression it is not possible to have zero loss at any points, since {{tmath\|y_k}} is either 0 or 1, but {{tmath\|0 < p_k < 1}}. These can be combined into a single expression: :<math>\ell_k = -y_k\ln p_k - (1 - y_k)\ln (1 - p_k).</math> This expression is more formally known as the [[cross-entropy]] of the predicted distribution <math>\big(p_k, (1-p_k)\big)</math> from the actual distribution <math>\big(y_k, (1-y_k)\big)</math>, as probability distributions on the two-element space of (pass, fail). The sum of these, the total loss, is the overall negative log-likelihood {{tmath\|-\ell}}, and the best fit is obtained for those choices of {{tmath\|\beta_0}} and {{tmath\|\beta_1}} for which {{tmath\|-\ell}} is ''minimized''. Alternatively, instead of ''minimizing'' the loss, one can ''maximize'' its inverse, the (positive) log-likelihood: :<math>\ell = \sum_{k:y_k=1}\ln(p_k) + \sum_{k:y_k=0}\ln(1-p_k) = \sum_{k=1}^K \left(\,y_k \ln(p_k)+(1-y_k)\ln(1-p_k)\right)</math> or equivalently maximize the [[likelihood function]] itself, which is the probability that the given data set is produced by a particular logistic function: :<math>L = \prod_{k:y_k=1}p_k\,\prod_{k:y_k=0}(1-p_k)</math> This method is known as [[maximum likelihood estimation]]. ===Parameter estimation=== Since ''ℓ'' is nonlinear in {{tmath\|\beta_0}} and {{tmath\|\beta_1}}, determining their optimum values will require numerical methods. One method of maximizing ''ℓ'' is to require the derivatives of ''ℓ'' with respect to {{tmath\|\beta_0}} and {{tmath\|\beta_1}} to be zero: :<math>0 = \frac{\partial \ell}{\partial \beta_0} = \sum_{k=1}^K(y_k-p_k)</math> :<math>0 = \frac{\partial \ell}{\partial \beta_1} = \sum_{k=1}^K(y_k-p_k)x_k</math> and the maximization procedure can be accomplished by solving the above two equations for {{tmath\|\beta_0}} and {{tmath\|\beta_1}}, which, again, will generally require the use of numerical methods. The values of {{tmath\|\beta_0}} and {{tmath\|\beta_1}} which maximize ''ℓ'' and ''L'' using the above data are found to be: :<math>\beta_0 \approx -4.1</math> :<math>\beta_1 \approx 1.5</math> which yields a value for ''μ'' and ''s'' of: :<math>\mu = -\beta_0/\beta_1 \approx 2.7</math> :<math>s = 1/\beta_1 \approx 0.67</math> ===Predictions=== The {{tmath\|\beta_0}} and {{tmath\|\beta_1}} coefficients may be entered into the logistic regression equation to estimate the probability of passing the exam. For example, for a student who studies 2 hours, entering the value <math>x = 2</math> into the equation gives the estimated probability of passing the exam of 0.25: : <math> t = \beta_0+2\beta_1 \approx - 4.1 + 2 \cdot 1.5 = -1.1 </math> : <math> p = \frac{1}{1 + e^{-t} } \approx 0.25 = \text{Probability of passing exam} </math> Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87: : <math>t = \beta_0+4\beta_1 \approx - 4.1 + 4 \cdot 1.5 = 1.9</math> : <math>p = \frac{1}{1 + e^{-t} } \approx 0.87 = \text{Probability of passing exam} </math> This table shows the estimated probability of passing the exam for several values of hours studying. {\| class="wikitable" \|- ! rowspan="2" \| Hours<br />of study<br />(x) ! colspan="3" \| Passing exam \|- ! Log-odds (t) !! Odds (e<sup>t</sup>) !! Probability (p) \|- style="text-align: right;" \| 1\|\| −2.57 \|\| 0.076 ≈ 1:13.1 \|\| 0.07 \|- style="text-align: right;" \| 2\|\| −1.07 \|\| 0.34 ≈ 1:2.91 \|\| 0.26 \|- style="text-align: right;" \|{{tmath\|\mu \approx 2.7}} \|\| 0 \|\|1 \|\| <math>\tfrac{1}{2}</math> = 0.50 \|- style="text-align: right;" \| 3\|\| 0.44 \|\| 1.55 \|\| 0.61 \|- style="text-align: right;" \| 4\|\| 1.94 \|\| 6.96 \|\| 0.87 \|- style="text-align: right;" \| 5\|\| 3.45 \|\| 31.4 \|\| 0.97 \|} ===Model evaluation=== The logistic regression analysis gives the following output. {\| class="wikitable" \|- ! !! Coefficient!! Std. Error !! ''z''-value !! ''p''-value (Wald) \|- style="text-align:right;" ! Intercept (''β''<sub>0</sub>) \| −4.1 \|\| 1.8 \|\| −2.3 \|\| 0.021 \|- style="text-align:right;" ! Hours (''β''<sub>1</sub>) \| 1.5 \|\| 0.6 \|\| 2.4 \|\| 0.017 \|} By the [[Wald test]], the output indicates that hours studying is significantly associated with the probability of passing the exam (<math>p = 0.017</math>). Rather than the Wald method, the recommended method<ref name="NeymanPearson1933">{{citation \| last1 = Neyman \| first1 = J. \| author-link1 = Jerzy Neyman\| last2 = Pearson \| first2 = E. S. \| author-link2 = Egon Pearson\| doi = 10.1098/rsta.1933.0009 \| title = On the problem of the most efficient tests of statistical hypotheses \| journal = [[Philosophical Transactions of the Royal Society of London A]] \| volume = 231 \| issue = 694–706 \| pages = 289–337 \| year = 1933 \| jstor = 91247 \|bibcode = 1933RSPTA.231..289N \| url = http://www.stats.org.uk/statistical-inference/NeymanPearson1933.pdf \| doi-access = free }}</ref> to calculate the ''p''-value for logistic regression is the [[likelihood-ratio test]] (LRT), which for these data give <math>p \approx 0.00064</math> (see {{slink\|\|Deviance and likelihood ratio tests}} below). ===Generalizations=== This simple model is an example of binary logistic regression, and has one explanatory variable and a binary categorical variable which can assume one of two categorical values. [[Multinomial logistic regression]] is the generalization of binary logistic regression to include any number of explanatory variables and any number of categories. ==Background== [[Image:Logistic-curve.svg\|thumb\|320px\|right\|Figure 1. The standard logistic function <math>\sigma (t)</math>; <math>\sigma (t) \in (0,1)</math> for all <math>t</math>.]] ===Definition of the logistic function=== An explanation of logistic regression can begin with an explanation of the standard [[logistic function]]. The logistic function is a [[sigmoid function]], which takes any [[Real number\|real]] input <math>t</math>, and outputs a value between zero and one.<ref name=Hosmer/> For the logit, this is interpreted as taking input [[log-odds]] and having output [[probability]]. The ''standard'' logistic function <math>\sigma:\mathbb R\rightarrow (0,1)</math> is defined as follows: :<math>\sigma (t) = \frac{e^t}{e^t+1} = \frac{1}{1+e^{-t}}</math> A graph of the logistic function on the ''t''-interval (−6,6) is shown in Figure 1. Let us assume that <math>t</math> is a linear function of a single [[dependent and independent variables\|explanatory variable]] <math>x</math> (the case where <math>t</math> is a ''linear combination'' of multiple explanatory variables is treated similarly). We can then express <math>t</math> as follows: :<math>t = \beta_0 + \beta_1 x</math> And the general logistic function <math>p:\mathbb R \rightarrow (0,1)</math> can now be written as: :<math>p(x) = \sigma(t)= \frac {1}{1+e^{-(\beta_0 + \beta_1 x)}}</math> In the logistic model, <math>p(x)</math> is interpreted as the probability of the dependent variable <math>Y</math> equaling a success/case rather than a failure/non-case. It is clear that the [[Dependent and independent variables\|response variables]] <math>Y_i</math> are not identically distributed: <math>P(Y_i = 1\mid X)</math> differs from one data point <math>X_i</math> to another, though they are independent given [[design matrix]] <math>X</math> and shared parameters <math>\beta</math>.<ref name = "Freedman09" /> ===Definition of the inverse of the logistic function=== We can now define the [[logit]] (log odds) function as the inverse <math>g = \sigma^{-1}</math> of the standard logistic function. It is easy to see that it satisfies: :<math>g(p(x)) = \sigma^{-1} (p(x)) = \operatorname{logit} p(x) = \ln \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x ,</math> and equivalently, after exponentiating both sides we have the odds: :<math>\frac{p(x)}{1 - p(x)} = e^{\beta_0 + \beta_1 x}.</math> ===Interpretation of these terms=== In the above equations, the terms are as follows: * <math>g</math> is the logit function. The equation for <math>g(p(x))</math> illustrates that the [[logit]] (i.e., log-odds or natural logarithm of the odds) is equivalent to the linear regression expression. * <math>\ln</math> denotes the [[natural logarithm]]. * <math>p(x)</math> is the probability that the dependent variable equals a case, given some linear combination of the predictors. The formula for <math>p(x)</math> illustrates that the probability of the dependent variable equaling a case is equal to the value of the logistic function of the linear regression expression. This is important in that it shows that the value of the linear regression expression can vary from negative to positive infinity and yet, after transformation, the resulting expression for the probability <math>p(x)</math> ranges between 0 and 1. * <math>\beta_0</math> is the [[Y-intercept\|intercept]] from the linear regression equation (the value of the criterion when the predictor is equal to zero). * <math>\beta_1 x</math> is the regression coefficient multiplied by some value of the predictor. * base <math>e</math> denotes the exponential function. ===Definition of the odds=== The odds of the dependent variable equaling a case (given some linear combination <math>x</math> of the predictors) is equivalent to the exponential function of the linear regression expression. This illustrates how the [[logit]] serves as a link function between the probability and the linear regression expression. Given that the logit ranges between negative and positive infinity, it provides an adequate criterion upon which to conduct linear regression and the logit is easily converted back into the odds.<ref name=Hosmer/> So we define odds of the dependent variable equaling a case (given some linear combination <math>x</math> of the predictors) as follows: :<math>\text{odds} = e^{\beta_0 + \beta_1 x}.</math> ===The odds ratio=== For a continuous independent variable the odds ratio can be defined as: :[[File:Odds Ratio-1.jpg\|thumb\|The image represents an outline of what an odds ratio looks like in writing, through a template in addition to the test score example in the "Example" section of the contents. In simple terms, if we hypothetically get an odds ratio of 2 to 1, we can say... "For every one-unit increase in hours studied, the odds of passing (group 1) or failing (group 0) are (expectedly) 2 to 1 (Denis, 2019).]]<math> \mathrm{OR} = \frac{\operatorname{odds}(x+1)}{\operatorname{odds}(x)} = \frac{\left(\frac{p(x+1)}{1 - p(x+1)}\right)}{\left(\frac{p(x)}{1 - p(x)}\right)} = \frac{e^{\beta_0 + \beta_1 (x+1)}}{e^{\beta_0 + \beta_1 x}} = e^{\beta_1}</math> This exponential relationship provides an interpretation for <math>\beta_1</math>: The odds multiply by <math>e^{\beta_1}</math> for every 1-unit increase in x.<ref>{{cite web\|url=https://stats.idre.ucla.edu/stata/faq/how-do-i-interpret-odds-ratios-in-logistic-regression/\|title=How to Interpret Odds Ratio in Logistic Regression?\|publisher=Institute for Digital Research and Education}}</ref> For a binary independent variable the odds ratio is defined as <math>\frac{ad}{bc}</math> where ''a'', ''b'', ''c'' and ''d'' are cells in a 2×2 [[contingency table]].<ref>{{cite book \| last = Everitt \| first = Brian \| title = The Cambridge Dictionary of Statistics \| publisher = Cambridge University Press \| location = Cambridge, UK New York \| year = 1998 \| isbn = 978-0-521-59346-5 \| url-access = registration \| url = https://archive.org/details/cambridgediction00ever_0 }}</ref> ===Multiple explanatory variables=== If there are multiple explanatory variables, the above expression <math>\beta_0+\beta_1x</math> can be revised to <math>\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m = \beta_0+ \sum_{i=1}^m \beta_ix_i</math>. Then when this is used in the equation relating the log odds of a success to the values of the predictors, the linear regression will be a [[multiple regression]] with ''m'' explanators; the parameters <math>\beta_j</math> for all <math>j = 0, 1, 2, \dots, m</math> are all estimated. Again, the more traditional equations are: :<math>\log \frac{p}{1-p} = \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m</math> and :<math>p = \frac{1}{1+b^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_mx_m )}}</math> where usually <math>b=e</math>. ==Definition== The basic setup of logistic regression is as follows. We are given a dataset containing ''N'' points. Each point ''i'' consists of a set of ''m'' input variables ''x''<sub>1,''i''</sub> ... ''x''<sub>''m,i''</sub> (also called [[independent variable]]s, explanatory variables, predictor variables, features, or attributes), and a [[binary variable\|binary]] outcome variable ''Y''<sub>''i''</sub> (also known as a [[dependent variable]], response variable, output variable, or class), i.e. it can assume only the two possible values 0 (often meaning "no" or "failure") or 1 (often meaning "yes" or "success"). The goal of logistic regression is to use the dataset to create a predictive model of the outcome variable. As in linear regression, the outcome variables ''Y''<sub>''i''</sub> are assumed to depend on the explanatory variables ''x''<sub>1,''i''</sub> ... ''x''<sub>''m,i''</sub>. ; Explanatory variables The explanatory variables may be of any [[statistical data type\|type]]: [[real-valued]], [[binary variable\|binary]], [[categorical variable\|categorical]], etc. The main distinction is between [[continuous variable]]s and [[discrete variable]]s. (Discrete variables referring to more than two possible choices are typically coded using [[Dummy variable (statistics)\|dummy variables]] (or [[indicator variable]]s), that is, separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning "variable does have the given value" and a 0 meaning "variable does not have that value".) ;Outcome variables Formally, the outcomes ''Y''<sub>''i''</sub> are described as being [[Bernoulli distribution\|Bernoulli-distributed]] data, where each outcome is determined by an unobserved probability ''p''<sub>''i''</sub> that is specific to the outcome at hand, but related to the explanatory variables. This can be expressed in any of the following equivalent forms: ::<math> \begin{align} Y_i\mid x_{1,i},\ldots,x_{m,i} \ & \sim \operatorname{Bernoulli}(p_i) \\ \operatorname{\mathbb E}[Y_i\mid x_{1,i},\ldots,x_{m,i}] &= p_i \\ \Pr(Y_i=y\mid x_{1,i},\ldots,x_{m,i}) &= \begin{cases} p_i & \text{if }y=1 \\ 1-p_i & \text{if }y=0 \end{cases} \\ \Pr(Y_i=y\mid x_{1,i},\ldots,x_{m,i}) &= p_i^y (1-p_i)^{(1-y)} \end{align} </math> The meanings of these four lines are: # The first line expresses the [[probability distribution]] of each ''Y''<sub>''i''</sub> : conditioned on the explanatory variables, it follows a [[Bernoulli distribution]] with parameters ''p''<sub>''i''</sub>, the probability of the outcome of 1 for trial ''i''. As noted above, each separate trial has its own probability of success, just as each trial has its own explanatory variables. The probability of success ''p''<sub>''i''</sub> is not observed, only the outcome of an individual Bernoulli trial using that probability. # The second line expresses the fact that the [[expected value]] of each ''Y''<sub>''i''</sub> is equal to the probability of success ''p''<sub>''i''</sub>, which is a general property of the Bernoulli distribution. In other words, if we run a large number of Bernoulli trials using the same probability of success ''p''<sub>''i''</sub>, then take the average of all the 1 and 0 outcomes, then the result would be close to ''p''<sub>''i''</sub>. This is because doing an average this way simply computes the proportion of successes seen, which we expect to converge to the underlying probability of success. # The third line writes out the [[probability mass function]] of the Bernoulli distribution, specifying the probability of seeing each of the two possible outcomes. # The fourth line is another way of writing the probability mass function, which avoids having to write separate cases and is more convenient for certain types of calculations. This relies on the fact that ''Y''<sub>''i''</sub> can take only the value 0 or 1. In each case, one of the exponents will be 1, "choosing" the value under it, while the other is 0, "canceling out" the value under it. Hence, the outcome is either ''p''<sub>''i''</sub> or 1 − ''p''<sub>''i''</sub>, as in the previous line. ; Linear predictor function The basic idea of logistic regression is to use the mechanism already developed for [[linear regression]] by modeling the probability ''p''<sub>''i''</sub> using a [[linear predictor function]], i.e. a [[linear combination]] of the explanatory variables and a set of [[regression coefficient]]s that are specific to the model at hand but the same for all trials. The linear predictor function <math>f(i)</math> for a particular data point ''i'' is written as: :<math>f(i) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_m x_{m,i},</math> where <math>\beta_0, \ldots, \beta_m</math> are [[regression coefficient]]s indicating the relative effect of a particular explanatory variable on the outcome. The model is usually put into a more compact form as follows: * The regression coefficients ''β''<sub>0</sub>, ''β''<sub>1</sub>, ..., ''β''<sub>''m''</sub> are grouped into a single vector '''''β''''' of size ''m'' + 1. * For each data point ''i'', an additional explanatory pseudo-variable ''x''<sub>0,''i''</sub> is added, with a fixed value of 1, corresponding to the [[Y-intercept\|intercept]] coefficient ''β''<sub>0</sub>. * The resulting explanatory variables ''x''<sub>0,''i''</sub>, ''x''<sub>1,''i''</sub>, ..., ''x''<sub>''m,i''</sub> are then grouped into a single vector '''''X<sub>i</sub>''''' of size ''m'' + 1. This makes it possible to write the linear predictor function as follows: :<math>f(i)= \boldsymbol\beta \cdot \mathbf{X}_i,</math> using the notation for a [[dot product]] between two vectors. [[File:Logistic Regression in SPSS.png\|thumb\|356x356px\|This is an example of an SPSS output for a logistic regression model using three explanatory variables (coffee use per week, energy drink use per week, and soda use per week) and two categories (male and female).]] === Many explanatory variables, two categories === The above example of binary logistic regression on one explanatory variable can be generalized to binary logistic regression on any number of explanatory variables ''x<sub>1</sub>, x<sub>2</sub>,...'' and any number of categorical values <math>y=0,1,2,\dots</math>. To begin with, we may consider a logistic model with ''M'' explanatory variables, ''x<sub>1</sub>'', ''x<sub>2</sub>'' ... ''x<sub>M</sub>'' and, as in the example above, two categorical values (''y'' = 0 and 1). For the simple binary logistic regression model, we assumed a [[linear model\|linear relationship]] between the predictor variable and the log-odds (also called [[logit]]) of the event that <math>y=1</math>. This linear relationship may be extended to the case of ''M'' explanatory variables: :<math>t = \log_b \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_M x_M </math> where ''t'' is the log-odds and <math>\beta_i</math> are parameters of the model. An additional generalization has been introduced in which the base of the model (''b'') is not restricted to the [[Euler number]] ''e''. In most applications, the base <math>b</math> of the logarithm is usually taken to be ''[[E (mathematical constant)\|e]]''. However, in some cases it can be easier to communicate results by working in base 2 or base 10. For a more compact notation, we will specify the explanatory variables and the ''β'' coefficients as {{tmath\|(M+1)}}-dimensional vectors: :<math>\boldsymbol{x}=\{x_0,x_1,x_2,\dots,x_M\}</math> :<math>\boldsymbol{\beta}=\{\beta_0,\beta_1,\beta_2,\dots,\beta_M\}</math> with an added explanatory variable ''x<sub>0</sub>'' =1. The logit may now be written as: :<math>t =\sum_{m=0}^{M} \beta_m x_m = \boldsymbol{\beta} \cdot x</math> Solving for the probability ''p'' that <math>y=1</math> yields: :<math>p(\boldsymbol{x}) = \frac{b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}{1+b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}= \frac{1}{1+b^{-\boldsymbol{\beta} \cdot \boldsymbol{x}}}=S_b(t)</math>, where <math>S_b</math> is the [[sigmoid function]] with base <math>b</math>. The above formula shows that once the <math>\beta_m</math> are fixed, we can easily compute either the log-odds that <math>y=1</math> for a given observation, or the probability that <math>y=1</math> for a given observation. The main use-case of a logistic model is to be given an observation <math>\boldsymbol{x}</math>, and estimate the probability <math>p(\boldsymbol{x})</math> that <math>y=1</math>. The optimum beta coefficients may again be found by maximizing the log-likelihood. For ''K'' measurements, defining <math>\boldsymbol{x}_k</math> as the explanatory vector of the ''k''-th measurement, and <math>y_k</math> as the categorical outcome of that measurement, the log likelihood may be written in a form very similar to the simple <math>M=1</math> case above: :<math>\ell = \sum_{k=1}^K y_k \log_b(p(\boldsymbol{x_k}))+\sum_{k=1}^K (1-y_k) \log_b(1-p(\boldsymbol{x_k}))</math> As in the simple example above, finding the optimum ''β'' parameters will require numerical methods. One useful technique is to equate the derivatives of the log likelihood with respect to each of the ''β'' parameters to zero yielding a set of equations which will hold at the maximum of the log likelihood: :<math>\frac{\partial \ell}{\partial \beta_m} = 0 = \sum_{k=1}^K y_k x_{mk} - \sum_{k=1}^K p(\boldsymbol{x}_k)x_{mk}</math> where ''x<sub>mk</sub>'' is the value of the ''x<sub>m</sub>'' explanatory variable from the ''k-th'' measurement. Consider an example with <math>M=2</math> explanatory variables, <math>b=10</math>, and coefficients <math>\beta_0=-3</math>, <math>\beta_1=1</math>, and <math>\beta_2=2</math> which have been determined by the above method. To be concrete, the model is: :<math>t=\log_{10}\frac{p}{1 - p} = -3 + x_1 + 2 x_2</math> :<math>p = \frac{b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}{1+b^{\boldsymbol{\beta} \cdot x}} = \frac{b^{\beta_0 + \beta_1 x_1 + \beta_2 x_2}}{1+b^{\beta_0 + \beta_1 x_1 + \beta_2 x_2} } = \frac{1}{1 + b^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2)}}</math>, where ''p'' is the probability of the event that <math>y=1</math>. This can be interpreted as follows: * <math>\beta_0 = -3</math> is the [[y-intercept\|''y''-intercept]]. It is the log-odds of the event that <math>y=1</math>, when the predictors <math>x_1=x_2=0</math>. By exponentiating, we can see that when <math>x_1=x_2=0</math> the odds of the event that <math>y=1</math> are 1-to-1000, or <math>10^{-3}</math>. Similarly, the probability of the event that <math>y=1</math> when <math>x_1=x_2=0</math> can be computed as <math> 1/(1000 + 1) = 1/1001.</math> * <math>\beta_1 = 1</math> means that increasing <math>x_1</math> by 1 increases the log-odds by <math>1</math>. So if <math>x_1</math> increases by 1, the odds that <math>y=1</math> increase by a factor of <math>10^1</math>. The '''probability''' of <math>y=1</math> has also increased, but it has not increased by as much as the odds have increased. * <math>\beta_2 = 2</math> means that increasing <math>x_2</math> by 1 increases the log-odds by <math>2</math>. So if <math>x_2</math> increases by 1, the odds that <math>y=1</math> increase by a factor of <math>10^2.</math> Note how the effect of <math>x_2</math> on the log-odds is twice as great as the effect of <math>x_1</math>, but the effect on the odds is 10 times greater. But the effect on the '''probability''' of <math>y=1</math> is not as much as 10 times greater, it's only the effect on the odds that is 10 times greater. === Multinomial logistic regression: Many explanatory variables and many categories === {{main\|Multinomial logistic regression}} In the above cases of two categories (binomial logistic regression), the categories were indexed by "0" and "1", and we had two probabilities: The probability that the outcome was in category 1 was given by <math>p(\boldsymbol{x})</math>and the probability that the outcome was in category 0 was given by <math>1-p(\boldsymbol{x})</math>. The sum of these probabilities equals 1, which must be true, since "0" and "1" are the only possible categories in this setup. In general, if we have {{tmath\|M+1}} explanatory variables (including ''x<sub>0</sub>'') and {{tmath\|N+1}} categories, we will need {{tmath\|N+1}} separate probabilities, one for each category, indexed by ''n'', which describe the probability that the categorical outcome ''y'' will be in category ''y=n'', conditional on the vector of covariates '''x'''. The sum of these probabilities over all categories must equal 1. Using the mathematically convenient base ''e'', these probabilities are: :<math>p_n(\boldsymbol{x}) = \frac{e^{\boldsymbol{\beta}_n\cdot \boldsymbol{x}}}{1+\sum_{u=1}^N e^{\boldsymbol{\beta}_u\cdot \boldsymbol{x}}}</math> for <math>n=1,2,\dots,N</math> :<math>p_0(\boldsymbol{x}) = 1-\sum_{n=1}^N p_n(\boldsymbol{x})=\frac{1}{1+\sum_{u=1}^N e^{\boldsymbol{\beta}_u\cdot \boldsymbol{x}}}</math> Each of the probabilities except <math>p_0(\boldsymbol{x})</math> will have their own set of regression coefficients <math>\boldsymbol{\beta}_n</math>. It can be seen that, as required, the sum of the <math>p_n(\boldsymbol{x})</math> over all categories ''n'' is 1. The selection of <math>p_0(\boldsymbol{x})</math> to be defined in terms of the other probabilities is artificial. Any of the probabilities could have been selected to be so defined. This special value of ''n'' is termed the "pivot index", and the log-odds (''t<sub>n</sub>'') are expressed in terms of the pivot probability and are again expressed as a linear combination of the explanatory variables: :<math>t_n = \ln\left(\frac{p_n(\boldsymbol{x})}{p_0(\boldsymbol{x})}\right) = \boldsymbol{\beta}_n \cdot \boldsymbol{x}</math> Note also that for the simple case of <math>N=1</math>, the two-category case is recovered, with <math>p(\boldsymbol{x})=p_1(\boldsymbol{x})</math> and <math>p_0(\boldsymbol{x})=1-p_1(\boldsymbol{x})</math>. The log-likelihood that a particular set of ''K'' measurements or data points will be generated by the above probabilities can now be calculated. Indexing each measurement by ''k'', let the ''k''-th set of measured explanatory variables be denoted by <math>\boldsymbol{x}_k</math> and their categorical outcomes be denoted by <math>y_k</math> which can be equal to any integer in [0,N]. The log-likelihood is then: :<math>\ell = \sum_{k=1}^K \sum_{n=0}^N \Delta(n,y_k)\,\ln(p_n(\boldsymbol{x}_k))</math> where <math>\Delta(n,y_k)</math> is an [[indicator function]] which equals 1 if ''y<sub>k</sub> = n'' and zero otherwise. In the case of two explanatory variables, this indicator function was defined as ''y<sub>k</sub>'' when ''n'' = 1 and ''1-y<sub>k</sub>'' when ''n'' = 0. This was convenient, but not necessary.<ref>For example, the indicator function in this case could be defined as <math>\Delta(n,y)=1-(y-n)^2</math></ref> Again, the optimum beta coefficients may be found by maximizing the log-likelihood function generally using numerical methods. A possible method of solution is to set the derivatives of the log-likelihood with respect to each beta coefficient equal to zero and solve for the beta coefficients: :<math>\frac{\partial \ell}{\partial \beta_{nm}} = 0 = \sum_{k=1}^K \Delta(n,y_k)x_{mk} - \sum_{k=1}^K p_n(\boldsymbol{x}_k)x_{mk}</math> where <math>\beta_{nm}</math> is the ''m''-th coefficient of the <math>\boldsymbol{\beta}_n</math> vector and <math>x_{mk}</math> is the ''m''-th explanatory variable of the ''k''-th measurement. Once the beta coefficients have been estimated from the data, we will be able to estimate the probability that any subsequent set of explanatory variables will result in any of the possible outcome categories. ==Interpretations== There are various equivalent specifications and interpretations of logistic regression, which fit into different types of more general models, and allow different generalizations. ===As a generalized linear model=== The particular model used by logistic regression, which distinguishes it from standard [[linear regression]] and from other types of [[regression analysis]] used for [[binary-valued]] outcomes, is the way the probability of a particular outcome is linked to the linear predictor function: :<math>\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid x_{1,i},\ldots,x_{m,i}]) = \operatorname{logit}(p_i) = \ln \left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_m x_{m,i}</math> Written using the more compact notation described above, this is: :<math>\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i]) = \operatorname{logit}(p_i)=\ln\left(\frac{p_i}{1-p_i}\right) = \boldsymbol\beta \cdot \mathbf{X}_i</math> This formulation expresses logistic regression as a type of [[generalized linear model]], which predicts variables with various types of [[probability distribution]]s by fitting a linear predictor function of the above form to some sort of arbitrary transformation of the expected value of the variable. The intuition for transforming using the logit function (the natural log of the odds) was explained above{{Clarify\|reason=What exactly was explained?\|date=February 2023}}. It also has the practical effect of converting the probability (which is bounded to be between 0 and 1) to a variable that ranges over <math>(-\infty,+\infty)</math> — thereby matching the potential range of the linear prediction function on the right side of the equation. Both the probabilities ''p''<sub>''i''</sub> and the regression coefficients are unobserved, and the means of determining them is not part of the model itself. They are typically determined by some sort of optimization procedure, e.g. [[maximum likelihood estimation]], that finds values that best fit the observed data (i.e. that give the most accurate predictions for the data already observed), usually subject to [[regularization (mathematics)\|regularization]] conditions that seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients. The use of a regularization condition is equivalent to doing [[maximum a posteriori]] (MAP) estimation, an extension of maximum likelihood. (Regularization is most commonly done using [[Ridge regression\|a squared regularizing function]], which is equivalent to placing a zero-mean [[Gaussian distribution\|Gaussian]] [[prior distribution]] on the coefficients, but other regularizers are also possible.) Whether or not regularization is used, it is usually not possible to find a closed-form solution; instead, an iterative numerical method must be used, such as [[iteratively reweighted least squares]] (IRLS) or, more commonly these days, a [[quasi-Newton method]] such as the [[L-BFGS\|L-BFGS method]].<ref>{{cite conference \|url=https://dl.acm.org/citation.cfm?id=1118871 \|title=A comparison of algorithms for maximum entropy parameter estimation \|last1=Malouf \|first1=Robert \|date= 2002\|book-title= Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002) \|pages= 49–55 \|doi=10.3115/1118853.1118871 \|doi-access=free }}</ref> The interpretation of the ''β''<sub>''j''</sub> parameter estimates is as the additive effect on the log of the [[odds]] for a unit change in the ''j'' the explanatory variable. In the case of a dichotomous explanatory variable, for instance, gender <math>e^\beta</math> is the estimate of the odds of having the outcome for, say, males compared with females. An equivalent formula uses the inverse of the logit function, which is the [[logistic function]], i.e.: :<math>\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i] = p_i = \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) = \frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}</math> The formula can also be written as a [[probability distribution]] (specifically, using a [[probability mass function]]): : <math>\Pr(Y_i=y\mid \mathbf{X}_i) = {p_i}^y(1-p_i)^{1-y} =\left(\frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i}}{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{y} \left(1-\frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i}}{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{1-y} = \frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i \cdot y} }{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}</math> ===As a latent-variable model=== The logistic model has an equivalent formulation as a [[latent-variable model]]. This formulation is common in the theory of [[discrete choice]] models and makes it easier to extend to certain more complicated models with multiple, correlated choices, as well as to compare logistic regression to the closely related [[probit model]]. Imagine that, for each trial ''i'', there is a continuous [[latent variable]] ''Y''<sub>''i''</sub><sup></sup> (i.e. an unobserved [[random variable]]) that is distributed as follows: : <math> Y_i^\ast = \boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon_i \, </math> where : <math>\varepsilon_i \sim \operatorname{Logistic}(0,1) \, </math> i.e. the latent variable can be written directly in terms of the linear predictor function and an additive random [[error variable]] that is distributed according to a standard [[logistic distribution]]. Then ''Y''<sub>''i''</sub> can be viewed as an indicator for whether this latent variable is positive: : <math> Y_i = \begin{cases} 1 & \text{if }Y_i^\ast > 0 \ \text{ i.e. } - \varepsilon_i < \boldsymbol\beta \cdot \mathbf{X}_i, \\ 0 &\text{otherwise.} \end{cases} </math> The choice of modeling the error variable specifically with a standard logistic distribution, rather than a general logistic distribution with the location and scale set to arbitrary values, seems restrictive, but in fact, it is not. It must be kept in mind that we can choose the regression coefficients ourselves, and very often can use them to offset changes in the parameters of the error variable's distribution. For example, a logistic error-variable distribution with a non-zero location parameter ''μ'' (which sets the mean) is equivalent to a distribution with a zero location parameter, where ''μ'' has been added to the intercept coefficient. Both situations produce the same value for ''Y''<sub>''i''</sub><sup></sup> regardless of settings of explanatory variables. Similarly, an arbitrary scale parameter ''s'' is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by ''s''. In the latter case, the resulting value of ''Y''<sub>''i''</sub><sup>''''</sup> will be smaller by a factor of ''s'' than in the former case, for all sets of explanatory variables — but critically, it will always remain on the same side of 0, and hence lead to the same ''Y''<sub>''i''</sub> choice. (This predicts that the irrelevancy of the scale parameter may not carry over into more complex models where more than two choices are available.) It turns out that this formulation is exactly equivalent to the preceding one, phrased in terms of the [[generalized linear model]] and without any [[latent variable]]s. This can be shown as follows, using the fact that the [[cumulative distribution function]] (CDF) of the standard [[logistic distribution]] is the [[logistic function]], which is the inverse of the [[logit function]], i.e. :<math>\Pr(\varepsilon_i < x) = \operatorname{logit}^{-1}(x)</math> Then: :<math> \begin{align} \Pr(Y_i=1\mid\mathbf{X}_i) &= \Pr(Y_i^\ast > 0\mid\mathbf{X}_i) \\[5pt] &= \Pr(\boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon_i > 0) \\[5pt] &= \Pr(\varepsilon_i > -\boldsymbol\beta \cdot \mathbf{X}_i) \\[5pt] &= \Pr(\varepsilon_i < \boldsymbol\beta \cdot \mathbf{X}_i) & & \text{(because the logistic distribution is symmetric)} \\[5pt] &= \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) & \\[5pt] &= p_i & & \text{(see above)} \end{align} </math> This formulation—which is standard in [[discrete choice]] models—makes clear the relationship between logistic regression (the "logit model") and the [[probit model]], which uses an error variable distributed according to a standard [[normal distribution]] instead of a standard logistic distribution. Both the logistic and normal distributions are symmetric with a basic unimodal, "bell curve" shape. The only difference is that the logistic distribution has somewhat [[heavy-tailed distribution\|heavier tails]], which means that it is less sensitive to outlying data (and hence somewhat more [[robust statistics\|robust]] to model mis-specifications or erroneous data). ===Two-way latent-variable model=== Yet another formulation uses two separate latent variables: : <math> \begin{align} Y_i^{0\ast} &= \boldsymbol\beta_0 \cdot \mathbf{X}_i + \varepsilon_0 \, \\ Y_i^{1\ast} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 \, \end{align} </math> where : <math> \begin{align} \varepsilon_0 & \sim \operatorname{EV}_1(0,1) \\ \varepsilon_1 & \sim \operatorname{EV}_1(0,1) \end{align} </math> where ''EV''<sub>1</sub>(0,1) is a standard type-1 [[extreme value distribution]]: i.e. :<math>\Pr(\varepsilon_0=x) = \Pr(\varepsilon_1=x) = e^{-x} e^{-e^{-x}}</math> Then : <math> Y_i = \begin{cases} 1 & \text{if }Y_i^{1\ast} > Y_i^{0\ast}, \\ 0 &\text{otherwise.} \end{cases} </math> This model has a separate latent variable and a separate set of regression coefficients for each possible outcome of the dependent variable. The reason for this separation is that it makes it easy to extend logistic regression to multi-outcome categorical variables, as in the [[multinomial logit]] model. In such a model, it is natural to model each possible outcome using a different set of regression coefficients. It is also possible to motivate each of the separate latent variables as the theoretical [[utility]] associated with making the associated choice, and thus motivate logistic regression in terms of [[utility theory]]. (In terms of utility theory, a rational actor always chooses the choice with the greatest associated utility.) This is the approach taken by economists when formulating [[discrete choice]] models, because it both provides a theoretically strong foundation and facilitates intuitions about the model, which in turn makes it easy to consider various sorts of extensions. (See the example below.) The choice of the type-1 [[extreme value distribution]] seems fairly arbitrary, but it makes the mathematics work out, and it may be possible to justify its use through [[rational choice theory]]. It turns out that this model is equivalent to the previous model, although this seems non-obvious, since there are now two sets of regression coefficients and error variables, and the error variables have a different distribution. In fact, this model reduces directly to the previous one with the following substitutions: :<math>\boldsymbol\beta = \boldsymbol\beta_1 - \boldsymbol\beta_0</math> :<math>\varepsilon = \varepsilon_1 - \varepsilon_0</math> An intuition for this comes from the fact that, since we choose based on the maximum of two values, only their difference matters, not the exact values — and this effectively removes one [[Degrees of freedom (statistics)\|degree of freedom]]. Another critical fact is that the difference of two type-1 extreme-value-distributed variables is a logistic distribution, i.e. <math>\varepsilon = \varepsilon_1 - \varepsilon_0 \sim \operatorname{Logistic}(0,1) .</math> We can demonstrate the equivalent as follows: :<math>\begin{align} \Pr(Y_i=1\mid\mathbf{X}_i) = {} & \Pr \left (Y_i^{1\ast} > Y_i^{0\ast}\mid\mathbf{X}_i \right ) & \\[5pt] = {} & \Pr \left (Y_i^{1\ast} - Y_i^{0\ast} > 0\mid\mathbf{X}_i \right ) & \\[5pt] = {} & \Pr \left (\boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 - \left (\boldsymbol\beta_0 \cdot \mathbf{X}_i + \varepsilon_0 \right ) > 0 \right ) & \\[5pt] = {} & \Pr \left ((\boldsymbol\beta_1 \cdot \mathbf{X}_i - \boldsymbol\beta_0 \cdot \mathbf{X}_i) + (\varepsilon_1 - \varepsilon_0) > 0 \right ) & \\[5pt] = {} & \Pr((\boldsymbol\beta_1 - \boldsymbol\beta_0) \cdot \mathbf{X}_i + (\varepsilon_1 - \varepsilon_0) > 0) & \\[5pt] = {} & \Pr((\boldsymbol\beta_1 - \boldsymbol\beta_0) \cdot \mathbf{X}_i + \varepsilon > 0) & & \text{(substitute } \varepsilon\text{ as above)} \\[5pt] = {} & \Pr(\boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon > 0) & & \text{(substitute }\boldsymbol\beta\text{ as above)} \\[5pt] = {} & \Pr(\varepsilon > -\boldsymbol\beta \cdot \mathbf{X}_i) & & \text{(now, same as above model)}\\[5pt] = {} & \Pr(\varepsilon < \boldsymbol\beta \cdot \mathbf{X}_i) & \\[5pt] = {} & \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) \\[5pt] = {} & p_i \end{align}</math> ====Example==== : {{Original research\|example\|discuss=Talk:Logistic_regression#Utility_theory_/_Elections_example_is_irrelevant\|date=May 2022}} As an example, consider a province-level election where the choice is between a right-of-center party, a left-of-center party, and a secessionist party (e.g. the [[Parti Québécois]], which wants [[Quebec]] to secede from [[Canada]]). We would then use three latent variables, one for each choice. Then, in accordance with [[utility theory]], we can then interpret the latent variables as expressing the [[utility]] that results from making each of the choices. We can also interpret the regression coefficients as indicating the strength that the associated factor (i.e. explanatory variable) has in contributing to the utility — or more correctly, the amount by which a unit change in an explanatory variable changes the utility of a given choice. A voter might expect that the right-of-center party would lower taxes, especially on rich people. This would give low-income people no benefit, i.e. no change in utility (since they usually don't pay taxes); would cause moderate benefit (i.e. somewhat more money, or moderate utility increase) for middle-incoming people; would cause significant benefits for high-income people. On the other hand, the left-of-center party might be expected to raise taxes and offset it with increased welfare and other assistance for the lower and middle classes. This would cause significant positive benefit to low-income people, perhaps a weak benefit to middle-income people, and significant negative benefit to high-income people. Finally, the secessionist party would take no direct actions on the economy, but simply secede. A low-income or middle-income voter might expect basically no clear utility gain or loss from this, but a high-income voter might expect negative utility since he/she is likely to own companies, which will have a harder time doing business in such an environment and probably lose money. These intuitions can be expressed as follows: {\|class="wikitable" \|+Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables \|- ! !! Center-right !! Center-left !! Secessionist \|- ! High-income \| strong + \|\| strong − \|\| strong − \|- ! Middle-income \| moderate + \|\| weak + \|\| none \|- ! Low-income \| none \|\| strong + \|\| none \|- \|} This clearly shows that # Separate sets of regression coefficients need to exist for each choice. When phrased in terms of utility, this can be seen very easily. Different choices have different effects on net utility; furthermore, the effects vary in complex ways that depend on the characteristics of each individual, so there need to be separate sets of coefficients for each characteristic, not simply a single extra per-choice characteristic. # Even though income is a continuous variable, its effect on utility is too complex for it to be treated as a single variable. Either it needs to be directly split up into ranges, or higher powers of income need to be added so that [[polynomial regression]] on income is effectively done. ==={{anchor\|log-linear model}}As a "log-linear" model=== Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of the [[multinomial logit]]. Here, instead of writing the [[logit]] of the probabilities ''p''<sub>''i''</sub> as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes: : <math> \begin{align} \ln \Pr(Y_i=0) &= \boldsymbol\beta_0 \cdot \mathbf{X}_i - \ln Z \\ \ln \Pr(Y_i=1) &= \boldsymbol\beta_1 \cdot \mathbf{X}_i - \ln Z \end{align} </math> Two separate sets of regression coefficients have been introduced, just as in the two-way latent variable model, and the two equations appear a form that writes the [[logarithm]] of the associated probability as a linear predictor, with an extra term <math>- \ln Z</math> at the end. This term, as it turns out, serves as the [[normalizing factor]] ensuring that the result is a distribution. This can be seen by exponentiating both sides: : <math> \begin{align} \Pr(Y_i=0) &= \frac{1}{Z} e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} \\[5pt] \Pr(Y_i=1) &= \frac{1}{Z} e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} \end{align} </math> In this form it is clear that the purpose of ''Z'' is to ensure that the resulting distribution over ''Y''<sub>''i''</sub> is in fact a [[probability distribution]], i.e. it sums to 1. This means that ''Z'' is simply the sum of all un-normalized probabilities, and by dividing each probability by ''Z'', the probabilities become "[[normalizing constant\|normalized]]". That is: :<math> Z = e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}</math> and the resulting equations are :<math> \begin{align} \Pr(Y_i=0) &= \frac{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} \\[5pt] \Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}. \end{align} </math> Or generally: :<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}{\sum_h e^{\boldsymbol\beta_h \cdot \mathbf{X}_i}}</math> This shows clearly how to generalize this formulation to more than two outcomes, as in [[multinomial logit]]. This general formulation is exactly the [[softmax function]] as in :<math>\Pr(Y_i=c) = \operatorname{softmax}(c, \boldsymbol\beta_0 \cdot \mathbf{X}_i, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \dots) .</math> In order to prove that this is equivalent to the previous model, the above model is overspecified, in that <math>\Pr(Y_i=0)</math> and <math>\Pr(Y_i=1)</math> cannot be independently specified: rather <math>\Pr(Y_i=0) + \Pr(Y_i=1) = 1</math> so knowing one automatically determines the other. As a result, the model is [[nonidentifiable]], in that multiple combinations of '''''β'''''<sub>0</sub> and '''''β'''''<sub>1</sub> will produce the same probabilities for all possible explanatory variables. In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities: :<math> \begin{align} \Pr(Y_i=1) &= \frac{e^{(\boldsymbol\beta_1 +\mathbf{C}) \cdot \mathbf{X}_i}}{e^{(\boldsymbol\beta_0 +\mathbf{C})\cdot \mathbf{X}_i} + e^{(\boldsymbol\beta_1 +\mathbf{C}) \cdot \mathbf{X}_i}} \\[5pt] &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}} \\[5pt] &= \frac{e^{\mathbf{C} \cdot \mathbf{X}_i}e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\mathbf{C} \cdot \mathbf{X}_i}(e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i})} \\[5pt] &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}. \end{align} </math> As a result, we can simplify matters, and restore identifiability, by picking an arbitrary value for one of the two vectors. We choose to set <math>\boldsymbol\beta_0 = \mathbf{0} .</math> Then, :<math>e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} = e^{\mathbf{0} \cdot \mathbf{X}_i} = 1</math> and so :<math> \Pr(Y_i=1) = \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{1 + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} = \frac{1}{1+e^{-\boldsymbol\beta_1 \cdot \mathbf{X}_i}} = p_i</math> which shows that this formulation is indeed equivalent to the previous formulation. (As in the two-way latent variable formulation, any settings where <math>\boldsymbol\beta = \boldsymbol\beta_1 - \boldsymbol\beta_0</math> will produce equivalent results.) Most treatments of the [[multinomial logit]] model start out either by extending the "log-linear" formulation presented here or the two-way latent variable formulation presented above, since both clearly show the way that the model could be extended to multi-way outcomes. In general, the presentation with latent variables is more common in [[econometrics]] and [[political science]], where [[discrete choice]] models and [[utility theory]] reign, while the "log-linear" formulation here is more common in [[computer science]], e.g. [[machine learning]] and [[natural language processing]]. ===As a single-layer perceptron=== The model has an equivalent formulation :<math>p_i = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}}. \, </math> This functional form is commonly called a single-layer [[perceptron]] or single-layer [[artificial neural network]]. A single-layer neural network computes a continuous output instead of a [[step function]]. The derivative of ''p<sub>i</sub>'' with respect to ''X'' = (''x''<sub>1</sub>, ..., ''x''<sub>''k''</sub>) is computed from the general form: : <math>y = \frac{1}{1+e^{-f(X)}}</math> where ''f''(''X'') is an [[analytic function]] in ''X''. With this choice, the single-layer neural network is identical to the logistic regression model. This function has a continuous derivative, which allows it to be used in [[backpropagation]]. This function is also preferred because its derivative is easily calculated: : <math>\frac{\mathrm{d}y}{\mathrm{d}X} = y(1-y)\frac{\mathrm{d}f}{\mathrm{d}X}. \, </math> ===In terms of binomial data=== A closely related model assumes that each ''i'' is associated not with a single Bernoulli trial but with ''n''<sub>''i''</sub> [[independent identically distributed]] trials, where the observation ''Y''<sub>''i''</sub> is the number of successes observed (the sum of the individual Bernoulli-distributed random variables), and hence follows a [[binomial distribution]]: :<math>Y_i \,\sim \operatorname{Bin}(n_i,p_i),\text{ for }i = 1, \dots , n</math> An example of this distribution is the fraction of seeds (''p''<sub>''i''</sub>) that germinate after ''n''<sub>''i''</sub> are planted. In terms of [[expected value]]s, this model is expressed as follows: :<math>p_i = \operatorname{\mathbb E}\left[\left.\frac{Y_i}{n_{i}}\,\right\|\,\mathbf{X}_i \right]\,, </math> so that :<math>\operatorname{logit}\left(\operatorname{\mathbb E}\left[\left.\frac{Y_i}{n_i}\,\right\|\,\mathbf{X}_i \right]\right) = \operatorname{logit}(p_i) = \ln \left(\frac{p_i}{1-p_i}\right) = \boldsymbol\beta \cdot \mathbf{X}_i\,,</math> Or equivalently: :<math>\Pr(Y_i=y\mid \mathbf{X}_i) = {n_i \choose y} p_i^y(1-p_i)^{n_i-y} ={n_i \choose y} \left(\frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^y \left(1-\frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{n_i-y}\,.</math> This model can be fit using the same sorts of methods as the above more basic model. ==Model fitting== {{expand section\|date=October 2016}} ===Maximum likelihood estimation (MLE)=== The regression coefficients are usually estimated using [[maximum likelihood estimation]].<ref name=Menard/><ref>{{cite journal \|first1=Christian \|last1=Gourieroux \|first2=Alain \|last2=Monfort \|title=Asymptotic Properties of the Maximum Likelihood Estimator in Dichotomous Logit Models \|journal=Journal of Econometrics \|volume=17 \|issue=1 \|year=1981 \|pages=83–97 \|doi=10.1016/0304-4076(81)90060-9 }}</ref> Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function, so that an iterative process must be used instead; for example [[Newton's method]]. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.<ref name="Menard" /> In some instances, the model may not reach convergence. Non-convergence of a model indicates that the coefficients are not meaningful because the iterative process was unable to find appropriate solutions. A failure to converge may occur for a number of reasons: having a large ratio of predictors to cases, [[multicollinearity]], [[sparse matrix\|sparseness]], or complete [[Separation (statistics)\|separation]]. Having a large ratio of variables to cases results in an overly conservative Wald statistic (discussed below) and can lead to non-convergence. [[Regularization (mathematics)\|Regularized]] logistic regression is specifically intended to be used in this situation. * Multicollinearity refers to unacceptably high correlations between predictors. As multicollinearity increases, coefficients remain unbiased but standard errors increase and the likelihood of model convergence decreases.<ref name=Menard/> To detect multicollinearity amongst the predictors, one can conduct a linear regression analysis with the predictors of interest for the sole purpose of examining the tolerance statistic <ref name=Menard/> used to assess whether multicollinearity is unacceptably high. * Sparseness in the data refers to having a large proportion of empty cells (cells with zero counts). Zero cell counts are particularly problematic with categorical predictors. With continuous predictors, the model can infer values for the zero cell counts, but this is not the case with categorical predictors. The model will not converge with zero cell counts for categorical predictors because the natural logarithm of zero is an undefined value so that the final solution to the model cannot be reached. To remedy this problem, researchers may collapse categories in a theoretically meaningful way or add a constant to all cells.<ref name=Menard/> * Another numerical problem that may lead to a lack of convergence is complete separation, which refers to the instance in which the predictors perfectly predict the criterion – all cases are accurately classified and the likelihood maximized with infinite coefficients. In such instances, one should re-examine the data, as there may be some kind of error.<ref name=Hosmer/>{{explain\|date=May 2017\|reason= Why is there likely some kind of error? How can this be remedied?}} * One can also take semi-parametric or non-parametric approaches, e.g., via local-likelihood or nonparametric quasi-likelihood methods, which avoid assumptions of a parametric form for the index function and is robust to the choice of the link function (e.g., probit or logit).<ref name="sciencedirect.com">{{cite journal\| doi=10.1016/j.csda.2016.10.024 \| volume=108 \| title=Nonparametric estimation of dynamic discrete choice models for time series data \| year=2017 \| journal=Computational Statistics & Data Analysis \| pages=97–120 \| last1 = Park \| first1 = Byeong U. \| last2 = Simar \| first2 = Léopold \| last3 = Zelenyuk \| first3 = Valentin\| url=https://espace.library.uq.edu.au/view/UQ:415620/UQ415620_OA.pdf }}</ref> === Iteratively reweighted least squares (IRLS) === Binary logistic regression (<math>y=0</math> or <math> y=1</math>) can, for example, be calculated using ''iteratively reweighted least squares'' (IRLS), which is equivalent to maximizing the [[log-likelihood]] of a [[Bernoulli distribution\|Bernoulli distributed]] process using [[Newton's method]]. If the problem is written in vector matrix form, with parameters <math>\mathbf{w}^T=[\beta_0,\beta_1,\beta_2, \ldots]</math>, explanatory variables <math>\mathbf{x}(i)=[1, x_1(i), x_2(i), \ldots]^T</math> and expected value of the Bernoulli distribution <math>\mu(i)=\frac{1}{1+e^{-\mathbf{w}^T\mathbf{x}(i)}}</math>, the parameters <math>\mathbf{w}</math> can be found using the following iterative algorithm: :<math>\mathbf{w}_{k+1} = \left(\mathbf{X}^T\mathbf{S}_k\mathbf{X}\right)^{-1}\mathbf{X}^T \left(\mathbf{S}_k \mathbf{X} \mathbf{w}_k + \mathbf{y} - \mathbf{\boldsymbol\mu}_k\right) </math> where <math>\mathbf{S}=\operatorname{diag}(\mu(i)(1-\mu(i)))</math> is a diagonal weighting matrix, <math>\boldsymbol\mu=[\mu(1), \mu(2),\ldots]</math> the vector of expected values, :<math>\mathbf{X}=\begin{bmatrix} 1 & x_1(1) & x_2(1) & \ldots\\ 1 & x_1(2) & x_2(2) & \ldots\\ \vdots & \vdots & \vdots \end{bmatrix}</math> The regressor matrix and <math>\mathbf{y}(i)=[y(1),y(2),\ldots]^T</math> the vector of response variables. More details can be found in the literature.<ref>{{cite book\|last1=Murphy\|first1=Kevin P.\|title=Machine Learning – A Probabilistic Perspective\|publisher=The MIT Press\|date=2012\|page=245\|isbn=978-0-262-01802-9}}</ref> ===Bayesian=== [[File:Logistic-sigmoid-vs-scaled-probit.svg\|right\|300px\|thumb\|Comparison of [[logistic function]] with a scaled inverse [[probit function]] (i.e. the [[cumulative distribution function\|CDF]] of the [[normal distribution]]), comparing <math>\sigma(x)</math> vs. <math display="inline">\Phi(\sqrt{\frac{\pi}{8}}x)</math>, which makes the slopes the same at the origin. This shows the [[heavy-tailed distribution\|heavier tails]] of the logistic distribution.]] In a [[Bayesian statistics]] context, [[prior distribution]]s are normally placed on the regression coefficients, for example in the form of [[Gaussian distribution]]s. There is no [[conjugate prior]] of the [[likelihood function]] in logistic regression. When Bayesian inference was performed analytically, this made the [[posterior distribution]] difficult to calculate except in very low dimensions. Now, though, automatic software such as [[OpenBUGS]], [[Just another Gibbs sampler\|JAGS]], [[PyMC3]], [[Stan (software)\|Stan]] or [[Turing.jl]] allows these posteriors to be computed using simulation, so lack of conjugacy is not a concern. However, when the sample size or the number of parameters is large, full Bayesian simulation can be slow, and people often use approximate methods such as [[variational Bayesian methods]] and [[expectation propagation]]. ==="Rule of ten"=== {{main\|One in ten rule}} A widely used rule of thumb, the "[[one in ten rule]]", states that logistic regression models give stable values for the explanatory variables if based on a minimum of about 10 events per explanatory variable (EPV); where ''event'' denotes the cases belonging to the less frequent category in the dependent variable. Thus a study designed to use <math>k</math> explanatory variables for an event (e.g. [[myocardial infarction]]) expected to occur in a proportion <math>p</math> of participants in the study will require a total of <math>10k/p</math> participants. However, there is considerable debate about the reliability of this rule, which is based on simulation studies and lacks a secure theoretical underpinning.<ref>{{cite journal\|pmid=27881078\|pmc=5122171\|year=2016\|last1=Van Smeden\|first1=M.\|title=No rationale for 1 variable per 10 events criterion for binary logistic regression analysis\|journal=BMC Medical Research Methodology\|volume=16\|issue=1\|page=163\|last2=De Groot\|first2=J. A.\|last3=Moons\|first3=K. G.\|last4=Collins\|first4=G. S.\|last5=Altman\|first5=D. G.\|last6=Eijkemans\|first6=M. J.\|last7=Reitsma\|first7=J. B.\|doi=10.1186/s12874-016-0267-3 \|doi-access=free }}</ref> According to some authors<ref>{{cite journal\|last=Peduzzi\|first=P\|author2=Concato, J \|author3=Kemper, E \|author4=Holford, TR \|author5=Feinstein, AR \|title=A simulation study of the number of events per variable in logistic regression analysis\|journal=[[Journal of Clinical Epidemiology]]\|date=December 1996\|volume=49\|issue=12\|pages=1373–9\|pmid=8970487\|doi=10.1016/s0895-4356(96)00236-3\|doi-access=free}}</ref> the rule is overly conservative in some circumstances, with the authors stating, "If we (somewhat subjectively) regard confidence interval coverage less than 93 percent, type I error greater than 7 percent, or relative bias greater than 15 percent as problematic, our results indicate that problems are fairly frequent with 2–4 EPV, uncommon with 5–9 EPV, and still observed with 10–16 EPV. The worst instances of each problem were not severe with 5–9 EPV and usually comparable to those with 10–16 EPV".<ref>{{cite journal\|last1=Vittinghoff\|first1=E.\|last2=McCulloch\|first2=C. E.\|title=Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression\|journal=American Journal of Epidemiology\|date=12 January 2007\|volume=165\|issue=6\|pages=710–718\|doi=10.1093/aje/kwk052\|pmid=17182981\|doi-access=free}}</ref> Others have found results that are not consistent with the above, using different criteria. A useful criterion is whether the fitted model will be expected to achieve the same predictive discrimination in a new sample as it appeared to achieve in the model development sample. For that criterion, 20 events per candidate variable may be required.<ref name=plo14mod/> Also, one can argue that 96 observations are needed only to estimate the model's intercept precisely enough that the margin of error in predicted probabilities is ±0.1 with a 0.95 confidence level.<ref name=rms/> == Error and significance of fit == === Deviance and likelihood ratio test ─ a simple case === In any fitting procedure, the addition of another fitting parameter to a model (e.g. the beta parameters in a logistic regression model) will almost always improve the ability of the model to predict the measured outcomes. This will be true even if the additional term has no predictive value, since the model will simply be "[[overfitting]]" to the noise in the data. The question arises as to whether the improvement gained by the addition of another fitting parameter is significant enough to recommend the inclusion of the additional term, or whether the improvement is simply that which may be expected from overfitting. In short, for logistic regression, a statistic known as the [[deviance (statistics)\|deviance]] is defined which is a measure of the error between the logistic model fit and the outcome data. In the limit of a large number of data points, the deviance is [[Chi-squared distribution\|chi-squared]] distributed, which allows a [[chi-squared test]] to be implemented in order to determine the significance of the explanatory variables. Linear regression and logistic regression have many similarities. For example, in simple linear regression, a set of ''K'' data points (''x<sub>k</sub>'', ''y<sub>k</sub>'') are fitted to a proposed model function of the form <math>y=b_0+b_1 x</math>. The fit is obtained by choosing the ''b'' parameters which minimize the sum of the squares of the residuals (the squared error term) for each data point: :<math>\epsilon^2=\sum_{k=1}^K (b_0+b_1 x_k-y_k)^2.</math> The minimum value which constitutes the fit will be denoted by <math>\hat{\epsilon}^2</math> The idea of a [[null model]] may be introduced, in which it is assumed that the ''x'' variable is of no use in predicting the y<sub>k</sub> outcomes: The data points are fitted to a null model function of the form ''y=b<sub>0</sub>'' with a squared error term: :<math>\epsilon^2=\sum_{k=1}^K (b_0-y_k)^2.</math> The fitting process consists of choosing a value of ''b<sub>0</sub>'' which minimizes <math>\epsilon^2</math> of the fit to the null model, denoted by <math>\epsilon_\varphi^2</math> where the <math>\varphi</math> subscript denotes the null model. It is seen that the null model is optimized by <math>b_0=\overline{y}</math> where <math>\overline{y}</math> is the mean of the ''y<sub>k</sub>'' values, and the optimized <math>\epsilon_\varphi^2</math> is: :<math>\hat{\epsilon}_\varphi^2=\sum_{k=1}^K (\overline{y}-y_k)^2</math> which is proportional to the square of the (uncorrected) sample standard deviation of the ''y<sub>k</sub>'' data points. We can imagine a case where the ''y<sub>k</sub>'' data points are randomly assigned to the various ''x<sub>k</sub>'', and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the ''y<sub>k</sub>'' outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a [[chi-squared distribution]], with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be 2-1=1. Using the [[chi-squared test]], we may then estimate how many of these permuted sets of ''y<sub>k</sub>'' will yield an minimum error less than or equal to the minimum error using the original ''y<sub>k</sub>'', and so we can estimate how significant an improvement is given by the inclusion of the ''x'' variable in the proposed model. For logistic regression, the measure of goodness-of-fit is the likelihood function ''L'', or its logarithm, the log-likelihood ''ℓ''. The likelihood function ''L'' is analogous to the <math>\epsilon^2</math> in the linear regression case, except that the likelihood is maximized rather than minimized. Denote the maximized log-likelihood of the proposed model by <math>\hat{\ell}</math>. In the case of simple binary logistic regression, the set of ''K'' data points are fitted in a probabilistic sense to a function of the form: :<math>p(x)=\frac{1}{1+e^{-t}}</math> where {{tmath\|p(x)}} is the probability that <math>y=1</math>. The log-odds are given by: :<math>t=\beta_0+\beta_1 x</math> and the log-likelihood is: :<math>\ell=\sum_{k=1}^K \left( y_k \ln(p(x_k))+(1-y_k) \ln(1-p(x_k))\right)</math> For the null model, the probability that <math>y=1</math> is given by: :<math>p_\varphi(x)=\frac{1}{1+e^{-t_\varphi}}</math> The log-odds for the null model are given by: :<math>t_\varphi=\beta_0</math> and the log-likelihood is: :<math>\ell_\varphi=\sum_{k=1}^K \left( y_k \ln(p_\varphi)+(1-y_k) \ln(1-p_\varphi)\right)</math> Since we have <math>p_\varphi=\overline{y}</math> at the maximum of ''L'', the maximum log-likelihood for the null model is :<math>\hat{\ell}_\varphi=K(\,\overline{y} \ln(\overline{y}) + (1-\overline{y})\ln(1-\overline{y}))</math> The optimum <math>\beta_0</math> is: :<math>\beta_0=\ln\left(\frac{\overline{y}}{1-\overline{y}}\right)</math> where <math>\overline{y}</math> is again the mean of the ''y<sub>k</sub>'' values. Again, we can conceptually consider the fit of the proposed model to every permutation of the ''y<sub>k</sub>'' and it can be shown that the maximum log-likelihood of these permutation fits will never be smaller than that of the null model: :<math> \hat{\ell} \ge \hat{\ell}_\varphi</math> Also, as an analog to the error of the linear regression case, we may define the [[deviance (statistics)\|deviance]] of a logistic regression fit as: :<math>D=\ln\left(\frac{\hat{L}^2}{\hat{L}_\varphi^2}\right) = 2(\hat{\ell}-\hat{\ell}_\varphi)</math> which will always be positive or zero. The reason for this choice is that not only is the deviance a good measure of the goodness of fit, it is also approximately chi-squared distributed, with the approximation improving as the number of data points (''K'') increases, becoming exactly chi-square distributed in the limit of an infinite number of data points. As in the case of linear regression, we may use this fact to estimate the probability that a random set of data points will give a better fit than the fit obtained by the proposed model, and so have an estimate how significantly the model is improved by including the ''x<sub>k</sub>'' data points in the proposed model. For the simple model of student test scores described above, the maximum value of the log-likelihood of the null model is <math>\hat{\ell}_\varphi= -13.8629...</math> The maximum value of the log-likelihood for the simple model is <math>\hat{\ell}=-8.02988...</math> so that the deviance is <math>D = 2(\hat{\ell}-\hat{\ell}_\varphi)=11.6661...</math> Using the [[chi-squared test]] of significance, the integral of the [[chi-squared distribution]] with one degree of freedom from 11.6661... to infinity is equal to 0.00063649... This effectively means that about 6 out of a 10,000 fits to random ''y<sub>k</sub>'' can be expected to have a better fit (smaller deviance) than the given ''y<sub>k</sub>'' and so we can conclude that the inclusion of the ''x'' variable and data in the proposed model is a very significant improvement over the null model. In other words, we reject the [[null hypothesis]] with <math>1-D\approx 99.94 \%</math> confidence. ===Goodness of fit summary=== [[Goodness of fit]] in linear regression models is generally measured using [[R square\|R<sup>2</sup>]]. Since this has no direct analog in logistic regression, various methods<ref name=Greene>{{cite book \|last=Greene \|first=William N. \|title=Econometric Analysis \|edition=Fifth \|publisher=Prentice-Hall \|year=2003 \|isbn=978-0-13-066189-0 }}</ref>{{rp\|ch.21}} including the following can be used instead. ====Deviance and likelihood ratio tests==== In linear regression analysis, one is concerned with partitioning variance via the [[Partition of sums of squares\|sum of squares]] calculations – variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, [[Deviance (statistics)\|deviance]] is used in lieu of a sum of squares calculations.<ref name=Cohen/> Deviance is analogous to the sum of squares calculations in linear regression<ref name=Hosmer/> and is a measure of the lack of fit to the data in a logistic regression model.<ref name=Cohen/> When a "saturated" model is available (a model with a theoretically perfect fit), deviance is calculated by comparing a given model with the saturated model.<ref name=Hosmer/> This computation gives the [[likelihood-ratio test]]:<ref name=Hosmer/> :<math> D = -2\ln \frac{\text{likelihood of the fitted model}} {\text{likelihood of the saturated model}}.</math> In the above equation, {{mvar\|D}} represents the deviance and ln represents the natural logarithm. The log of this likelihood ratio (the ratio of the fitted model to the saturated model) will produce a negative value, hence the need for a negative sign. {{mvar\|D}} can be shown to follow an approximate [[chi-squared distribution]].<ref name=Hosmer/> Smaller values indicate better fit as the fitted model deviates less from the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square values indicate very little unexplained variance and thus, good model fit. Conversely, a significant chi-square value indicates that a significant amount of the variance is unexplained. When the saturated model is not available (a common case), deviance is calculated simply as −2·(log likelihood of the fitted model), and the reference to the saturated model's log likelihood can be removed from all that follows without harm. Two measures of deviance are particularly important in logistic regression: null deviance and model deviance. The null deviance represents the difference between a model with only the intercept (which means "no predictors") and the saturated model. The model deviance represents the difference between a model with at least one predictor and the saturated model.<ref name=Cohen/> In this respect, the null model provides a baseline upon which to compare predictor models. Given that deviance is a measure of the difference between a given model and the saturated model, smaller values indicate better fit. Thus, to assess the contribution of a predictor or set of predictors, one can subtract the model deviance from the null deviance and assess the difference on a <math>\chi^2_{s-p},</math> chi-square distribution with [[Degrees of freedom (statistics)\|degrees of freedom]]<ref name=Hosmer/> equal to the difference in the number of parameters estimated. Let :<math>\begin{align} D_{\text{null}} &=-2\ln \frac{\text{likelihood of null model}} {\text{likelihood of the saturated model}}\\[6pt] D_{\text{fitted}} &=-2\ln \frac{\text{likelihood of fitted model}} {\text{likelihood of the saturated model}}. \end{align} </math> Then the difference of both is: :<math>\begin{align} D_\text{null}- D_\text{fitted} &= -2 \left(\ln \frac{\text{likelihood of null model}} {\text{likelihood of the saturated model}}-\ln \frac{\text{likelihood of fitted model}} {\text{likelihood of the saturated model}}\right)\\[6pt] &= -2 \ln \frac{ \left( \dfrac{\text{likelihood of null model}}{\text{likelihood of the saturated model}}\right)}{\left(\dfrac{\text{likelihood of fitted model}}{\text{likelihood of the saturated model}}\right)}\\[6pt] &= -2 \ln \frac{\text{likelihood of the null model}}{\text{likelihood of fitted model}}. \end{align}</math> If the model deviance is significantly smaller than the null deviance then one can conclude that the predictor or set of predictors significantly improve the model's fit. This is analogous to the {{mvar\|F}}-test used in linear regression analysis to assess the significance of prediction.<ref name=Cohen/> ====Pseudo-R-squared==== {{main article\| Pseudo-R-squared}} In linear regression the squared multiple correlation, {{mvar\|R}}<sup>2</sup> is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors.<ref name=Cohen/> In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations.<ref name=Cohen/><ref name=":0">{{cite web \|url=https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf \|title=Measures of fit for logistic regression \|last=Allison \|first=Paul D. \|publisher=Statistical Horizons LLC and the University of Pennsylvania}}</ref> Four of the most commonly used indices and one less commonly used one are examined on this page: * Likelihood ratio {{mvar\|R}}<sup>2</sup>{{sub\|L}} * Cox and Snell {{mvar\|R}}<sup>2</sup>{{sub\|CS}} * Nagelkerke {{mvar\|R}}<sup>2</sup>{{sub\|N}} * McFadden {{mvar\|R}}<sup>2</sup>{{sub\|McF}} * Tjur {{mvar\|R}}<sup>2</sup>{{sub\|T}} ====Hosmer–Lemeshow test==== The [[Hosmer–Lemeshow test]] uses a test statistic that asymptotically follows a [[chi-squared distribution\|<math>\chi^2</math> distribution]] to assess whether or not the observed event rates match expected event rates in subgroups of the model population. This test is considered to be obsolete by some statisticians because of its dependence on arbitrary binning of predicted probabilities and relative low power.<ref>{{cite journal\|last1=Hosmer\|first1=D.W.\|title=A comparison of goodness-of-fit tests for the logistic regression model\|journal=Stat Med\|date=1997\|volume=16\|issue=9\|pages=965–980\|doi=10.1002/(sici)1097-0258(19970515)16:9<965::aid-sim509>3.3.co;2-f\|pmid=9160492}}</ref> ===Coefficient significance=== After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor.<ref name=Cohen/> In logistic regression, however, the regression coefficients represent the change in the logit for each unit change in the predictor. Given that the logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential function of the regression coefficient – the odds ratio (see [[#Logistic function, odds, odds ratio, and logit\|definition]]). In linear regression, the significance of a regression coefficient is assessed by computing a ''t'' test. In logistic regression, there are several different tests designed to assess the significance of an individual predictor, most notably the likelihood ratio test and the Wald statistic. ====Likelihood ratio test==== The [[likelihood-ratio test]] discussed above to assess model fit is also the recommended procedure to assess the contribution of individual "predictors" to a given model.<ref name=Hosmer/><ref name=Menard/><ref name=Cohen/> In the case of a single predictor model, one simply compares the deviance of the predictor model with that of the null model on a chi-square distribution with a single degree of freedom. If the predictor model has significantly smaller deviance (c.f. chi-square using the difference in degrees of freedom of the two models), then one can conclude that there is a significant association between the "predictor" and the outcome. Although some common statistical packages (e.g. SPSS) do provide likelihood ratio test statistics, without this computationally intensive test it would be more difficult to assess the contribution of individual predictors in the multiple logistic regression case.{{Citation needed\|date=October 2019}} To assess the contribution of individual predictors one can enter the predictors hierarchically, comparing each new model with the previous to determine the contribution of each predictor.<ref name=Cohen/> There is some debate among statisticians about the appropriateness of so-called "stepwise" procedures.{{weasel inline\|date=October 2019}} The fear is that they may not preserve nominal statistical properties and may become misleading.<ref>{{cite book \|first=Frank E. \|last=Harrell \|title=Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis \|location=New York \|publisher=Springer \|year=2010 \|isbn=978-1-4419-2918-1 }}{{page needed\|date=October 2019}}</ref> ====Wald statistic==== Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the [[Wald test\|Wald statistic]]. The Wald statistic, analogous to the ''t''-test in linear regression, is used to assess the significance of coefficients. The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution.<ref name=Menard/> : <math>W_j = \frac{\beta^2_j} {SE^2_{\beta_j}}</math> Although several statistical packages (e.g., SPSS, SAS) report the Wald statistic to assess the contribution of individual predictors, the Wald statistic has limitations. When the regression coefficient is large, the standard error of the regression coefficient also tends to be larger increasing the probability of [[Type I and Type II errors\|Type-II error]]. The Wald statistic also tends to be biased when data are sparse.<ref name=Cohen/> ====Case-control sampling==== Suppose cases are rare. Then we might wish to sample them more frequently than their prevalence in the population. For example, suppose there is a disease that affects 1 person in 10,000 and to collect our data we need to do a complete physical. It may be too expensive to do thousands of physicals of healthy people in order to obtain data for only a few diseased individuals. Thus, we may evaluate more diseased individuals, perhaps all of the rare outcomes. This is also retrospective sampling, or equivalently it is called unbalanced data. As a rule of thumb, sampling controls at a rate of five times the number of cases will produce sufficient control data.<ref name="islr">https://class.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/classification.pdf slide 16</ref> Logistic regression is unique in that it may be estimated on unbalanced data, rather than randomly sampled data, and still yield correct coefficient estimates of the effects of each independent variable on the outcome. That is to say, if we form a logistic model from such data, if the model is correct in the general population, the <math>\beta_j</math> parameters are all correct except for <math>\beta_0</math>. We can correct <math>\beta_0</math> if we know the true prevalence as follows:<ref name="islr"/> : <math>\widehat{\beta}_0^* = \widehat{\beta}_0+\log \frac \pi {1 - \pi} - \log{ \tilde{\pi} \over {1 - \tilde{\pi}} } </math> where <math>\pi</math> is the true prevalence and <math>\tilde{\pi}</math> is the prevalence in the sample. ==Discussion== Like other forms of [[regression analysis]], logistic regression makes use of one or more predictor variables that may be either continuous or categorical. Unlike ordinary linear regression, however, logistic regression is used for predicting dependent variables that take [[categorical variable\|membership in one of a limited number of categories]] (treating the dependent variable in the binomial case as the outcome of a [[Bernoulli trial]]) rather than a continuous outcome. Given this difference, the assumptions of linear regression are violated. In particular, the residuals cannot be normally distributed. In addition, linear regression may make nonsensical predictions for a binary dependent variable. What is needed is a way to convert a binary variable into a continuous one that can take on any real value (negative or positive). To do that, binomial logistic regression first calculates the [[odds]] of the event happening for different levels of each independent variable, and then takes its [[logarithm]] to create a continuous criterion as a transformed version of the dependent variable. The logarithm of the odds is the {{math\|[[logit]]}} of the probability, the {{math\|logit}} is defined as follows: <math display="block"> \operatorname{logit} p = \ln \frac p {1-p} \quad \text{for } 0<p<1\,. </math> Although the dependent variable in logistic regression is Bernoulli, the logit is on an unrestricted scale.<ref name=Hosmer/> The logit function is the [[link function]] in this kind of generalized linear model, i.e. <math display="block"> \operatorname{logit} \operatorname{\mathcal E}(Y) = \beta_0 + \beta_1 x </math> {{mvar\|Y}} is the Bernoulli-distributed response variable and {{mvar\|x}} is the predictor variable; the {{mvar\|β}} values are the linear parameters. The {{math\|logit}} of the probability of success is then fitted to the predictors. The predicted value of the {{math\|logit}} is converted back into predicted odds, via the inverse of the natural logarithm – the [[exponential function]]. Thus, although the observed dependent variable in binary logistic regression is a 0-or-1 variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a 'success'. In some applications, the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a 'success'; this categorical prediction can be based on the computed odds of success, with predicted odds above some chosen cutoff value being translated into a prediction of success. == Maximum entropy == Of all the functional forms used for estimating the probabilities of a particular categorical outcome which optimize the fit by maximizing the likelihood function (e.g. [[Probit model\|probit regression]], [[Poisson regression]], etc.), the logistic regression solution is unique in that it is a [[Maximum entropy probability distribution\|maximum entropy]] solution.<ref name="Mount2011">{{cite web \|url=http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf \|title=The Equivalence of Logistic Regression and Maximum Entropy models \|last=Mount \|first=J. \|date=2011 \|website= \|publisher= \|access-date=Feb 23, 2022 \|quote=}}</ref> This is a case of a general property: an [[exponential family]] of distributions maximizes entropy, given an expected value. In the case of the logistic model, the logistic function is the [[natural parameter]] of the Bernoulli distribution (it is in "[[canonical form]]", and the logistic function is the canonical link function), while other sigmoid functions are non-canonical link functions; this underlies its mathematical elegance and ease of optimization. See {{slink\|Exponential family\|Maximum entropy derivation}} for details. === Proof === In order to show this, we use the method of [[Lagrange multipliers]]. The Lagrangian is equal to the entropy plus the sum of the products of Lagrange multipliers times various constraint expressions. The general multinomial case will be considered, since the proof is not made that much simpler by considering simpler cases. Equating the derivative of the Lagrangian with respect to the various probabilities to zero yields a functional form for those probabilities which corresponds to those used in logistic regression.<ref name="Mount2011"/> As in the above section on [[#Multinomial logistic regression : Many explanatory variable and many categories\|multinomial logistic regression]], we will consider {{tmath\|M+1}} explanatory variables denoted {{tmath\|x_m}} and which include <math>x_0=1</math>. There will be a total of ''K'' data points, indexed by <math>k=\{1,2,\dots,K\}</math>, and the data points are given by <math>x_{mk}</math> and {{tmath\|y_k}}. The ''x<sub>mk</sub>'' will also be represented as an {{tmath\|(M+1)}}-dimensional vector <math>\boldsymbol{x}_k = \{x_{0k},x_{1k},\dots,x_{Mk}\}</math>. There will be {{tmath\|N+1}} possible values of the categorical variable ''y'' ranging from 0 to N. Let ''p<sub>n</sub>('''x''')'' be the probability, given explanatory variable vector '''x''', that the outcome will be <math>y=n</math>. Define <math>p_{nk}=p_n(\boldsymbol{x}_k)</math> which is the probability that for the ''k''-th measurement, the categorical outcome is ''n''. The Lagrangian will be expressed as a function of the probabilities ''p<sub>nk</sub>'' and will minimized by equating the derivatives of the Lagrangian with respect to these probabilities to zero. An important point is that the probabilities are treated equally and the fact that they sum to 1 is part of the Lagrangian formulation, rather than being assumed from the beginning. The first contribution to the Lagrangian is the [[Entropy (information theory)\|entropy]]: :<math>\mathcal{L}_{ent}=-\sum_{k=1}^K\sum_{n=0}^N p_{nk}\ln(p_{nk})</math> The log-likelihood is: :<math>\ell=\sum_{k=1}^K\sum_{n=0}^N \Delta(n,y_k)\ln(p_{nk})</math> Assuming the multinomial logistic function, the derivative of the log-likelihood with respect the beta coefficients was found to be: :<math>\frac{\partial \ell}{\partial \beta_{nm}}=\sum_{k=1}^K ( p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math> A very important point here is that this expression is (remarkably) not an explicit function of the beta coefficients. It is only a function of the probabilities ''p<sub>nk</sub>'' and the data. Rather than being specific to the assumed multinomial logistic case, it is taken to be a general statement of the condition at which the log-likelihood is maximized and makes no reference to the functional form of ''p<sub>nk</sub>''. There are then (''M''+1)(''N''+1) fitting constraints and the fitting constraint term in the Lagrangian is then: :<math>\mathcal{L}_{fit}=\sum_{n=0}^N\sum_{m=0}^M \lambda_{nm}\sum_{k=1}^K (p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math> where the ''λ<sub>nm</sub>'' are the appropriate Lagrange multipliers. There are ''K'' normalization constraints which may be written: :<math>\sum_{n=0}^N p_{nk}=1</math> so that the normalization term in the Lagrangian is: :<math>\mathcal{L}_{norm}=\sum_{k=1}^K \alpha_k \left(1-\sum_{n=1}^N p_{nk}\right) </math> where the ''α<sub>k</sub>'' are the appropriate Lagrange multipliers. The Lagrangian is then the sum of the above three terms: :<math>\mathcal{L}=\mathcal{L}_{ent} + \mathcal{L}_{fit} + \mathcal{L}_{norm}</math> Setting the derivative of the Lagrangian with respect to one of the probabilities to zero yields: :<math>\frac{\partial \mathcal{L}}{\partial p_{n'k'}}=0=-\ln(p_{n'k'})-1+\sum_{m=0}^M (\lambda_{n'm}x_{mk'})-\alpha_{k'}</math> Using the more condensed vector notation: :<math>\sum_{m=0}^M \lambda_{nm}x_{mk} = \boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k</math> and dropping the primes on the ''n'' and ''k'' indices, and then solving for <math>p_{nk}</math> yields: :<math>p_{nk}=e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}/Z_k</math> where: :<math>Z_k=e^{1+\alpha_k}</math> Imposing the normalization constraint, we can solve for the ''Z<sub>k</sub>'' and write the probabilities as: :<math>p_{nk}=\frac{e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}}{\sum_{u=0}^N e^{\boldsymbol{\lambda}_u\cdot\boldsymbol{x}_k}}</math> The <math>\boldsymbol{\lambda}_n</math> are not all independent. We can add any constant {{tmath\|(M+1)}}-dimensional vector to each of the <math>\boldsymbol{\lambda}_n</math> without changing the value of the <math>p_{nk}</math> probabilities so that there are only ''N'' rather than {{tmath\|N+1}} independent <math>\boldsymbol{\lambda}_n</math>. In the [[#Multinomial logistic regression : Many explanatory variable and many categories\|multinomial logistic regression]] section above, the <math>\boldsymbol{\lambda}_0</math> was subtracted from each <math>\boldsymbol{\lambda}_n</math> which set the exponential term involving <math>\boldsymbol{\lambda}_0</math> to 1, and the beta coefficients were given by <math>\boldsymbol{\beta}_n=\boldsymbol{\lambda}_n-\boldsymbol{\lambda}_0</math>. ===Other approaches=== In machine learning applications where logistic regression is used for binary classification, the MLE minimises the [[cross-entropy]] loss function. Logistic regression is an important [[machine learning]] algorithm. The goal is to model the probability of a random variable <math>Y</math> being 0 or 1 given experimental data.<ref>{{cite journal \| last = Ng \| first = Andrew \| year = 2000 \| pages = 16–19 \| journal = CS229 Lecture Notes \| title = CS229 Lecture Notes \| url = http://akademik.bahcesehir.edu.tr/~tevfik/courses/cmp5101/cs229-notes1.pdf}}</ref> Consider a [[generalized linear model]] function parameterized by <math>\theta</math>, :<math> h_\theta(X) = \frac{1}{1 + e^{-\theta^TX}} = \Pr(Y=1 \mid X; \theta) </math> Therefore, :<math> \Pr(Y=0 \mid X; \theta) = 1 - h_\theta(X) </math> and since <math> Y \in \{0,1\}</math>, we see that <math> \Pr(y\mid X;\theta) </math> is given by <math> \Pr(y \mid X; \theta) = h_\theta(X)^y(1 - h_\theta(X))^{(1-y)}. </math> We now calculate the [[likelihood function]] assuming that all the observations in the sample are independently Bernoulli distributed, :<math>\begin{align} L(\theta \mid y; x) &= \Pr(Y \mid X; \theta) \\ &= \prod_i \Pr(y_i \mid x_i; \theta) \\ &= \prod_i h_\theta(x_i)^{y_i}(1 - h_\theta(x_i))^{(1-y_i)} \end{align}</math> Typically, the log likelihood is maximized, :<math> N^{-1} \log L(\theta \mid y; x) = N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) </math> which is maximized using optimization techniques such as [[gradient descent]]. Assuming the <math>(x, y)</math> pairs are drawn uniformly from the underlying distribution, then in the limit of large ''N'', :<math>\begin{align} & \lim \limits_{N \rightarrow +\infty} N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \log \Pr(Y=y \mid X=x; \theta) \\[6pt] = {} & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \left( - \log\frac{\Pr(Y=y \mid X=x)}{\Pr(Y=y \mid X=x; \theta)} + \log \Pr(Y=y \mid X=x) \right) \\[6pt] = {} & - D_\text{KL}( Y \parallel Y_\theta ) - H(Y \mid X) \end{align}</math> where <math>H(Y\mid X)</math> is the [[conditional entropy]] and <math>D_\text{KL}</math> is the [[Kullback–Leibler divergence]]. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters. ==Comparison with linear regression== Logistic regression can be seen as a special case of the [[generalized linear model]] and thus analogous to [[linear regression]]. The model of logistic regression, however, is based on quite different assumptions (about the relationship between the dependent and independent variables) from those of linear regression. In particular, the key differences between these two models can be seen in the following two features of logistic regression. First, the conditional distribution <math>y \mid x</math> is a [[Bernoulli distribution]] rather than a [[Gaussian distribution]], because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the [[logistic function\|logistic distribution function]] because logistic regression predicts the '''probability''' of particular outcomes rather than the outcomes themselves. ==Alternatives== A common alternative to the logistic model (logit model) is the [[probit model]], as the related names suggest. From the perspective of [[generalized linear model]]s, these differ in the choice of [[link function]]: the logistic model uses the [[logit function]] (inverse logistic function), while the probit model uses the [[probit function]] (inverse [[error function]]). Equivalently, in the latent variable interpretations of these two methods, the first assumes a standard [[logistic distribution]] of errors and the second a standard [[normal distribution]] of errors.<ref>{{cite book\|title=Lecture Notes on Generalized Linear Models\|last=Rodríguez\|first=G.\|year=2007\|pages=Chapter 3, page 45\|url=http://data.princeton.edu/wws509/notes/}}</ref> Other [[sigmoid function]]s or error distributions can be used instead. Logistic regression is an alternative to Fisher's 1936 method, [[linear discriminant analysis]].<ref>{{cite book \|author1=Gareth James \|author2=Daniela Witten \|author3=Trevor Hastie \|author4=Robert Tibshirani \|title=An Introduction to Statistical Learning \|publisher=Springer \|year=2013 \|url=http://www-bcf.usc.edu/~gareth/ISL/ \|page=6}}</ref> If the assumptions of linear discriminant analysis hold, the conditioning can be reversed to produce logistic regression. The converse is not true, however, because logistic regression does not require the multivariate normal assumption of discriminant analysis.<ref>{{cite journal\|last1=Pohar\|first1=Maja\|last2=Blas\|first2=Mateja\|last3=Turk\|first3=Sandra\|year=2004\|title=Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study\|url=https://www.researchgate.net/publication/229021894\|journal=Metodološki Zvezki\|volume= 1\|issue= 1}}</ref> The assumption of linear predictor effects can easily be relaxed using techniques such as [[Spline (mathematics)\|spline functions]].<ref name=rms/> ==History== A detailed history of the logistic regression is given in {{harvtxt\|Cramer\|2002}}. The logistic function was developed as a model of [[population growth]] and named "logistic" by [[Pierre François Verhulst]] in the 1830s and 1840s, under the guidance of [[Adolphe Quetelet]]; see {{slink\|Logistic function\|History}} for details.{{sfn\|Cramer\|2002\|pp=3–5}} In his earliest paper (1838), Verhulst did not specify how he fit the curves to the data.<ref>{{cite journal\|first= Pierre-François \|last=Verhulst \|year= 1838\| title = Notice sur la loi que la population poursuit dans son accroissement \| journal = Correspondance Mathématique et Physique \|volume = 10\| pages = 113–121 \|url = https://books.google.com/books?id=8GsEAAAAYAAJ \| format = PDF\| access-date = 3 December 2014}}</ref><ref>{{harvnb\|Cramer\|2002\|p=4\|ps=, "He did not say how he fitted the curves."}}</ref> In his more detailed paper (1845), Verhulst determined the three parameters of the model by making the curve pass through three observed points, which yielded poor predictions.<ref>{{cite journal\|first= Pierre-François \|last=Verhulst \|year= 1845\| title = Recherches mathématiques sur la loi d'accroissement de la population \| journal = Nouveaux Mémoires de l'Académie Royale des Sciences et Belles-Lettres de Bruxelles \|volume = 18 \| url = http://gdz.sub.uni-goettingen.de/dms/load/img/?PPN=PPN129323640_0018&DMDID=dmdlog7\| access-date = 2013-02-18\|trans-title= Mathematical Researches into the Law of Population Growth Increase}}</ref>{{sfn\|Cramer\|2002\|p=4}} The logistic function was independently developed in chemistry as a model of [[autocatalysis]] ([[Wilhelm Ostwald]], 1883).{{sfn\|Cramer\|2002\|p=7}} An autocatalytic reaction is one in which one of the products is itself a [[catalyst]] for the same reaction, while the supply of one of the reactants is fixed. This naturally gives rise to the logistic equation for the same reason as population growth: the reaction is self-reinforcing but constrained. The logistic function was independently rediscovered as a model of population growth in 1920 by [[Raymond Pearl]] and [[Lowell Reed]], published as {{harvtxt\|Pearl\|Reed\|1920}}, which led to its use in modern statistics. They were initially unaware of Verhulst's work and presumably learned about it from [[L. Gustave du Pasquier]], but they gave him little credit and did not adopt his terminology.{{sfn\|Cramer\|2002\|p=6}} Verhulst's priority was acknowledged and the term "logistic" revived by [[Udny Yule]] in 1925 and has been followed since.{{sfn\|Cramer\|2002\|p=6–7}} Pearl and Reed first applied the model to the population of the United States, and also initially fitted the curve by making it pass through three points; as with Verhulst, this again yielded poor results.{{sfn\|Cramer\|2002\|p=5}} In the 1930s, the [[probit model]] was developed and systematized by [[Chester Ittner Bliss]], who coined the term "probit" in {{harvtxt\|Bliss\|1934}}, and by [[John Gaddum]] in {{harvtxt\|Gaddum\|1933}}, and the model fit by [[maximum likelihood estimation]] by [[Ronald A. Fisher]] in {{harvtxt\|Fisher\|1935}}, as an addendum to Bliss's work. The probit model was principally used in [[bioassay]], and had been preceded by earlier work dating to 1860; see {{slink\|Probit model\|History}}. The probit model influenced the subsequent development of the logit model and these models competed with each other.{{sfn\|Cramer\|2002\|p=7–9}} The logistic model was likely first used as an alternative to the probit model in bioassay by [[Edwin Bidwell Wilson]] and his student [[Jane Worcester]] in {{harvtxt\|Wilson\|Worcester\|1943}}.{{sfn\|Cramer\|2002\|p=9}} However, the development of the logistic model as a general alternative to the probit model was principally due to the work of [[Joseph Berkson]] over many decades, beginning in {{harvtxt\|Berkson\|1944}}, where he coined "logit", by analogy with "probit", and continuing through {{harvtxt\|Berkson\|1951}} and following years.<ref>{{harvnb\|Cramer\|2002\|p=8\|ps=, "As far as I can see the introduction of the logistics as an alternative to the normal probability function is the work of a single person, Joseph Berkson (1899–1982), ..."}}</ref> The logit model was initially dismissed as inferior to the probit model, but "gradually achieved an equal footing with the logit",{{sfn\|Cramer\|2002\|p=11}} particularly between 1960 and 1970. By 1970, the logit model achieved parity with the probit model in use in statistics journals and thereafter surpassed it. This relative popularity was due to the adoption of the logit outside of bioassay, rather than displacing the probit within bioassay, and its informal use in practice; the logit's popularity is credited to the logit model's computational simplicity, mathematical properties, and generality, allowing its use in varied fields.{{sfn\|Cramer\|2002\|p=10–11}} Various refinements occurred during that time, notably by [[David Cox (statistician)\|David Cox]], as in {{harvtxt\|Cox\|1958}}.<ref name=wal67est>{{cite journal\|last1=Walker\|first1=SH\|last2=Duncan\|first2=DB\|title=Estimation of the probability of an event as a function of several independent variables\|journal=Biometrika\|date=1967\|volume=54\|issue=1/2\|pages=167–178\|doi=10.2307/2333860\|jstor=2333860}}</ref> The multinomial logit model was introduced independently in {{harvtxt\|Cox\|1966}} and {{harvtxt\|Theil\|1969}}, which greatly increased the scope of application and the popularity of the logit model.{{sfn\|Cramer\|2002\|p=13}} In 1973 [[Daniel McFadden]] linked the multinomial logit to the theory of [[discrete choice]], specifically [[Luce's choice axiom]], showing that the multinomial logit followed from the assumption of [[independence of irrelevant alternatives]] and interpreting odds of alternatives as relative preferences;<ref>{{cite book \|chapter=Conditional Logit Analysis of Qualitative Choice Behavior \|chapter-url=https://eml.berkeley.edu/reprints/mcfadden/zarembka.pdf \|archive-url=https://web.archive.org/web/20181127110612/https://eml.berkeley.edu/reprints/mcfadden/zarembka.pdf \|archive-date=2018-11-27 \|access-date=2019-04-20 \|first=Daniel \|last=McFadden \|author-link=Daniel McFadden \|editor=P. Zarembka \|title=Frontiers in Econometrics \|pages=105–142 \|publisher=Academic Press \|location=New York \|year=1973 }}</ref> this gave a theoretical foundation for the logistic regression.{{sfn\|Cramer\|2002\|p=13}} ==Extensions== There are large numbers of extensions: * [[Multinomial logistic regression]] (or '''multinomial logit''') handles the case of a multi-way [[categorical variable\|categorical]] dependent variable (with unordered values, also called "classification"). The general case of having dependent variables with more than two values is termed ''polytomous regression''. * [[Ordered logistic regression]] (or '''ordered logit''') handles [[Levels of measurement#Ordinal measurement\|ordinal]] dependent variables (ordered values). * [[Mixed logit]] is an extension of multinomial logit that allows for correlations among the choices of the dependent variable. * An extension of the logistic model to sets of interdependent variables is the [[conditional random field]]. * [[Conditional logistic regression]] handles [[Matching (statistics)\|matched]] or [[stratification (clinical trials)\|stratified]] data when the strata are small. It is mostly used in the analysis of [[observational studies]]. ==Software== Most [[statistical software]] can do binary logistic regression. * [[SPSS]] ** [http://www-01.ibm.com/support/docview.wss?uid=swg21475013] for basic logistic regression. * [[Stata]] * [[SAS (software)\|SAS]] [https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#logistic_toc.htm PROC LOGISTIC] for basic logistic regression. [https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_catmod_sect003.htm PROC CATMOD] when all the variables are categorical. ** [https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#glimmix_toc.htm PROC GLIMMIX] for [[multilevel model]] logistic regression. * [[R (programming language)\|R]] [[Generalized linear model\|<code>glm</code>]] in the stats package (using family = binomial)<ref>{{cite book \|first1=Andrew \|last1=Gelman \|first2=Jennifer \|last2=Hill\|author2-link=Jennifer Hill \|title=Data Analysis Using Regression and Multilevel/Hierarchical Models \|location=New York \|publisher=Cambridge University Press \|year=2007 \|isbn=978-0-521-68689-1 \|pages=79–108 \|url=https://books.google.com/books?id=lV3DIdV0F9AC&pg=PA79 }}</ref> <code>lrm</code> in the [https://cran.r-project.org/web/packages/rms rms package] GLMNET package for an efficient implementation regularized logistic regression lmer for mixed effects logistic regression Rfast package command <code>gm_logistic</code> for fast and heavy calculations involving large scale data. arm package for bayesian logistic regression * [[Python (programming language)\|Python]] [http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html <code>Logit</code>] in the [[Statsmodels]] module. [http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html <code>LogisticRegression</code>] in the [[scikit-learn]] module. [https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/learn/LogisticRegressor <code>LogisticRegressor</code>] in the [[TensorFlow]] module. Full example of logistic regression in the Theano tutorial [http://deeplearning.net/software/theano/tutorial/examples.html] Bayesian Logistic Regression with ARD prior [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/skbayes/rvm_ard_models/fast_rvm.py code], [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/ipython_notebooks_tutorials/rvm_ard/ard_classification_demo.ipynb tutorial] Variational Bayes Logistic Regression with ARD prior [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/skbayes/rvm_ard_models/vrvm.py code], [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/ipython_notebooks_tutorials/rvm_ard/vbard_classification.ipynb tutorial] ** Bayesian Logistic Regression [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/skbayes/linear_models/bayes_logistic.py code], [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/ipython_notebooks_tutorials/linear_models/bayesian_logistic_regression_demo.ipynb tutorial] * [[NCSS (statistical software)\|NCSS]] ** [http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Logistic_Regression.pdf Logistic Regression in NCSS] * [[Matlab]] <code>mnrfit</code> in the [[Statistics Toolbox for MATLAB\|Statistics and Machine Learning Toolbox]] (with "incorrect" coded as 2 instead of 0) <code>fminunc/fmincon, fitglm, mnrfit, fitclinear, mle</code> can all do logistic regression. [[Java (programming language)\|Java]] ([[Java virtual machine\|JVM]]) [[LibLinear]] [https://nightlies.apache.org/flink/flink-ml-docs-release-2.0/docs/operators/classification/logisticregression/ Apache Flink] [[Apache Spark]] *[[Apache Spark\|SparkML]] supports Logistic Regression [[Field-programmable gate array\|FPGA]] ** [https://github.com/inaccel/logisticregression <code>Logistic Regresesion IP core</code>] in [[High-level synthesis\|HLS]] for [[Field-programmable gate array\|FPGA]]. Notably, [[Microsoft Excel]]'s statistics extension package does not include it. ==See also== {{Portal\|Mathematics}} * [[Logistic function]] * [[Discrete choice]] * [[Jarrow–Turnbull model]] * [[Limited dependent variable]] * [[Multinomial logit\|Multinomial logit model]] * [[Ordered logit]] * [[Hosmer–Lemeshow test]] * [[Brier score]] * [[mlpack]] - contains a [[C++]] implementation of logistic regression * [[Local case-control sampling]] * [[Logistic model tree]] ==References== {{Reflist\|32em\|refs= <ref name=Hosmer>{{cite book \| last1 = Hosmer \| first1 = David W. \| first2= Stanley \|last2=Lemeshow \| title = Applied Logistic Regression \|edition= 2nd \| publisher = Wiley \| year = 2000 \| isbn = 978-0-471-35632-5 }} {{page needed\|date=May 2012}}</ref> <ref name=Menard>{{cite book \| last = Menard \| first = Scott W. \| title = Applied Logistic Regression \|edition= 2nd \| publisher = SAGE \| year = 2002 \| isbn = 978-0-7619-2208-7 }} {{page needed\|date=May 2012}}</ref> <ref name=Cohen>{{cite book \| last1 = Cohen \| first1 = Jacob \| first2= Patricia \|last2=Cohen \|first3= Steven G. \|last3= West \|first4= Leona S. \|last4= Aiken \| author4-link= Leona S. Aiken \| title = Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences \|edition= 3rd \| publisher = Routledge \| year = 2002 \| isbn = 978-0-8058-2223-6 }} {{page needed\|date=May 2012}} </ref> <ref name=rms>{{cite book \| last = Harrell \| first = Frank E. \| title = Regression Modeling Strategies \| edition = 2nd \| publisher = New York; Springer \| year = 2015 \| isbn = 978-3-319-19424-0 \| doi = 10.1007/978-3-319-19425-7\| series = Springer Series in Statistics }} </ref> <ref name=plo14mod>{{cite journal\|last1=van der Ploeg\|first1=Tjeerd\|last2=Austin\|first2=Peter C.\|last3=Steyerberg\|first3=Ewout W.\|title=Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints\|journal=BMC Medical Research Methodology\|date=2014\|volume=14\|page=137\|doi=10.1186/1471-2288-14-137\|pmid=25532820\|pmc=4289553 \|doi-access=free }} </ref> }} ==Sources== {{Refbegin\|32em}} * {{Cite journal \| last = Berkson\| first = Joseph \| doi = 10.1080/01621459.1944.10500699 \| title = Application of the Logistic Function to Bio-Assay \| journal = Journal of the American Statistical Association \| date = 1944\| volume = 39\| issue = 227\| pages = 357–365\| jstor = 2280041}} * {{cite journal \|last1=Berkson \|first1=Joseph \|title=Why I Prefer Logits to Probits \|journal=Biometrics \|date=1951 \|volume=7 \|issue=4 \|pages=327–339 \|doi=10.2307/3001655 \|jstor=3001655 \|issn=0006-341X}} * {{cite journal \|first=C. I. \|last=Bliss \|title=The Method of Probits \|journal=[[Science (journal)\|Science]] \|volume=79 \|issue=2037 \|pages=38–39 \|year=1934 \|quote=These arbitrary probability units have been called 'probits'. \|doi=10.1126/science.79.2037.38 \|pmid=17813446 \|bibcode=1934Sci....79...38B }} * {{cite journal \|last=Cox\|first=David R. \|author-link=David Cox (statistician) \|title=The regression analysis of binary sequences (with discussion)\|journal=J R Stat Soc B\|date=1958\|volume=20\|issue=2\|pages=215–242\|jstor=2983890}} * {{cite book \|author-link=David Cox (statistician) \|last=Cox \|first=David R. \|year=1966 \|chapter=Some procedures connected with the logistic qualitative response curve \|title=Research Papers in Probability and Statistics (Festschrift for J. Neyman) \|editor=F. N. David \|publisher=Wiley \|location=London \|pages=55–71 }} * {{cite tech report \|last=Cramer\|first=J. S. \|title=The origins of logistic regression \|institution=Tinbergen Institute \|date=2002\|volume=119\|issue=4\|pages=167–178 \|doi=10.2139/ssrn.360300 \|url=https://papers.tinbergen.nl/02119.pdf }} ** Published in: {{cite journal \|title=The early origins of the logit model \|last=Cramer\|first=J. S. \|journal=Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences \|volume=35 \|number=4 \|year=2004 \|pages=613–626 \|doi=10.1016/j.shpsc.2004.09.003 }} * {{cite journal \|last=Fisher \|first=R. A. \|title=The Case of Zero Survivors in Probit Assays \|journal=Annals of Applied Biology \|volume=22 \|pages=164–165 \|year=1935 \|url=https://ebooks.adelaide.edu.au/dspace/handle/2440/15223 \|archive-url=https://archive.today/20140430203018/http://ebooks.adelaide.edu.au/dspace/handle/2440/15223 \|archive-date=2014-04-30 \|doi=10.1111/j.1744-7348.1935.tb07713.x }} * {{cite book \|last=Gaddum \|first=John H. \|title=Reports on Biological Standards: Methods of biological assay depending on a quantal response. III \|date=1933 \|publisher=H.M. Stationery Office \|oclc=808240121}} * {{cite journal \|first=Henri \|last=Theil \|title=A Multinomial Extension of the Linear Logit Model \|journal=International Economic Review \|volume=10 \|number=3 \|pages=251–59 \|year=1969 \|doi=10.2307/2525642 \|jstor=2525642 }} {{Cite journal \| last1 = Pearl\| first1 = Raymond \| last2 = Reed\| first2 = Lowell J. \| title = On the Rate of Growth of the Population of the United States since 1790 and Its Mathematical Representation \| journal = Proceedings of the National Academy of Sciences \| date = June 1920 \| volume = 6 \| number = 6 \| pages = 275–288 \| doi = 10.1073/pnas.6.6.275 \| pmid = 16576496 \| pmc = 1084522 \| bibcode = 1920PNAS....6..275P \| doi-access = free }} {{cite journal \|title=The Determination of L.D.50 and Its Sampling Error in Bio-Assay. \|last1=Wilson \|first1=E.B. \|author-link1=Edwin Bidwell Wilson \|last2=Worcester \|first2=J. \|author-link2=Jane Worcester \|year=1943 \|volume=29 \|number=2 \|pages=79–85 \|journal=[[Proceedings of the National Academy of Sciences of the United States of America]] \|pmid=16588606 \|pmc=1078563 \|doi=10.1073/pnas.29.2.79 \|bibcode=1943PNAS...29...79W \|doi-access=free }} * {{cite book \| last = Agresti \| first = Alan. \| title = Categorical Data Analysis \| publisher = New York: Wiley-Interscience \| year = 2002 \| isbn = 978-0-471-36093-3 }} * {{cite book \|last=Amemiya \|first=Takeshi \|chapter=Qualitative Response Models \|title=Advanced Econometrics \|year=1985 \|publisher=Basil Blackwell \|location=Oxford \|isbn=978-0-631-13345-2 \|pages=267–359 \|chapter-url=https://books.google.com/books?id=0bzGQE14CwEC&pg=PA267 }} * {{cite book \| last = Balakrishnan \| first = N. \| title = Handbook of the Logistic Distribution \| publisher = Marcel Dekker, Inc. \| year = 1991 \| isbn = 978-0-8247-8587-1 }} * {{cite book \|first=Christian \|last=Gouriéroux \|author-link=Christian Gouriéroux \|chapter=The Simple Dichotomy \|title=Econometrics of Qualitative Dependent Variables \|location=New York \|publisher=Cambridge University Press \|year=2000 \|isbn=978-0-521-58985-7 \|pages=6–37 \|chapter-url=https://books.google.com/books?id=dE2prs_U0QMC&pg=PA6 }} * {{cite book \| last = Greene \| first = William H. \| title = Econometric Analysis, fifth edition \| publisher = Prentice Hall \| year = 2003 \| isbn = 978-0-13-066189-0 }} * {{cite book \| last = Hilbe \| first = Joseph M. \| author-link=Joseph Hilbe \| title = Logistic Regression Models \| publisher = Chapman & Hall/CRC Press \| year = 2009 \| isbn = 978-1-4200-7575-5}} * {{cite book \| last = Hosmer \| first = David \| title = Applied logistic regression \| publisher = Wiley \| location = Hoboken, New Jersey \| year = 2013 \| isbn = 978-0-470-58247-3 }} * {{cite book \| last = Howell \| first = David C. \| title = Statistical Methods for Psychology, 7th ed \| publisher = Belmont, CA; Thomson Wadsworth \| year = 2010 \| isbn = 978-0-495-59786-5 }} * {{cite journal \| last = Peduzzi \| first = P. \|author2=J. Concato \|author3=E. Kemper \|author4=T.R. Holford \|author5=A.R. Feinstein \| title = A simulation study of the number of events per variable in logistic regression analysis \| journal = [[Journal of Clinical Epidemiology]] \| volume = 49 \| issue = 12 \| pages = 1373–1379 \| year = 1996 \| pmid = 8970487 \| doi=10.1016/s0895-4356(96)00236-3\| doi-access = free }} {{cite book \| last1 = Berry \| first1 = Michael J.A. \| first2 = Gordon \| last2 = Linoff \| title = Data Mining Techniques For Marketing, Sales and Customer Support \| publisher = Wiley \| year = 1997}} {{Refend}} ==External links== {{Wikiversity}} {{Commons category-inline}} {{YouTube\|id=JvioZoK1f4o&t=64m48s\|title=Econometrics Lecture (topic: Logit model)}} by [[Mark Thoma]] [http://www.omidrouhani.com/research/logisticregression/html/logisticregression.htm Logistic Regression tutorial] *[https://czep.net/stat/mlelr.html mlelr]: software in [[C (programming language)\|C]] for teaching purposes {{Statistics\|correlation}} {{Authority control}} [[Category:Logistic regression\| ]] [[Category:Predictive analytics]] [[Category:Regression models]]'
Unified diff of changes made by edit (`edit_diff`)	'@@ -23,5 +23,5 @@ \| issue = 7 \| pages = 511–24 -\| last2 = Cornfield\| first2 = J\| last3 = Kannel\| first3 = W \| doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Any Other Party, based on age, income, sex, race, state of residence, votes in previous elections, etc.<ref name="rms" /> The technique can also be used in [[engineering]], especially for predicting the probability of failure of a given process, system or product.<ref name= strano05>{{cite journal \| author = M. Strano \| author2 = B.M. Colosimo \| year = 2006 \| title = Logistic regression analysis for experimental determination of forming limit diagrams \| journal = International Journal of Machine Tools and Manufacture \| volume = 46 \| issue = 6 \| pages = 673–682 \| doi = 10.1016/j.ijmachtools.2005.07.005 }}</ref><ref name= safety>{{cite journal \| last1 = Palei \| first1 = S. K. \| last2 = Das \| first2 = S. K. \| doi = 10.1016/j.ssci.2008.01.002 \| title = Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: An approach \| journal = Safety Science \| volume = 47 \| pages = 88–96 \| year = 2009 }}</ref> It is also used in [[marketing]] applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.<ref>{{cite book\|title=Data Mining Techniques For Marketing, Sales and Customer Support\|last= Berry \|first=Michael J.A\|publisher=Wiley\|year=1997\|page=10}}</ref> In [[economics]], it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a [[mortgage]]. [[Conditional random field]]s, an extension of logistic regression to sequential data, are used in [[natural language processing]]. +\| last2 = Cornfield\| first2 = J\| last3 = Kannel\| first3 = W \| doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Any Other Party, based on age, income, sex, race, state of residence, votes in previous elections, etc.<ref name="rms" /> The technique can also be used in [[engineering]], especially for predicting the probability of failure of a given process, system or product.<ref name= strano05>{{cite journal \| author = M. Strano \| author2 = B.M. Colosimo \| year = 2006 \| title = Logistic regression analysis for experimental determination of forming limit diagrams \| journal = International Journal of Machine Tools and Manufacture \| volume = 46 \| issue = 6 \| pages = 673–682 \| doi = 10.1016/j.ijmachtools.2005.07.005 }}</ref><ref name= safety>{{cite journal \| last1 = Palei \| first1 = S. K. \| last2 = Das \| first2 = S. K. \| doi = 10.1016/j.ssci.2008.01.002 \| title = Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: An approach \| journal = Safety Science \| volume = 47 \| pages = 88–96 \| year = 2009 }}</ref> It is also used in [[marketing]] applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.<ref>{{cite book\|title=Data Mining Techniques For Marketing, Sales and Customer Support\|last= Berry \|first=Michael J.A\|publisher=Wiley\|year=1997\|page=10}}</ref> In [[economics]], it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a [[mortgage]]. [[Conditional random field]]s, an extension of logistic regression to sequential data, are used in [[natural language processing]].Disaster planners and engineers rely on these models to predict decision take by householders or building occupants in small-scale and large-scales evacuations ,such as building fires, wildfires, hurricanes among others.<ref>{{Cite journal \|last=Mesa-Arango \|first=Rodrigo \|last2=Hasan \|first2=Samiul \|last3=Ukkusuri \|first3=Satish V. \|last4=Murray-Tuite \|first4=Pamela \|date=2013-02 \|title=Household-Level Model for Hurricane Evacuation Destination Type Choice Using Hurricane Ivan Data \|url=https://ascelibrary.org/doi/10.1061/%28ASCE%29NH.1527-6996.0000083 \|journal=Natural Hazards Review \|language=en \|volume=14 \|issue=1 \|pages=11–20 \|doi=10.1061/(ASCE)NH.1527-6996.0000083 \|issn=1527-6988}}</ref><ref>{{Cite journal \|last=Wibbenmeyer \|first=Matthew J. \|last2=Hand \|first2=Michael S. \|last3=Calkin \|first3=David E. \|last4=Venn \|first4=Tyron J. \|last5=Thompson \|first5=Matthew P. \|date=2013-06 \|title=Risk Preferences in Strategic Wildfire Decision Making: A Choice Experiment with U.S. Wildfire Managers \|url=https://onlinelibrary.wiley.com/doi/10.1111/j.1539-6924.2012.01894.x \|journal=Risk Analysis \|language=en \|volume=33 \|issue=6 \|pages=1021–1037 \|doi=10.1111/j.1539-6924.2012.01894.x \|issn=0272-4332}}</ref><ref>{{Cite journal \|last=Lovreglio \|first=Ruggiero \|last2=Borri \|first2=Dino \|last3=dell’Olio \|first3=Luigi \|last4=Ibeas \|first4=Angel \|date=2014-02-01 \|title=A discrete choice model based on random utilities for exit choice in emergency evacuations \|url=https://www.sciencedirect.com/science/article/pii/S0925753513002294 \|journal=Safety Science \|volume=62 \|pages=418–426 \|doi=10.1016/j.ssci.2013.10.004 \|issn=0925-7535}}</ref> These models help in the development of reliable [[Emergency management\|disaster managing plan]]<nowiki/>s and safer design for the [[built environment]]. ==Example== '
New page size (`new_size`)	132155
Old page size (`old_size`)	130346
Size change in edit (`edit_delta`)	1809
Lines added in edit (`added_lines`)	[ 0 => '\| last2 = Cornfield\| first2 = J\| last3 = Kannel\| first3 = W \| doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Any Other Party, based on age, income, sex, race, state of residence, votes in previous elections, etc.<ref name="rms" /> The technique can also be used in [[engineering]], especially for predicting the probability of failure of a given process, system or product.<ref name= strano05>{{cite journal \| author = M. Strano \| author2 = B.M. Colosimo \| year = 2006 \| title = Logistic regression analysis for experimental determination of forming limit diagrams \| journal = International Journal of Machine Tools and Manufacture \| volume = 46 \| issue = 6 \| pages = 673–682 \| doi = 10.1016/j.ijmachtools.2005.07.005 }}</ref><ref name= safety>{{cite journal \| last1 = Palei \| first1 = S. K. \| last2 = Das \| first2 = S. K. \| doi = 10.1016/j.ssci.2008.01.002 \| title = Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: An approach \| journal = Safety Science \| volume = 47 \| pages = 88–96 \| year = 2009 }}</ref> It is also used in [[marketing]] applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.<ref>{{cite book\|title=Data Mining Techniques For Marketing, Sales and Customer Support\|last= Berry \|first=Michael J.A\|publisher=Wiley\|year=1997\|page=10}}</ref> In [[economics]], it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a [[mortgage]]. [[Conditional random field]]s, an extension of logistic regression to sequential data, are used in [[natural language processing]].Disaster planners and engineers rely on these models to predict decision take by householders or building occupants in small-scale and large-scales evacuations ,such as building fires, wildfires, hurricanes among others.<ref>{{Cite journal \|last=Mesa-Arango \|first=Rodrigo \|last2=Hasan \|first2=Samiul \|last3=Ukkusuri \|first3=Satish V. \|last4=Murray-Tuite \|first4=Pamela \|date=2013-02 \|title=Household-Level Model for Hurricane Evacuation Destination Type Choice Using Hurricane Ivan Data \|url=https://ascelibrary.org/doi/10.1061/%28ASCE%29NH.1527-6996.0000083 \|journal=Natural Hazards Review \|language=en \|volume=14 \|issue=1 \|pages=11–20 \|doi=10.1061/(ASCE)NH.1527-6996.0000083 \|issn=1527-6988}}</ref><ref>{{Cite journal \|last=Wibbenmeyer \|first=Matthew J. \|last2=Hand \|first2=Michael S. \|last3=Calkin \|first3=David E. \|last4=Venn \|first4=Tyron J. \|last5=Thompson \|first5=Matthew P. \|date=2013-06 \|title=Risk Preferences in Strategic Wildfire Decision Making: A Choice Experiment with U.S. Wildfire Managers \|url=https://onlinelibrary.wiley.com/doi/10.1111/j.1539-6924.2012.01894.x \|journal=Risk Analysis \|language=en \|volume=33 \|issue=6 \|pages=1021–1037 \|doi=10.1111/j.1539-6924.2012.01894.x \|issn=0272-4332}}</ref><ref>{{Cite journal \|last=Lovreglio \|first=Ruggiero \|last2=Borri \|first2=Dino \|last3=dell’Olio \|first3=Luigi \|last4=Ibeas \|first4=Angel \|date=2014-02-01 \|title=A discrete choice model based on random utilities for exit choice in emergency evacuations \|url=https://www.sciencedirect.com/science/article/pii/S0925753513002294 \|journal=Safety Science \|volume=62 \|pages=418–426 \|doi=10.1016/j.ssci.2013.10.004 \|issn=0925-7535}}</ref> These models help in the development of reliable [[Emergency management\|disaster managing plan]]<nowiki/>s and safer design for the [[built environment]].' ]
Lines removed in edit (`removed_lines`)	[ 0 => '\| last2 = Cornfield\| first2 = J\| last3 = Kannel\| first3 = W \| doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Any Other Party, based on age, income, sex, race, state of residence, votes in previous elections, etc.<ref name="rms" /> The technique can also be used in [[engineering]], especially for predicting the probability of failure of a given process, system or product.<ref name= strano05>{{cite journal \| author = M. Strano \| author2 = B.M. Colosimo \| year = 2006 \| title = Logistic regression analysis for experimental determination of forming limit diagrams \| journal = International Journal of Machine Tools and Manufacture \| volume = 46 \| issue = 6 \| pages = 673–682 \| doi = 10.1016/j.ijmachtools.2005.07.005 }}</ref><ref name= safety>{{cite journal \| last1 = Palei \| first1 = S. K. \| last2 = Das \| first2 = S. K. \| doi = 10.1016/j.ssci.2008.01.002 \| title = Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: An approach \| journal = Safety Science \| volume = 47 \| pages = 88–96 \| year = 2009 }}</ref> It is also used in [[marketing]] applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.<ref>{{cite book\|title=Data Mining Techniques For Marketing, Sales and Customer Support\|last= Berry \|first=Michael J.A\|publisher=Wiley\|year=1997\|page=10}}</ref> In [[economics]], it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a [[mortgage]]. [[Conditional random field]]s, an extension of logistic regression to sequential data, are used in [[natural language processing]].' ]
Whether or not the change was made through a Tor exit node (`tor_exit_node`)	false
Unix timestamp of change (`timestamp`)	'1705889177'