

ORIGINAL ARTICLE 

Year : 2021  Volume
: 9
 Issue : 1  Page : 1415 

Statistical corner: Logistic regression using R
Mikko Pyysalo
City of Tampere, Oral Health Services; Oral and Maxillofacial Unit, Tampere University Hospital; Hemorrhagic Brain Pathology Research Group, University of Tampere, Finland, Europe, Finland
Date of Submission  16Jun2021 
Date of Decision  21Jun2021 
Date of Acceptance  01Jul2021 
Date of Web Publication  27Aug2021 
Correspondence Address: Dr. Mikko Pyysalo City of Tampere, Oral Health Services; Oral and Maxillofacial Unit, Tampere University Hospital; Hemorrhagic Brain Pathology Research Group, University of Tampere Finland
Source of Support: None, Conflict of Interest: None
DOI: 10.4103/jcvs.jcvs_14_21
Introduction: Logistic regression is a regression with a categorical outcome variable and predictor variables that can be either continuous or categorical. Objectives: To demonstrate the basic workflow of logistic regression using R. Materials and Methods: A real world dataset has been used to present an example for the basic workflow of logistic regression using R. Results: Accurate results were obtained including deviance for analysing the fit of the model. Conclusions: Performing basic statistical modeling in R is simple and straightforward procedure. Analysing model fit is essential to be able to report the results.
Keywords: Logistic, regression, statistics
How to cite this article: Pyysalo M. Statistical corner: Logistic regression using R. J Cerebrovasc Sci 2021;9:145 
Introduction   
Logistic regression is a regression with a categorical outcome variable and predictor variables that can be either continuous or categorical. In this short example tutorial, the basic workflow of logistic regression using R statistical software is shown. As introduced earlier,^{[1]} R is a statistical programming language widely used in the field of data science and statistics. R can be downloaded to Windows, MacOS and Linux platforms from https://www.rproject.org/webpage. RStudio is an integrated development environment for R. Free RStudio can be downloaded from https://rstudio.com/. R can be used without RStudio, but using it makes many things easier, such as downloading data to R, etc.
Methods   
In this example, a dataset by Unda et al.^{[2]} is used to show a basic workflow of logistic regression. The purpose of this short tutorial is not to explain the mathematical background behind the regression models.
First, data should be stored to the desired location (desktop in this case) and read into R as described in the earlier Statistical corner.^{[1]}
library (readxl)
Dataset < read_excel('~/Desktop/Dataset.xls')
Then, all the variable names are cleaned. Everything else but letters and numbers are replaced with _ symbol to avoid possible subsequent problems. '[^[: alnum:]]' is a regular expression meaning 'everything else but alphanumerals'. If variable names contain mathematical symbols, there might be problems during the analyses.
names (Dataset) < gsub('[^[: alnum:]]', '_', names (Dataset))
Then, a simple boxplot, that shows a rough relationship between the age and patients condition at discharge, is created [Figure 1].  Figure 1: Example analysis. Boxplot of the age distribution of different conditiongroups of the patients
Click here to view 
boxplot (Age ~ mRS_at_dIscharge, data = Dataset)
To create a dichotomic variable, transform() function is used. This piece of code tells R to create a new variable categorical_mRS and give it a value '0' if mRS_at_discharge is either 1, 2 or 3. All the other values of the new variable are set to '1'. So, the value of categorical_mRS is '0' in patients with good recovery and '1' when the recovery is poorer. A basic form of ifelse() function is explained at the webpage https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/ifelse.
Dataset < transform (Dataset, categorical_mRs = ifelse (mRS_at_dIscharge==1  mRS_at_dIscharge==2  mRS_at_dIscharge==3, 0, 1))
Then, only the variables of interest are selected and all the rows with missing values are dropped off using tidyverse functionalities:
library (tidyverse)
Dataset < Dataset %>%
select (Age, BMI, mRS_at_dIscharge, categorical_mRs)
Dataset %>% drop_na()
It is not necessary to get rid of the extra variables but in the case of bigger datasets, the view might be confusing, if the whole dataset is printed on the screen. Its only authors personal preference to handle as small dataset as possible. Then, the logistic regression model is built using generalized linear model command glm() from base R. Both age and body mass index (BMI) are chosen as predictor variables to see if they explain the condition of the patient at discharge:
model < glm (categorical_mRs ~ Age + BMI, family = 'binomial', data = Dataset)
Results   
When summary (model) is typed, R shows the overall information about the model. In this case, BMI seems not to explain the condition of the patient (P > 0.05) but age seems to be a significant predictor (P < 0.001). Deviance of a model is a variable that describes the overall fit of the model. The bigger the deviance is, the poorer is the fit of the model. Null deviance is a deviance of a 'null model', which is a model that contains only constant predictor. Residual deviance is a deviance of the model with the given predictors. In this case, null deviance is 205.27 and residual deviance 192.49, which means that the true model is fitting better than the 'null model', which is of course a good thing in this case. To analyse further, one could build other models on the same dataset and test which one of them explains the chosen outcomes better, using anova() analysis of the models.
Conclusion   
To conclude, performing basic statistical modeling in R is a fairly simple and straightforward procedure. Analysing model fit is essential to be able to report the results.
References   
1.  Pyysalo M, Vesterinen T. Statistical corner: Using R to build, analyse and plot clinical neurological datasets. J Cerebrovasc Sci 2020;8:10712. [Full text] 
2.  Unda SR, Labagnara K, Birnbaum J, Wong M, de Silva N, Terala H, et al. Impact of hospitalacquired complications in longterm clinical outcomes after subarachnoid hemorrhage. Clin Neurol Neurosurg 2020;194:105945. 
[Figure 1]
