   ORIGINAL ARTICLE
 Year : 2021  |  Volume : 9  |  Issue : 1  |  Page : 14-15

Statistical corner: Logistic regression using R

City of Tampere, Oral Health Services; Oral and Maxillofacial Unit, Tampere University Hospital; Hemorrhagic Brain Pathology Research Group, University of Tampere, Finland, Europe, Finland

 Date of Submission 16-Jun-2021 Date of Decision 21-Jun-2021 Date of Acceptance 01-Jul-2021 Date of Web Publication 27-Aug-2021

Dr. Mikko Pyysalo
City of Tampere, Oral Health Services; Oral and Maxillofacial Unit, Tampere University Hospital; Hemorrhagic Brain Pathology Research Group, University of Tampere
Finland Source of Support: None, Conflict of Interest: None

DOI: 10.4103/jcvs.jcvs_14_21 Abstract

Introduction: Logistic regression is a regression with a categorical outcome variable and predictor variables that can be either continuous or categorical.
Objectives: To demonstrate the basic workflow of logistic regression using R.
Materials and Methods: A real world data-set has been used to present an example for the basic workflow of logistic regression using R.
Results: Accurate results were obtained including deviance for analysing the fit of the model.
Conclusions: Performing basic statistical modeling in R is simple and straightforward procedure. Analysing model fit is essential to be able to report the results.

Keywords: Logistic, regression, statistics

 How to cite this article:Pyysalo M. Statistical corner: Logistic regression using R. J Cerebrovasc Sci 2021;9:14-5

 How to cite this URL:Pyysalo M. Statistical corner: Logistic regression using R. J Cerebrovasc Sci [serial online] 2021 [cited 2021 Sep 18];9:14-5. Available from: http://www.jcvs.com/text.asp?2021/9/1/14/324810

 Introduction Logistic regression is a regression with a categorical outcome variable and predictor variables that can be either continuous or categorical. In this short example tutorial, the basic workflow of logistic regression using R statistical software is shown. As introduced earlier, R is a statistical programming language widely used in the field of data science and statistics. R can be downloaded to Windows, MacOS and Linux platforms from https://www.r-project.org/webpage. RStudio is an integrated development environment for R. Free RStudio can be downloaded from https://rstudio.com/. R can be used without RStudio, but using it makes many things easier, such as downloading data to R, etc.

 Methods In this example, a dataset by Unda et al. is used to show a basic workflow of logistic regression. The purpose of this short tutorial is not to explain the mathematical background behind the regression models.

First, data should be stored to the desired location (desktop in this case) and read into R as described in the earlier Statistical corner.

Then, all the variable names are cleaned. Everything else but letters and numbers are replaced with _ symbol to avoid possible subsequent problems. '[^[: alnum:]]' is a regular expression meaning 'everything else but alphanumerals'. If variable names contain mathematical symbols, there might be problems during the analyses.

names (Dataset) <- gsub('[^[: alnum:]]', '_', names (Dataset))

Then, a simple boxplot, that shows a rough relationship between the age and patients condition at discharge, is created [Figure 1]. Figure 1: Example analysis. Boxplot of the age distribution of different condition-groups of the patientsClick here to view

boxplot (Age ~ mRS_at_dIscharge, data = Dataset)

To create a dichotomic variable, transform() function is used. This piece of code tells R to create a new variable categorical_mRS and give it a value '0' if mRS_at_discharge is either 1, 2 or 3. All the other values of the new variable are set to '1'. So, the value of categorical_mRS is '0' in patients with good recovery and '1' when the recovery is poorer. A basic form of ifelse() function is explained at the webpage https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/ifelse.

Dataset <- transform (Dataset, categorical_mRs = ifelse (mRS_at_dIscharge==1 | mRS_at_dIscharge==2 | mRS_at_dIscharge==3, 0, 1))

Then, only the variables of interest are selected and all the rows with missing values are dropped off using tidyverse functionalities:

library (tidyverse)

Dataset <- Dataset %>%

select (Age, BMI, mRS_at_dIscharge, categorical_mRs)

Dataset %>% drop_na()

It is not necessary to get rid of the extra variables but in the case of bigger datasets, the view might be confusing, if the whole dataset is printed on the screen. Its only authors personal preference to handle as small dataset as possible. Then, the logistic regression model is built using generalized linear model command glm() from base R. Both age and body mass index (BMI) are chosen as predictor variables to see if they explain the condition of the patient at discharge:

model <- glm (categorical_mRs ~ Age + BMI, family = 'binomial', data = Dataset)

 Results When summary (model) is typed, R shows the overall information about the model. In this case, BMI seems not to explain the condition of the patient (P > 0.05) but age seems to be a significant predictor (P < 0.001). Deviance of a model is a variable that describes the overall fit of the model. The bigger the deviance is, the poorer is the fit of the model. Null deviance is a deviance of a 'null model', which is a model that contains only constant predictor. Residual deviance is a deviance of the model with the given predictors. In this case, null deviance is 205.27 and residual deviance 192.49, which means that the true model is fitting better than the 'null model', which is of course a good thing in this case. To analyse further, one could build other models on the same dataset and test which one of them explains the chosen outcomes better, using anova() analysis of the models.

 Conclusion To conclude, performing basic statistical modeling in R is a fairly simple and straightforward procedure. Analysing model fit is essential to be able to report the results.

 References 1 Pyysalo M, Vesterinen T. Statistical corner: Using R to build, analyse and plot clinical neurological datasets. J Cerebrovasc Sci 2020;8:107-12. [Full text] 2 Unda SR, Labagnara K, Birnbaum J, Wong M, de Silva N, Terala H, et al. Impact of hospital-acquired complications in long-term clinical outcomes after subarachnoid hemorrhage. Clin Neurol Neurosurg 2020;194:105945. Figures

[Figure 1]

 Search
 Similar in PUBMED Search Pubmed forPyysalo M Search in Google Scholar for Pyysalo M Related articles Access Statistics Email Alert * Add to My List * * Registration required (free) 