• Users Online: 58
  • Print this page
  • Email this page


 
 Table of Contents  
ORIGINAL ARTICLE
Year : 2021  |  Volume : 9  |  Issue : 1  |  Page : 14-15

Statistical corner: Logistic regression using R


City of Tampere, Oral Health Services; Oral and Maxillofacial Unit, Tampere University Hospital; Hemorrhagic Brain Pathology Research Group, University of Tampere, Finland, Europe, Finland

Date of Submission16-Jun-2021
Date of Decision21-Jun-2021
Date of Acceptance01-Jul-2021
Date of Web Publication27-Aug-2021

Correspondence Address:
Dr. Mikko Pyysalo
City of Tampere, Oral Health Services; Oral and Maxillofacial Unit, Tampere University Hospital; Hemorrhagic Brain Pathology Research Group, University of Tampere
Finland
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/jcvs.jcvs_14_21

Rights and Permissions
  Abstract 


Introduction: Logistic regression is a regression with a categorical outcome variable and predictor variables that can be either continuous or categorical.
Objectives: To demonstrate the basic workflow of logistic regression using R.
Materials and Methods: A real world data-set has been used to present an example for the basic workflow of logistic regression using R.
Results: Accurate results were obtained including deviance for analysing the fit of the model.
Conclusions: Performing basic statistical modeling in R is simple and straightforward procedure. Analysing model fit is essential to be able to report the results.

Keywords: Logistic, regression, statistics


How to cite this article:
Pyysalo M. Statistical corner: Logistic regression using R. J Cerebrovasc Sci 2021;9:14-5

How to cite this URL:
Pyysalo M. Statistical corner: Logistic regression using R. J Cerebrovasc Sci [serial online] 2021 [cited 2021 Dec 1];9:14-5. Available from: http://www.jcvs.com/text.asp?2021/9/1/14/324810




  Introduction Top


Logistic regression is a regression with a categorical outcome variable and predictor variables that can be either continuous or categorical. In this short example tutorial, the basic workflow of logistic regression using R statistical software is shown. As introduced earlier,[1] R is a statistical programming language widely used in the field of data science and statistics. R can be downloaded to Windows, MacOS and Linux platforms from https://www.r-project.org/webpage. RStudio is an integrated development environment for R. Free RStudio can be downloaded from https://rstudio.com/. R can be used without RStudio, but using it makes many things easier, such as downloading data to R, etc.


  Methods Top


In this example, a dataset by Unda et al.[2] is used to show a basic workflow of logistic regression. The purpose of this short tutorial is not to explain the mathematical background behind the regression models.

First, data should be stored to the desired location (desktop in this case) and read into R as described in the earlier Statistical corner.[1]

library (readxl)

Dataset <- read_excel('~/Desktop/Dataset.xls')

Then, all the variable names are cleaned. Everything else but letters and numbers are replaced with _ symbol to avoid possible subsequent problems. '[^[: alnum:]]' is a regular expression meaning 'everything else but alphanumerals'. If variable names contain mathematical symbols, there might be problems during the analyses.

names (Dataset) <- gsub('[^[: alnum:]]', '_', names (Dataset))

Then, a simple boxplot, that shows a rough relationship between the age and patients condition at discharge, is created [Figure 1].
Figure 1: Example analysis. Boxplot of the age distribution of different condition-groups of the patients

Click here to view


boxplot (Age ~ mRS_at_dIscharge, data = Dataset)

To create a dichotomic variable, transform() function is used. This piece of code tells R to create a new variable categorical_mRS and give it a value '0' if mRS_at_discharge is either 1, 2 or 3. All the other values of the new variable are set to '1'. So, the value of categorical_mRS is '0' in patients with good recovery and '1' when the recovery is poorer. A basic form of ifelse() function is explained at the webpage https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/ifelse.

Dataset <- transform (Dataset, categorical_mRs = ifelse (mRS_at_dIscharge==1 | mRS_at_dIscharge==2 | mRS_at_dIscharge==3, 0, 1))

Then, only the variables of interest are selected and all the rows with missing values are dropped off using tidyverse functionalities:

library (tidyverse)

Dataset <- Dataset %>%

select (Age, BMI, mRS_at_dIscharge, categorical_mRs)

Dataset %>% drop_na()

It is not necessary to get rid of the extra variables but in the case of bigger datasets, the view might be confusing, if the whole dataset is printed on the screen. Its only authors personal preference to handle as small dataset as possible. Then, the logistic regression model is built using generalized linear model command glm() from base R. Both age and body mass index (BMI) are chosen as predictor variables to see if they explain the condition of the patient at discharge:

model <- glm (categorical_mRs ~ Age + BMI, family = 'binomial', data = Dataset)


  Results Top


When summary (model) is typed, R shows the overall information about the model. In this case, BMI seems not to explain the condition of the patient (P > 0.05) but age seems to be a significant predictor (P < 0.001). Deviance of a model is a variable that describes the overall fit of the model. The bigger the deviance is, the poorer is the fit of the model. Null deviance is a deviance of a 'null model', which is a model that contains only constant predictor. Residual deviance is a deviance of the model with the given predictors. In this case, null deviance is 205.27 and residual deviance 192.49, which means that the true model is fitting better than the 'null model', which is of course a good thing in this case. To analyse further, one could build other models on the same dataset and test which one of them explains the chosen outcomes better, using anova() analysis of the models.


  Conclusion Top


To conclude, performing basic statistical modeling in R is a fairly simple and straightforward procedure. Analysing model fit is essential to be able to report the results.



 
  References Top

1.
Pyysalo M, Vesterinen T. Statistical corner: Using R to build, analyse and plot clinical neurological datasets. J Cerebrovasc Sci 2020;8:107-12.  Back to cited text no. 1
  [Full text]  
2.
Unda SR, Labagnara K, Birnbaum J, Wong M, de Silva N, Terala H, et al. Impact of hospital-acquired complications in long-term clinical outcomes after subarachnoid hemorrhage. Clin Neurol Neurosurg 2020;194:105945.  Back to cited text no. 2
    


    Figures

  [Figure 1]



 

Top
 
 
  Search
 
Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
Access Statistics
Email Alert *
Add to My List *
* Registration required (free)

 
  In this article
Abstract
Introduction
Methods
Results
Conclusion
References
Article Figures

 Article Access Statistics
    Viewed159    
    Printed8    
    Emailed0    
    PDF Downloaded9    
    Comments [Add]    

Recommend this journal


[TAG2]
[TAG3]
[TAG4]