Implementing a LASSO with healthcare data
Published:
Link to the GitHub repository with illustrative code
Note, the following details and code are simplified for illustrative purposes
I use a LASSO applied to healthcare data with hundreds of independent variables. The outcome of interest (dependent variable) was hospital out-of-network billing prevalence. I create a figure which shows conditional correlates of variables selected from the LASSO. Data are from the AHA, Equality of Opportunity Project, SK&A, and Health Leader Interstudy.
For more details please see the following paper: Surprise! Out-of-Network Billing for Emergency Care in the United States
Technical Overview
A lasso (least absolute shrinkage and selection operator) can be used for variable selection in a dataset with hundreds to thousands of possible indpendent variables. A relatively straightforward guide to the mathematics behind the technique is summarized in a post from Rob Tibshirani, who is credited with the methodology.
In short, the LASSO minimizes the sum of squared errors (as in OLS) subject to a constraint such that the sum of the absolute value of the coefficients is less than some parameter, t. This selects a subset of variables most correlated with the outcome of interest.
Glmnet in R
Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The package allows the user to use cross-validation, a technique for prediction that assigns a training dataset (outcome is known) and a testing dataset (outcome is unkown). Cross-validation tests the model’s prediction ability. Glmnet obtains a penalizing parameter to limit the number of independent variables included in the model.
#matrix of dependent
y<-as.matrix(mydata[c("oon_rate")])
fit <- cv.glmnet(x,y)
sdlamb=data.frame(fit$lambda.1se)
#Run glmnet with chosen lambda
cvfit <- glmnet(x,y,lambda=sdlamb)
#Obtain coefficients from cvfit
coef(cvfit)
output<-as.matrix(coef(cvfit))
Cleaning and preparing data
Prior to running the LASSO, Stata was used to clean the existing data. Stata was used to create the conditional correlates plot seen above.
Continuous variables are scaled so that they have a mean zero and SD of one.
unab varlist: _all
unab exclude: emcare teamhealth year_start inout_ratio aha_mapp* aha_c_np aha_c_g eop_intersects_msa aha_techtotal_q* eop_gini99_q* hli_inshare_q*
local newlist: list varlist - exclude
*Standardize all other variables
foreach var of local newlist {
egen `var'_temp=std(`var')
drop `var'
rename `var'_temp `var'
}
Measures for technology, insurer coverage, income, and gini coefficients were put into quintiles.
local qlist hli_inshare aha_techtotal eop_gini99 eop_hhinc00
foreach f of local qlist {
xtile pct=`f', n(5)
tab pct, gen(`f'_q)
drop `f'
drop pct
}
Coefficient Plot in Stata
The coefficient plot is then relatively simple using Stata’s syntax. The local “reglist” denotes variables selected from the LASSO.
reg inout_ratio `reglist' i.year_start
estimates store s1
graph set window fontface "Times New Roman"
*Plot coefficients from the regression
coefplot (s1, mcolor(`red') msymbol(circle) msize(small)), ///
mlabel format(%9.2g) mlabsize(vsmall) mlabcolor(`red') mlabposition(12) mlabgap(*.5) ///
headings(emcare="{bf:Physician Group Indicator}" aha_syshhi_15m="{bf:Market Characteristics}" aha_c_np="{bf: Hospital Characteristics}" cen_countypop3="{bf:Local Area Characteristics}" ) ///
xlabel(-20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50, labsize(small)) xtitle("Out-of-Network Rate (%)") ///
ciopts(recast(rcap) lwidth(thin) lcolor(`blue')) drop(_cons *year_start*) xline(0, lcolor(black) lstyle(makes_thin)) coeflabels(,labsize(vsmall))