Tuesday, June 16, 2026

Tutorial on using GLM to estimate costs associated with disease model states – Healthcare Economist






Modeling health care costs is often problematic because they are distributed in unusual ways. Typically, there are a large number of $0 observations (i.e., individuals who do not use any health care), and the distribution of costs among health care users is severely skewed due to a disproportionate number of individuals with very high health care costs. This observation is well known to health economists, but a complicating factor for modelers is mapping disease costs to specific health care states. For example, while the cost of cancer treatment may vary depending on the stage of the disease and whether the cancer has progressed; the cost of cardiovascular disease will vary if the patient has had a myocardial infarction.

A paper by Zhou et al. (2023) provides a good tutorial on how to use generalized linear models to estimate costs of disease model states. This tutorial contains the main steps.

Step 1: Prepare the data set:

  • Data sets often require cost calculations over discrete time periods. For example, if you have claims data, you might have cost information by date, but for analysis purposes you may want a dataset with cost information by person (row), where the columns are by year (or months) listed costs. Alternatively, you could establish the units of observation as person-years (or person-months) and each row would be a separate person-year record.
  • Next, the disease state must be clarified. At each time period, the person is assigned to a disease state. Challenges include determining the granularity of states (eg, MI only versus timing after MI) and how to handle multi-state scenarios.
  • When data are censored, one can (i) add a covariate to indicate that the data are censored or (ii) exclude observations that are part of the data. If cost data are missing (but the patient is not otherwise censored), several imputation methods can be used. The time period in which the analysis is formed needs to be mapped to the cycle length of the decision model, review appropriately handled, and possibly transformed material.
  • A sample data set is shown below.

Step 2: Model selection:

  • This paper proposes the use of a two-part model with a generalized linear model (GLM) framework because the OLS assumptions regarding normality and homoscedasticity of the residuals are frequently violated.
  • Using GLM, the expected value of cost undergoes a nonlinear transformation, as shown in the following equation. You need to estimate the link function and the distribution of the error term. “The most popular methods (combinations of copulas and distributions) in health care costs are linear regression (identity copula with Gaussian distribution) and gamma regression with natural log copula.)
  • To combine a GLM with a two-part model, simply estimate the above equation based on all positive values, and then calculate a logit or probit model of the likelihood that an individual has a positive cost.

Step 3: Select the final model.

  • Model selection must first consider which covariates to include in the regression, which can be obtained through stepwise selection using prespecified statistical significance. However, this may lead to overfitting. Alternative covariate selection techniques include guided stepwise selection and penalty techniques (e.g., minimum angle selection and shrinkage operator, LASSO). Interactions between covariates can also be considered.
  • Overall fit can be evaluated using mean error, mean absolute error, and root mean square error (the last of which is most commonly used). A better fitting model has smaller errors.

Step 4: Model prediction

  • While predicting costs is easy to do, the impact of disease states on costs is more complex. The author makes the following suggestions:

For single-part nonlinear models or two-part models, you can use loop prediction to derive marginal effects. It consists of the following two steps: (1) Run two tests in the target population by setting the disease state of interest as (a) present (e.g., cancer recurrence) or (b) absent (e.g., cancer does not recur). Situation; (2) Calculate the difference in average cost between the two scenarios. The standard error of the mean difference can be estimated using bootstrapping.

The authors also provide an illustrative example of applying this approach to modeling hospital costs associated with cardiovascular events in the UK.The author also provides R sample code, which you can download here.





Source link

Related articles

Recession Watch: I agree with ZeroHedge

from Zero Hedge Given the long lag between recession...

Immigration, recovery and inflation | Economic Explorer

inside The Fed recently conducted a review of...

What is the household's debt situation?

CNN published an article today titled "What happened...

Confidence, news and sentiment in May

While the (ultimate) sentiment measured by the U-M...
spot_imgspot_img