Generation of machine-learning derived cancer vulnerability indicator to determine the spatial burden of cancer outcomes.
Due to the difficulty of obtaining population-based individual-level data, ecological studies are often used to explore factors related to geographic variations in health outcomes. This study proposes a novel framework to identify area-level predictors of spatial variations in lung cancer outcomes and generate a lung cancer vulnerability index (LcVI) based on these predictors.
Data on 11,313 persons diagnosed with invasive lung cancer in Queensland, Australia (2016-2019) were sourced from the population-based Queensland Cancer Register. Bayesian spatial models estimated smoothed standardised incidence ratios (SIRs) for 519 geographic areas. Area-level variables (n = 911) were extracted from multiple data collections. Random forest models were fitted to identify important predictors for lung cancer incidence rates. A novel non-parametric dimensionality reduction approach incorporating the final random forest model results was developed to generate the LcVI which ranged from 0-10.
Eight variables were identified as predictors for lung cancer incidence with the top two being the prevalence of diabetes and adequate fruit intake. Areas having incidence rates below the Queensland average had significantly lower LcVI than those with average incidence rates (mean difference = 2.80, 95% CI: 2.34-3.25, p < 0.001) while areas with above average incidence rates had significantly higher LcVI than those with average incidence (mean difference = 2.70, 95% CI: 2.20-3.19, p < 0.001). The LcVI was strongly associated with the continuous SIR, explaining 57% of the variation (R² = 0.57, p < 0.001).
This novel approach identified a small number of important predictors for lung cancer incidence from a high-dimensional dataset. The lung cancer vulnerability index partially explained the geographic variations, potentially offering insights into underlying drivers. As an ecological analysis, this associations reflect relationships at the population level. Future research incorporating individual-level data is needed to confirm whether the area-level associations observed here hold true for individuals.
Data on 11,313 persons diagnosed with invasive lung cancer in Queensland, Australia (2016-2019) were sourced from the population-based Queensland Cancer Register. Bayesian spatial models estimated smoothed standardised incidence ratios (SIRs) for 519 geographic areas. Area-level variables (n = 911) were extracted from multiple data collections. Random forest models were fitted to identify important predictors for lung cancer incidence rates. A novel non-parametric dimensionality reduction approach incorporating the final random forest model results was developed to generate the LcVI which ranged from 0-10.
Eight variables were identified as predictors for lung cancer incidence with the top two being the prevalence of diabetes and adequate fruit intake. Areas having incidence rates below the Queensland average had significantly lower LcVI than those with average incidence rates (mean difference = 2.80, 95% CI: 2.34-3.25, p < 0.001) while areas with above average incidence rates had significantly higher LcVI than those with average incidence (mean difference = 2.70, 95% CI: 2.20-3.19, p < 0.001). The LcVI was strongly associated with the continuous SIR, explaining 57% of the variation (R² = 0.57, p < 0.001).
This novel approach identified a small number of important predictors for lung cancer incidence from a high-dimensional dataset. The lung cancer vulnerability index partially explained the geographic variations, potentially offering insights into underlying drivers. As an ecological analysis, this associations reflect relationships at the population level. Future research incorporating individual-level data is needed to confirm whether the area-level associations observed here hold true for individuals.