New Tool Predicts Lung Cancer Risk
Machine learning tool assesses lung cancer risk within five years.
― 5 min read
Table of Contents
Lung Cancer is a major cause of death around the world. Early Detection is crucial because it can lead to better survival rates. This article talks about a new tool that uses machine learning to predict the chances of someone developing lung cancer within five years. This tool has been trained using data from a significant cancer screening study and has been tested for accuracy.
Data Used
Datasets
The tool is based on two main datasets. The first is from the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial, which involved a large number of patients and collected comprehensive information about various Risk Factors and outcomes related to lung cancer. The second dataset is from the National Lung Screening Trial (NLST), which focused on using low-dose computed tomography to detect lung cancer in high-risk individuals.
Risk Factors
To identify people who might be at high risk for lung cancer, the study focused on former and current smokers. Smoking is the leading cause of lung cancer due to harmful substances in tobacco smoke that can damage lung cells. Other risk factors include exposure to secondhand smoke, certain workplace hazards, and air pollution. Age, gender, and a family history of lung cancer also increase the risk.
Model Development
The machine learning model was built using a technique called XGBoost, which is effective for handling complex data sets. The model was trained on data from the PLCO study and then tested using the NLST data. A major step before training the model involved cleaning the data to remove participants who had never smoked or who died from causes unrelated to lung cancer. This ensured that the model only focused on those at a higher risk.
Feature Selection
When building the model, certain features or characteristics were chosen to help make predictions. The features included information such as the participant's age, smoking history, and family history of lung cancer. The goal was to keep the model simple while still making accurate predictions.
Performance of the Model
Once the model was trained, it was tested on the NLST dataset to see how well it performed. The model showed good accuracy, with a score of 82% on the PLCO dataset and 70% on the NLST dataset. These scores indicated that the model could effectively identify individuals at high risk for lung cancer. Moreover, its performance was compared to existing guidelines, showing that it could identify more high-risk individuals than the current screening recommendations.
Web Application
An online tool was developed based on this model to allow users to estimate their own risk of developing lung cancer over the next five years. This tool consists of a simple questionnaire that users can complete, making it easy for individuals to assess their risk without needing extensive medical knowledge.
Importance of Early Detection
Lung cancer can be much more treatable when detected early. The five-year survival rate for lung cancer is significantly higher for individuals diagnosed in the early stages compared to those diagnosed later when the disease has spread. The current guidelines recommend screening for those aged 55 to 80 who have a significant smoking history. However, the new risk model provides a more personalized assessment, allowing for early detection in more individuals who might otherwise not be screened.
Comparison to Current Guidelines
The model was compared to the current recommendations from the US Preventive Services Task Force (USPSTF). While the guidelines were effective for some individuals, the new model was able to identify more people who might benefit from screening. The model achieved similar recall rates but with greater precision, making it a potentially better option for reducing lung cancer deaths through early intervention.
Limitations
Despite its strengths, the model does have limitations. The data used for training and testing were collected only in the United States, so the findings may not apply to other populations. Additionally, the model's effectiveness might be impacted by missing data from the studies. Future research may work on improving the model, especially in terms of its applicability to diverse populations.
Future Directions
The goal is to further refine the model so that it can be effectively integrated into routine healthcare practices. The easy-to-use web tool could help in shared decision-making about lung cancer screening, promoting early detection and improving patient outcomes.
Conclusion
This lung cancer risk estimation tool represents a significant advancement in predicting lung cancer for those at risk. With a user-friendly web application, individuals can assess their own risk and make informed decisions about screening. By focusing on personalized risk factors rather than generalized guidelines, this tool could lead to earlier detection and ultimately save lives.
Early detection of lung cancer remains critical for improving survival rates. By enhancing our understanding of risk factors and leveraging machine learning techniques, we can work toward a future where lung cancer is identified and treated more effectively.
With the continued development of risk assessment tools, we can move closer to reducing lung cancer mortality while ensuring that individuals receive the care and attention they need based on their unique circumstances. The model and its application offer a promising avenue for better health outcomes through early intervention and practical assessment of risk.
Title: Development and external validation of a lung cancer risk estimation tool using gradient-boosting
Abstract: Lung cancer is a significant cause of mortality worldwide, emphasizing the importance of early detection for improved survival rates. In this study, we propose a machine learning (ML) tool trained on data from the PLCO Cancer Screening Trial and validated on the NLST to estimate the likelihood of lung cancer occurrence within five years. The study utilized two datasets, the PLCO (n=55,161) and NLST (n=48,595), consisting of comprehensive information on risk factors, clinical measurements, and outcomes related to lung cancer. Data preprocessing involved removing patients who were not current or former smokers and those who had died of causes unrelated to lung cancer. Additionally, a focus was placed on mitigating bias caused by censored data. Feature selection, hyper-parameter optimization, and model calibration were performed using XGBoost, an ensemble learning algorithm that combines gradient boosting and decision trees. The ML model was trained on the pre-processed PLCO dataset and tested on the NLST dataset. The model incorporated features such as age, gender, smoking history, medical diagnoses, and family history of lung cancer. The model was well-calibrated (Brier score=0.044). ROC-AUC was 82% on the PLCO dataset and 70% on the NLST dataset. PR-AUC was 29% and 11% respectively. When compared to the USPSTF guidelines for lung cancer screening, our model provided the same recall with a precision of 13.1% vs. 9.3% on the PLCO dataset and 3.2% vs. 3.1% on the NLST dataset. The developed ML tool provides a freely available web application for estimating the likelihood of developing lung cancer within five years. By utilizing risk factors and clinical data, individuals can assess their risk and make informed decisions regarding lung cancer screening. This research contributes to the efforts in early detection and prevention strategies, aiming to reduce lung cancer-related mortality rates.
Authors: Pierre-Louis Benveniste, Julie Alberge, Lei Xing, Jean-Emmanuel Bibault
Last Update: 2023-08-23 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2308.12188
Source PDF: https://arxiv.org/pdf/2308.12188
Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.