AI detecting early pancreatic cancer from CT scan
a recent study found that using artificial intelligence (AI) was more accurate in detecting early pancreatic cancer from a CT scan than trained radiologists (pancreatic cancer AI radiology LancetOnc2026 in dropbox, or https://doi.org/10.1016/S1470-2045(25)00567-4)
Details:
--this international, paired, non-inferiority, confirmatory, observational study (PANORAMA) compared AI diagnoses of pancreatic ductal adenocarcinoma (PDAC) between trained radiologists and an AI system developed to read/interpret contrast-enhanced CT scans
-- the AI system was trained and externally validated within an international benchmark, with a cohort of these 2310 patients from four tertiary care centers in the Netherlands and the USA for training (n=2224) and tuning (n=86), and a sequestered cohort of 1130 patients from five tertiary care centers (the Netherlands, Sweden, and Norway) for testing.
-- AI system used: the top three performing algorithms were deep learning models developed by teams based at Siemens Healthineers (Princeton, NJ, USA), Cedars Sinai Medical Center (Los Angeles, CA, USA), and Guerbet Research (Paris, France). The resulting AI system, which consisted of the ensemble of the top three models, was tested with a sequestered testing cohort of 1130 patients. the AI systems were trained with thousands of images and was tested with a sequestered cohort of 1130 patients (median age 67, 550 females, 580 males), of whom 406 had histologically confirmed PDAC
-- the international teams of AI developers who created the algorithms for PDAC detection on CT scans used a newly introduced, multicenter public dataset for both model training and tuning
-- then a multi-case reader study was conducted with 68 radiologists (40 centers, 12 countries; median 9·0 [IQR 6·0–14·5] years of radiologic experience, and 6 radiologists had a history of practicing at one of the participating centers) on a subset of 391 patients from the testing cohort. The reference standard was established with known histopathology and at least 3 years of clinical follow-up. the histopathology of this testing cohort revealed 44 [37%] with pancreatic ductal adenocarcinoma (PDAC)
-- in the derivation study:
-- median age 67, 44% female
-- PDAC cases: 1130/3440 (32%), 95% histologically confirmed
-- negative control cases: 2337/3440 (68%); 25% histologically confirmed, with at least 3 years of clinical follow-up in 1762/2337 (75%)
-- PDAC cases: mean 30.7 mm tumor; 71% located at head of pancreas, 18% body of pancreas, 11% tail of pancreas
-- T-stage: T1 6%, T2 25%, T3 22%, T4 32%
-- clinical stage: resectable 145/1103 (13%), borderline resectable 20%, locally advanced 27%, metastasized 31%
-- reader study from contrast-enhanced CT images:
-- 391 cases of PDAC from 4 centers in the Netherlands and Sweden, with random assignment of readers and cases (98 exams and 17 readers)
-- readers were asked to rate the likelihood of PDAC diagnosis on a scale of 0-100, and, since there is no standardized reporting system for PDAC, they also documented the clinically meaningful Likert score, where 0=no abnormality, 1= very low suspicion of PDAC, 2=low suspicion, 3=equivocal, 4= high suspicion and 5=very high suspicion. All cases that had Likert score of at least 2 required the reader to provide a point coordinate for the suspected PDAC lesion
-- main outcomes: to assess the AI system performance vs the average radiologists' performance in the reader study, using the AUROC curve (the area under the Receiver Operating Characteristic curve, where a 0.5=no better than guessing, and the closer to a curve of sensitivity of 1.0 and 1-specificity of zero being the perfect test); a result of 0.80-0.90 is considered good discrimination and 0.90-1.0 is excellent/outstanding discrimination)
-- the primary hypothesis was that the stand-alone AI was non-inferior to the radiologists; the secondary hypothesis was the A! was superior to the radiologists
-- the study protocol and statistical plan were prespecified to test non-inferiority (considering a margin of 0·05), followed by superiority towards the AI system.
-- secondary endpoints were the sensitivity, specificity, positive and negative predictive values in the comparison
Results:
-- 3440 patients having 3454 contrast-enhanced CT examinations were included in the initial study across five participating centers and two publicly available data sets.
-- A total of 432 individuals from 46 countries participated in the development of the AI system within the PANORAMA challenge, resulting in 258 submitted algorithms.
-- comparing the results of the AI system vs radiologists revealed an AUROC of 0.92 (0.90–0.93) in the sequestered testing cohort, with 85.7% (82.8–89.5) sensitivity and 83.5% (80.9–86.3) specificity at the optimal ROC threshold. More information on AI performance is provided in their appendix (pp 16, 40–41).
-- results of the reader study, evaluating images from a subset of 391 patients, 144 of whom (37%) had histologically-confirmed PDAC:
-- below is the plot of the AUROC:
-- radiologists: 0.88 (0.85-0.91)
-- AI system for the same cohort: AUROC 0.922 (0.89-0.94)
-- these results surpassed the primary outcome of AI being non-inferior to radiologists and achieved the prespecified difference indicating superiority of the AI system
-- this graph of the AUROC curves displays the relative merits of the AI system over the radiology reader group:
-- in terms of the diagnostic accuracy using a Likert score of at least 1:
-- the pool of 68 radiologists’ average sensitivity was 96.1% (93.9–98.0) with a specificity of 44.2% (38.8–50.4)
-- the AI system reduced the number of false positives by 38% (85 false positives with AI vs 138 false positives with radiologist readers)
-- with a Likert score of at least 2
-- the AI system detected 38% more cancers (13 missed cancers with AI vs 21 missed cancers with the readers, out of 144 positive cases)
-- the AI system specificity was 73.8% (69.0–78.9), and produced 26% fewer false positives than did the readers (48 false positives with AI vs 65 false positives with readers)
-- the sensitivity was 85.6% (81.0–89.7)
-- at the optimal ROC operating point, the AI system achieved 85.4% (75.5–87.8) sensitivity and 84.6% (81.0–89.8) specificity
-- the more the radiologists were concerned about PDAF (through the increased Likert scores) revealed that the higher the Likert scale, the lower the improvement found in the AI system approach
-- of the 144 PDAC cases,:
-- none were classified as having Likert score <2 by all of the radiologists, but 5 (2% of the 247 negative cases) were assigned a Likert score of >2 by all radiology readers though were considered false positives by the AI system
-- all five of the false positive cases by the AI system were described as highly suspicious for PDAC in the original radiology reports, but these were ultimately diagnosed as other diseases after histological assessment (though one was a common bile duct cancer, one a metastatic duodenal carcinoma, and one a metastatic colon carcinoma in the pancreatic head; ie, though not PDAC, they were probably quite useful findings that may have altered therapy)
Commentary:
-- pancreatic ductal adenocarcinoma (PDAC) is the deadliest cancer in the world, with a median overall survival of 4 months across all disease stages and >467,000 deaths annually
-- the global incidence of PDAC is rising, leading to its being declared a medical emergency by several European clinical groups
-- the presumed reason for these awful outcomes is that PDAC is typically asymptomatic in its early progression, metastasizes rapidly, and late diagnosis is associated with limited effective interventions
-- early-stage disease and resection, on the other hand, is associated with a median survival of 32 months
-- hence this PANORAMA study to see if AI systems adapted to identify early irregularities in the pancreas and surrounding tissues might lead to more people identified in the early, presymptomatic and potentially curable or at least less aggressive stage
-- it should be noted that PDAC detection on contrast-enhanced CT scan is not standardized, and there is high inter-reader variability
--- the PANDA study in 2023 (https://www.nature.com/articles/s41591-023-02640-w) of 3208 patents from a single center found that those in their AI group had an AUROC of 0.9886-0.996 for lesion detection, with the mean trained radiologist sensitivity of 34.1% and specificity of 99.89% for PDAC detection
-- the PANORAMA study achieved an important advance in evaluating the real-world diagnostic performance of an AI system for PDAC detection based on contrast-enhanced CT
-- and there is potentially very broad application of their approach, since 40-52% of all CTs done are the contrast-enhanced CT in the portal-venous phase
-- this study showed that the AI system outperformed radiologists by directly comparing the performance of AI to a large, diverse group of 68 radiologists from 12 countries in a multicentre setting, using the clinically meaningful Likert score thresholds. Notably, the AI system reduced false positives at matched sensitivity levels, addressing concerns about unnecessary follow-up testing and patient anxiety. Furthermore, PANORAMA distinguished itself by openly releasing its benchmark dataset, promoting transparency, reproducibility, and standardization in a field that has long lacked these important elements.
the development of AI brings up lots of issues, which are well-documented in https://www.technologyreview.com/2025/05/20/1116327/ai-energy-usage-climate-footprint-big-tech/; and most info below is from this MIT Technology Review article:
-- Energy consumption/environmental concerns:
-- in 2024, energy demand for AI was calculated to be 4.4% of electricity demand (and roughly what it takes to power Thailand for a year)
-- Trump in his Stargate initiative, plans to spend $500 billion to build up to 10 AI data centers, each requiring 5 gigawatts of power (more than the total needed for New Hampshire); Apple plans to spend $500 billion over the next 4 years; and Google $75 billion on just the AI infrastructure in 2025
-- and, in the setting of huge competition to build as much as possible as quickly as possible, this all translates to using much more fossil fuel, a highly carbon-intensive forms of energy
-- and using lots of computer chips, reliant on rare earth elements (neodymium, gallium, geranium, lanthanum...), which, parenthetically, seem to be trashed instead of reclaimed/reused later....
-- by 2028 (not so so far from now), energy consumption is projected to be 12% of the total (going from 165 terawatt-hours to 326 terawatt-hours/year, which translates to the amount of energy to power 22% of US households/year, as well as driving cars over 300 billion miles (or 1,600 round trips to the sun from earth)
-- one estimate is that 0.3 watt-hours (1,080 joules) of power are used for one ChatGPT message. ChatGPT is the fifth most visited website in the world, with about 1 billion messages every day and now about 78 million images/day
-- one billion messages/day translates to the electricity requirement of 10,400 US homes for a year; adding the images to the texts, it would be another 3,300/year
-- and all of these estimates are based on current utilization; the future might be dramatically different as AI advances, and AI may be used without our even knowing that it is performing tasks
-- and, and, and the US, increasingly focusing on fossil fuel consumption for electricity, is a major contributor to climate change, which is very likely to only get much worse. one recent paper calculated that the carbon intensity of electricity used by data centers was 48% higher than the US average (which is already pretty high). And the comments by Trump, Meta, Amazon and Google that we will just build more nuclear power would be years/decades away (nuclear is currently only about 20% of US energy consumption)
-- and, of course, companies are now planning to build multi-gigawatt AI constructions around the world, to "spread democratic AI"
-- my medical concerns about AI:
-- i do believe that there are some specific tasks in medicine that AI will really help, and likely increasingly so in the future:
-- radiology: this is a primary one, as found in this study. AI is no doubt better than even the best radiologists in picking up very subtle differences in a CT scan, etc, that are so small that they are likely to be passed over when a radiologist is scanning the whole abdominal CT (or MRI, or ultrasound..).
-- the radiologist is less likely (i presume) to find minor problems in the image if the referring clinician directs them to assess a very different part of the body for a potential problem (ie, a radiologist asked to comment on one area of the CT, such as the kidneys, may not notice a very small aberrancy in the common bile duct)
-- and the AI system has the advantage of always retuning: the stored database of CT scans will dramatically increase all the time, especially in such a common study as an abdominal CT.
-- the studies of the consistency of trained radiologists coming to the same xray interpretation is not so great: for example, the comparison of even specialized mammography radiologists with lots of years of evaluating mammograms is quite concerning; one study found that the range between radiologists exceeded 40% https://pubmed.ncbi.nlm.nih.gov/8546556/#:~:text=As%20indicated%20by%20receiver%20operating,the%20population%20of%20US%20radiologists., or https://www.nejm.org/doi/full/10.1056/NEJM199412013312206
-- and this inconsistency is true pretty much across the radiology board: https://pmc.ncbi.nlm.nih.gov/articles/PMC8259547/
-- photo images: dermatology (especially when there are high-quality photos which can provide some 3-dimensional views), retinal scans, etc would likely similarly benefit from AI system involvement
-- these types of issues where AI systems would likely add much benefit (to me) is that they rely on having a huge and ever increasing data-base of images for the computers to use as their baseline, along with lots of their high-speed computer ability to pore through these images for matches (ie much more sophisticated pattern recognition than by us mere mortals)
-- BUT i have real concerns about the role of AI in eliciting patient concerns as well as potentially adversely modifying the clinician's approach:
-- there is no doubt that based on key words from the patients, AI can generate a very large differential diagnosis, and clinicians may well assess possibilities they had not considered
-- but, is that large differential diagnosis going to lead to lots more testing, leading to lots more false positives, lots more invasive procedures and potentially lots more ionizing radiation exposure/morbidity and mortality, perhaps for very unlikely items in the AI-generated differential diagnoses
-- and, is the enlarged differential going to make it more likely that clinicians will focus on less likely diagnoses that undermine a holistic assessment of the patients?; will clinicians lose their step-wise approach to the patient, initially focusing on what diagnoses are clinically most likely given the aggregate of the patients' other diagnoses, social issues and the current presenting symptom? will the clinician be swayed to just ordering the tests, perhaps fearing medicolegal repercussions, since AI raised the possibility of unlikely diagnoses that would typically be addressed if needed after the more likely ones had been addressed?
-- and, perhaps most importantly (to me) is that a thoughtful clinician, who knows the patient well and perhaps their family and their community, is in a better position to be able to interpret what a patient is saying. is this patient one who is a symptom minimizer (ie, "macho") or symptom maximizer (ie, "somatic"), also the words patients use to describe their symptom vary a lot by the patient's education, cultural issues in its broadest sense (eg, if their health advisor, perhaps their grandmother, used a certain phrase), their emotional state (stress, depression, other psych issues), their understanding of how the body works (patients may well say that the pain is in the nerves, though actually it is in their muscles or bones), the medical problems of friends/families (eg, as real examples: a woman saying she has breast pain, when the issue is that their mother was just diagnosed with breast cancer; or one afraid to have sex with her husband because of concern he has an infection but actually he had bladder cancer, and she preferred not to mention that though it came out with empathetic questioning). and the computer is not so able to interpret the non-verbal aspect of an interaction with a clinician, and thereby investigating more deeply the words the patient used in describing the problem
-- my own experience working in an inner city health center with many patients from different cultural/linguistic backgrounds and largely being not so medically literate (many older patients without formal education), is that they may well use the word "pain" as a description of a problem that AI might well accept as is, yet the patient actually means something very different from the "pain" box they might be put in. I honestly do find this to be the case fairly often. my guess is that it is not so unusual in very different clinical settings that patients sometimes need the empathetic continued exploratory questioning to get at the real issue
-- tangentially to the above study, there were interesting animal and human studies suggesting that pitavastatin and atorvastatin, both having significant anti-inflammatory effects may decrease pancreatic cancer risk (and chronic inflammation is a factor that increases cancer risk by wreaking immunologic havoc: https://gmodestmedblogs.blogspot.com/2024/06/pitavastatin-decreases-pancreatic.html)
Limitations:
-- is the AI detection actually just lead-time bias: would those PDACs picked up earlier, with median life expectancy about 32 months vs 4 months in those with later, symptomatic detection, just reflect the fact that they were picked up 28 months earlier and the actual longevity of the patients is in fact quite similar. and perhaps those picked up in the early stage by the AI approach actually ended up with a very extensive surgery that might not have been needed if they were detected much later
-- and those getting the early detection would be subjected to the stress and anxiety of the diagnosis and the subsequent stress and anxiety post-op of whether they would still get metastatic disease
-- there are also a few aspects of their methodology that might undercut the generalizability of the results: this was a retrospective dataset from European tertiary care centers. are their PDACs different, including in their detection, given the current increase in pancreatitis/potential for other risk factors in different settings with patients having different diet/exercise/environmental and occupational exposures...?
-- confirmation of PDAC was not consistent in the study, varying from cytology, biopsy, surgical resection or clinical followup in the preliminary study, and based not on the readings made in the reader study
so,
-- the impressive findings of the PANORAMA study suggest that AI may well have an important role in the future for detecting early and much more treatable PDACs.
-- a real RCT would help clarify this. And it would provide information as to whether the results above are actually lead-time bias or true improvement on clinical outcomes for those with detected early PDAC
-- a relevant extension of this study is the potential to identify of other cancers through trained AI systems. perhaps lung cancer which, similar to PDAC, is a really terrible disease and early identification is associated with dramatically better outcomes?? breast cancer? others?
-- so as noted above, this type of clinical intervention by AI, as per the current study, is likely a great leap forward diagnostically for radiology, dermatology, ophthalmology and perhaps other areas where there are very high-definition images that can be screened in much more detail by the huge array of computers used in AI, with huge and rapidly increasing databases of the images and outcomes.
-- but there are concerns about AI socially that need to be considered in the explosive expansion of AI, including huge energy consumption, huge diversion of resources (land, money that could be used to provide health care and meet important social needs...), increasing inequities (the extremely rich getting extremely richer, vs the rest potentially losing jobs and income; the potential decrease in electricity access by the majority of people given the inadequacies of our current electrical grid system, and the necessity as currently perceived of needing to use lots more fossil fuels to get the electricity for AI (and the result of surging climate change, etc,etc)
-- and there are concerns about its overall benefits in being applied more broadly to clinical medicine
geoff
-----------------------------------
If you would like to be on the regular email list for upcoming blogs, please contact me at gmodest@bidmc.harvard.edu
to get access to all of the blogs: go to http://gmodestmedblogs.blogspot.com/ to see the blogs in reverse chronological order
or you can just click on the magnifying glass on top right, then type in a name in the search box and get all the blogs with that name in them
Comments
Post a Comment
if you would like to receive the near-daily emails regularly, please email me at gmodest@uphams.org