A treasure trove of health data helps AI predict disease
Artificial intelligence can be used to understand what is actually happening with a human, rather than what the doctor thinks.
Madumita Murgia from London Ziad Obermayer, a physician and machine-learning scientist at the University of California, Berkeley, launched Nightingale Open Science last month — a treasure trove of unique medical data sets, each curated around an unsolved medical mystery that AI can help solve.
The data sets, released after the project received $2m in funding from former Google CEO Eric Schmidt, could help train computer algorithms to predict medical conditions earlier, treat them better and save lives.
The data includes 40 terabytes of medical images, such as X-rays, EKGs and pathology samples, from patients with a range of conditions, including high-risk breast cancer, sudden cardiac arrest, fractures and COVID-19. The patient's medical findings are affixed to each image. , such as the stage of breast cancer and whether it has resulted in death, or whether a Covid patient needs a ventilator.
Obermayer made the data sets free to use and worked mainly with hospitals in the United States and Taiwan to build them over a two-year period. He plans to expand this to Kenya and Lebanon in the coming months to reflect as much medical diversity as possible.
“There is nothing like it,” said Obermayer, who announced the new project in December alongside colleagues at Neuer IPS, the global academic conference on artificial intelligence. “What sets this apart from anything available online is that the data sets are tagged with 'truth’ “Basic”, which means what actually happened to the patient, not just the doctor’s opinion.
This means that the datasets for a downtime ECG, for example, were not included in the label depending on whether the cardiologist discovered something suspicious, but rather whether this patient eventually had a heart attack. "We can learn from actual patient outcomes, rather than repeating faulty human judgment," Obermayer said.
In the past year, the AI community has seen an industry-wide shift from collecting “big data” – as much data as possible – to data that is meaningful, or information more coordinated and relevant to a specific problem, that can be used to address issues such as entrenched human biases in healthcare, or image recognition or natural language processing.
To date, many health care algorithms have been shown to amplify existing health differentiations. For example, Obermayer found that the AI system used by hospitals and treating up to 70 million Americans allocated additional medical support to chronically ill patients, prioritizing the healthiest white patients over the sickest blacks who needed help. The risks were ranked based on data that included total health care costs per capita for the year. The model has been using health care costs as a proxy for health care needs.
The crux of this problem, which is reflected in the model's underlying data, is that not everyone generates health care costs in the same way. Minorities and other disadvantaged populations may lack access to health care and resources, may be less able to take time off work for doctor visits, or suffer discrimination within the system by receiving fewer treatments or tests, which may lead to their being categorized as They are the least expensive in data sets. This does not necessarily mean that they are less sick.
The researchers estimated that approximately 47 percent of black patients should have been referred for additional care, but algorithm bias means that only 17 percent were referred.
"Your care costs will be lower even though your needs are the same," Obermayer said. "That was the basis of the bias we found." He found that many other similar AI systems use cost as a proxy, a decision estimated to affect the lives of some 200 million patients.
Unlike the widely used datasets in computer optics as the ImageNet database, which is modeled using images from the Internet that do not necessarily reflect real-world diversity, a wealth of new datasets include information that is more representative of the population, resulting in not only a range of Wider application and greater accuracy of algorithms, but also the expansion of our scientific knowledge.
Schmidt, whose foundation funded the Nightingale Open Science Project, said this diverse, new, high-quality data could be used to root out basic "discriminatory" biases in health care systems such as women and minorities. He added that "artificial intelligence can be used to understand what is actually happening with a human, rather than what the doctor thinks."
Project Nightingale data sets were among the dozens proposed this year at the NEOR IPS conference.
Other projects included speech data set from Mandarin and eight sub-languages recorded by 27,000 speakers in 34 cities in China, the largest set of audio data for Covid respiratory sounds, such as breathing, coughing and audio recordings, from more than 36,000 participants to help diagnose the disease. and a dataset of satellite imagery covering the entire country of South Africa from 2006 to 2017, broken down and categorized by neighborhood, to study the social impacts of spatial segregation.
New types of data can also help study the spread of diseases in diverse locations, as people from different cultures react differently to diseases, said Elaine Nswezi, a computational epidemiologist at Boston University's School of Public Health.
She said her grandmother in Cameroon, for example, might think differently than Americans might think about health. "If someone has a flu-like illness in Cameroon, they may look for herbal remedies or traditional home remedies, compared to different drugs or home remedies in the United States."
Computer scientists Serena Young and Huaquin Fanchuren, who suggested that research to build new data sets be shared at the Neuro-SBS conference, noted that the vast majority of the AI community is still unable to find good data sets to evaluate their algorithms. This means that AI researchers are still turning to data that is potentially "full of bias," they said. There are no good models without good data."