Patient data is inherently sensitive—healthcare providers are bound by ethical and legal frameworks to keep personal information, diagnoses and genetic information behind confidentiality screens. However, certain exceptional circumstances necessitate a delicate balance between individual privacy and the collective benefit.
Jun Jie Sim from A*STAR’s Institute for Infocomm Research (I2R) explained that although viral genome data from patient samples are anonymised for contact tracing in pandemic management, there’s a chance that patient identities can inadvertently be revealed.
“During the onset of a new variant, people who were classified to have the same [viral] strain can be assumed to have had close contact,” explained Sim, adding that metadata such as the geographic location of the sample may be enough to identify an individual.
Sim led a team that developed a privacy-preserving machine learning framework to help facilitate pandemic management without compromising patient confidentiality. The system, called CoVnita, was built using genomic sequence data from eight common SARS-CoV-2 strains and data-sharing simulations between multiple clinical providers.
The workflow leveraged an honest-but-curious threat model, a data security framework that protects sensitive information from database users that might try to learn more than they should from the available data. “You can think of them as the kaypoh (nosy) neighbour that helps keep the corridor clean, but never fails to listen in on what’s happening in the block,” illustrated Sim.
CoVnita allowed multiple organisations to upload patient samples and jointly train the model while ensuring patient information stays private throughout the training process. The team’s framework used three key technologies: Differentially Private Stochastic Gradient Descent (DP-SGD) and federated learning (FL) were used to train the model ‘in the clear’ (i.e., with unencrypted patient data), while homomorphic encryption (HE) was used to perform classification with encrypted data.
“Differential privacy ensures protection against model inversion attacks which try to sniff out information related to the inputs of the model—in this case, sequencing data,” said Sim. “In this work, we used those three technologies to ensure two things: that the original patient data stayed protected during training, thanks to DP-SGD and FL; and that new patient data was protected during classification, thanks to HE.”
CoVnita provided quick and accurate classifications of SARS-CoV-2 strains which can help reduce the burden on hospital infrastructure to improve patient triage. The framework also enables secure and private data sharing for bioinformatics analyses that are crucial for managing and monitoring pandemics.
Sim said that CoVnita demonstrates the feasibility of using privacy-preserving machine learning in real-world healthcare settings. “We now plan to extend this framework to support other models, statistical methods and other forms of medical data like images,” Sim concluded.
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).