Diffusion Deconvolution of DIA-MS/MS Data

Abstract

As biological analysis machines and methodologies become more sophisticated and capable of handling more complex samples, the data they output also become more complicated to analyze. Modern generative machine learning techniques such as diffusion and score-based modeling have been used with great success in the domains of image, video, text, and audio data.

We aim to apply the same principles to highly multiplexed biological data signals and leverage the ability of generative models to learn the underlying distribution of the data, instead of just the boundaries using discriminative methods. We hope to apply diffusion models to signal denoising, specifically the deconvolution of highly multiplexed DIA-MS/MS data.

DIA-MS/MS features two types of data: MS1 and MS2. In MS1 data, information such as mass-to-charge ratio and chromatography elution time are recorded for entire peptides as they are analyzed. In MS2 data, the same information is recorded for the set MS2 peptide fragments belonging to the MS1 peptides onto the same data map. This means that although the data between MS1 and MS2 are correlated, the MS2 data can be highly multiplexed with signals from multiple MS1 peptides showing up.

Our project aims to train a diffusion model and condition it on MS1 data to deconvolute the corresponding MS2 signal, effectively simulating the case where the MS1 scan captured fewer peptides in its analysis window, producing cleaner MS2 data. This would be extremely useful for downstream analysis, identification, and quantification tasks.

We currently have access to a set of clean MS2 data which we plan to use to generate synthetic multiplexed MS2 data, and to use the corresponding clean MS1 data as a conditioning factor to re-extract the clean MS2. This should be an effective proof of concept for diffusion-based denoising of biological signal data.