非靶向液相色谱-串联质谱(LC-MS)代谢组学数据集蕴含着大量信息,但在分析和处理过程中却面临诸多挑战。通常,需要对两个或多个独立处理的数据集进行整合以形成完整的数据集,但现有的软件并不能完全满足我们的需求。为此,我们创建了一个名为“Eclipse”的开源 Python 包。Eclipse 采用一种新颖的基于图的方法来处理由 n 个(n > 2)数据集引发的复杂匹配情况。
Nontargeted LC-MS (liquid chromatography–tandem mass spectrometry) metabolomics datasets contain a wealth of information but present many challenges during analysis and processing. Often, two or more independently processed datasets must be aligned to form a complete dataset, but existing software does not fully meet our needs. For this, we have created an open-source Python package called Eclipse. Eclipse uses a novel graph-based approach to handle complex matching scenarios that arise from n > 2 datasets.
非靶向液相色谱-串联质谱法(LC-MS)是一种用于检测生物样本代谢状态的强大方法(克利什 2015 年)。在常规的数据处理流程中,特征提取软件会将原始仪器文件转换为表格数据集,通过识别并整合数千个特征来实现这一过程。每个特征都会附带其色谱保留时间(RT)和质荷比(m/z)(史密斯等人 2006 年,普拉斯卡尔等人 2010 年)。虽然许多特征都获得了化学标签(注释),但仍有很大一部分未被标注。未标注的空间包含了具有生物学意义的特征(陈等人 2022 年,塔希尔等人 2022 年,塔瓦内等人 2022 年),但在尝试将分别获取和处理的数据集进行拼接时(例如对齐)会遇到挑战(史密斯等人 2015 年)。当引入 n 个以上的数据集时,这些挑战会变得更加严重,导致复杂的匹配无法完全在表格数据中体现(补充图 S1)。虽然有一些基于特征描述符来对数据集进行对齐的解决方案存在(布伦纽斯等人,2016 年;科赫等人,2016 年;马克等人,2020 年;哈布拉等人,2021 年、2024 年;克利马科·皮托内等人,2022 年),但这些方案都不完全满足我们的所有要求,即它必须在默认设置下稳定运行,不能产生多个匹配结果,必须用 Python 编写,并且能够对 n 个以上的数据集进行对齐,同时结果不受数据集顺序的影响。
图 1.Eclipse 算法概述。(a)Eclipse 算法的高级概览示例,包含三个数据集。 (b)在 DS1→DS2 子对齐过程中生成的缩放因子,这是将要运行的六个步骤之一。 数据集被简化(s1、s2),然后进行调查匹配。 缩放因子从每个描述符的残差(RT、m/z、强度)中生成,然后减去这些值以揭示剩余的平方误差。 (c)DS1→DS2 子对齐的匹配表生成。 DS1 被缩放(1→Sc1),然后对每个特征查询 DS2。 落在所有描述符的 ±6 RSE 范围内的 DS2 被排序,并将最佳匹配记录在 DS1→DS2 匹配表中。 (d)对对齐结果的聚合和报告。 在所有子对齐运行完毕后,它们被收集到一个有向图中。 图被压缩和聚类以生成结果表。
https://github.com/broadinstitute/bmxp
https://github.com/broadinstitute/bmxp/blob/main/tests/test_blueshift.py
import bmxp
from bxmp.eclipse import MSAligner
from bxmp.blueshift import DriftCorrection
from bmxp.gravity import cluster
bmxp.FMDATA['Compound_ID'] = 'Feature_ID'
bmxp.IMDATA['injection_id'] = 'Filename'
# pylint: disable=redefined-outer-name, missing-function-docstring, consider-using-with
"""
Tests for blueshift
"""
import pickle
from pathlib import Path
import pytest
import pandas as pd
import numpy as np
from bmxp import blueshift as b
@pytest.fixture()
def path_dc_input_1():
return Path(__file__).parent / "DCinput1.csv"
@pytest.fixture()
def path_sample_info_1():
return Path(__file__).parent / "DCinfo1.csv"
@pytest.fixture()
def path_dc_input_2():
return Path(__file__).parent / "DCinput2.csv"
@pytest.fixture()
def path_sample_info_2():
return Path(__file__).parent / "DCinfo2.csv"
@pytest.fixture()
def df_dc_input_1(path_dc_input_1):
return pd.read_csv(path_dc_input_1)
@pytest.fixture()
def df_sample_info_1(path_sample_info_1):
return pd.read_csv(path_sample_info_1)
@pytest.fixture()
def df_dc_input_2(path_dc_input_2):
return pd.read_csv(path_dc_input_2)
@pytest.fixture()
def df_sample_info_2(path_sample_info_2):
return pd.read_csv(path_sample_info_2)
@pytest.fixture()
def pickled_results():
return pd.read_pickle(Path(__file__).parent / "blueshift.pickle")
def test_data_validation(df_dc_input_1, df_sample_info_1):
# missing required column in injection information
info = df_sample_info_1.drop("injection_order", axis=1)
with pytest.raises(ValueError) as e:
b.DriftCorrection(df_dc_input_1, info)
assert "injection_order" in str(e.value)
# missing injection in data input
data = df_dc_input_1.drop("B0005_COL_ExampleProject_CN-M36058078", axis=1)
with pytest.raises(ValueError) as e:
b.DriftCorrection(data, df_sample_info_1)
assert "data sheet: B0005_COL_ExampleProject_CN-M36058078" in str(e.value)
# no error when missing "not_used" injection in data input
data = df_dc_input_1.drop("B0008_COL_ExampleProject_CN-M59244903", axis=1)
b.DriftCorrection(data, df_sample_info_1)
# duplicate injection order
info = df_sample_info_1.copy()
info.loc[14, "injection_order"] = info.loc[15, "injection_order"]
with pytest.raises(ValueError) as e:
b.DriftCorrection(df_dc_input_1, info)
assert "duplicate values" in str(e.value)
# duplicate injection id
info = df_sample_info_1.copy()
info.loc[14, "injection_id"] = info.loc[15, "injection_id"]
with pytest.raises(ValueError) as e:
b.DriftCorrection(df_dc_input_1, info)
assert "duplicate injection_ids" in str(e.value)
# out-of-order injection order
info = df_sample_info_1.copy()
info.loc[14, "injection_order"] = 700
with pytest.raises(ValueError) as e:
b.DriftCorrection(df_dc_input_1, info)
assert "must be sorted" in str(e.value)
# invalid label in batches column
info = df_sample_info_1.copy()
info.loc[13, "batches"] = "batch nd"
with pytest.raises(ValueError) as e:
b.DriftCorrection(df_dc_input_1, info)
assert "invalid label" in str(e.value)
# non-numeric character in data
data = df_dc_input_1.copy()
data.iloc[5, 5] = "f"
with pytest.raises(TypeError) as e:
b.DriftCorrection(data, df_sample_info_1)
assert "non-numeric" in str(e.value)
# data and sample are not in same order
data = df_dc_input_1.copy()
col_list = list(data.columns)
col_list = col_list[:10] + col_list[11:] + col_list[10:11]
data = data.loc[:, col_list]
with pytest.raises(ValueError) as e:
b.DriftCorrection(data, df_sample_info_1)
assert "usable samples" in str(e.value)
def test_batch_start_end(df_sample_info_1):
# batch_end shifts up to nearest valid injection
info = df_sample_info_1.copy()
info.loc[16, ["batches", "QCRole"]] = ["batch_end", "NA"]
info[["batches", "QCRole"]] = info[["batches", "QCRole"]].fillna("")
batches = b.find_batch_start_end(info)
assert batches.loc[15] == "batch_end" and batches.loc[16] == ""
info = df_sample_info_1.copy()
info.loc[7, "batches"] = "batch_end"
info.loc[:7, "QCRole"] = "NA"
info[["batches", "QCRole"]] = info[["batches", "QCRole"]].fillna("")
with pytest.raises(ValueError) as e:
b.find_batch_start_end(info)
assert "Cannot move " in str(e.value)
def test_batch_generation(
df_dc_input_1,
df_sample_info_1,
path_dc_input_2,
path_sample_info_2,
pickled_results,
):
a = b.DriftCorrection(df_dc_input_1, df_sample_info_1)
for batch, ref_batch in zip(a.batches["default"], pickled_results["default1"]):
assert (batch.values == ref_batch.values).all()
for batch, ref_batch in zip(a.batches["override"], pickled_results["override1"]):
assert (batch.values == ref_batch.values).all()
a = b.DriftCorrection(path_dc_input_2, path_sample_info_2)
for batch, ref_batch in zip(a.batches["default"], pickled_results["default2"]):
assert (batch.values == ref_batch.values).all()
for batch, ref_batch in zip(a.batches["override"], pickled_results["override2"]):
assert (batch.values == ref_batch.values).all()
def test_internal_standard_correction(
path_dc_input_1,
df_dc_input_1,
path_sample_info_1,
df_dc_input_2,
df_sample_info_2,
pickled_results,
):
# one internal standard
a = b.DriftCorrection(path_dc_input_1, path_sample_info_1)
a.internal_standard_correct("Internal Standard 1")
assert np.isclose(
a.data.round().fillna(0),
pickled_results["DCinput1_IS_InternalStandard1"].round().loc[:, a.data.columns],
equal_nan=True,
).all()
# one internal standard with nonquant duplicate
nonquant_df = df_dc_input_1.copy()
nonquant_df.loc[4, "Metabolite"] = "Internal Standard 1"
nonquant_df.loc[4, "Non_Quant"] = True
a = b.DriftCorrection(nonquant_df, path_sample_info_1)
a.internal_standard_correct("Internal Standard 1")
assert np.isclose(
a.data.round().fillna(0),
pickled_results["DCinput1_IS_InternalStandard1"].round().loc[:, a.data.columns],
equal_nan=True,
).all()
# nonquant "missing" internal standard
nonquant_df = df_dc_input_1.copy()
nonquant_df.loc[0, "Non_Quant"] = True
a = b.DriftCorrection(nonquant_df, path_sample_info_1)
with pytest.raises(ValueError) as e:
a.internal_standard_correct("Internal Standard 1")
assert "not found in" in str(e.value)
# two internal standards
a = b.DriftCorrection(df_dc_input_2, df_sample_info_2)
a.internal_standard_correct(["15R-15-methyl-PGA2", "15R-15-methyl-PGF2a"])
assert np.isclose(
a.data.round(),
pickled_results["DCinput2_IS_PGA2_PGF2a"].loc[:, a.data.columns],
equal_nan=True,
).all()
# missing IS value
data = df_dc_input_2.copy()
data.iloc[14, 50] = 0
a = b.DriftCorrection(data, df_sample_info_2)
with pytest.raises(ValueError) as e:
a.internal_standard_correct("15S-15-methyl-PGD2")
assert "missing values" in str(e.value)
# wrong IS name
with pytest.raises(ValueError) as e:
a.internal_standard_correct("not_a_real_metabolite")
assert "not found in" in str(e.value)
def test_pool_correction(
path_dc_input_1,
path_sample_info_1,
path_dc_input_2,
path_sample_info_2,
pickled_results,
):
# linear with override
a = b.DriftCorrection(path_dc_input_1, path_sample_info_1)
a.pool_correct(
interpolation="linear", pool="PREFA", override=True, max_missing_percent=100
)
assert np.isclose(
a.data.apply(np.floor),
pickled_results["DCinput1_linear_PREFA_override"].loc[:, a.data.columns],
equal_nan=True,
).all()
# linear without override
a = b.DriftCorrection(path_dc_input_1, path_sample_info_1)
a.pool_correct(
interpolation="linear", pool="PREFA", override=False, max_missing_percent=100
)
assert np.isclose(
a.data.apply(np.floor),
pickled_results["DCinput1_linear_PREFA"].loc[:, a.data.columns],
equal_nan=True,
).all()
# internal standard + NN
a = b.DriftCorrection(path_dc_input_2, path_sample_info_2)
a.internal_standard_correct("15R-15-methyl-PGA2")
a.pool_correct(interpolation="NN", pool="PREFB", max_missing_percent=100)
assert np.isclose(
a.data.round(),
pickled_results["DCinput2_IS_PGA2_NN_PREFB"].loc[:, a.data.columns],
equal_nan=True,
).all()
# linear with max_missing_percent=30
a = b.DriftCorrection(path_dc_input_1, path_sample_info_1)
a.pool_correct(
interpolation="linear", pool="PREFA", override=True, max_missing_percent=30
)
assert np.isclose(
a.data.apply(np.floor),
pickled_results["DCinput1_linear_PREFA_override_maxmissing30"]
.loc[:, a.data.columns]
.apply(np.floor),
equal_nan=True,
).all()
def test_cv_calculation(
path_dc_input_2,
path_sample_info_2,
pickled_results,
):
# CV calculation
a = b.DriftCorrection(path_dc_input_2, path_sample_info_2)
a.pool_correct(interpolation="linear", pool="PREFA", max_missing_percent=100)
a.calculate_cvs()
res = a.cvs.loc[:, ["CV" in col for col in a.cvs.columns]]
assert np.isclose(
res.fillna(0),
pickled_results["DCinput2_linear_PREFA_CVs"].loc[:, res.columns].fillna(0),
).all()