【工具】Eclipse:一个用于对两个或多个非靶向液相色谱-质谱代谢组学数据集进行比对的 Python 软件包

【工具】Eclipse:一个用于对两个或多个非靶向液相色谱-质谱代谢组学数据集进行比对的 Python 软件包_第1张图片

文章目录

    • 介绍
    • 代码
    • 参考

介绍

非靶向液相色谱-串联质谱(LC-MS)代谢组学数据集蕴含着大量信息,但在分析和处理过程中却面临诸多挑战。通常,需要对两个或多个独立处理的数据集进行整合以形成完整的数据集,但现有的软件并不能完全满足我们的需求。为此,我们创建了一个名为“Eclipse”的开源 Python 包。Eclipse 采用一种新颖的基于图的方法来处理由 n 个(n > 2)数据集引发的复杂匹配情况。

Nontargeted LC-MS (liquid chromatography–tandem mass spectrometry) metabolomics datasets contain a wealth of information but present many challenges during analysis and processing. Often, two or more independently processed datasets must be aligned to form a complete dataset, but existing software does not fully meet our needs. For this, we have created an open-source Python package called Eclipse. Eclipse uses a novel graph-based approach to handle complex matching scenarios that arise from n > 2 datasets.

非靶向液相色谱-串联质谱法(LC-MS)是一种用于检测生物样本代谢状态的强大方法(克利什 2015 年)。在常规的数据处理流程中,特征提取软件会将原始仪器文件转换为表格数据集,通过识别并整合数千个特征来实现这一过程。每个特征都会附带其色谱保留时间(RT)和质荷比(m/z)(史密斯等人 2006 年,普拉斯卡尔等人 2010 年)。虽然许多特征都获得了化学标签(注释),但仍有很大一部分未被标注。未标注的空间包含了具有生物学意义的特征(陈等人 2022 年,塔希尔等人 2022 年,塔瓦内等人 2022 年),但在尝试将分别获取和处理的数据集进行拼接时(例如对齐)会遇到挑战(史密斯等人 2015 年)。当引入 n 个以上的数据集时,这些挑战会变得更加严重,导致复杂的匹配无法完全在表格数据中体现(补充图 S1)。虽然有一些基于特征描述符来对数据集进行对齐的解决方案存在(布伦纽斯等人,2016 年;科赫等人,2016 年;马克等人,2020 年;哈布拉等人,2021 年、2024 年;克利马科·皮托内等人,2022 年),但这些方案都不完全满足我们的所有要求,即它必须在默认设置下稳定运行,不能产生多个匹配结果,必须用 Python 编写,并且能够对 n 个以上的数据集进行对齐,同时结果不受数据集顺序的影响。

代码


图 1.Eclipse 算法概述。(a)Eclipse 算法的高级概览示例,包含三个数据集。 (b)在 DS1→DS2 子对齐过程中生成的缩放因子,这是将要运行的六个步骤之一。 数据集被简化(s1、s2),然后进行调查匹配。 缩放因子从每个描述符的残差(RT、m/z、强度)中生成,然后减去这些值以揭示剩余的平方误差。 (c)DS1→DS2 子对齐的匹配表生成。 DS1 被缩放(1→Sc1),然后对每个特征查询 DS2。 落在所有描述符的 ±6 RSE 范围内的 DS2 被排序,并将最佳匹配记录在 DS1→DS2 匹配表中。 (d)对对齐结果的聚合和报告。 在所有子对齐运行完毕后,它们被收集到一个有向图中。 图被压缩和聚类以生成结果表。

https://github.com/broadinstitute/bmxp
https://github.com/broadinstitute/bmxp/blob/main/tests/test_blueshift.py
【工具】Eclipse:一个用于对两个或多个非靶向液相色谱-质谱代谢组学数据集进行比对的 Python 软件包_第2张图片

import bmxp
from bxmp.eclipse import MSAligner
from bxmp.blueshift import DriftCorrection
from bmxp.gravity import cluster
bmxp.FMDATA['Compound_ID'] = 'Feature_ID'
bmxp.IMDATA['injection_id'] = 'Filename'

# pylint: disable=redefined-outer-name, missing-function-docstring, consider-using-with
"""
Tests for blueshift
"""
import pickle
from pathlib import Path
import pytest
import pandas as pd
import numpy as np
from bmxp import blueshift as b


@pytest.fixture()
def path_dc_input_1():
    return Path(__file__).parent / "DCinput1.csv"


@pytest.fixture()
def path_sample_info_1():
    return Path(__file__).parent / "DCinfo1.csv"


@pytest.fixture()
def path_dc_input_2():
    return Path(__file__).parent / "DCinput2.csv"


@pytest.fixture()
def path_sample_info_2():
    return Path(__file__).parent / "DCinfo2.csv"


@pytest.fixture()
def df_dc_input_1(path_dc_input_1):
    return pd.read_csv(path_dc_input_1)


@pytest.fixture()
def df_sample_info_1(path_sample_info_1):
    return pd.read_csv(path_sample_info_1)


@pytest.fixture()
def df_dc_input_2(path_dc_input_2):
    return pd.read_csv(path_dc_input_2)


@pytest.fixture()
def df_sample_info_2(path_sample_info_2):
    return pd.read_csv(path_sample_info_2)


@pytest.fixture()
def pickled_results():
    return pd.read_pickle(Path(__file__).parent / "blueshift.pickle")


def test_data_validation(df_dc_input_1, df_sample_info_1):
    # missing required column in injection information
    info = df_sample_info_1.drop("injection_order", axis=1)
    with pytest.raises(ValueError) as e:
        b.DriftCorrection(df_dc_input_1, info)
    assert "injection_order" in str(e.value)

    # missing injection in data input
    data = df_dc_input_1.drop("B0005_COL_ExampleProject_CN-M36058078", axis=1)
    with pytest.raises(ValueError) as e:
        b.DriftCorrection(data, df_sample_info_1)
    assert "data sheet: B0005_COL_ExampleProject_CN-M36058078" in str(e.value)

    # no error when missing "not_used" injection in data input
    data = df_dc_input_1.drop("B0008_COL_ExampleProject_CN-M59244903", axis=1)
    b.DriftCorrection(data, df_sample_info_1)

    # duplicate injection order
    info = df_sample_info_1.copy()
    info.loc[14, "injection_order"] = info.loc[15, "injection_order"]
    with pytest.raises(ValueError) as e:
        b.DriftCorrection(df_dc_input_1, info)
    assert "duplicate values" in str(e.value)

    # duplicate injection id
    info = df_sample_info_1.copy()
    info.loc[14, "injection_id"] = info.loc[15, "injection_id"]
    with pytest.raises(ValueError) as e:
        b.DriftCorrection(df_dc_input_1, info)
    assert "duplicate injection_ids" in str(e.value)

    # out-of-order injection order
    info = df_sample_info_1.copy()
    info.loc[14, "injection_order"] = 700
    with pytest.raises(ValueError) as e:
        b.DriftCorrection(df_dc_input_1, info)
    assert "must be sorted" in str(e.value)

    # invalid label in batches column
    info = df_sample_info_1.copy()
    info.loc[13, "batches"] = "batch nd"
    with pytest.raises(ValueError) as e:
        b.DriftCorrection(df_dc_input_1, info)
    assert "invalid label" in str(e.value)

    # non-numeric character in data
    data = df_dc_input_1.copy()
    data.iloc[5, 5] = "f"
    with pytest.raises(TypeError) as e:
        b.DriftCorrection(data, df_sample_info_1)
    assert "non-numeric" in str(e.value)

    # data and sample are not in same order
    data = df_dc_input_1.copy()
    col_list = list(data.columns)
    col_list = col_list[:10] + col_list[11:] + col_list[10:11]
    data = data.loc[:, col_list]
    with pytest.raises(ValueError) as e:
        b.DriftCorrection(data, df_sample_info_1)
    assert "usable samples" in str(e.value)


def test_batch_start_end(df_sample_info_1):
    # batch_end shifts up to nearest valid injection
    info = df_sample_info_1.copy()
    info.loc[16, ["batches", "QCRole"]] = ["batch_end", "NA"]
    info[["batches", "QCRole"]] = info[["batches", "QCRole"]].fillna("")
    batches = b.find_batch_start_end(info)
    assert batches.loc[15] == "batch_end" and batches.loc[16] == ""

    info = df_sample_info_1.copy()
    info.loc[7, "batches"] = "batch_end"
    info.loc[:7, "QCRole"] = "NA"
    info[["batches", "QCRole"]] = info[["batches", "QCRole"]].fillna("")
    with pytest.raises(ValueError) as e:
        b.find_batch_start_end(info)
    assert "Cannot move " in str(e.value)


def test_batch_generation(
    df_dc_input_1,
    df_sample_info_1,
    path_dc_input_2,
    path_sample_info_2,
    pickled_results,
):
    a = b.DriftCorrection(df_dc_input_1, df_sample_info_1)
    for batch, ref_batch in zip(a.batches["default"], pickled_results["default1"]):
        assert (batch.values == ref_batch.values).all()
    for batch, ref_batch in zip(a.batches["override"], pickled_results["override1"]):
        assert (batch.values == ref_batch.values).all()

    a = b.DriftCorrection(path_dc_input_2, path_sample_info_2)
    for batch, ref_batch in zip(a.batches["default"], pickled_results["default2"]):
        assert (batch.values == ref_batch.values).all()
    for batch, ref_batch in zip(a.batches["override"], pickled_results["override2"]):
        assert (batch.values == ref_batch.values).all()


def test_internal_standard_correction(
    path_dc_input_1,
    df_dc_input_1,
    path_sample_info_1,
    df_dc_input_2,
    df_sample_info_2,
    pickled_results,
):
    # one internal standard
    a = b.DriftCorrection(path_dc_input_1, path_sample_info_1)
    a.internal_standard_correct("Internal Standard 1")
    assert np.isclose(
        a.data.round().fillna(0),
        pickled_results["DCinput1_IS_InternalStandard1"].round().loc[:, a.data.columns],
        equal_nan=True,
    ).all()

    # one internal standard with nonquant duplicate
    nonquant_df = df_dc_input_1.copy()
    nonquant_df.loc[4, "Metabolite"] = "Internal Standard 1"
    nonquant_df.loc[4, "Non_Quant"] = True
    a = b.DriftCorrection(nonquant_df, path_sample_info_1)
    a.internal_standard_correct("Internal Standard 1")
    assert np.isclose(
        a.data.round().fillna(0),
        pickled_results["DCinput1_IS_InternalStandard1"].round().loc[:, a.data.columns],
        equal_nan=True,
    ).all()

    # nonquant "missing" internal standard
    nonquant_df = df_dc_input_1.copy()
    nonquant_df.loc[0, "Non_Quant"] = True
    a = b.DriftCorrection(nonquant_df, path_sample_info_1)
    with pytest.raises(ValueError) as e:
        a.internal_standard_correct("Internal Standard 1")
    assert "not found in" in str(e.value)

    # two internal standards
    a = b.DriftCorrection(df_dc_input_2, df_sample_info_2)
    a.internal_standard_correct(["15R-15-methyl-PGA2", "15R-15-methyl-PGF2a"])
    assert np.isclose(
        a.data.round(),
        pickled_results["DCinput2_IS_PGA2_PGF2a"].loc[:, a.data.columns],
        equal_nan=True,
    ).all()

    # missing IS value
    data = df_dc_input_2.copy()
    data.iloc[14, 50] = 0
    a = b.DriftCorrection(data, df_sample_info_2)
    with pytest.raises(ValueError) as e:
        a.internal_standard_correct("15S-15-methyl-PGD2")
    assert "missing values" in str(e.value)

    # wrong IS name
    with pytest.raises(ValueError) as e:
        a.internal_standard_correct("not_a_real_metabolite")
    assert "not found in" in str(e.value)


def test_pool_correction(
    path_dc_input_1,
    path_sample_info_1,
    path_dc_input_2,
    path_sample_info_2,
    pickled_results,
):
    # linear with override
    a = b.DriftCorrection(path_dc_input_1, path_sample_info_1)
    a.pool_correct(
        interpolation="linear", pool="PREFA", override=True, max_missing_percent=100
    )
    assert np.isclose(
        a.data.apply(np.floor),
        pickled_results["DCinput1_linear_PREFA_override"].loc[:, a.data.columns],
        equal_nan=True,
    ).all()

    # linear without override
    a = b.DriftCorrection(path_dc_input_1, path_sample_info_1)
    a.pool_correct(
        interpolation="linear", pool="PREFA", override=False, max_missing_percent=100
    )
    assert np.isclose(
        a.data.apply(np.floor),
        pickled_results["DCinput1_linear_PREFA"].loc[:, a.data.columns],
        equal_nan=True,
    ).all()

    # internal standard + NN
    a = b.DriftCorrection(path_dc_input_2, path_sample_info_2)
    a.internal_standard_correct("15R-15-methyl-PGA2")
    a.pool_correct(interpolation="NN", pool="PREFB", max_missing_percent=100)
    assert np.isclose(
        a.data.round(),
        pickled_results["DCinput2_IS_PGA2_NN_PREFB"].loc[:, a.data.columns],
        equal_nan=True,
    ).all()

    # linear with max_missing_percent=30
    a = b.DriftCorrection(path_dc_input_1, path_sample_info_1)
    a.pool_correct(
        interpolation="linear", pool="PREFA", override=True, max_missing_percent=30
    )
    assert np.isclose(
        a.data.apply(np.floor),
        pickled_results["DCinput1_linear_PREFA_override_maxmissing30"]
        .loc[:, a.data.columns]
        .apply(np.floor),
        equal_nan=True,
    ).all()


def test_cv_calculation(
    path_dc_input_2,
    path_sample_info_2,
    pickled_results,
):
    # CV calculation
    a = b.DriftCorrection(path_dc_input_2, path_sample_info_2)
    a.pool_correct(interpolation="linear", pool="PREFA", max_missing_percent=100)
    a.calculate_cvs()
    res = a.cvs.loc[:, ["CV" in col for col in a.cvs.columns]]
    assert np.isclose(
        res.fillna(0),
        pickled_results["DCinput2_linear_PREFA_CVs"].loc[:, res.columns].fillna(0),
    ).all()

参考

  • Eclipse: a Python package for alignment of two or more nontargeted LC-MS metabolomics datasets
  • https://github.com/broadinstitute/bmxp

你可能感兴趣的:(学习笔记,python,数据分析,数据挖掘)