10X单细胞(10X空间转录组)RNA速率分析之scVelo

今天我们测试一下软件scVelo,这个软件应该很多人都用过了,但是玩的好的人,应该没几个。

加载

import scvelo as scv
scv.set_figure_params()

读取数据(loom, h5ad, csv), 我的数据是根据velocyto软件对bam进行分析的结果,这个过程大家自己好好做一下吧

adata = scv.read(filename, cache=True)

看一眼数据结构

图片.png
图片.png

如果是多个样本,可以进行merge

ldata = scv.read(filename.loom, cache=True)
adata = scv.utils.merge(adata, ldata)

我这里将三个样本的信息merge起来做后续分析

adata = scv.utils.merge(adata, ldata,sdata)
图片.png

看看里面都有些什么

which stores the data matrix (adata.X), annotation of cells / observations (adata.obs) and genes / variables (adata.var), unstructured annotation such as graphs (adata.uns) and additional data layers where spliced and unspliced counts are stored (adata.layers) 。
图片.png
图片.png

layer下面就是我们分析需要的最重要的信息。

预处理,在基本预处理(基因选择和归一化)之后,计算速度估计的一阶和二阶矩(均值和非中心方差):

scv.pp.filter_and_normalize(adata, **params)
scv.pp.moments(adata, **params)

函数的参数一定要注意,这里采用默认值

看一下函数scv.pp.moments,Computes moments for velocity estimation。First-/second-order moments are computed for each cell across its nearest neighbors, where the neighbor graph is obtained from euclidean distances in PCA space.数学的东西总是有点难以理解,但是我们还是要深入其中。注意这里的moments是矩阵的秩。

然后进行速率分析

###Velocity Tools
scv.tl.velocity(adata, mode='stochastic', **params)

先来看一下这个函数scv.tl.velocity,Estimates velocities in a gene-specific manner。 The steady-state model determines velocities by quantifying how observations deviate from a presumed steady-state equilibrium ratio of unspliced to spliced mRNA levels(这句话是核心). This steady-state ratio is obtained by performing a linear regression restricting the input data to the extreme quantiles. By including second-order moments, the stochastic model exploits not only the balance of unspliced to spliced mRNA levels but also their covariation. By contrast, the likelihood-based dynamical model solves the full splicing kinetics and generalizes RNA velocity estimation to transient systems. It is also capable of capturing non-observed steady states.(也比较难理解)。

我们总结一下

目前有三种方法评估 RNA 速率: (1)稳态/确定性模型 velocyto 中使用稳态/确定性模型对 RNA 速率进行估计。在假定转录阶段 (诱导和抑制) 持续足够长的时间以达到稳态平衡的情况下,速率被量化为实际观测值如何偏离稳态平衡。 平衡 mRNA 水平近似于在假定的上下分位数的稳定状态下的线性回归。这种简化是通过假设 一个跨基因的通用剪接率和数据中反映的稳态 mRNA 水平来实现的。基于这些假设,可能导 致速率估计和细胞状态的错误,特别是当一个种群包含多个异质亚种群动态时。 (2)随机模型 随机模型的目标是更好地捕捉稳态,但与稳态模型的假设相同。它是通过处理转录,剪 接和降解作为概率事件,从而纳入二阶矩。也就是说,稳态水平不仅与 mRNA 水平近似,而 且与内在表达变异性近似。 (3)动态模型 动态模型 (最强大,但计算量最大) 解决了每个基因的剪接动力学的全部动态。因此,它 使 RNA 速率适应广泛变化的规格,如非平稳群体,因为它不依赖于限制一个共同的剪接率或待抽样的稳态。通过迭代估计反应速率和潜在细胞特异性变量的可识别参数,即转录状态和 细胞内潜伏时间,在基于概率的期望最大化框架中求解剪接动力学。 该模型能够进一步以一种基于概率的方式系统地识别动态驱动基因,从而找到控制细胞 命运转变的关键驱动因素。此外,动态模型推断了一个普遍的细胞内潜伏时间共享的基因, 能够将相关基因和识别转录变化的机制联系起来。

看看函数的参数;

  • mode : 速度是通过求解转录动力学的随机模型获得的基因表达空间中的向量。 确定性模型的解决方案是通过设置 mode='deterministic' 获得的。 The solution to the dynamical model is obtained by setting mode='dynamical'(动态模型), which requires to run scv.tl.recover_dynamics(adata, **params) beforehand.

通过将速度转换为可能的细胞转换,将速度投影到低维嵌入中。 也就是说,对于每个速度向量,我们找到与该方向一致的可能的细胞转换。 一个细胞过渡到另一个细胞的概率使用余弦相关性(潜在单元格转换和速度向量之间)计算,并存储在表示为速度图的矩阵中:

scv.tl.velocity_graph(adata, **params)

最后可视化,相对简单了

scv.pl.velocity_embedding(adata, basis='umap', **params)
scv.pl.velocity_embedding_grid(adata, basis='umap', **params)
scv.pl.velocity_embedding_stream(adata, basis='umap', **params)

For every tool module there is a plotting counterpart, which allows you to examine your results in detail, e.g.:(感觉起来很有说服力)。

scv.pl.velocity(adata, var_names=['gene_A', 'gene_B'], **params)
scv.pl.velocity_graph(adata, **params)

来吧,看看我们的结果,先来看基础分析内容

scv.pl.proportions(adata)
proportions.png

和教程的图不太一样啊,下面是教程的图

图片.png

说明这个函数内部有一些处理,看一下

看了之后,groupby='clusters',我们没有设置,这里大家将Seurat的整合分析的聚类结果导入进去,这里我图省事,直接scanpy聚类了。

import scanpy as sc
sc.tl.leiden(adata)
scv.pl.proportions(adata,groupby='leiden')

看看结果,嗯,正常了,不过大家注意要做细胞定义啊,可不要都拿进来分析RNA Velocyto。

proportions.png

前处理数据,这个已经很常规了,跟前面的一样

###Preprocess the Data
scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)
scv.pp.log1p(adata)
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

接下来重点,Estimate RNA velocity

Velocities are vectors in gene expression space and represent the direction and speed of movement of the individual cells. The velocities are obtained by modeling transcriptional dynamics of splicing kinetics, either stochastically (default) or deterministically (by setting mode='deterministic'). For each gene, a steady-state-ratio of pre-mature (unspliced) and mature (spliced) mRNA counts is fitted, which constitutes a constant transcriptional state. Velocities are then obtained as residuals from this ratio. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated(大家要深入理解一下)。

我们先来看最基础的mode,stochastic

scv.tl.velocity(adata)

当然,更多时候我们需要低维空间展示,有函数计算二维空间,建议大家导入seurat分析出来的二维空间,这里展示我就用默认的。

scv.tl.velocity_graph(adata)
sc.tl.umap(adata)
scv.pl.velocity_embedding_stream(adata, basis='umap',color = 'leiden')
Velocyto.png

嗯,效果相当不错。再来一张酷炫的

scv.pl.velocity_embedding(adata, arrow_length=3, arrow_size=2, dpi=120)
Velocyto1.png

接下来Interprete the velocities,其实就是关键基因

This is perhaps the most important part as we advise the user not to limit biological conclusions to the projected velocities, but to examine individual gene dynamics via phase portraits to understand how inferred directions are supported by particular genes.

scv.pl.velocity(adata, ['Cpe',  'Gnao1', 'Ins2', 'Adk'], ncols=2)
图片.png

The black line corresponds to the estimated ‘steady-state’ ratio, i.e. the ratio of unspliced to spliced mRNA abundance which is in a constant transcriptional state. RNA velocity for a particular gene is determined as the residual, i.e. how much an observation deviates from that steady-state line. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.(图的意义,大家可要认真学习)

scv.pl.scatter(adata, 'Cpe', color=['clusters', 'velocity'],
               add_outline='Ngn3 high EP, Pre-endocrine, Beta')
图片.png

然后是Identify important genes

We need a systematic way to identify genes that may help explain the resulting vector field and inferred lineages.To do so, we can test which genes have cluster-specific differential velocity expression, being siginificantly higher/lower compared to the remaining population.(跟Seurat计算差异基因的方法类似)。The module scv.tl.rank_velocity_genes runs a differential velocity t-test and outpus a gene ranking for each cluster. Thresholds can be set (e.g. min_corr) to restrict the test on a selection of gene candidates.
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=.3)
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head()
图片.png
kwargs = dict(frameon=False, size=10, linewidth=1.5,
              add_outline='Ngn3 high EP, Pre-endocrine, Beta')

scv.pl.scatter(adata, df['Ngn3 high EP'][:5], ylabel='Ngn3 high EP', **kwargs)
scv.pl.scatter(adata, df['Pre-endocrine'][:5], ylabel='Pre-endocrine', **kwargs)
图片.png

(可选)Velocities in cycling progenitors,了解一下

scv.tl.score_genes_cell_cycle(adata)
scv.pl.scatter(adata, color_gradients=['S_score', 'G2M_score'], smooth=True, perc=[5, 95])
图片.png

Speed and coherence

Two more useful stats: - The speed or rate of differentiation is given by the length of the velocity vector. - The coherence of the vector field (i.e., how a velocity vector correlates with its neighboring velocities) provides a measure of confidence.
scv.tl.velocity_confidence(adata)
keys = 'velocity_length', 'velocity_confidence'
scv.pl.scatter(adata, c=keys, cmap='coolwarm', perc=[5, 95])
velocyto_length_confidence.png
These provide insights where cells differentiate at a slower/faster pace, and where the direction is un-/determined.(还是很重要的)。
df = adata.obs.groupby('leiden')[keys].mean().T
df.style.background_gradient(cmap='coolwarm', axis=1)
图片.png

Velocity graph and pseudotime(最后的可视化)

We can visualize the velocity graph to portray all velocity-inferred cell-to-cell connections/transitions. It can be confined to high-probability transitions by setting a threshold. The graph, for instance, indicates two phases of Epsilon cell production, coming from early and late Pre-endocrine cells.
scv.pl.velocity_graph(adata, threshold=.1)
clusters.png

Further, the graph can be used to draw descendents/anscestors coming from a specified cell. Here, a pre-endocrine cell is traced to its potential fate.

x, y = scv.utils.get_cell_transitions(adata, basis='umap', starting_cell=70)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)
potential.png

然后拟时

scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')
pseudotime.png

PAGA velocity graph(PAGA大家应该不陌生才对)

# this is needed due to a current bug - bugfix is coming soon.
adata.uns['neighbors']['distances'] = adata.obsp['distances']
adata.uns['neighbors']['connectivities'] = adata.obsp['connectivities']

scv.tl.paga(adata, groups='leiden')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
df.style.background_gradient(cmap='Blues').format('{:.2g}')

图片.png

PAGA ,结果相当不错

scv.pl.paga(adata, basis='umap', size=50, alpha=.1,
            min_edge_width=2, node_size_scale=1.5)
PAGA.png

接下来要分析动态模型

That yields several additional insights such as latent time and identification of putative driver genes。注意这里运用了函数scv.tl.recover_dynamics,Recovers the full splicing kinetics of specified genes.The model infers transcription rates, splicing rates, degradation rates,as well as cell-specific latent time and transcriptional states,estimated iteratively by expectation-maximization.
scv.tl.recover_dynamics(adata)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding_stream(adata, basis='umap',color = 'leiden')
dynamics.png

差别还是很大的,所有基因都囊括进来还是很值得商榷的。

Kinetic rate paramters(动力学参数)

无需任何实验数据即可估算 RNA 转录、剪接和降解的速率。
它们有助于更好地了解细胞身份和表型异质性。
df = adata.var
df = df[(df['fit_likelihood'] > .1) & df['velocity_genes'] == True]

kwargs = dict(xscale='log', fontsize=16)
with scv.GridSpec(ncols=3) as pl:
    pl.hist(df['fit_alpha'], xlabel='transcription rate', **kwargs)
    pl.hist(df['fit_beta'] * df['fit_scaling'], xlabel='splicing rate', xticks=[.1, .4, 1], **kwargs)
    pl.hist(df['fit_gamma'], xlabel='degradation rate', xticks=[.1, .4, 1], **kwargs)

scv.get_df(adata, 'fit*', dropna=True).head()
图片.png

图片.png

Latent time(相当于轨迹分析)

The dynamical model recovers the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and approximates the real time experienced by cells as they differentiate, based only on its transcriptional dynamics.
scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=80)
monocle.png

嗯,示例数据就很差了,毕竟没定义,全都拿进来当作示例了。

top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata, var_names=top_genes, sortby='latent_time', col_color='leiden', n_convolve=100)
heatmap.png

嗯,很乱,我们看一看官网的示例

图片.png

Top-likelihood genes

Driver genes display pronounced dynamic behavior and are systematically detected via their characterization by high likelihoods in the dynamic model.
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(adata, basis=top_genes[:15], ncols=5, frameon=False)
图片.png
var_names = ['Actn4', 'Ppp3ca', 'Cpe', 'Nnat']
scv.pl.scatter(adata, var_names, frameon=False)
scv.pl.scatter(adata, x='latent_time', y=var_names, frameon=False)
图片.png

Cluster-specific top-likelihood genes

scv.tl.rank_dynamical_genes(adata, groupby='leiden')
df = scv.get_df(adata, 'rank_dynamical_genes/names')
df.head(5)
图片.png
for cluster in ['Ductal', 'Ngn3 high EP', 'Pre-endocrine', 'Beta']:
    scv.pl.scatter(adata, df[cluster][:5], ylabel=cluster, frameon=False)
图片.png

第三部分,Differential Kinetics

One important concern is dealing with systems that represent multiple lineages and processes, where genes are likely to show different kinetic regimes across subpopulations. Distinct cell states and lineages are typically governed by different variations in the gene regulatory networks and may hence exhibit different splicing kinetics. This gives rise to genes that display multiple trajectories in phase space.(这也是最为常见的情况)。

为了解决这个问题,可以使用动力学模型来执行差分动力学的似然比测试。 通过这种方式,我们可以检测到无法通过整体动力学的单个模型很好解释的动力学行为的集群。 将细胞类型聚类到其不同的动力学机制中,然后允许分别拟合每个机制。

来看看这个过程,前面的预处理都是一样的。

这里首先是基础模型

###Basic Velocity Estimation
scv.tl.velocity(adata)
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding_stream(adata, basis='umap',color = 'leiden')
Velocyto.png

Differential Kinetic Test

不同的细胞类型和谱系可能表现出不同的动力学机制,因为它们可以由不同的网络结构控制。 即使细胞类型或谱系相关,由于可变剪接、可变多聚腺苷酸化和降解调节,动力学也会有所不同。

动力学模型允许我们通过差分动力学的似然比测试来解决这个问题,以检测显示动力学行为的集群/谱系,而整体动力学的单个模型无法充分解释这些行为。 测试每种细胞类型是否独立拟合产生显着改善的可能性。

遵循渐近卡方分布的似然比可以检验显着性。 请注意,出于效率原因,默认情况下使用正交回归而不是完整的相轨迹来测试集群是否能被整体动力学很好地解释或表现出不同的动力学。

var_names = ['Tmsb10', 'Fam155a', 'Hn1', 'Rpl6']
scv.tl.differential_kinetic_test(adata, var_names=var_names, groupby='clusters')
scv.get_df(adata[:, var_names], ['fit_diff_kinetics', 'fit_pval_kinetics'], precision=2)
图片.png
kwargs = dict(linewidth=2, add_linfit=True, frameon=False)
scv.pl.scatter(adata, basis=var_names, add_outline='fit_diff_kinetics', **kwargs)
图片.png
diff_clusters=list(adata[:, var_names].var['fit_diff_kinetics'])
scv.pl.scatter(adata, legend_loc='right', size=60, title='diff kinetics',
               add_outline=diff_clusters, outline_width=(.8, .2))
图片.png

Testing top-likelihood genes

通过最高似然基因进行筛选,我们发现了一些显示多种动力学机制的基因动态。
scv.tl.recover_dynamics(adata)
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:100]
scv.tl.differential_kinetic_test(adata, var_names=top_genes, groupby='clusters')
scv.pl.scatter(adata, basis=top_genes[:15], ncols=5, add_outline='fit_diff_kinetics', **kwargs)
图片.png
scv.pl.scatter(adata, basis=top_genes[15:30], ncols=5, add_outline='fit_diff_kinetics', **kwargs)
图片.png

Recompute velocities

最后,可以利用多个相互竞争的动力学机制的信息重新计算速度。
scv.tl.velocity(adata, diff_kinetics=True)
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding(adata, dpi=120, arrow_size=2, arrow_length=2)
图片.png

确实很难,但是因为难,才有学习的价值

生活很好,有你更好

你可能感兴趣的:(10X单细胞(10X空间转录组)RNA速率分析之scVelo)