《生物信息学:导论与方法》----变异的功能预测----听课笔记(十一)

第六章  变异的功能预测

6.1 问题概述

  • Where did your genetic variations come from?
  1. inherited from parents
  2. de novo mutations(70~100个新发突变)
  3. somatic mutations(体细胞突变,如癌症)
  • 有很多的先天的小儿疾病,就是这个孩子有一个De novo mutation,刚好落在了一个重要的基因上,它有可能有这种严重的疾病。
  • 肿瘤细胞一般都有某种体细胞突变,会导致细胞不受控制地增值。
  • Types of genetic variations in a human genome
  1. Chromosomal aneuploidy(非整倍性的):最严重的的类型,染色体倍数发生错误,如唐氏综合征,这些孩子21号染色体有三条。
  2. Structural Variations(SVs):insertion; deletion; Inversion; Translocation
  3. Copy Number Variations(CNVs)
  4. Short insertion/deletions(indels): 有可能发生在基因区间或者内含子区,多数时候对表型的影响不是很大,但不是绝对;也可能发生编码区,分为两种,一种导致frameshift;一种是不会导致frameshift。
  5. Single Nucletide Variations(SNVs):一般来讲,每一个人的基因组都会有300万个单核苷酸的编译,大概相当于1/1000的概率。可能出现在基因区间,或者内含子区,也有可能是出现在启动子区。
  • Nomenclature: Mutation(突变) vs. polymorphism(多态性) vs. variation(变异) vs. variant
  • 在整个地球上所有的人群中,小于1%的这种变异,大家把它通称为突变,叫mutation;超过了1%的这种变异,就叫做polymorphism。有时候也用5%作为cutoff。
  • variation或者variant一般是mutation和polymorphism的统称。
  • SVNs within coding regions
  1. stop gain(nonsense):最严重,引入一个stop codon(终止密码子),这个蛋白会提前终止,或者这个蛋白无法翻译出来,或者会翻译出来一个截断的一个一个版本。
  2. stop loss:有时候还会造成一个终止密码子的缺失,那就是它最后这个蛋白会比原来的蛋白更长,但同时也有可能它因此就无法被翻译。
  3. Non-synonymous(missense):影响氨基酸的变异。
  4. Synonymous(silent):发生在外显子区,又是编码蛋白的外显子区,
  5. Affect splicing:发生在内含子和外显子交界的地方
  • Missense (Non-synonymous) SNVs是目前研究的最多的,原因是:1.Missense SNVs change the amino acid; 2. Missense SNVs account for ~2% of  the genome but >50% of all mutations known to be involved in human inherited disease. 但是这个数据有可能是因为研究非同义突变的人太多了。。。
  • What features differentiate disease-causing variants from neutral ones?
  • How can we predict whether a variation is disease-causing?

6.2 记录变异的数据库

  • 1976年发现的Thalassemia的一个非同义突变
  • 1993年,Huntington's disease的致病原因被发现,它是由一个nucleotide repeat不同的数目导致的。
  • 1995年发现的Williams syndrome致病原因为代表的,是一个染色体上片段的删除导致的这样一个疾病。
  • 影响蛋白的遗传变异的数据是放在1986年开发的Swiss-prot数据库中。
  • 1998年在NCBI建立了dbSNP这个数据,测量了很多正常人出现的单核苷酸变异等。
  • 2010年NCBI建立了dbVar数据库,主要存储一些比较大一点的结构变异。
  • 2012年发表了1000 genomes。
  • 1987年,OMIM(Online Mendelian Inheritance In Man),储存疾病相关的遗传变异。
  • 1996年,Human Gene Mutation Database被建立,存储遗传变异相关的数据。
  • 2007年,Locus Specifical Database建立,它是专门针对每一个不同的Locus,把相关的遗传变异汇总起来。
  • 2007年和2012年,dbVar和ClinVar,这也是存储的GWAS和新一代测序实验结果里发现的和实验相关的一些遗传变异。
  • 2004年建立的COSMIC的数据库,它主要存储的是癌症里面的体细胞突变。
  • dbSNP是NCBI的一个很重要的数据库,建立于1998年,主要目的是存储所有的被鉴定出来的遗传变异,包括正常人的和病人的。
  • LSDBs
  1. Collect all known variants of each disease related gene in a specific database.
  2. Annotate with complete and accurate information on genetic mutations
  3. Most LSDBs are build based on LOVD(Leiden Open Variation Database) which is a database framework of storing variants information.

6.3 基于保守性和规则的预测方法:SIFT和PolyPhen

  • Phenotypical/functional "effects" of human genetic variations
  1. Disease vs. normal
  2. Deleterious vs. neutral : 演化上的一个概念,就是它会不会影响这个人的适应性
  3. Personal trait differences (e.g. height): effect不是说疾病和正常的这样极端的表型,而是说一些特征。
  • 除了对个体最终的表型的评估之外,如果想要进行深入的研究,建立真正的基因型和表型之间的因果关系,你就要做很多在动物模型和细胞水平上的工作(Animal model phenotypic changes and Cellular phenotypic changes)。
  • functional effect其实是指这个变异是不是会造成蛋白的结构和功能上的改变,Protein function changes and Protein structure changes
  • 在最底层说,就是会不会造成一个蛋白序列的改变, Protein sequence changes
  • Statistical and stochastic, not deterministic
  • Observations, not "truth"
  • Nonsense mutations are usually considered deleterious.
  • Known deleterious mutations are enriched in nonsynonymous mutations.
  • 非同义突变,占50%,现在已知的单基因疾病的突变都是这些非同义突变。
  • ~50 known mutations of Mendelian disorders are nonsynonymous mutations(ascertainment bias?)
  • synonymous mutations, intronic mutations, and intergenic mutations are understudied (According to GWAS studies, 88% of trait-associated variants of weak effect are non-coding)
  • Most research so far had focused on nonsynonymous mutations.
  • More successful methods
  1. Conservation-based(e.g., SIFT)
  2. Rule-based(e.g., PolyPhen)
  3. Classifier-based(e.g., PolyPhen2, SAPRED)
  • Sort Intolerant From Tolerant substitutions (SIFT)
  1. Published in 2001 by Pauline C.Ng and Steven Henikoff
  2. The first tool of predicting deleterious Amino Acid Substitutions
  3. Website: http://sift.jcvi.org
  • SIFT bets on evolution: Important positions (such as active sites) tend to be conserved in the protein family across species. Mutations at well-conserved positions tend to be deleterious.
  • SIFT bets on evolution: Some positions have a high degree of diversity across species. Mutations at these positions tend to be neutral.

SIFT is a multistep procedure.

Given a protein sequence:

Step1. Search for similar sequences

  • Sequence search databse: SWISS-PROT
  • PSI-blast is run for four  iterations to collect a pool of sequences similar to the query.

Step2. Choose closely related sequences that are likely to share similar function

  • The psi-blast results are grouped together if they are >90% identical in the regions aligned

Step3. Obtain the multiple alignment of these chosen sequences

Step4. Calculate normalized probabilities for all possible substitutions at each position at the alignment

  • 第四步就根据每一个位点,所看到的氨基酸的分布可以算一个概率,基于这个概率,得到最后的一个值,一个数值的预测值,如果这个SCORE分数是小于0.05的,就预测是deleterious; 如果大于0.05,就是中性的,不会造成功能和表型的改变。

如何定义准确度?

《生物信息学:导论与方法》----变异的功能预测----听课笔记(十一)_第1张图片

Polymorphism Phenotyping (PolyPhen): a rule-based method

  • Amino acid variants may impact folding, interaction sites, solubility or stability of the protein.
  • Changes in protein structure may affect protein function, which may lead to phenotype change.
  • PolyPhen predicts impact of amino acid allelic variants based on multi-sequence alignment AND protein 3D structure features.

PolyPhen

1. Multi-sequence alignment of homologous sequences

2. Get the protein 3D structure or using homolog modeling to predict its structure

3. Structure-based characterization of the substitution site

  • DISULFIDE, THIOLEST or THIOEATH bond, BINDING site, ACTIVE site etc.
  • Whether the variant is located in transmembrane regions
  • Whether the variant is located in coiled coil regions
  • Whether the variant is located in signal peptide regions

4. Calculate the 3D structure features of the substitution site

  • Secondary structure
  • Solvent accessible surface area
  • Φ-Ψ dihedral angles
  • Normalized β-factor for the residue
  • Loss of hydrogen bond
  • Contacts with critical sites, ligands or other polypeptide chains

Pros:

  • improved prediction accuracy when protein 3D structure is avaliable

Cons:

  • If 3D structure is not avaliable, it can only depend on MSA.
  • The rules are empirical.

你可能感兴趣的:(生物信息学,遗传变异,SIFT,PolyPhen)