更多Adam学习代码等资料请见:https://github.com/xubo245/AdamLearning
1.Fasta格式在Adam里面的Avro模式是NucleotideContigFragment,具体在package org.bdgenomics.formats.avro包下,这个包在bdg-formats项目下,更多的可以看git:https://github.com/bigdatagenomics/bdg-formats
2.参考2中有具体的fasta的操作,截取结果为:
{"contig": {"contigName": "chrUn_KN707606v1_decoy", "contigLength": 2200, "contigMD5": null, "referenceURL": null, "assembly": null, "species": null, "referenceIndex": null}, "description": "AC:KN707606.1 gi:734691250 LN:2200 rl:unplaced M5:20c768ac79ca38077e5012ee0e5f8333 AS:hs38d1", "fragmentSequence": "ctagtagctgggactacaagcgcccgccaccacacccggctaatttttttgtatttttagtggagacaggtttcaccgtgttggccaggatggtctcgatctcctgaccttgtgatctgcccaccttgccctcccaaagtgctgggattacaggcatgagccaccatacccggcagTGTCCTATCCATTTTTAAGGCAGCCACTTGGAGTTGGAGCATGTCTTTCTCTCATAATCTCTTACCAGATGTCTCAGAGCAGCCTGTGCACTTTAACTCCAGACATTCTGCCACTGAGCCCCCTAGAGCTCCAGCTTTTAAAGCACTTGGGGTGAGCCTCGAGAGATGACAGACGGAGCTGCCCAAGAGCTGCCAGCTGCCAACCCTGCCTGGGGCTTCACGGCCCGCGCCCTACTTCCTCTCAGCTGGCTCCACACCCTGGGGCGTGTAATTTCCAAATTCTCACTCCCAGGGCTAATTTGGGGGATAAGACATTTGATTAGAAGTATCAgaaaccagctgggcatggtggctcacacctgtaatcccagcactttgggaggttatgactagaggatcatttgaactcaggaattcaagaccagcctggataacagtgagaccccatctctacaaaatataaacaattatgtgagcatggtggtgcacacctgtagtccctgttccttgggaggctgaggccggaggatcccttgagcccaggagttcaaggctgcagagagctgcgattgtgccactgcacactaacctgggagatagagcaagaacttgtctcagaaaaaaaaagtatcaggaaCTAATCTCCAGTCCTATCAAGTTAGGCATAAGGTCAATGTGTGATAGCTGAGTGTCACAGAAACCAAGGACAGGAATGCAACTGCCACTGGGGATGAACTGGAAGTGGGGAGTTAAACCACCTCAGAATGTccccatttttgtttcttctccagATGTGCTGCTTTGCTTTTCCGTATGTTTCTCTACGGACCAGCTACCTCTCCTCTGCCAACAGATCCAAGTTGTGCATGTTATGGGTCCAAACACCACGTGACAAGCCCATTCTTCCAGTTTCTCAGACCAGAAACTGCACTGTCCTCTAACTGCTTCTTCTCCCTCTTGCATCTGGTCCTTGGGGAAATCCTGTTTGCCCGGCCTTCAGCATATATCCACAGTTTAACCTTAACCACTCCTCGCCACCACTCGCGGGGGCGAGCAGCCTTCGCCCCCTGCCTAGATTACTACAGTAACTTCATTGTTCTTTCTACTTCTCTCTTTGCCCCTCTGCTATCTCAAAACAGCATCCAAAATGCACCTAGCAAGAGCATGTCATTCCTCTGCACAAAACTCTccaacttctctctttttttttttttttttttttttgagacggagtctcactctgtcacccaggctggagtgcaatagtgtgatcttggctcactgcaacctccacctcccagattcaagcgattctcctgcctcagcctcctgagtagctgagattacaggttcatgtcaccatgcccggctaatttttgtatttttagtagagacagggtttcaccatgttagtcaggctggtctcgaactcctgaccttgtgatccacccgcctcagcctcccaaagtgctgggattataggcatgagccaccgtgcatgacCAACTTCTCTTTTTGTTCAGAGTAAAAGCCAACGGCCCATGAGGCTTTCCATGGTCACGCCTCCGCTCATTCGCTCTGTGGCTTTGTCTTACACGGGTTCACTCCTCACTGGCCGCCTTGCTGACCCCATAGCTCACGGGCCTTACTCTGCTctcggggcctttgcacttgctccaCTGCAAATGCTCCTCCCCCAGAGGCCTTTGTGGCCCATTCCCTCGGTTCCTTAGGAACAATCCCTTCCCTGGTCAAACCTCCACTGACATCTGTCTCCTtcccttctgaattttttttctccgGTAGTATTTATCACTCTGCTATCCTTAGGATTTCCTTATCTTGTTTATCATCATCTCCTCATCCAGAGcttaagtcctttttttttttttgagatagagtctcgctctgtcgcccaggctggagtgcagtggcgcgatctcgtctcgctgaaagctccacctcccgggttcacgccattctcccgcctcagcctcccgagtagctgggactacaggcactcg", "fragmentNumber": 0, "fragmentStartPosition": 0, "fragmentLength": 2200, "numberOfFragmentsInContig": 1}
3.创建空的:
代码:
/** * @author xubo * Fasta/Fastq/SAM/BAM read */ package org.bdgenomics.adamLocal.algorithms.test import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.bdgenomics.adam.rdd.ADAMContext import org.bdgenomics.formats.avro.NucleotideContigFragment import org.bdgenomics.formats.avro.Contig //import scala.collection.parallel.Foreach object NucleotideContigFragmentTest { def main(args: Array[String]) { val conf = new SparkConf().setAppName("FastaAndNucleotideContigFragment").setMaster("local") val sc = new SparkContext(conf) val ac = new ADAMContext(sc) val builder = NucleotideContigFragment.newBuilder() val contig = Contig.newBuilder builder.setContig(contig.build) val build1 = builder.build() println(build1); sc.stop } }
{"contig": {"contigName": null, "contigLength": null, "contigMD5": null, "referenceURL": null, "assembly": null, "species": null, "referenceIndex": null}, "description": null, "fragmentSequence": null, "fragmentNumber": null, "fragmentStartPosition": null, "fragmentLength": null, "numberOfFragmentsInContig": null}
附加:通过分析ADAMContext中loadFasta方法,里面有调用package org.bdgenomics.adam.converters下的FastaConverter,里面有段代码创建了改格式,故可以参考
def loadFasta( filePath: String, fragmentLength: Long): RDD[NucleotideClontigFragment] = { val fastaData: RDD[(LongWritable, Text)] = sc.newAPIHadoopFile( filePath, classOf[TextInputFormat], classOf[LongWritable], classOf[Text] ) if (Metrics.isRecording) fastaData.instrument() else fastaData val remapData = fastaData.map(kv => (kv._1.get, kv._2.toString)) FastaConverter(remapData, fragmentLength) }
val fragments = sequencesAsFragments.zipWithIndex .map(si => { val (bases, index) = si val contig = Contig.newBuilder .setContigLength(sequenceLength) val builder = NucleotideContigFragment.newBuilder() .setFragmentSequence(bases) .setFragmentNumber(index) .setFragmentStartPosition(index * fragmentLength) .setNumberOfFragmentsInContig(fragmentCount) .setFragmentLength(bases.length) // map over optional fields name.foreach(contig.setContigName(_)) description.foreach(builder.setDescription(_)) builder.setContig(contig.build) // build and return builder.build() })
参考:
【1】https://github.com/xubo245/AdamLearning
【2】http://blog.csdn.net/xubo245/article/details/51288264
【3】https://github.com/bigdatagenomics/adam