遺伝子データ処理49のcloud-scale-bwamem運転に成功


1.まずartを使ってデータを生成する:前編を見てください
2.fastqをhdfsにアップロード:
hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ spark-submit  --class cs.ucla.edu.bwaspark.BWAMEMSpark --master local[2]  /home/hadoop/xubo/tools/cloud-scale-bwamem-0.2.1/target/cloud-scale-bwamem-0.2.0-assembly.jar upload-fastq   0 1 fastq/G38L100c1Nhs20.fastq /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq
command: upload-fastq
Map('isPairEnd -> 0, 'filePartNum -> 1, 'inFilePath1 -> fastq/G38L100c1Nhs20.fastq, 'outFilePath -> /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq)
Upload FASTQ command line arguments: 0 1 fastq/G38L100c1Nhs20.fastq  /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq 250000
[WARNING] Avro: Invalid default for field comment: null not a "bytes"
[WARNING] Avro: Invalid default for field comment: null not a "bytes"
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Upload FASTQ to HDFS Finished!!!

3.alignを行う:
hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ spark-submit --executor-memory 2g --class cs.ucla.edu.bwaspark.BWAMEMSpark --total-executor-cores 2 --master local[2]  --conf spark.driver.host=**MasterIP** --conf spark.driver.cores=2 --conf spark.driver.maxResultSize=2g --conf spark.storage.memoryFraction=0.7  --conf spark.akka.threads=2 --conf spark.akka.frameSize=1024 /home/hadoop/xubo/tools/cloud-scale-bwamem-0.2.1/target/cloud-scale-bwamem-0.2.0-assembly.jar cs-bwamem -bfn 1 -bPSW 1 -sbatch 10 -bPSWJNI 1  -oChoice 2 -oPath hdfs://**MasterIP**:9000/xubo/11.adam -localRef 1  -isSWExtBatched 1  0 GRCH38BWAindex/GRCH38chr1L3556522.fasta  /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq
command: cs-bwamem
Map('isPSWJNI -> 1, 'localRef -> 1, 'batchedFolderNum -> 1, 'isPSWBatched -> 1, 'subBatchSize -> 10, 'inFASTQPath -> /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq, 'inFASTAPath -> GRCH38BWAindex/GRCH38chr1L3556522.fasta, 'outputPath -> hdfs://**MasterIP**:9000/xubo/11.adam, 'isSWExtBatched -> 1, 'isPairEnd -> 0, 'outputChoice -> 2)
CS- BWAMEM command line arguments: false GRCH38BWAindex/GRCH38chr1L3556522.fasta /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq 1 true 10 true ./target/jniNative.so 2 hdfs://**MasterIP**:9000/xubo/11.adam
HDFS master: hdfs://Master:9000
Input HDFS folder number: 1
Head line: @RG  ID:foo  SM:bar
Read Group ID: foo
Load Index Files
Load BWA-MEM options
Output choice: 2
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[WARNING] Avro: Invalid default for field comment: null not a "bytes"
[WARNING] Avro: Invalid default for field comment: null not a "bytes"
[WARNING] Avro: Invalid default for field comment: null not a "bytes"
CS-BWAMEM Finished!!!
Jun 3, 2016 11:32:26 AM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 1
Jun 3, 2016 11:32:27 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Jun 3, 2016 11:32:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records.
Jun 3, 2016 11:32:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Jun 3, 2016 11:32:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 17 ms. row count = 1

MasterIPは対応するように修正する必要があります
4.adamファイルの表示:cs-bwamemはmergeを提供し、与えられた方法で成功しなかった.SparkSQLを使用して直接読み込むことができます.

package org.bdgenomics.avocado.cli

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.bdgenomics.adam.rdd.ADAMContext._

/**
  * Created by xubo on 2016/5/27.
  *  hdfs    avocado      
  * run:success
  */
object parquetRead2csbwamem {
  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$')))
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    println("start:")
    val file = "hdfs://**MasterIp**:9000/xubo/14.adam/0"
    val df3 = sqlContext.read.option("mergeSchema", "true").parquet(file)
    //    df3.printSchema()
    df3.show()
    println("end")
    sc.stop
  }
}

結果:

|              contig|    start|      end|mapq|            readName|            sequence|                qual|cigar|basesTrimmedFromStart|basesTrimmedFromEnd|readPaired|properPair|readMapped|mateMapped|firstOfPair|secondOfPair|failedVendorQualityChecks|duplicateRead|readNegativeStrand|mateNegativeStrand|primaryAlignment|secondaryAlignment|supplementaryAlignment|mismatchingPositions|origQual|          attributes|recordGroupName|recordGroupSequencingCenter|recordGroupDescription|recordGroupRunDateEpoch|recordGroupFlowOrder|recordGroupKeySequence|recordGroupLibrary|recordGroupPredictedMedianInsertSize|recordGroupPlatform|recordGroupPlatformUnit|recordGroupSample|mateAlignmentStart|mateAlignmentEnd|mateContig|

|[chr1,248956422,n...|225496693|225496793|  60|chr1-1   RG  ID:foo  ...|CATATTTACCAATTAAA...|@C@D@FFDFHHHHIJ.J...| 100M|                    0|                  0|     false|     false|      true|     false|      false|       false|                    false|        false|             false|             false|            true|             false|                 false|               61A38|    null|NM:i:1    AS:i:95 XS...|            foo|                       null|                  null|                   null|                null|                  null|              null|                                null|               null|                   null|              bar|              null|            null|      null|


end

結果分析:bwaと.snapマッチングとart生成の一致!!!参考【2】
hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ cat G38L100c1Nhs20.sam
@SQ SN:chr1 LN:248956422
@PG ID:bwa  PN:bwa  VN:0.7.13-r1126 CL:bwa samse GRCH38chr1L3556522.fna G38L100c1Nhs20.sai G38L100c1Nhs20.fq
chr1-1  0   chr1    225496694   37  100M    *   0   0   CATATTTACCAATTAAAGTCACAAAATATTTCTCATTATTTATTCATGCAGGTAACTGAGACAAAGATAGTGCAGAAATCAACTTTAAATAAAAAATTAT    @C@D@FFDFHHHHIJ.JBIJJGJGIJ:G47JHJ@IJJ91BJJIGHHHEIJDGD=IJJJBJJ'DG=3D)chr1   chr1-1  225496693   +
CATATTTACCAATTAAAGTCACAAAATATTTCTCATTATTTATTCATGCAGGTAACTGAGAAAAAGATAGTGCAGAAATCAACTTTAAATAAAAAATTAT
CATATTTACCAATTAAAGTCACAAAATATTTCTCATTATTTATTCATGCAGGTAACTGAGACAAAGATAGTGCAGAAATCAACTTTAAATAAAAAATTAT

参考:【1】https://github.com/ytchen0323/cloud-scale-bwamem 【2】http://blog.csdn.net/xubo245/article/details/51576880