二代测序序列拼接:flash

转自:https://mp.weixin.qq.com/s/lgJDpwk0vYipARfTorfCkA
首先要说的是,并不是所有的分析都需要将双末端测序序列拼接,比如转录组就不需要,拼接最常见的是扩增子测序。
为什么要进行拼接?因为二代测序是将DNA或RNA打成特定长度的片段,比如300-400bp,而二代测序只能测特定长度,比如150nt,超过这一长度,测序质量就会下降的很严重,基本没有意义了。但是还有150-200bp没有测到,所以同一条DNA片段再反向测一次。
以下就是双末端测序中同一条DNA片段,正向和反向测序序列使用Clone Manager的比对结果。图中蓝色和红色分别表示两条reads匹配的序列,长约111bp,而打碎的这条DNA/RNA片段长约189bp。

Fig1.png

1.软件安装

在Linux系统下通过命令行进行下载安装。
自行下载安装

wget http://ccb.jhu.edu/software/FLASH/index.shtml/FLASH-1.2.11.tar.gz
tar -zxvf FLASH-1.2.11.tar.gz(解压缩FLASH-1.2.11.tar.gz)
cd FLASH-1.2.11/(进入FLASH-1.2.11文件夹工作路径下)
make(运行make编译命令,自动完成安装,生成可执行文件‘flash’)

或者conda安装

conda install -c bioconda flash
flash --help
Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ
       flash [OPTIONS] --interleaved-input (MATES.FASTQ | -)
       flash [OPTIONS] --tab-delimited-input (MATES.TAB | -)

----------------------------------------------------------------------------
                                 DESCRIPTION                                
----------------------------------------------------------------------------

FLASH (Fast Length Adjustment of SHort reads) is an accurate and fast tool
to merge paired-end reads that were generated from DNA fragments whose
lengths are shorter than twice the length of reads.  Merged read pairs result
in unpaired longer reads, which are generally more desired in genome
assembly and genome analysis processes.

Briefly, the FLASH algorithm considers all possible overlaps at or above a
minimum length between the reads in a pair and chooses the overlap that
results in the lowest mismatch density (proportion of mismatched bases in
the overlapped region).  Ties between multiple overlaps are broken by
considering quality scores at mismatch sites.  When building the merged
sequence, FLASH computes a consensus sequence in the overlapped region.
More details can be found in the original publication
(http://bioinformatics.oxfordjournals.org/content/27/21/2957.full).

Limitations of FLASH include:
   - FLASH cannot merge paired-end reads that do not overlap.
   - FLASH is not designed for data that has a significant amount of indel
     errors (such as Sanger sequencing data).  It is best suited for Illumina
     data.

----------------------------------------------------------------------------
                               MANDATORY INPUT
----------------------------------------------------------------------------

The most common input to FLASH is two FASTQ files containing read 1 and read 2
of each mate pair, respectively, in the same order.

Alternatively, you may provide one FASTQ file, which may be standard input,
containing paired-end reads in either interleaved FASTQ (see the
--interleaved-input option) or tab-delimited (see the --tab-delimited-input
option) format.  In all cases, gzip compressed input is autodetected.  Also,
in all cases, the PHRED offset is, by default, assumed to be 33; use the
--phred-offset option to change it.

----------------------------------------------------------------------------
                                   OUTPUT
----------------------------------------------------------------------------

The default output of FLASH consists of the following files:

   - out.extendedFrags.fastq      The merged reads.
   - out.notCombined_1.fastq      Read 1 of mate pairs that were not merged.
   - out.notCombined_2.fastq      Read 2 of mate pairs that were not merged.
   - out.hist                     Numeric histogram of merged read lengths.
   - out.histogram                Visual histogram of merged read lengths.

FLASH also logs informational messages to standard output.  These can also be
redirected to a file, as in the following example:

  $ flash reads_1.fq reads_2.fq 2>&1 | tee flash.log

In addition, FLASH supports several features affecting the output:

   - Writing the merged reads directly to standard output (--to-stdout)
   - Writing gzip compressed output files (-z) or using an external
     compression program (--compress-prog)
   - Writing the uncombined read pairs in interleaved FASTQ format
     (--interleaved-output)
   - Writing all output reads to a single file in tab-delimited format
     (--tab-delimited-output)

----------------------------------------------------------------------------
                                   OPTIONS
----------------------------------------------------------------------------

  -m, --min-overlap=NUM   The minimum required overlap length between two
                          reads to provide a confident overlap.  Default:
                          10bp.

  -M, --max-overlap=NUM   Maximum overlap length expected in approximately
                          90% of read pairs.  It is by default set to 65bp,
                          which works well for 100bp reads generated from a
                          180bp library, assuming a normal distribution of
                          fragment lengths.  Overlaps longer than the maximum
                          overlap parameter are still considered as good
                          overlaps, but the mismatch density (explained below)
                          is calculated over the first max_overlap bases in
                          the overlapped region rather than the entire
                          overlap.  Default: 65bp, or calculated from the
                          specified read length, fragment length, and fragment
                          length standard deviation.

  -x, --max-mismatch-density=NUM
                          Maximum allowed ratio between the number of
                          mismatched base pairs and the overlap length.
                          Two reads will not be combined with a given overlap
                          if that overlap results in a mismatched base density
                          higher than this value.  Note: Any occurence of an
                          'N' in either read is ignored and not counted
                          towards the mismatches or overlap length.  Our
                          experimental results suggest that higher values of
                          the maximum mismatch density yield larger
                          numbers of correctly merged read pairs but at
                          the expense of higher numbers of incorrectly
                          merged read pairs.  Default: 0.25.

  -O, --allow-outies      Also try combining read pairs in the "outie"
                          orientation, e.g.

                               Read 1: <-----------
                               Read 2:       ------------>

                          as opposed to only the "innie" orientation, e.g.

                               Read 1:       <------------
                               Read 2: ----------->

                          FLASH uses the same parameters when trying each
                          orientation.  If a read pair can be combined in
                          both "innie" and "outie" orientations, the
                          better-fitting one will be chosen using the same
                          scoring algorithm that FLASH normally uses.

                          This option also causes extra .innie and .outie
                          histogram files to be produced.

  -p, --phred-offset=OFFSET
                          The smallest ASCII value of the characters used to
                          represent quality values of bases in FASTQ files.
                          It should be set to either 33, which corresponds
                          to the later Illumina platforms and Sanger
                          platforms, or 64, which corresponds to the
                          earlier Illumina platforms.  Default: 33.

  -r, --read-len=LEN
  -f, --fragment-len=LEN
  -s, --fragment-len-stddev=LEN
                          Average read length, fragment length, and fragment
                          standard deviation.  These are convenience parameters
                          only, as they are only used for calculating the
                          maximum overlap (--max-overlap) parameter.
                          The maximum overlap is calculated as the overlap of
                          average-length reads from an average-size fragment
                          plus 2.5 times the fragment length standard
                          deviation.  The default values are -r 100, -f 180,
                          and -s 18, so this works out to a maximum overlap of
                          65 bp.  If --max-overlap is specified, then the
                          specified value overrides the calculated value.

                          If you do not know the standard deviation of the
                          fragment library, you can probably assume that the
                          standard deviation is 10% of the average fragment
                          length.

  --cap-mismatch-quals    Cap quality scores assigned at mismatch locations
                          to 2.  This was the default behavior in FLASH v1.2.7
                          and earlier.  Later versions will instead calculate
                          such scores as max(|q1 - q2|, 2); that is, the
                          absolute value of the difference in quality scores,
                          but at least 2.  Essentially, the new behavior
                          prevents a low quality base call that is likely a
                          sequencing error from significantly bringing down
                          the quality of a high quality, likely correct base
                          call.

  --interleaved-input     Instead of requiring files MATES_1.FASTQ and
                          MATES_2.FASTQ, allow a single file MATES.FASTQ that
                          has the paired-end reads interleaved.  Specify "-"
                          to read from standard input.

  --interleaved-output    Write the uncombined pairs in interleaved FASTQ
                          format.

  -I, --interleaved       Equivalent to specifying both --interleaved-input
                          and --interleaved-output.

  -Ti, --tab-delimited-input
                          Assume the input is in tab-delimited format
                          rather than FASTQ, in the format described below in
                          '--tab-delimited-output'.  In this mode you should
                          provide a single input file, each line of which must
                          contain either a read pair (5 fields) or a single
                          read (3 fields).  FLASH will try to combine the read
                          pairs.  Single reads will be written to the output
                          file as-is if also using --tab-delimited-output;
                          otherwise they will be ignored.  Note that you may
                          specify "-" as the input file to read the
                          tab-delimited data from standard input.

  -To, --tab-delimited-output
                          Write output in tab-delimited format (not FASTQ).
                          Each line will contain either a combined pair in the
                          format 'tag  seq  qual' or an uncombined
                          pair in the format 'tag  seq_1  qual_1
                           seq_2  qual_2'.

  -o, --output-prefix=PREFIX
                          Prefix of output files.  Default: "out".

  -d, --output-directory=DIR
                          Path to directory for output files.  Default:
                          current working directory.

  -c, --to-stdout         Write the combined reads to standard output.  In
                          this mode, with FASTQ output (the default) the
                          uncombined reads are discarded.  With tab-delimited
                          output, uncombined reads are included in the
                          tab-delimited data written to standard output.
                          In both cases, histogram files are not written,
                          and informational messages are sent to standard
                          error rather than to standard output.

  -z, --compress          Compress the output files directly with zlib,
                          using the gzip container format.  Similar to
                          specifying --compress-prog=gzip and --suffix=gz,
                          but may be slightly faster.

  --compress-prog=PROG    Pipe the output through the compression program
                          PROG, which will be called as `PROG -c -',
                          plus any arguments specified by --compress-prog-args.
                          PROG must read uncompressed data from standard input
                          and write compressed data to standard output when
                          invoked as noted above.
                          Examples: gzip, bzip2, xz, pigz.

  --compress-prog-args=ARGS
                          A string of additional arguments that will be passed
                          to the compression program if one is specified with
                          --compress-prog=PROG.  (The arguments '-c -' are
                          still passed in addition to explicitly specified
                          arguments.)

  --suffix=SUFFIX, --output-suffix=SUFFIX
                          Use SUFFIX as the suffix of the output files
                          after ".fastq".  A dot before the suffix is assumed,
                          unless an empty suffix is provided.  Default:
                          nothing; or 'gz' if -z is specified; or PROG if
                          --compress-prog=PROG is specified.

  -t, --threads=NTHREADS  Set the number of worker threads.  This is in
                          addition to the I/O threads.  Default: number of
                          processors.  Note: if you need FLASH's output to
                          appear deterministically or in the same order as
                          the original reads, you must specify -t 1
                          (--threads=1).

  -q, --quiet             Do not print informational messages.

  -h, --help              Display this help and exit.

  -v, --version           Display version.

Run `flash --help | less' to prevent this text from scrolling by.

2.使用方法

flash read1.fq read2.fq -p 33 -r 250 -f 500 -s 100 -o output

主要参数说明:

-m 拼接时overlap区的最小长度阈值,默认10bp;
-M overlap区的最大长度阈值,
-x overlap区允许的最大碱基错配比率(最大碱基错配数目/overlap区长度),默认为0.25;
-p 碱基质量值类型,64或者33;
-r reads长度;
-f 片段长度,也就是测序的文库大小;
-s 文库的偏差;
-o 输出文件前缀;
-z 输出压缩文件
-t 设置线程数,默认为1,FLASH软件支持多线程,速度快;

FLASH拼接默认输出6个结果文件:
output.extendeFrags.fastq 为拼接后的扩增片段序列文件;
output.flash.log 为日志文件,详细记录了拼接过程中的参数和拼接统计的数据;
output.hist 为拼接后的reads长度的统计信息文件;
output.histogram 为拼接后的reads长度直方图文件;
output.notCombined_1.fastq 为拼接不上的reads1序列文件;
output.notCombined_2.fastq 为拼接不上的reads2序列文件;

拼接

ls *1.fastq.gz |while read id;
do
mkdir -p ${id%_*}
flash ${id%_*}_R1.fastq.gz -O ${id%_*}_R2.fastq.gz \
-m 10 -M 100 -x 0.25 -z -o  ${id%_*} -d ./${id%_*}
done

你可能感兴趣的:(二代测序序列拼接:flash)