Heritrix源码分析(二) 配置文件order.xml介绍

 本博客属原创文章,欢迎转载!转载请务必注明出处:http://guoyunsky.iteye.com/blog/613412

     本博客已迁移到本人独立博客: http://www.yun5u.com/

      order.xml是整个Heritrix的核心,里面的每个一个配置都关系到Heritrix的运行情况,没读源码之前我只能从有限的渠道去获知这些配置的运用.读完之后才知道Heritrix竟然有如此灵活的运用,如可以控制抓取速度,可以优化电脑性能,可以在某一次的抓取上继续抓取.当然整个order.xml里我也没有全部掌握,只知道大部分配置的作用,希望大家指点改正以及补充,谢谢!

 

  1.       代表着该抓取JOB的元素,相当于Html的meta
Xml代码   收藏代码
  1. <meta>  
  2.    <name>myheritrixname>                      
  3.    <description>my heritrixdescription>    
  4.    <operator>Adminoperator>                   
  5.    <organization>organization>                 
  6.    <audience>audience>                            
  7.    <date>20090520051654date>               
Xml代码   收藏代码
  1. meta>  

 

2. 跟抓取有关的所有参数,由于内容较多,并且Heritrix也已将他们分成不同模块,所以这里我也将他们拆分来说明.

 

 

Xml代码   收藏代码
  1.  <controller>  
  2.     <string name="settings-directory">settingsstring>   
  3.     <string name="disk-path">string>  
  4.     <string name="logs-path">logsstring>  
  5.     <string name="checkpoints-path">checkpointsstring>  
  6.     <string name="state-path">statestring>   
  7.     <string name="scratch-path">scratchstring>    
  8.     <long name="max-bytes-download">0long>     
  9.     <long name="max-document-download">0long>    
  10.     <long name="max-time-sec">0long>   
  11.     <integer name="max-toe-threads">30integer>    
  12.     <integer name="recorder-out-buffer-bytes">4096integer>   
  13.     <integer name="recorder-in-buffer-bytes">65536integer>   
  14.     <integer name="bdb-cache-percent">0integer>   
  15. <newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">   
  16. newObject>  
  17. <map name="http-headers">   
  18. map>  
  19.  <newObject name="robots-honoring-policy" class="org.archive.crawler.datamodel.RobotsHonoringPolicy">   
  20. newObject>  
  21.  <newObject name="frontier" class="org.archive.crawler.frontier.BdbFrontier">   
  22. map>  
  23. <map name="pre-fetch-processors">   
  24. map>  
  25. <map name="fetch-processors">    
  26. map>  
  27. <map name="extract-processors">   
  28. map>  
  29. <map name="write-processors">   
  30. map>  
  31. <map name="post-processors">   
  32. map>  
  33. <map name="loggers">   
  34. map>  
  35. <newObject name="credential-store" class="org.archive.crawler.datamodel.CredentialStore">   
  36.  newObject>  
  37.  controller>  

3.接下来拆分每个组件的配置文件一一进行说明,最后对Heritrix主要的配置也就是我们可以影响抓取的配置进行说明。

   3.1:抓取范围

  

Xml代码   收藏代码
  1. <newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">  
  2.       <boolean name="enabled">falseboolean>   
  3.       <string name="seedsfile">seeds.txtstring>   
  4.       <boolean name="reread-seeds-on-config">trueboolean>   
  5.       <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">   
  6.         <map name="rules">  
  7.           <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule">   
  8.           newObject>  
  9.           <newObject name="acceptIfSurtPrefixed" class="org.archive.crawler.deciderules.SurtPrefixedDecideRule">  
  10.             <string name="decision">ACCEPTstring>   
  11.             <string name="surts-source-file">string>   
  12.             <boolean name="seeds-as-surt-prefixes">trueboolean>   
  13.             <string name="surts-dump-file">string>   
  14.             <boolean name="also-check-via">falseboolean>  
  15.             <boolean name="rebuild-on-reconfig">trueboolean>  
  16.           newObject>  
  17.           <newObject name="rejectIfTooManyHops" class="org.archive.crawler.deciderules.TooManyHopsDecideRule">  
  18.             <integer name="max-hops">20integer>  
  19.           newObject>  
  20.           <newObject name="acceptIfTranscluded" class="org.archive.crawler.deciderules.TransclusionDecideRule">  
  21.             <integer name="max-trans-hops">3integer>  
  22.             <integer name="max-speculative-hops">1integer>  
  23.           newObject>  
  24.           <newObject name="rejectIfPathological" class="org.archive.crawler.deciderules.PathologicalPathDecideRule">  
  25.             <integer name="max-repetitions">2integer>  
  26.           newObject>  
  27.           <newObject name="rejectIfTooManyPathSegs" class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule">  
  28.             <integer name="max-path-depth">20integer>  
  29.           newObject>  
  30.           <newObject name="acceptIfPrerequisite" class="org.archive.crawler.deciderules.PrerequisiteAcceptDecideRule">  
  31.           newObject>  
  32.         map>  
  33.       newObject>  
  34.     newObject>  

 

    3.2: HTTP协议

   

Xml代码   收藏代码
  1. <map name="http-headers">  
  2.       <string name="user-agent">Mozilla/5.0 (compatible; heritrix/1.14.3 +http://127.0.0.1)string>  
  3.       <string name="from">[email protected]string>  
  4.     map>  

 

    3.3:爬虫协议

    

Xml代码   收藏代码
  1. <newObject name="robots-honoring-policy" class="org.archive.crawler.datamodel.RobotsHonoringPolicy">  
  2.       <string name="type">classicstring>           
  3.       <boolean name="masquerade">falseboolean>     
  4.       <text name="custom-robots">text>    
  5.       <stringList name="user-agents">   
  6.       stringList>  
  7.     newObject>  

 

   3.4:Frontier 调度器

  

Xml代码   收藏代码
  1. <newObject name="frontier" class="org.archive.crawler.frontier.BdbFrontier">  
  2.       <float name="delay-factor">4.0float>  
  3.       <integer name="max-delay-ms">20000integer>  
  4.       <integer name="min-delay-ms">2000integer>  
  5.       <integer name="respect-crawl-delay-up-to-secs">300integer>  
  6.       <integer name="max-retries">30integer>  
  7.       <long name="retry-delay-seconds">900long>  
  8.       <integer name="preference-embed-hops">1integer>  
  9.       <integer name="total-bandwidth-usage-KB-sec">0integer>  
  10.       <integer name="max-per-host-bandwidth-usage-KB-sec">0integer>  
  11.       <string name="queue-assignment-policy">org.archive.crawler.frontier.HostnameQueueAssignmentPolicystring>  
  12.       <string name="force-queue-assignment">string>  
  13.       <boolean name="pause-at-start">falseboolean>  
  14.       <boolean name="pause-at-finish">falseboolean>  
  15.       <boolean name="source-tag-seeds">falseboolean>  
  16.       <boolean name="recovery-log-enabled">trueboolean>  
  17.       <boolean name="hold-queues">trueboolean>  
  18.       <integer name="balance-replenish-amount">3000integer>  
  19.       <integer name="error-penalty-amount">100integer>  
  20.       <long name="queue-total-budget">-1long>  
  21.       <string name="cost-policy">org.archive.crawler.frontier.ZeroCostAssignmentPolicystring>  
  22.       <long name="snooze-deactivate-ms">300000long>  
  23.       <integer name="target-ready-backlog">50integer>  
  24.       <string name="uri-included-structure">org.archive.crawler.util.BdbUriUniqFilterstring>  
  25.       <boolean name="dump-pending-at-close">falseboolean>  
  26.     newObject>  

 

   3.5:URL规范化规则,主要用来规范化每个URL,用Heritrix默认的就好了,这里不做说明了,其实也是通过各种规则

   3.6:预先处理链组件:

Xml代码   收藏代码
  1. <map name="pre-fetch-processors">   
  2.       <newObject name="Preselector" class="org.archive.crawler.prefetch.Preselector">  
  3.         <boolean name="enabled">trueboolean>  
  4.         <newObject name="Preselector#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  5.           <map name="rules">  
  6.           map>  
  7.         newObject>  
  8.         <boolean name="override-logger">falseboolean>  
  9.         <boolean name="recheck-scope">trueboolean>  
  10.         <boolean name="block-all">falseboolean>  
  11.         <string name="block-by-regexp">string>  
  12.         <string name="allow-by-regexp">string>  
  13.       newObject>  
  14.       <newObject name="Preprocessor" class="org.archive.crawler.prefetch.PreconditionEnforcer">  
  15.         <boolean name="enabled">trueboolean>  
  16.         <newObject name="Preprocessor#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  17.           <map name="rules">  
  18.           map>  
  19.         newObject>  
  20.         <integer name="ip-validity-duration-seconds">86400integer>  
  21.         <integer name="robot-validity-duration-seconds">86400integer>  
  22.         <boolean name="calculate-robots-only">falseboolean>  
  23.       newObject>  
  24.     map>  

 

   3.7:获取组件:

Xml代码   收藏代码
  1. <map name="fetch-processors">    
  2.      <newObject name="DNS" class="org.archive.crawler.fetcher.FetchDNS">  
  3.        <boolean name="enabled">trueboolean>  
  4.        <newObject name="DNS#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  5.          <map name="rules">  
  6.          map>  
  7.        newObject>  
  8.        <boolean name="accept-non-dns-resolves">falseboolean>  
  9.        <boolean name="digest-content">trueboolean>  
  10.        <string name="digest-algorithm">sha1string>  
  11.      newObject>  
  12.      <newObject name="HTTP" class="org.archive.crawler.fetcher.FetchHTTP">  
  13.        <boolean name="enabled">trueboolean>  
  14.        <newObject name="HTTP#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  15.          <map name="rules">  
  16.          map>  
  17.        newObject>  
  18.        <newObject name="midfetch-decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  19.          <map name="rules">  
  20.          map>  
  21.        newObject>  
  22.        <integer name="timeout-seconds">1200integer>  
  23.        <integer name="sotimeout-ms">20000integer>  
  24.        <integer name="fetch-bandwidth">0integer>  
  25.        <long name="max-length-bytes">0long>  
  26.        <boolean name="ignore-cookies">falseboolean>  
  27.        <boolean name="use-bdb-for-cookies">trueboolean>  
  28.        <string name="load-cookies-from-file">string>  
  29.        <string name="save-cookies-to-file">string>  
  30.        <string name="trust-level">openstring>  
  31.        <stringList name="accept-headers">  
  32.        stringList>  
  33.        <string name="http-proxy-host">string>  
  34.        <string name="http-proxy-port">string>  
  35.        <string name="default-encoding">GB2312string>  
  36.        <boolean name="digest-content">trueboolean>  
  37.        <string name="digest-algorithm">sha1string>  
  38.        <boolean name="send-if-modified-since">trueboolean>  
  39.        <boolean name="send-if-none-match">trueboolean>  
  40.        <boolean name="send-connection-close">trueboolean>  
  41.        <boolean name="send-referer">trueboolean>  
  42.        <boolean name="send-range">falseboolean>  
  43.        <string name="http-bind-address">string>  
  44.      newObject>  
  45.    map>  

 

   3.8:抽取组件

Xml代码   收藏代码
  1. <map name="extract-processors">   
  2.      <newObject name="ExtractorHTTP" class="org.archive.crawler.extractor.ExtractorHTTP">  
  3.        <boolean name="enabled">trueboolean>  
  4.        <newObject name="ExtractorHTTP#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  5.          <map name="rules">  
  6.          map>  
  7.        newObject>  
  8.      newObject>  
  9.      <newObject name="ExtractorHTML" class="org.archive.crawler.extractor.ExtractorHTML">  
  10.        <boolean name="enabled">trueboolean>  
  11.        <newObject name="ExtractorHTML#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  12.          <map name="rules">  
  13.          map>  
  14.        newObject>  
  15.        <boolean name="extract-javascript">trueboolean>  
  16.        <boolean name="treat-frames-as-embed-links">trueboolean>  
  17.        <boolean name="ignore-form-action-urls">trueboolean>  
  18.        <boolean name="extract-only-form-gets">trueboolean>  
  19.        <boolean name="extract-value-attributes">trueboolean>  
  20.        <boolean name="ignore-unexpected-html">trueboolean>  
  21.      newObject>  
  22.    map>  

 

   3.9:写组件

Xml代码   收藏代码
  1. <map name="write-processors">   
  2.       <newObject name="Archiver" class="com.steel.heritrix.extend.MyWriterMirror">  
  3.         <boolean name="enabled">trueboolean>  
  4.         <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  5.           <map name="rules">  
  6.           map>  
  7.         newObject>  
  8.         <boolean name="case-sensitive">trueboolean>   
  9.         <stringList name="character-map" />   
  10.         <stringList name="content-type-map" />   
  11.         <string name="directory-file">index.htmlstring>   
  12.         <string name="dot-begin">%2Estring>   
  13.         <string name="dot-end">.string>   
  14.         <stringList name="host-map" />   
  15.         <boolean name="host-directory">trueboolean>   
  16.         <string name="path">mirrorstring>   
  17.         <integer name="max-path-length">1023integer>   
  18.         <integer name="max-segment-length">255integer>   
  19.         <boolean name="port-directory">falseboolean>   
  20.         <boolean name="suffix-at-end">trueboolean>   
  21.         <string name="too-long-directory">LONGstring>   
  22.         <stringList name="underscore-set" />   
  23.       newObject>  
  24.     map>  

 

   3.10:请求链组件里面可以配置自己的调度器

Xml代码   收藏代码
  1. <map name="post-processors">   
  2.       <newObject name="Updater" class="org.archive.crawler.postprocessor.CrawlStateUpdater">  
  3.         <boolean name="enabled">trueboolean>  
  4.         <newObject name="Updater#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  5.           <map name="rules">  
  6.           map>  
  7.         newObject>  
  8.       newObject>  
  9.       <newObject name="LinksScoper" class="org.archive.crawler.postprocessor.LinksScoper">  
  10.         <boolean name="enabled">trueboolean>  
  11.         <newObject name="LinksScoper#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  12.           <map name="rules">  
  13.           map>  
  14.         newObject>  
  15.         <boolean name="override-logger">falseboolean>  
  16.         <integer name="preference-depth-hops">-1integer>  
  17.         <newObject name="scope-rejected-url-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  18.           <map name="rules">  
  19.           map>  
  20.         newObject>  
  21.       newObject>  
  22.       <newObject name="Scheduler" class="com.steel.heritrix.extend.MyFrontierScheduler">  
  23.         <boolean name="enabled">trueboolean>  
  24.         <newObject name="Scheduler#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">  
  25.           <map name="rules">  
  26.           map>  
  27.         newObject>  
  28.       newObject>  
  29.     map>  

 

   3.11:统计跟踪链组件

Xml代码   收藏代码
  1. <map name="loggers">   
  2.       <newObject name="crawl-statistics" class="org.archive.crawler.admin.StatisticsTracker">  
  3.         <integer name="interval-seconds">20integer>  
  4.       newObject>  
  5.     map>  

  

你可能感兴趣的:(Heritrix)