简介:Spring for Apache Hadoop provides integration with the Spring Framework to create and run Hadoop MapReduce, Hive, and Pig jobs as well as work with HDFS and HBase. If you have simple needs to work with Hadoop, including basic scheduling, you can add the Spring for Apache Hadoop namespace to your Spring based project and get going quickly using Hadoop. As the complexity of your Hadoop application increases, you may want to use Spring Batch and Spring Integration to regin in the complexity of developing a large Hadoop application.
1.Using the Spring for Apache Hadoop Namespace
To use the SHDP namespace, one just needs to import it inside the configuration:
<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:hdp="
http://www.springframework.org/schema/hadoop" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd
"> <bean id ... >
<hdp:configuration ...> </beans>
|
Spring for Apache Hadoop namespace prefix. Any name can do but through out the reference documentation, the |
|
The namespace URI. |
|
The namespace URI location. Note that even though the location points to an external address (which exists and is valid), Spring will resolve the schema locally as it is included in the Spring for Apache Hadoop library. |
|
Declaration example for the Hadoop namespace. Notice the prefix usage. |
2. Configuring Hadoop
In order to use Hadoop, one needs to first configure it namely by creating a Configuration object. The configuration holds information about the job tracker, the input, output format and the various other parameters of the map reduce job.
<hdp:configuration />
使用xml配置文件:
<hdp:configuration resources="classpath:/custom-site.xml, classpath:/hq-site.xml">
使用properties配置文件:
<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:hdp="http://www.springframework.org/schema/hadoop" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd"> <hdp:configuration> fs.default.name=hdfs://localhost:9000 hadoop.tmp.dir=/tmp/hadoop electric=sea </hdp:configuration> </beans>使用表达式避免硬编码(properties文件):
<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:hdp="http://www.springframework.org/schema/hadoop" xmlns:context="http://www.springframework.org/schema/context" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd"> <hdp:configuration> fs.default.name=${hd.fs} hadoop.tmp.dir=file://${java.io.tmpdir} hangar=${number:18} </hdp:configuration> <context:property-placeholder location="classpath:hadoop.properties" /> </beans>
3.Creating a Hadoop Job
example:
未指定Configuration,会默认使用hadoopConfiguration
<hdp:job id="mr-job" input-path="/input/" output-path="/ouput/" mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper" reducer="org.apache.hadoop.examples.WordCount.IntSumReducer"/>
Notice that there is no reference to the Hadoop configuration above - that's because, if not specified, the default naming convention (hadoopConfiguration) will be used instead.
指定Configuration配置文件:
<hdp:job id="mr-job" input-path="/input/" output-path="/ouput/" mapper="mapper class" reducer="reducer class" jar-by-class="class used for jar detection" properties-location="classpath:special-job.properties"> electric=sea </hdp:job>
4.Creating a Hadoop Streaming Job
<hdp:streaming id="streaming-env" input-path="/input/" output-path="/ouput/" mapper="${path.cat}" reducer="${path.wc}"> <hdp:cmd-env> EXAMPLE_DIR=/home/example/dictionaries/ ... </hdp:cmd-env> </hdp:streaming>5.Running a Hadoop Jobfor basic job submission SHDP provides the job-runner element (backed by JobRunner class) which submits several jobs sequentially
<hdp:job-runner id="myjob-runner" pre-action="cleanup-script" post-action="export-results" job="myjob" run-at-startup="true"/> <hdp:job id="myjob" input-path="/input/" output-path="/output/" mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper" reducer="org.apache.hadoop.examples.WordCount.IntSumReducer" />Multiple jobs can be specified and even nested if they are not used outside the runner:
<hdp:job-runner id="myjobs-runner" pre-action="cleanup-script" job="myjob1, myjob2" run-at-startup="true"/> <hdp:job id="myjob1" ... /> <hdp:streaming id="myjob2" ... />Do note that the runner will not run unless triggered manually or if run-at-startup is set to true6.Using the Hadoop Job tasklet
<hdp:job-tasklet id="hadoop-tasklet" job-ref="mr-job" wait-for-job="true" />wait-for-job is true so that the tasklet will wait for the job to complete when it executes7.Running a Hadoop Toolhadoop jar带参运行:
bin/hadoop jar -conf hadoop-site.xml -jt darwin:50020 -D property=value someJar.jar org.foo.SomeTooldata/in.txt data/out.txt
spring for hadoop 使用tool-runner运行带参的mapreduce
<hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true"> <hdp:arg value="data/in.txt"/> <hdp:arg value="data/out.txt"/> property=value </hdp:tool-runner>