hadoop0.20.2
1.使用streaming命令(摘至hadoop开发文档):
除了纯文本格式的输出,你还可以生成gzip文件格式的输出,你只需设置streaming作业中的选项‘-jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode’。
2.使用程序:
输入文件:
$ bin/hadoop fs -ls /temp/in Found 2 items -rw-r--r-- 1 Administrator supergroup 52 2012-02-09 10:02 /temp/in/t1.txt -rw-r--r-- 1 Administrator supergroup 35 2012-02-09 10:02 /temp/in/t2.txt
调试代码:
public class ZipFile { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { output.collect((Text)value, null); } } public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(com.hadoop.test.ZipFile.class); // TODO: specify output types // conf.setOutputKeyClass(Text.class); // conf.setOutputValueClass(IntWritable.class); // TODO: specify input and output DIRECTORIES (not files) FileInputFormat.setInputPaths(conf, new Path("/temp/in")); FileOutputFormat.setOutputPath(conf, new Path("/temp/out-" + System.currentTimeMillis())); // TODO: specify a mapper conf.setMapperClass(Map.class); // TODO: specify a reducer // conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class); FileOutputFormat.setCompressOutput(conf, true); FileOutputFormat.setOutputCompressorClass(conf, org.apache.hadoop.io.compress.GzipCodec.class); // conf.setOutputFormat(NonSplitableTextInputFormat.class); // conf.setInputFormat(TextInputFormat.class); // conf.setOutputFormat(TextOutputFormat.class); conf.setNumReduceTasks(0); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } }
输出文件:
$ bin/hadoop fs -ls /temp/out-1328857284203 Found 2 items -rw-r--r-- 3 Administrator supergroup 67 2012-02-10 15:01 /temp/out-1328857284203/part-00000.gz -rw-r--r-- 3 Administrator supergroup 53 2012-02-10 15:01 /temp/out-1328857284203/part-00001.gz
使用命令:
$ bin/hadoop fs -get /temp/out-1328857284203/part-00000.gz out1.gz
把压缩后的文件下载到本地也是zip格式的文件,打开,解压打开跟原文件一致。