【2019-04-28】Hadoop分布式文件系统

Hadoop自带HDFS(hadoop distribute filesystem)。

HDFS 默认数据块 128M。

HDFS 有两类工作节点:一个管理节点namenode 、多个工作节点datenode,namenode维护文件系统树以及整颗树内所有的文件和目录。namenode记录每个文件中各个数据块所在的数据节点信息。但不永久保存块的位置信息,这些信息会在系统启动时根据数据节点信息重建。

namenode的数据如果毁坏无法恢复,Hadoop提供两种容错机制(1)备份组成文件系统元数据持久状态的文件(2)运行一个赋值namenode,但不能用作namenode。

一般datanode 从磁盘读取文件,但是对于频繁访问的文件,其对应的块可能被显示地缓存在datanode中,以堆外块缓存的形式存。

Hadoop 文件系统

image
image
image

JAVA接口

(1)从Hadoop URL读取文件

import java.io.InputStream;
import java.net.URL;

import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import org.apache.hadoop.io.IOUtils;

// vv URLCat
public class URLCat {

  static {
    URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
  }
  
  public static void main(String[] args) throws Exception {
    InputStream in = null;
    try {
      in = new URL(args[0]).openStream();
      IOUtils.copyBytes(in, System.out, 4096, false);
    } finally {
      IOUtils.closeStream(in);
    }
  }
}

通过FsUrlStreamHandlerFactory实例调用java.net.URL对象的FsUrlStreamHandlerFactory()方法。每个Java虚拟机只能调用一次这个方法,因此通常在静态方法中调用。
(2)通过FileSystemAPI读取文件
Hadoop文件系统中通过Hadoop Path对象来代表文件。FileSystem是一个通用的文件系统API。configuration封装了客户端和服务端的配置,通过设置配置文件读取类路径来实现。

// cc FileSystemCat Displays files from a Hadoop filesystem on standard output by using the FileSystem directly
import java.io.InputStream;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

// vv FileSystemCat
public class FileSystemCat {

  public static void main(String[] args) throws Exception {
    String uri = args[0];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    InputStream in = null;
    try {
      in = fs.open(new Path(uri));
      IOUtils.copyBytes(in, System.out, 4096, false);
    } finally {
      IOUtils.closeStream(in);
    }
  }
}
// ^^ FileSystemCat

FSDataInputStream
FileSystem中的open()对象返回的是FSDataInputStream对象,而不是java.io类对象,使用seek()方法,将Hadoop文件系统中的一个文件在标准输出中显示两次。

// cc FileSystemDoubleCat Displays files from a Hadoop filesystem on standard output twice, by using seek
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

// vv FileSystemDoubleCat
public class FileSystemDoubleCat {

 public static void main(String[] args) throws Exception {
   String uri = args[0];
   Configuration conf = new Configuration();
   FileSystem fs = FileSystem.get(URI.create(uri), conf);
   FSDataInputStream in = null;
   try {
     in = fs.open(new Path(uri));
     IOUtils.copyBytes(in, System.out, 4096, false);
     in.seek(0); // go back to the start of the file
     IOUtils.copyBytes(in, System.out, 4096, false);
   } finally {
     IOUtils.closeStream(in);
   }
 }
}
// ^^ FileSystemDoubleCat

(3)写入数据
FileSystem有一系列新建文件的方法,(i)最简单的方法是给准备建的文件指定一个path对象,然后返回一个用户写入数据的数据流。(ii)重载方法Progressable用于传递回掉借口。(iii)append()方法在一个先有文件的末尾追加数据。

// cc FileCopyWithProgress Copies a local file to a Hadoop filesystem, and shows progress
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;

// vv FileCopyWithProgress
public class FileCopyWithProgress {
  public static void main(String[] args) throws Exception {
    String localSrc = args[0];
    String dst = args[1];
    
    InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
    
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(dst), conf);
    OutputStream out = fs.create(new Path(dst), new Progressable() {
      public void progress() {
        System.out.print(".");
      }
    });
    
    IOUtils.copyBytes(in, out, 4096, true);
  }
}
// ^^ FileCopyWithProgress

(4)目录
显示Hadoop文件系统中一组路径的文件信息

// cc ListStatus Shows the file statuses for a collection of paths in a Hadoop filesystem 
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;

// vv ListStatus
public class ListStatus {

  public static void main(String[] args) throws Exception {
    String uri = args[0];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    
    Path[] paths = new Path[args.length];
    for (int i = 0; i < paths.length; i++) {
      paths[i] = new Path(args[i]);
    }
    
    FileStatus[] status = fs.listStatus(paths);
    Path[] listedPaths = FileUtil.stat2Paths(status);
    for (Path p : listedPaths) {
      System.out.println(p);
    }
  }
}
// ^^ ListStatus

判断是否包含一个路径

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;

public class RegexPathFilter implements PathFilter {
  
  private final String regex;
  private final boolean include;

  public RegexPathFilter(String regex) {
    this(regex, true);
  }
  
  public RegexPathFilter(String regex, boolean include) {
    this.regex = regex;
    this.include = include;
  }

  public boolean accept(Path path) {
    return (path.toString().matches(regex)) ? include : !include;
  }

}

pathFilter用户排除匹配正则匹配表达式的路径

// cc RegexExcludePathFilter A PathFilter for excluding paths that match a regular expression
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;

// vv RegexExcludePathFilter
public class RegexExcludePathFilter implements PathFilter {
  
  private final String regex;

  public RegexExcludePathFilter(String regex) {
    this.regex = regex;
  }

  public boolean accept(Path path) {
    return !path.toString().matches(regex);
  }
}
// ^^ RegexExcludePathFilter

数据流


客户端读取HDFS数据

客户端数据写入HDFS

你可能感兴趣的:(【2019-04-28】Hadoop分布式文件系统)