Hadoop

Hadoop简介

Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力进行高速运算和存储。Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高吞吐量(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求,可以以流的形式访问(streaming access)文件系统中的数据。Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS为海量的数据提供了存储,而MapReduce则为海量的数据提供了计算.

安装Hadoop

步骤一:环境准备
1)安装java环境

[root@nn01 ~]# yum -y install java-1.8.0-openjdk-devel
[root@nn01 ~]# java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-b12)
OpenJDK 64-Bit Server VM (build 25.131-b12, mixed mode)
[root@nn01 ~]# jps
1235 Jps

2)安装hadoop

[root@nn01 ~]# cd hadoop/
[root@nn01 hadoop]# ls
hadoop-2.7.7.tar.gz  
[root@nn01 hadoop]# tar -xf hadoop-2.7.7.tar.gz 
[root@nn01 hadoop]# mv hadoop-2.7.7 /usr/local/hadoop
[root@nn01 hadoop]# cd /usr/local/hadoop
[root@nn01 hadoop]# ls
bin  include  libexec      NOTICE.txt  sbin
etc  lib      LICENSE.txt  README.txt  share
[root@nn01 hadoop]# ./bin/hadoop   //报错,JAVA_HOME没有找到
Error: JAVA_HOME is not set and could not be found.
[root@nn01 hadoop]#

3)解决报错问题

[root@nn01 hadoop]# rpm -ql java-1.8.0-openjdk
[root@nn01 hadoop]# cd ./etc/hadoop/
[root@nn01 hadoop]# vim hadoop-env.sh
25 export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64    /jre"
33 export HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
[root@nn01 ~]# cd /usr/local/hadoop/
[root@nn01 hadoop]# ./bin/hadoop
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
  CLASSNAME            run the class named CLASSNAME
 or
  where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
                       note: please use "yarn jar" to launch
                             YARN applications, not this command.
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
  credential           interact with credential providers
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
  trace                view and modify Hadoop tracing settings
Most commands print help when invoked w/o parameters.
[root@nn01 hadoop]# mkdir /usr/local/hadoop/input
[root@nn01 hadoop]# ls
bin  etc  include  lib  libexec  LICENSE.txt  NOTICE.txt  input  README.txt  sbin  share
[root@nn01 hadoop]# cp *.txt /usr/local/hadoop/input
[root@nn01 hadoop]# ./bin/hadoop jar  \
 share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar  wordcount input output        //wordcount为参数 统计input这个文件夹,存到output这个文件里面(这个文件不能存在,要是存在会报错,是为了防止数据覆盖)
[root@nn01 hadoop]#  cat   output/part-r-00000    //查看

安装配置Hadoop

准备四台虚拟机,由于之前已经准备过一台,所以只需再准备三台新的虚拟机即可,安装hadoop,使所有节点可以ping通,配置SSH信任关系,如图-1所示:
Hadoop_第1张图片
步骤一:环境准备
1)三台机器配置主机名为node1、node2、node3,配置ip地址
2)编辑/etc/hosts(四台主机同样操作,以nn01为例)

[root@nn01 ~]# vim /etc/hosts
192.168.1.60  nn01
192.168.1.61  node1
192.168.1.62  node2
192.168.1.63  node3

3)安装java环境,在node1,node2,node3上面操作(以node1为例)

[root@node1 ~]# yum -y install java-1.8.0-openjdk-devel

4)布置SSH信任关系

[root@nn01 ~]# vim /etc/ssh/ssh_config    //第一次登陆不需要输入yes
Host *
        GSSAPIAuthentication yes
        StrictHostKeyChecking no
[root@nn01 .ssh]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:Ucl8OCezw92aArY5+zPtOrJ9ol1ojRE3EAZ1mgndYQM root@nn01
The key's randomart image is:
+---[RSA 2048]----+
|        o*E*=.   |
|         +XB+.   |
|        ..=Oo.   |
|        o.+o...  |
|       .S+.. o   |
|        + .=o    |
|         o+oo    |
|        o+=.o    |
|        o==O.    |
+----[SHA256]-----+
[root@nn01 .ssh]# for i in 61 62 63 64 ; do  ssh-copy-id  192.168.1.$i; done   
//部署公钥给nn01,node1,node2,node3

5)测试信任关系

[root@nn01 .ssh]# ssh node1
Last login: Fri Sep  7 16:52:00 2018 from 192.168.1.60
[root@node1 ~]# exit
logout
Connection to node1 closed.
[root@nn01 .ssh]# ssh node2
Last login: Fri Sep  7 16:52:05 2018 from 192.168.1.60
[root@node2 ~]# exit
logout
Connection to node2 closed.
[root@nn01 .ssh]# ssh node3

步骤二:配置hadoop
1)修改slaves文件

[root@nn01 ~]# cd  /usr/local/hadoop/etc/hadoop
[root@nn01 hadoop]# vim slaves
node1
node2
node3

2)hadoop的核心配置文件core-site

[root@nn01 hadoop]# vim core-site.xml
<configuration>
<property>
        <name>fs.defaultFS</name>
        <value>hdfs://nn01:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/hadoop</value>
    </property>
</configuration>
[root@nn01 hadoop]# mkdir /var/hadoop        //hadoop的数据根目录

3)配置hdfs-site文件

[root@nn01 hadoop]# vim hdfs-site.xml
<configuration>
 <property>
        <name>dfs.namenode.http-address</name>
        <value>nn01:50070</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>nn01:50090</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
</configuration>

4)同步配置到node1,node2,node3

 [root@nn01 hadoop]# for i in 62 63 64 ; do rsync -aSH --delete /usr/local/hadoop/ 
\   192.168.1.$i:/usr/local/hadoop/  -e 'ssh' & done
[1] 23260
[2] 23261
[3] 23262

5)查看是否同步成功

[root@nn01 hadoop]# ssh node1 ls /usr/local/hadoop/
bin
etc
include
lib
libexec
LICENSE.txt
NOTICE.txt
output
README.txt
sbin
share
input
[root@nn01 hadoop]# ssh node2 ls /usr/local/hadoop/
bin
etc
include
lib
libexec
LICENSE.txt
NOTICE.txt
output
README.txt
sbin
share
input
[root@nn01 hadoop]# ssh node3 ls /usr/local/hadoop/
bin
etc
include
lib
libexec
LICENSE.txt
NOTICE.txt
output
README.txt
sbin
share
input

步骤三:格式化

[root@nn01 hadoop]# cd /usr/local/hadoop/
[root@nn01 hadoop]# ./bin/hdfs namenode -format         //格式化 namenode
[root@nn01 hadoop]# ./sbin/start-dfs.sh        //启动
[root@nn01 hadoop]# jps        //验证角色
23408 NameNode
23700 Jps
23591 SecondaryNameNode
[root@nn01 hadoop]# ./bin/hdfs dfsadmin -report        //查看集群是否组建成功
Live datanodes (3):        //有三个角色成功

部署Hadoop

在准备好的环境下给master (nn01)主机添加ResourceManager的角色,在node1,node2,node3上面添加NodeManager的角色,如图所示:
Hadoop_第2张图片

步骤一:部署hadoop
1)配置mapred-site(nn01上面操作)

[root@nn01 ~]# cd /usr/local/hadoop/etc/hadoop/
[root@nn01 hadoop]# mv mapred-site.xml.template mapred-site.xml
[root@nn01 hadoop]# vim mapred-site.xml
<configuration>
<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

2)配置yarn-site(nn01上面操作)

[root@nn01 hadoop]# vim yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
        <name>yarn.resourcemanager.hostname</name>
        <value>nn01</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

3)同步配置(nn01上面操作)

[root@nn01 hadoop]# for i in {62..64}; do rsync -aSH --delete /usr/local/hadoop/ 192.168.1.$i:/usr/local/hadoop/  -e 'ssh' & done
[1] 712
[2] 713
[3] 714

4)验证配置(nn01上面操作)

[root@nn01 hadoop]# cd /usr/local/hadoop
[root@nn01 hadoop]# ./sbin/start-dfs.sh
Starting namenodes on [nn01]
nn01: namenode running as process 23408. Stop it first.
node1: datanode running as process 22409. Stop it first.
node2: datanode running as process 22367. Stop it first.
node3: datanode running as process 22356. Stop it first.
Starting secondary namenodes [nn01]
nn01: secondarynamenode running as process 23591. Stop it first.
[root@nn01 hadoop]# ./sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-root-resourcemanager-nn01.out
node2: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-node2.out
node3: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-node3.out
node1: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-node1.out
[root@nn01 hadoop]# jps    //nn01查看有ResourceManager
23408 NameNode
1043 ResourceManager
1302 Jps
23591 SecondaryNameNode
[root@nn01 hadoop]# ssh node1 jps        //node1查看有NodeManager
25777 Jps
22409 DataNode
25673 NodeManager
[root@nn01 hadoop]# ssh node2 jps        //node1查看有NodeManager
25729 Jps
25625 NodeManager
22367 DataNode
[root@nn01 hadoop]# ssh node3 jps        //node1查看有NodeManager
22356 DataNode
25620 NodeManager
25724 Jps

5)web访问hadoop

http://192.168.1.60:50070/            //--namenode web页面(nn01)
http://192.168.1.60:50090/        //--secondory namenode web页面(nn01)
http://192.168.1.61:50075/        //--datanode web页面(node1,node2,node3)
http://192.168.1.60:8088/        //--resourcemanager web页面(nn01)
http://192.168.1.61:8042/        //--nodemanager web页面(node1,node2,node3)

Hadoop词频统计

步骤一:词频统计

[root@nn01 hadoop]# ./bin/hadoop fs -ls /        //查看集群文件系统的根,没有内容
[root@nn01 hadoop]# ./bin/hadoop fs -mkdir  /aaa        
//在集群文件系统下创建aaa目录
[root@nn01 hadoop]# ./bin/hadoop fs -ls /        //再次查看,有刚创建的aaa目录
Found 1 items
drwxr-xr-x   - root supergroup          0 2018-09-10 09:56 /aaa
[root@nn01 hadoop]#  ./bin/hadoop fs -touchz  /fa    //在集群文件系统下创建fa文件
[root@nn01 hadoop]# ./bin/hadoop fs -put *.txt /aaa     
//上传*.txt到集群文件系统下的aaa目录
[root@nn01 hadoop]#  ./bin/hadoop fs -ls /aaa    //查看
Found 3 items
-rw-r--r--   2 root supergroup      86424 2018-09-10 09:58 /aaa/LICENSE.txt
-rw-r--r--   2 root supergroup      14978 2018-09-10 09:58 /aaa/NOTICE.txt
-rw-r--r--   2 root supergroup       1366 2018-09-10 09:58 /aaa/README.txt
[root@nn01 hadoop]# ./bin/hadoop fs -get  /aaa  //下载集群文件系统的aaa目录
[root@nn01 hadoop]# ./bin/hadoop jar  \
 share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar  wordcount /aaa /bbb    //hadoop集群分析大数据,hadoop集群/aaa里的数据存到hadoop集群/bbb下
[root@nn01 hadoop]# ./bin/hadoop fs -cat /bbb/*        //查看集群里的数据

节点管理

步骤一:增加节点
1)增加一个新的节点node4

[root@hadoop5 ~]# echo node4 > /etc/hostname     //更改主机名为node4
[root@hadoop5 ~]# hostname node4
[root@node4 ~]# yum -y install java-1.8.0-openjdk-devel
[root@node4 ~]# mkdir /var/hadoop
[root@nn01 .ssh]# ssh-copy-id 192.168.1.64
[root@nn01 .ssh]# vim /etc/hosts
192.168.1.60  nn01
192.168.1.61  node1
192.168.1.62  node2
192.168.1.63  node3
192.168.1.64  node4
[root@nn01 .ssh]# scp /etc/hosts 192.168.1.64:/etc/
[root@nn01 ~]# cd /usr/local/hadoop/
[root@nn01 hadoop]# vim ./etc/hadoop/slaves
node1
node2
node3
node4
[root@nn01 hadoop]# for i in {61..64}; do rsync -aSH --delete /usr/local/hadoop/
\ 192.168.1.$i:/usr/local/hadoop/  -e 'ssh' & done        //同步配置
[1] 1841
[2] 1842
[3] 1843
[4] 1844
[root@node4 ~]# cd /usr/local/hadoop/
[root@node4 hadoop]# ./sbin/hadoop-daemon.sh start datanode  //启动

2)查看状态

[root@node4 hadoop]# jps
24439 Jps
24351 DataNode

3)设置同步带宽

[root@node4 hadoop]# ./bin/hdfs dfsadmin -setBalancerBandwidth 60000000
Balancer bandwidth is set to 60000000
[root@node4 hadoop]# ./sbin/start-balancer.sh

4)删除节点

[root@nn01 hadoop]# vim /usr/local/hadoop/etc/hadoop/slaves        
//去掉之前添加的node4
node1
node2
node3
[root@nn01 hadoop]# vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml        
//在此配置文件里面加入下面四行
<property>                                      
    <name>dfs.hosts.exclude</name>
    <value>/usr/local/hadoop/etc/hadoop/exclude</value>
</property>
[root@nn01 hadoop]# vim /usr/local/hadoop/etc/hadoop/exclude
node4

5)导出数据

[root@nn01 hadoop]# ./bin/hdfs dfsadmin -refreshNodes
Refresh nodes successful
[root@nn01 hadoop]# ./bin/hdfs dfsadmin -report  //查看node4显示Decommissioned
Name: 192.168.1.64:50010 (node4)
Hostname: node4
Decommission Status : Decommissioned
Configured Capacity: 2135949312 (1.99 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 1861509120 (1.73 GB)
DFS Remaining: 274436096 (261.72 MB)
DFS Used%: 0.00%
DFS Remaining%: 12.85%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Mar 05 17:17:09 CST 2019
[root@node4 hadoop]# ./sbin/hadoop-daemon.sh stop datanode    //停止datanode
stopping datanode
[root@node4 hadoop]# ./sbin/yarn-daemon.sh start nodemanager             
//yarn 增加 nodemanager
[root@node4 hadoop]# ./sbin/yarn-daemon.sh stop  nodemanager  //停止nodemanager
stopping nodemanager
[root@node4 hadoop]# ./bin/yarn node -list        
//yarn 查看节点状态,还是有node4节点,要过一段时间才会消失
Total Nodes:4
         Node-Id         Node-State    Node-Http-Address    Number-of-Running-Containers
     node3:34628            RUNNING           node3:8042                               0
     node2:36300            RUNNING           node2:8042                               0
     node4:42459            RUNNING           node4:8042                               0
     node1:39196            RUNNING           node1:8042                               0

NFS配置

步骤一:基础准备
1)更改主机名,配置/etc/hosts(/etc/hosts在nn01和nfsgw上面配置)
Hadoop_第3张图片

[root@localhost ~]# echo nfsgw > /etc/hostname 
[root@localhost ~]# hostname nfsgw
[root@nn01 hadoop]# vim /etc/hosts
192.168.1.60  nn01
192.168.1.61  node1
192.168.1.62  node2
192.168.1.63  node3
192.168.1.64  node4
192.168.1.65  nfsgw

2)创建代理用户(nn01和nfsgw上面操作),以nn01为例子

[root@nn01 hadoop]# groupadd -g 800 nfsuser
[root@nn01 hadoop]# useradd -u 800 -g 800 -r -d /var/hadoop nfsuser

3)配置core-site.xml

[root@nn01 hadoop]# ./sbin/stop-all.sh   //停止所有服务
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [nn01]
nn01: stopping namenode
node2: stopping datanode
node4: no datanode to stop
node3: stopping datanode
node1: stopping datanode
Stopping secondary namenodes [nn01]
nn01: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
node2: stopping nodemanager
node3: stopping nodemanager
node4: no nodemanager to stop
node1: stopping nodemanager
...
[root@nn01 hadoop]# cd etc/hadoop
[root@nn01 hadoop]# >exclude
[root@nn01 hadoop]# vim core-site.xml
    <property>
        <name>hadoop.proxyuser.nfsuser.groups</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.nfsuser.hosts</name>
        <value>*</value>
    </property>

4)同步配置到node1,node2,node3

[root@nn01 hadoop]# for i in {61..63}; do rsync -aSH --delete /usr/local/hadoop/ 192.168.1.$i:/usr/local/hadoop/  -e 'ssh' & done
[4] 2722
[5] 2723
[6] 2724

5)启动集群

[root@nn01 hadoop]# /usr/local/hadoop/sbin/start-dfs.sh

6)查看状态

[root@nn01 hadoop]# /usr/local/hadoop/bin/hdfs  dfsadmin -report

步骤二:NFSGW配置
1)安装java-1.8.0-openjdk-devel和rsync

[root@nfsgw ~]# yum -y install java-1.8.0-openjdk-devel
[root@nn01 hadoop]# rsync -avSH --delete \ 
/usr/local/hadoop/ 192.168.1.65:/usr/local/hadoop/  -e 'ssh'

2)创建数据根目录 /var/hadoop(在NFSGW主机上面操作)

[root@nfsgw ~]# mkdir /var/hadoop

3)创建转储目录,并给用户nfs 赋权

[root@nfsgw ~]# mkdir /var/nfstmp
[root@nfsgw ~]# chown nfsuser:nfsuser /var/nfstmp

4)给/usr/local/hadoop/logs赋权(在NFSGW主机上面操作)

[root@nfsgw ~]# setfacl -m user:nfsuser:rwx /usr/local/hadoop/logs
[root@nfsgw ~]# vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml
    <property>
        <name>nfs.exports.allowed.hosts</name>
        <value>* rw</value>
    </property>
    <property>
        <name>nfs.dump.dir</name>
        <value>/var/nfstmp</value>
    </property>

5)可以创建和删除即可

[root@nfsgw ~]# su - nfs
[nfs@nfsgw ~]$ cd /var/nfstmp/
[nfs@nfsgw nfstmp]$ touch 1
[nfs@nfsgw nfstmp]$ ls
1
[nfs@nfsgw nfstmp]$ rm -rf 1
[nfs@nfsgw nfstmp]$ ls
[nfs@nfsgw nfstmp]$ cd /usr/local/hadoop/logs/
[nfs@nfsgw logs]$ touch 1
[nfs@nfsgw logs]$ ls
1 hadoop-root-secondarynamenode-nn01.log    yarn-root-resourcemanager-nn01.log
hadoop-root-namenode-nn01.log hadoop-root-secondarynamenode-nn01.out    yarn-root-resourcemanager-nn01.out
hadoop-root-namenode-nn01.out    hadoop-root-secondarynamenode-nn01.out.1
hadoop-root-namenode-nn01.out.1  SecurityAuth-root.audit
[nfs@nfsgw logs]$ rm -rf 1
[nfs@nfsgw logs]$ ls

6)启动服务

[root@nfsgw ~]# /usr/local/hadoop/sbin/hadoop-daemon.sh --script ./bin/hdfs start portmap        //portmap服务只能用root用户启动
starting portmap, logging to /usr/local/hadoop/logs/hadoop-root-portmap-nfsgw.out
[root@nfsgw ~]# jps
23714 Jps
23670 Portmap
[root@nfsgw ~]# su - nfs
Last login: Mon Sep 10 12:31:58 CST 2018 on pts/0
[nfs@nfsgw ~]$ cd /usr/local/hadoop/
[nfs@nfsgw hadoop]$ ./sbin/hadoop-daemon.sh  --script ./bin/hdfs start nfs3  
//nfs3只能用代理用户启动
starting nfs3, logging to /usr/local/hadoop/logs/hadoop-nfs-nfs3-nfsgw.out
[nfs@nfsgw hadoop]$ jps                    
1362 Jps
1309 Nfs3 
[root@nfsgw hadoop]# jps            //root用户执行可以看到portmap和nfs3
1216 Portmap
1309 Nfs3
1374 Jps

7)实现客户端挂载(客户端可以用node4这台主机)

[root@node4 ~]# rm -rf /usr/local/hadoop
[root@node4 ~]# yum -y install nfs-utils
[root@node4 ~]# mount -t nfs -o \
vers=3,proto=tcp,nolock,noatime,sync,noacl 192.168.1.64:/  /mnt/  //挂载
[root@node4 ~]# cd /mnt/
[root@node4 mnt]# ls
aaa  bbb  fa  system  tmp
[root@node4 mnt]# touch a
[root@node4 mnt]# ls
a  aaa  bbb  fa  system  tmp
[root@node4 mnt]# rm -rf a
[root@node4 mnt]# ls
aaa  bbb  fa  system  tmp

8)实现开机自动挂载

[root@node4 ~]# vim /etc/fstab
192.168.1.64:/  /mnt/ nfs  vers=3,proto=tcp,nolock,noatime,sync,noacl,_netdev 0 0 
[root@node4 ~]# mount -a
[root@node4 ~]# df -h
192.168.1.26:/   64G  6.2G   58G  10% /mnt
[root@node4 ~]# rpcinfo -p 192.168.1.64
   program vers proto   port  service
    100005    3   udp   4242  mountd
    100005    1   tcp   4242  mountd
    100000    2   udp    111  portmapper
    100000    2   tcp    111  portmapper
    100005    3   tcp   4242  mountd
    100005    2   tcp   4242  mountd
    100003    3   tcp   2049  nfs
    100005    2   udp   4242  mountd
    100005    1   udp   4242  mountd

你可能感兴趣的:(Hadoop)