Hadoop Effective Space = (MaxAllocFactor * DiskSize * ( #Disk �C RaidDisks ) ) / ReplicationFactor
MaxAllocFactor = Max recommended allocation, 75% for Hadoop
DiskSize = Size of your drive
#Disk = Number of drives
RaidDisks = Number disk eaten up by RAID, for Hadoop this is 0
ReplicationFactor = Hadoop recommends three copies of data thus it gets a replication factor of 3.
计算节点的可用空间:
假设复制因子为3,同时临时空间要占用25%的硬盘原始空间。基于上述假设,要在主机硬盘空间为2TB的集群上处理10TB数据,所需主机数的计算方法如下:
1. 用主机存储空间总量除以复制因子
2TB / 3 = 666 GB
2. 在此基础上减去25%的临时数据存储空间
666 GB * 0.75 = 500 GB
3. 因此,每个硬盘存储空间为2TB的节点只有大约500GB的可用空间
4. 数据集规模除以该值,结果即为所需的节点数
10TB / 500 GB = 20
所以,处理10TB数据的集群最少需要20个节点
Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster:
1. 12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
2. 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
3. 64-512GB of RAM
4. Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)
Here are the recommended specifications for NameNode/JobTracker/Standby NameNode nodes. The drive count will fluctuate depending on the amount of redundancy:
1. 4�C6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1 for Apache ZooKeeper, and 1 for Journal node)
2. 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
3. 64-128GB of RAM
4. Bonded Gigabit Ethernet or 10Gigabit Ethernet
Below is a list of various hardware configurations for different workloads, including our original “balanced” recommendation:
1. Light Processing Configuration (1U/machine): Two hex-core CPUs, 24-64GB memory, and 8 disk drives (1TB or 2TB)
2. Balanced Compute Configuration (1U/machine): Two hex-core CPUs, 48-128GB memory, and 12 �C 16 disk drives (1TB or 2TB) directly attached using the motherboard controller. These are often available as twins with two motherboards and 24 drives in a single 2U cabinet.
3. Storage Heavy Configuration (2U/machine): Two hex-core CPUs, 48-96GB memory, and 16-24 disk drives (2TB �C 4TB). This configuration will cause high network traffic in case of multiple node/rack failures.
4. Compute Intensive Configuration (2U/machine): Two hex-core CPUs, 64-512GB memory, and 4-8 disk drives (1TB or 2TB)