There are a couple of reasons why map-only jobs are efficient. • Since no reducers are needed, data never has to be transmitted between the map and reduce phase. Most of the map tasks pull data off of their locally attached disks and then write back out to that node. • Since there are no reducers, both the sort phase and the reduce phase are cut out. This usually doesn’t take very long, but every little bit helps.
Distributed grep
public static class GrepMapper
extends Mapper
Simple Random Sampling
public static class SRSMapper
extends Mapper
Bloom Filtering
Bloom filter training.
public class BloomFilterDriver {
public static void main(String[] args) throws Exception {
// Parse command line arguments
Path inputFile = new Path(args[0]);
int numMembers = Integer.parseInt(args[1]);
float falsePosRate = Float.parseFloat(args[2]);
Path bfFile = new Path(args[3]);
// Calculate our vector size and optimal K value based on approximations
int vectorSize = getOptimalBloomFilterSize(numMembers, falsePosRate);
int nbHash = getOptimalK(numMembers, vectorSize);
// Create new Bloom filter
BloomFilter filter = new BloomFilter(vectorSize, nbHash,
Hash.MURMUR_HASH);
System.out.println("Training Bloom filter of size " + vectorSize
+ " with " + nbHash + " hash functions, " + numMembers
+ " approximate number of records, and " + falsePosRate
+ " false positive rate");
// Open file for read
String line = null;
int numElements = 0;
FileSystem fs = FileSystem.get(new Configuration());
for (FileStatus status : fs.listStatus(inputFile)) {
BufferedReader rdr = new BufferedReader(new InputStreamReader(
new GZIPInputStream(fs.open(status.getPath()))));
System.out.println("Reading " + status.getPath());
while ((line = rdr.readLine()) != null) {
filter.add(new Key(line.getBytes()));
++numElements;
}
rdr.close();
}
System.out.println("Trained Bloom filter with " + numElements
+ " entries.");
System.out.println("Serializing Bloom filter to HDFS at " + bfFile);
FSDataOutputStream strm = fs.create(bfFile);
filter.write(strm);
strm.flush();
strm.close();
System.exit(0);
}
}
public static class BloomFilteringMapper extends
Mapper
This Bloom filter was trained with all user IDs that have a reputation of at least 1,500.
public static class BloomFilteringMapper extends
Mapper
Top Ten
class mapper:
setup():
initialize top ten sorted list
map(key, record):
insert record into top ten sorted list
if length of array is greater-than 10 then
truncate list to a length of 10
cleanup():
for record in top sorted ten list:
emit null,record
class reducer:
setup():
initialize top ten sorted list
reduce(key, records):
sort records
truncate records to top 10
for record in records:
private TreeMap repToRecordMap = new TreeMap();
public void reduce(NullWritable key, Iterable values,
Context context) throws IOException, InterruptedException {
for (Text value : values) {
Map parsed = transformXmlToMap(value.toString());
repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")),
new Text(value));
// If we have more than ten records, remove the one with the lowest rep
// As this tree map is sorted in descending order, the user with
// the lowest reputation is the last key.
if (repToRecordMap.size() > 10) {
repToRecordMap.remove(repToRecordMap.firstKey());
}
}
for (Text t : repToRecordMap.descendingMap().values()) {
// Output our ten records to the file system with a null key
context.write(NullWritable.get(), t);
}
}
}
The mapper takes each record and extracts the data fields for which we want unique values.In our HTTP logs example, this means extracting the user, the web browser,and the device values. The mapper outputs the record as the key, and null as the value.
public static class DistinctUserMapper extends
Mapper
Combiner optimization. A combiner can and should be used in the distinct pattern. Duplicate keys will be removed from each local map’s output, thus reducing the amount ofnetwork I/O required. The same code for the reducer can be used in the combiner.
Given a sorted linked list, delete all nodes that have duplicate numbers, leaving only distinct numbers from the original list.
For example,Given 1->2->3->3->4->4->5,