When a job is in sorting or merging phase, Hadoop leverage RawComparator for the map output key to compare keys. Built-in Writable classes such as IntWritable have byte-level implementation that are fast because they don't require the byte form of the object to be unmarshalled to Object form for the comparision. When writing your own Writable, it may be tempting to implement the WritableComparable interface for it's easy to implemente this interface without knowing the layout of the custom Writable layout in memory. Unfortunately, it requres Object unmarshalling from byte form which lead to inefficiency of comparisions.
In this blog post, I'll show you how to implement your custom RawComparator to avoid the inefficiencies. But by comparision, I'll implement the WritableComparable interface first, then implement RawComparator with the same custom object.
Suppose you have a custom Writable called Person, in order to make it comparable, you implement the WritableComparable like this:
import org.apache.hadoop.io.WritableComparable; import java.io.*; public class Person implements WritableComparable<Person> { private String firstName; private String lastName; public Person() { } public String getFirstName() { return firstName; } public void setFirstName(String firstName) { this.firstName = firstName; } public String getLastName() { return lastName; } public void setLastName(String lastName) { this.lastName = lastName; } @Override public void readFields(DataInput in) throws IOException { this.lastName = in.readUTF(); this.firstName = in.readUTF(); } @Override public void write(DataOutput out) throws IOException { out.writeUTF(lastName); out.writeUTF(firstName); } @Override public int compareTo(Person other) { int cmp = this.lastName.compareTo(other.lastName); if (cmp != 0) { return cmp; } return this.firstName.compareTo(other.firstName); } public void set(String lastName, String firstName) { this.lastName = lastName; this.firstName = firstName; } }
The trouble with this Comparator is that MapReduce store your intermediary map output data in byte form, and every time it needs to sort your data, it has to unmarshall it into Writable form to perform the comparison, this unmarshalling is expensive because it recreates your objects for comparison purposes.
To write a byte-level Comparator for the Person class, we have to implement the RawComparator interface. Let's revisit the Person class and look at how to do this. In the Person class, we store the two fields, firstname and last name, as string, and used the DataOutput's writableUTF method to write them out.
@Override public void write(DataOutput out) throws IOException { out.writeUTF(lastName); out.writeUTF(firstName); }
If you're going to read the javadoc of writeUTF(String str, DataOut out), you will see below statement:
* First, two bytes are written to out as if by the <code>writeShort</code>
* method giving the number of bytes to follow. This value is the number of
* bytes actually written out, not the length of the string. Following the
* length, each character of the string is output, in sequence, using the
* modified UTF-8 encoding for the character. If no exception is thrown, the
* counter <code>written</code> is incremented by the total number of
* bytes written to the output stream. This will be at least two
* plus the length of <code>str</code>, and at most two plus
* thrice the length of <code>str</code>.
This simply means that the writeUTF method writes two bytes containing the length of the string, followed by the byte form of the string.
Assume that you want to perform a lexicographical comparison that includes both the last and the first name, you can not do this with the entire byte array because the string lengths are also encoded in the array. Instead, the comparator needs to be smart enough to skip over the string lengths, as below code shown:
import org.apache.hadoop.io.WritableComparator; public class PersonBinaryComparator extends WritableComparator { protected PersonBinaryComparator() { super(Person.class, true); } @Override public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // Compare last name int lastNameResult = compare(b1, s1, b2, s2); // If last name is identical, return the result of comparison if (lastNameResult != 0) { return lastNameResult; } // Read the size of of the last name from the byte array int b1l1 = readUnsignedShort(b1, s1); int b2l1 = readUnsignedShort(b2, s2); // Return the comparison result on the first name return compare(b1, s1 + b1l1 + 2, b2, s2 + b2l1 + 2); } // Compare string in byte form public static int compare(byte[] b1, int s1, byte[] b2, int s2) { // Read the size of the UTF-8 string in byte array int b1l1 = readUnsignedShort(b1, s1); int b2l1 = readUnsignedShort(b2, s2); // Perform lexicographical comparison of the UTF-8 binary data // with the WritableComparator.compareBytes(...) method return compareBytes(b1, s1 + 2, b1l1, b2, s2 + 2, b2l1); } // Read two bytes public static int readUnsignedShort(byte[] b, int offset) { int ch1 = b[offset]; int ch2 = b[offset + 1]; return (ch1 << 8) + (ch2); } }
Final note: Using the writableUTF is limited because it can only support string that contain less than 65525 (two bytes) characters. If you need to work with a larger string, you should look at using Hadoop's Text class, which can support much larget strings. The implementation of Text's comparator is similar to what we completed in this blog post.
相关推荐
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
The MATLAB® programming environment is often perceived as a platform suitable for prototyping and modeling but not for actual real-life applications. One of the reasons that I constantly hear when ...
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Accelerating MATLAB with CUDA.pdf Accelerating MATLAB with CUDA.pdf Accelerating MATLAB with CUDA.pdf Accelerating MATLAB with CUDA.pdf
Accelerating MATLAB Performance: 1001 Tips to Speed Up MATLAB Programs by Yair M. Altman English | 2014 | ISBN: 1482211300 | 785 pages | PDF | 145 MB Features Demonstrates how to improve MATLAB® ...
MATLAB is a powerful tool for prototyping and analysis.MATLAB could be easily extended via MEX files to take advantage of the computational power offered by the latest NVIDIA graphics processor unit ...
Batch Normalization:Accelerating Deep Network Training
通过动力学共轭变换加快周期轨道展开的收敛速度,高昂,谢建波,周期轨道理论提供了两种重要的函数---动力学zeta函数和谱行列式,来计算非线性系统的动力学平均值。当系统是双曲系统时,周期轨道�
Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks 对图卷积网络进行加速的批量稀疏矩阵乘法 作者的ppt的pdf版本
Accelerating Deep Learning by Focusing on the Biggest Losers 概 思想很简单, 在训练网络的时候, 每个样本都会产生一个损失L(f(xi),yi)\mathcal{L}(f(x_i),y_i)L(f(xi),yi), 训练的模式往往是批训练, 将一个...
一本介绍如何玩PERL MVC的书,入门而深入,值得学习。
Accelerating DFA Construction by Parallelizing Subset Construction
Dark Energy and Accelerating Universe .pdf
英伟达nvidia的ampere架构GPU的sparsity特性介绍,详细介绍了sparsity的原理,性能提高和代价,很好的学习资料
最新的matlab使用GPU并行的书,对相关方向的研究很有帮助
Accelerating MATLAB Performance aims to correct this perception by describing multiple ways to greatly improve MATLAB program speed. Packed with thousands of helpful tips, it leaves no stone unturned,...
FSRCNN
Chapter 1, Introduction to DevOps, walks you through the evolution from the past to what we call DevOps today and the tools that you should know. Demand for people with DevOps skills has been growing ...
Accelerating_Web2_Coll_wPortal7Accelerating_Web2_Coll_wPortal7
AIGC- Accelerating Industrial Metaverse Value Creation.pdf