hyperloglog算法_这就是为什么HyperLogLog算法是我的新宠

hyperloglog算法

by Alex Nadalin

通过亚历克斯·纳达林

这就是为什么HyperLogLog算法是我的新宠 (This is why the HyperLogLog algorithm is my new favorite)

Every now and then I bump into a concept that’s so simple and powerful that I’m wish I’d discovered such an incredible and beautiful idea.

我时不时碰到一个如此简单而强大的概念，希望我发现了这样一个令人难以置信且美丽的想法。

I discovered HyperLogLog (HLL) a couple of years ago, and fell in love with it right after reading how redis decided to add a HLL data structure.

几年前，我发现了HyperLogLog (HLL)，并在阅读Redis如何决定添加HLL数据结构后立即爱上了它。

The idea behind HLL is devastatingly simple but extremely powerful. This is what makes it such a widespread algorithm, used by giants of the internet such as Google and Reddit.

HLL背后的想法非常简单，但却非常强大。这就是使其成为如此广泛的算法的原因，该算法已被互联网巨头(例如Google和Reddit)使用。

收集电话号码 (Collecting phone numbers)

My friend Tommy and I planned to go to a conference. While heading to its location, we decided to wager on who would meet the most new people. Once we reached the place, we’d start conversing around and keep a counter of how many people we talked to.

我的朋友汤米和我打算去参加一个会议。前往其所在地时，我们决定押注谁会遇到最新的人。到达该地点后，我们将开始交谈，并与我们交谈的人数保持一致。

At the end of the event, Tommy comes to me with his figure — say, 17 — and I tell him that I had a word with 46 people.

在活动结束时，汤米(Tommy)带着他的身影来到我身边，例如17岁。我告诉他，我与46个人有过一段话。

Clearly, I am the winner, but Tommy’s frustrated as he thinks I’ve counted the same people multiple times. He believes he only saw me talking to maybe 15–20 people in total.

显然，我是赢家，但汤米(Tommy)沮丧，因为他认为我已经多次计算过同一个人。他相信他只看到我与大约15–20个人进行了交谈。

So, the wager’s off. We decide that for our next event, we’ll be taking down names instead, to be sure we’re counting unique people, and not just the total number of conversations.

因此，下注了。我们决定在下一次活动中，取而代之的是记下姓名，以确保我们在计算的是唯一身份的人，而不只是对话的总数。

At the end of the following conference, we meet each other with a very long list of names and — guess what? Tommy had a couple more encounters than I did! We laugh it off, and while discussing our approach to counting uniques, Tommy comes up with a great idea:

在下一次会议结束时，我们会面很长的名字，彼此会面-猜猜是什么？汤米比我多遇到了两次！我们笑了起来，在讨论计算唯一性的方法时，Tommy提出了一个好主意：

“Alex, you know what? We can’t go around with pen and paper and track down a list of names, it’s really impractical! Today I spoke to 65 different people and counting their names on this paper was a real pain. I lost count 3 times and had to start from scratch!”

“亚历克斯，你知道吗？ 我们不能随便用笔和纸来寻找名字的列表，这是不切实际的！ 今天，我与65位不同的人进行了交谈，而在这篇论文上计算他们的名字实在是很痛苦。 我输了3次，不得不从头开始！”

“Yeah, I know, but do we even have an alternative?”

“是的，我知道，但是我们还有其他选择吗？”

“What if, for our next conference, instead of asking for names, we ask people the last 5 digits of their phone number? Instead of winning by counting their names, the winner will be the one who spoke to someone with the longest sequence of leading zeroes in those digits.”

“如果在下一次会议上，我们不问姓名，而是问人们电话号码的后5位怎么办？ 获胜者将是与那些与数字中前导零序列最长的人交谈的人，而不是通过计数他们的名字来获胜。”

“Wait Tommy, you’re going too fast! Slow down a second and give me an example…”

“等等汤米，你太快了！ 慢一点，给我一个例子……”

“Sure, just ask each person for those last 5 digits, ok? Let’s suppose they reply ‘54701’. There’s no leading zero, so the longest sequence of leading zeroes is 0. The next person you talk to says ‘02561’ — that’s a leading zero! So your longest sequence is now 1.”

“当然，只要问每个人的后5位数字，好吗？ 假设他们回答“ 54701”。 没有前导零，因此前导零的最长序列是0。您与之交谈的下一个人说'02561'-这是前导零！ 因此，您最长的序列现在是1。”

“You’re starting to make sense to me…”

“你开始对我有意义……”

“Yeah, so if we speak to only a couple of people, chances are that are longest zero-sequence will be 0. But if we talk to maybe 10 people, we have more chances of it being 1.”

“是的，因此，如果我们仅与几个人交谈，则最长的零序可能性就是0。但是，如果我们与10个人交谈，则更有可能为1。”

“Now, imagine you tell me your longest zero-sequence is 5 — you must have spoken to thousands of people to find someone with 00000 in their phone number!”

“现在，假设您告诉我您最长的零序是5-您必须已经与成千上万的人交谈，以找到电话号码为00000的人！”

“Dude, you’re a damn genius!”

“老兄，你真是个天才！”

我的朋友们，这就是HyperLogLog从根本上如何工作的方式。 (And that, my friends, is how HyperLogLog fundamentally works.)

It allows us to estimate unique items within a large dataset by recording the longest sequence of zeroes within that set.

它允许我们通过记录该集中最长的零序列来估计大型数据集中的唯一项。

This ends up creating an incredible advantage over keeping track of each and every element in the set. It is an incredibly efficient way to count unique values, with relatively high accuracy.

与跟踪集合中的每个元素相比，这最终带来了难以置信的优势。这是一种以相对较高的精度对唯一值进行计数的高效方法。

“The HyperLogLog algorithm can estimate cardinalities well beyond 10⁹ with a relative accuracy (standard error) of 2% while only using 1.5kb of memory”

“ HyperLogLog算法可以估计基数超过10 can，相对精度(标准误)为2％，而仅使用1.5kb的内存”

Fangjin Yang — Fast, Cheap, and 98% Right: Cardinality Estimation for Big Data

杨芳金- 快速，便宜且98％正确：大数据的基数估计

Since I may be oversimplifying, let’s have a look at some more details of HLL.

由于我可能过于简化，因此让我们看一下HLL的更多详细信息。

在野外的HLL (HLL in the wild)

So, where can we find the application of HLLs? Two great web-scale examples are:

那么，我们在哪里可以找到HLL的应用？网络规模的两个很好的例子是：

BigQuery, to efficiently count uniques in a table (APPROX_COUNT_DISTINCT())
BigQuery ，以有效地计算表中的唯一性( APPROX_COUNT_DISTINCT() )
Reddit, where it’s used to calculate how many unique views a post has gathered
Reddit ，用于计算帖子收集了多少个唯一视图

In particular, see how HLL impacts queries on BigQuery:

特别是，请参阅HLL如何影响BigQuery上的查询：

SELECT COUNT(DISTINCT actor.login) exact_cntFROM `githubarchive.year.2016`6,610,026 (4.1s elapsed, 3.39 GB processed, 320,825,029 rows scanned)

SELECT APPROX_COUNT_DISTINCT(actor.login) approx_cntFROM `githubarchive.year.2016`6,643,627 (2.6s elapsed, 3.39 GB processed, 320,825,029 rows scanned)

The second result is an approximation (with an error rate of ~0.5%), but takes a fraction of the time.

第二个结果是一个近似值(错误率约为0.5％)，但花费的时间却很少。

Long story short: HyperLogLog is amazing!

长话短说： HyperLogLog很棒！

Now you know what it is and when it can be used, so go out and do incredible stuff with it!

现在您知道它是什么以及何时可以使用它了，所以出去用它做令人难以置信的事情！

进一步阅读 (Further reading)

HyperLogLog on Wikipedia
Wikipedia上的HyperLog登录
the original paper
原始纸
HyperLogLog++, Google’s improved implementation of HLL
HyperLogLog ++，Google对HLL的改进实现
Redis new data structure: the HyperLogLog
Redis新数据结构：HyperLogLog
Damn Cool Algorithms: Cardinality Estimation
该死的酷算法：基数估计
HLL data types in Riak
Riak中的HLL数据类型
HyperLogLog and MinHash
HyperLogLog和MinHash

Originally published at odino.org (13th Jan 2018).

最初发布于odino.org (2018年1月13日)。

翻译自: https://www.freecodecamp.org/news/my-favorite-algorithm-and-data-structure-hyperloglog-6583a25c8a4f/