(Go: >> BACK << -|- >> HOME <<)

MurmurHash: Difference between revisions

Content deleted Content added
→‎Variants: logical order
 
(354 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Computer function}}
{{cleanup|date=April 2009}}
{{Use dmy dates|date=January 2021}}
 
'''MurmurHash''' is a [[non-cryptographic hash function]] suitable for general hash-based lookup.<ref name="Hadoop">{{cite web |url=http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/util/MurmurHash.html |title=Hadoop in Java |publisher=Hbase.apache.org |date=24 July 2011 |accessdate=13 January 2012 |url-status=dead |archiveurl=https://web.archive.org/web/20120112023407/http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/util/MurmurHash.html |archivedate=12 January 2012}}
'''MurmurHash''' is a non-[[Cryptographic hash function|cryptographic]] [[hash function]] created by [[Austin Appleby]]. [http://murmurhash.googlepages.com/]
</ref><ref>
[http://materias.fi.uba.ar/7500/chouza-tesisingenieriainformatica.pdf Chouza et al].
</ref><ref>
{{cite web|url=http://www.inesc-id.pt/ficheiros/publicacoes/5453.pdf |title=Couceiro et al. |language=pt |accessdate=13 January 2012
|page= 14
}}
</ref>
It was created by Austin Appleby in 2008<ref>{{cite web|author=Tanjent (tanjent) wrote,3 March 2008 13:31:00 |url=http://tanjent.livejournal.com/756623.html |title=MurmurHash first announcement |publisher=Tanjent.livejournal.com |accessdate=13 January 2012}}</ref> and is currently hosted on GitHub along with its test suite named 'SMHasher'. It also exists in a number of variants,<ref name="Murmur160">{{cite web|url=http://simonhf.wordpress.com/2010/09/25/murmurhash160/ |title=MurmurHash2-160 |publisher=Simonhf.wordpress.com |date=25 September 2010 |accessdate=13 January 2012}}</ref> all of which have been released into the public domain. The name comes from two basic operations, multiply (MU) and rotate (R), used in its inner loop.
 
Unlike [[cryptographic hash function]]s, it is not specifically designed to be difficult to reverse by an adversary, making it unsuitable for cryptographic purposes.
==The hash==
 
==Variants==
MurmurHash 2.0 is a relatively simple algorithm, which compiles down to about 52 instruction on the [[x86]]. It is noted[http://torum.net/page/2/] for being exceptionally fast, often two to four times faster than comparable[http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html] aglorithms such as [[Fowler Noll Vo hash|FNV]][http://murmurhash.googlepages.com/fnvtestresults], Jenkins' [[Jenkins hash function|lookup3]][http://burtleburtle.net/bob/hash/doobs.html][http://murmurhash.googlepages.com/lookup3testresults] and Hsieh's [[SuperFastHash]][http://murmurhash.googlepages.com/superfasthashtestresults], with excellent distribution [http://murmurhash.googlepages.com/statistics], avalanche behavior [http://murmurhash.googlepages.com/avalanche] and overall collision resistance [http://www.strchr.com/hash_functions]. A survey of hash functions[http://www.strchr.com/hash_functions] concluded that "Murmur2 is the only of the complex hash functions that provides good performance for all kinds of keys. It can be recommended as a general-purpose hashing function."
 
==Variations=MurmurHash1===
The original MurmurHash was created as an attempt to make a faster function than [[Jenkins hash function#lookup3|Lookup3]].<ref>{{cite web |title=MurmurHash1 |url=https://github.com/aappleby/smhasher/wiki/MurmurHash1 |accessdate=12 January 2019}}</ref> Although successful, it hadn't been tested thoroughly and wasn't capable of providing 64-bit hashes as in Lookup3. It had a rather elegant design, that would be later built upon in MurmurHash2, combining a multiplicative hash (similar to [[Fowler–Noll–Vo hash function]]) with a [[Xorshift]].
The first version was MurmurHash 1.0, but it was superceded by 2.0, which Appleby says is "much faster & mixes better." [http://tanjent.livejournal.com/756623.html]. All have been released into the public domain.
 
===MurmurHash2===
The fastest implementation[http://murmurhash.googlepages.com/MurmurHash2.cpp] reads four bytes at a time and generates canonical values only on a processor with Intel [little-endian] byte. For processors that require aligned reads[http://murmurhash.googlepages.com/MurmurHashAligned2.cpp] or have Motorola [[big-endian]] byte order[http://murmurhash.googlepages.com/MurmurHashNeutral2.cpp], there are versions that work correctly but at approximately half speed. The algorithm generates a 32-bit hash value and is optimized for processors with 32-bit or larger registers.
MurmurHash2<ref>{{cite web|title=MurmurHash2 on Github|url=https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp}}</ref> yields a 32-bit or 64-bit value. It came in multiple variants, including some that allowed incremental hashing and aligned or neutral versions.
 
* MurmurHash2 (32-bit, x86) - The original version; contains a flaw that weakens collision in some cases.<ref>{{cite web |title=MurmurHash2Flaw |url=https://github.com/aappleby/smhasher/wiki/MurmurHash2Flaw |accessdate=15 January 2019}}</ref>
===Implementation===
* MurmurHash2A (32-bit, x86) - A fixed variant using [[Merkle–Damgård construction]]. Slightly slower.
* CMurmurHash2A (32-bit, x86) - MurmurHash2A but works incrementally.
* MurmurHashNeutral2 (32-bit, x86) - Slower, but endian and alignment neutral.
* MurmurHashAligned2 (32-bit, x86) - Slower, but does aligned reads (safer on some platforms).
* MurmurHash64A (64-bit, x64) - The original 64-bit version. Optimized for 64-bit arithmetic.
* MurmurHash64B (64-bit, x86) - A 64-bit version optimized for 32-bit platforms. It is not a true 64-bit hash due to insufficient mixing of the stripes.<ref>{{cite web |title=MurmurHash3 (see note on MurmurHash2_x86_64) |url=https://github.com/aappleby/smhasher/wiki/MurmurHash3#bulk-speed-test-hashing-an-8-byte-aligned-256k-block |accessdate=15 January 2019}}</ref>
 
The person who originally found the flaw{{what|date=July 2021}} in MurmurHash2 created an unofficial 160-bit version of MurmurHash2 called MurmurHash2_160.<ref>{{cite web |title=MurmurHash2_160 |url=https://simonhf.wordpress.com/2010/09/25/murmurhash160/ |accessdate=12 January 2019}}</ref>
The algorithm was originally expressed in C++, and has been ported to a number of popular languages, including Python[http://pypi.python.org/pypi/Murmur/0.1.3], C#[http://landman-code.blogspot.com/] and Java[http://www.getopt.org/murmur/MurmurHash.java].
 
===MurmurHash3===
//-----------------------------------------------------------------------------
The current version is MurmurHash3,<ref>{{cite web|title=MurmurHash3 on Github|url=https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp}}</ref><ref name="Horvath">{{cite web | first = Adam | last = Horvath | url = http://blog.teamleadnet.com/2012/08/murmurhash3-ultra-fast-hash-algorithm.html | title = MurMurHash3, an ultra fast hash algorithm for C# / .NET | date = 10 August 2012 }}</ref> which yields a 32-bit or 128-bit hash value. When using 128-bits, the x86 and x64 versions do not produce the same values, as the algorithms are optimized for their respective platforms. MurmurHash3 was released alongside SMHasher—a hash function test suite.
// MurmurHash2, by Austin Appleby
 
==Implementations==
// Note - This code makes a few assumptions about how your machine behaves -
The canonical implementation is in [[C++]], but there are efficient ports for a variety of popular languages, including [[Python (programming language)|Python]],<ref>{{cite web|url=https://code.google.com/p/pyfasthash/ |title=pyfasthash in Python |accessdate=13 January 2012}}</ref> [[C (programming language)|C]],<ref>{{cite web|url=https://github.com/wolkykim/qlibc |title=C implementation in qLibc by Seungyoung Kim}}</ref> [[Go (programming language)|Go]],<ref>{{cite web|url=https://github.com/spaolacci/murmur3 |title=murmur3 in Go}}</ref> [[C Sharp (programming language)|C#]],<ref name="Horvath"/><ref>{{cite web|last=Landman |first=Davy |url=http://landman-code.blogspot.com/2009/02/c-superfasthash-and-murmurhash2.html |title=Davy Landman in C# |publisher=Landman-code.blogspot.com |accessdate=13 January 2012}}</ref> [[D (programming language)|D]],<ref>{{Cite web|url=https://dlang.org/phobos/std_digest_murmurhash.html|title=std.digest.murmurhash - D Programming Language|website=dlang.org|access-date=2016-11-05}}</ref> [https://github.com/tkaemming/lua-murmurhash3 Lua], [[Perl]],<ref>{{cite web|url=http://metacpan.org/module/Digest::MurmurHash |title=Toru Maesaka in Perl |publisher=metacpan.org |accessdate=13 January 2012}}</ref> [[Ruby (programming language)|Ruby]],<ref>{{cite web|author=Yuki Kurihara |url=https://github.com/ksss/digest-murmurhash |title=Digest::MurmurHash |publisher=GitHub.com |date=16 October 2014 |accessdate=18 March 2015}}</ref> [[Rust (programming language)|Rust]],<ref>{{Cite web|title = stusmall/murmur3|url = https://github.com/stusmall/murmur3|website = GitHub|accessdate = 2015-11-29}}</ref> [[PHP]],<ref>{{cite web|url=https://github.com/lastguest/murmurhash-php |title=PHP userland implementation of MurmurHash3 |publisher=github.com |accessdate=18 December 2017}}</ref><ref>{{cite web|url=https://php.watch/versions/8.1/MurmurHash3 |title=PHP 8.1 with MurmurHash3 support}}</ref> [[Common Lisp]],<ref>{{cite web|url=https://bitbucket.org/tarballs_are_good/murmurhash3|title=tarballs_are_good / murmurhash3|accessdate=7 February 2015}}</ref> [[Haskell (programming language)|Haskell]],<ref>{{cite web|url=http://hackage.haskell.org/package/murmur-hash |title=Haskell |publisher=Hackage.haskell.org |accessdate=13 January 2012}}</ref> [[Elm (programming language)|Elm]],<ref>{{cite web|url=https://package.elm-lang.org/packages/Skinney/murmur3/latest/ |title=Elm |publisher=package.elm-lang.org |accessdate=12 June 2019}}</ref> [[Clojure]],<ref>{{cite web|url=https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/Murmur3.java|title=Murmur3.java in Clojure source code on Github|publisher=clojure.org|accessdate=2014-03-11}}</ref> [[Scala (programming language)|Scala]],<ref>{{cite web|url=https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/hashing/MurmurHash3.scala|title=Scala standard library implementation|date=26 September 2014}}</ref> [[Java (programming language)|Java]],<ref>[https://google.github.io/guava/releases/snapshot/api/docs/com/google/common/hash/Hashing.html Murmur3], part of Guava</ref><ref>{{cite web|url=https://github.com/greenrobot/greenrobot-common/blob/master/hash-functions.md|title=Murmur3A and Murmur3F Java classes on Github|publisher=greenrobot|accessdate=5 November 2014}}</ref> [[Erlang (programming language)|Erlang]],<ref>{{cite web|url=https://github.com/bipthelin/murmerl3|title=bipthelin/murmerl3|work=GitHub|accessdate=21 October 2015}}</ref> [[Swift (programming language)|Swift]],<ref>{{cite web|author=Daisuke T |url=https://github.com/daisuke-t-jp/MurmurHash-Swift |title=MurmurHash-Swift |publisher=GitHub.com |date=7 February 2019 |accessdate=10 February 2019}}</ref> [[Object Pascal]],<ref>[https://github.com/Xor-el/HashLib4Pascal GitHub - Xor-el/HashLib4Pascal: Hashing for Modern Object Pascal<!-- Bot generated title -->]</ref> [[Kotlin (programming language)|Kotlin]],<ref>{{cite web|url=https://github.com/goncalossilva/kotlinx-murmurhash |title=goncalossilva/kotlinx-murmurhash |publisher=GitHub.com |date=10 December 2021 |accessdate=14 December 2021}}</ref> [[JavaScript]].<ref>{{cite web|author=raycmorgan (owner) |url=https://gist.github.com/588423 |title=Javascript implementation by Ray Morgan |publisher=Gist.github.com |accessdate=13 January 2012}}</ref> and [[OCaml]], <ref>{{cite web|author=INRIA |url=https://github.com/ocaml/ocaml/blob/trunk/runtime/hash.c |title=OCaml Source |publisher=GitHub.com}}</ref>
 
// 1. We can read a 4-byte value from any address without crashing
It has been adopted into a number of open-source projects, most notably [[libstdc++]] (ver 4.6), [[nginx]] (ver 1.0.1),<ref>{{cite web|url=http://nginx.org/en/CHANGES |title=nginx |accessdate=13 January 2012}}</ref> [[Rubinius]],<ref>{{cite web|url=https://github.com/rubinius/rubinius/commit/1d69526c484cc9435a7198e41b8995db6c3acf1a |title=Rubinius |accessdate=29 February 2012}}</ref> libmemcached (the [[C (programming language)|C]] driver for [[Memcached]]),<ref>{{cite web|url=http://libmemcached.org/libMemcached.html|title=libMemcached|work=libmemcached.org|accessdate=21 October 2015}}</ref> [[Npm (software)|npm]] (nodejs package manager),<ref>{{cite web|url=https://github.com/npm/write-file-atomic/commit/22dd8759076fc8d0327d50693283060e479afafc |title=switch from MD5 to murmur}}</ref> [[maatkit]],<ref>{{cite web|url=https://code.google.com/p/maatkit/source/detail?r=3273 |title=maatkit |date=24 March 2009 |accessdate=13 January 2012}}</ref> [[Hadoop]],<ref name="Hadoop"/> Kyoto Cabinet,<ref>{{cite web|url=http://fallabs.com/kyotocabinet/spex.html |title=Kyoto Cabinet specification |publisher=Fallabs.com |date=4 March 2011 |accessdate=13 January 2012}}</ref> [[Apache Cassandra|Cassandra]],<ref>{{cite web|url=http://wiki.apache.org/cassandra/Partitioners|title=Partitioners|publisher=apache.org |date=2013-11-15|accessdate=2013-12-19}}</ref><ref>{{cite web |url=https://www.youtube.com/watch?v=d7o6a75sfY0 |title=Introduction to Apache Cassandra™ + What’s New in 4.0 by Patrick McFadin. DataStax Presents |work=YouTube |date=April 10, 2019}}</ref> [[Apache Solr|Solr]],<ref>{{cite web|url=http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/codecs/bloom/MurmurHash2.html|title=Solr MurmurHash2 Javadoc}}</ref> [[Vowpal Wabbit|vowpal wabbit]],<ref>{{cite web|url=https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/explore/hash.h|title=hash.cc in vowpalwabbit source code}}</ref> [[Elasticsearch]],<ref>{{cite web|url=https://www.elastic.co/guide/en/elasticsearch/reference/2.0/breaking_20_crud_and_routing_changes.html#_routing_hash_function|title=Elasticsearch 2.0 - CRUD and routing changes}}</ref> [[Google Guava|Guava]],<ref>{{cite web|url=https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/Hashing.java|title=Guava Hashing.java}}</ref> [[Apache Kafka|Kafka]]<ref>{{cite web|url=https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java|title=Kafka DefaultPartitioner.java}}</ref> and [https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/vdo-integration RedHat Virtual Data Optimizer (VDO)].<ref>Virtual Data Optimizer [https://github.com/dm-vdo/kvdo source code]</ref>
// 2. sizeof(int) == 4
 
==Vulnerabilities==
// And it has a few limitations -
Hash functions can be vulnerable to [[collision attack]]s, where a user can choose input data in such a way so as to intentionally cause hash collisions. Jean-Philippe Aumasson and [[Daniel J. Bernstein]] were able to show that even implementations of MurmurHash using a randomized seed are vulnerable to so-called [[HashDoS]] attacks.<ref>{{cite web|url=https://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/|title=Breaking Murmur: Hash-flooding DoS Reloaded}}</ref> With the use of [[differential cryptanalysis]] they were able to generate inputs that would lead to a hash collision. The authors of the attack recommend to use their own [[SipHash]] instead.
 
// 1. It will not work incrementally.
==Algorithm==
// 2. It will not produce the same results on little-endian and big-endian
'''algorithm''' Murmur3_32 '''is'''
// machines.
unsigned int''// MurmurHash2Note: (In constthis voidversion, *all key,arithmetic intis len,performed with unsigned int seed32-bit )integers.''
''// In the case of overflow, the result is reduced modulo {{math|2<sup>32</sup>}}.''
{
'''input:''' ''key'', ''len'', ''seed''
// 'm' and 'r' are mixing constants generated offline.
// They're not really 'magic', they just happen to work well.
c1 ← 0xcc9e2d51
c2 ← 0x1b873593
const unsigned int m = 0x5bd1e995;
const intr1 r = 24;15
r2 ← 13
m ← 5
// Initialize the hash to a 'random' value
n ← 0xe6546b64
unsigned int h = seed ^ len;
hash ← ''seed''
// Mix 4 bytes at a time into the hash
'''for each''' fourByteChunk of ''key'' '''do'''
k ← fourByteChunk
const unsigned char * data = (const unsigned char *)key;
k ← k × c1
while(len >= 4)
k ← k ROL r1
{
unsigned int k = *(unsignedk int× *)data;c2
hash ← hash XOR k
k *= m;
k ^= k >> r; hash ← hash ROL r2
k *= hash ← (hash × m;) + n
'''with any''' remainingBytesInKey '''do'''
h *= m;
remainingBytes ← SwapToLittleEndian(remainingBytesInKey)
h ^= k;
''// Note: Endian swapping is only necessary on big-endian machines.''
''// The purpose is to place the meaningful digits towards the low end of the value,''
data += 4;
''// so that these digits have the greatest potential to affect the low range digits''
len -= 4;
''// in the subsequent multiplication. Consider that locating the meaningful digits''
}
''// in the high range would produce a greater effect upon the high digits of the''
''// multiplication, and notably, that such high digits are likely to be discarded''
// Handle the last few bytes of the input array
''// by the modulo arithmetic under overflow. We don't want that.''
switch(len)
remainingBytes ← remainingBytes × c1
{
remainingBytes ← remainingBytes ROL r1
case 3: h ^= data[2] << 16;
remainingBytes ← remainingBytes × c2
case 2: h ^= data[1] << 8;
case 1: h ^= data[0];
hash hash hXOR *= m;remainingBytes
};
hash ← hash XOR ''len''
// Do a few final mixes of the hash to ensure the last few
hash ← hash XOR (hash >> 16)
// bytes are well-incorporated.
hash ← hash × 0x85ebca6b
h ^=hash h← hash XOR (hash >> 13;)
hash ← hash × 0xc2b2ae35
h *= m;
h ^=hash h← hash XOR (hash >> 15;16)
 
A sample C implementation follows (for little-endian CPUs):
return h;
 
<syntaxhighlight lang="c">
static inline uint32_t murmur_32_scramble(uint32_t k) {
k *= 0xcc9e2d51;
k = (k << 15) | (k >> 17);
k *= 0x1b873593;
return k;
}
uint32_t murmur3_32(const uint8_t* key, size_t len, uint32_t seed)
{
uint32_t h = seed;
uint32_t k;
/* Read in groups of 4. */
for (size_t i = len >> 2; i; i--) {
// Here is a source of differing results across endiannesses.
// A swap here has no effects on hash properties though.
memcpy(&k, key, sizeof(uint32_t));
key += sizeof(uint32_t);
h ^= murmur_32_scramble(k);
h = (h << 13) | (h >> 19);
h = h * 5 + 0xe6546b64;
}
/* Read the rest. */
k = 0;
for (size_t i = len & 3; i; i--) {
k <<= 8;
k |= key[i - 1];
}
// A swap is *not* necessary here because the preceding loop already
// places the low bytes in the low places according to whatever endianness
// we use. Swaps only apply when the memory is copied in a chunk.
h ^= murmur_32_scramble(k);
/* Finalize. */
h ^= len;
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
}
</syntaxhighlight>
{| class="wikitable"
|+Tests
!Test string
!Seed value
!Hash value (hexadecimal)
!Hash value (decimal)
|-
|
|<code>0x00000000</code>
|<code>0x00000000</code>
|<code>0</code>
|-
|
|<code>0x00000001</code>
|<code>0x514E28B7</code>
|<code>1,364,076,727</code>
|-
|
|<code>0xffffffff</code>
|<code>0x81F16F39</code>
|<code>2,180,083,513</code>
|-
|test
|<code>0x00000000</code>
|<code>0xba6bd213</code>
|<code>3,127,628,307</code>
|-
|test
|<code>0x9747b28c</code>
|<code>0x704b81dc</code>
|<code>1,883,996,636</code>
|-
|Hello, world!
|<code>0x00000000</code>
|<code>0xc0363e43</code>
|<code>3,224,780,355</code>
|-
|Hello, world!
|<code>0x9747b28c</code>
|<code>0x24884CBA</code>
|<code>612,912,314</code>
|-
|The quick brown fox jumps over the lazy dog
|<code>0x00000000</code>
|<code>0x2e4ff723</code>
|<code>776,992,547</code>
|-
|The quick brown fox jumps over the lazy dog
|<code>0x9747b28c</code>
|<code>0x2FA826CD</code>
|<code>799,549,133</code>
|}
 
== See also ==
* [[Non-cryptographic hash functions]]
 
==References==
{{reflist|colwidth=30em}}
 
==External links==
* [https://github.com/aappleby/smhasher Official SMHasher site]
* [http://murmurhash.googlepages.com/ Austin Appleby's MurmurHash page]
 
{{DEFAULTSORT:Murmurhash}}
[[Category:Hash functions]]
[[Category:Hash function (non-cryptographic)]]
[[Category:Articles with example pseudocode]]
[[Category:Articles with example C code]]