Selasa, 19 Juni 2018

Sponsored Links

Big Data Analytics Tutorial #17: The Bloom Filter - YouTube
src: i.ytimg.com

The Bloom filter is a space-efficient probabilistic data structure, compiled by Burton Howard Bloom in 1970, used to test whether an element is a member of a collection. False false matches may occur, but false negatives do not - in other words, queries produce "probable set" or "definitely not set." Elements can be added to the set, but not deleted (although this can be solved with a "calculation" filter); the more elements are added to the set, the greater the false positive probability.

Bloom proposed a technique for applications where the amount of source data would require large amounts of impractical memory if a conventional error-free hashing technique was applied. He gave an example of a hyphenation algorithm for a 500,000 word dictionary, from which 90% follow simple hyphenation rules, but the remaining 10% requires expensive hard disk access to pick up certain hyphenation patterns. With sufficient core memory, an error-free hash can be used to remove all unnecessary disk access; on the other hand, with limited core memory, the Bloom technique uses a smaller hash area but still removes most unnecessary access. For example, a hash area of ​​only 15% of the size required by an ideal error-free hash still removes 85% of disk access.

More generally, less than 10 bits per element is required for a 1% positive probability, regardless of the size or number of elements in the set.


Video Bloom filter



Description of algorithm

The Empty Bloom filter is the bit array m bit, all set to 0. There must be k different hash functions specified, each of which mapping or memhing has several elements set to one of the array positions m , resulting in a uniform random distribution. Usually, k is a constant, much smaller than m , which is proportional to the number of elements to be added; the exact choice of k and the proportionality constant m is determined by the false positive rate of the filter.

To add elements, give the feed to each hash function k to get the k array position. Set the bits in all these positions to 1.

For query for the element (test if it is set), feed it to each hash function k to get the k array position. If one bit in this position is 0, its element must not be set - if it is, then all bits will be set to 1 when inserted. If everything is 1, then one element is in the set, or the bits are accidentally set to 1 during the insertion of another element, resulting in a false positive. In a simple Bloom filter, there is no way to distinguish between two cases, but more sophisticated techniques can solve this problem.

The different designing requirements of k different hash functions can be prohibitive for large k . For a good hash function with wide output, there should be little if there is a correlation between different bit-fields of a hash, so this type of hash can be used to generate some different "hash" functions by slicing its output into several bits. field. Alternatively, one can pass different k start values ​​(such as 0, 1,..., k Ã,-1) to a hash function that takes the initial value; or add (or add) these values ​​to the key. For larger m and/or k , the independence between the hash functions can be relaxed with a negligible increase in the false positive level. (Specifically, Dillinger & Manolios (2004b) demonstrate the effectiveness of lowering the k index using improved double hashing or triple hashing, a double hashing variant that effectively generates simple random number generators with two or three values hash.)

Removing elements from this simple Bloom filter is impossible because false negatives are not allowed. Elements map to k bits, and though setting one of those k bits to zero is enough to remove an element, it also results in removing other elements that occur on the map to that bit. Since there is no way to determine if other elements have been added that affect the bits for an element to be removed, clearing one of the bits will introduce the possibility for false negatives.

A one-time removal of elements from a Bloom filter can be simulated by having a second Bloom filter containing deleted items. However, a positive error in the second filter becomes a false negative in a composite filter, which may be undesirable. In this approach adding back previously deleted items is not possible, because one must remove them from the "deleted" filter.

Often all keys are available but expensive to calculate (for example, requiring lots of disk read). When the false positive rate is too high, the filter can be regenerated; this should be a relatively rare event.

Maps Bloom filter



The advantages of space and time

Despite the false-positive risks, the Bloom filter has a strong space advantage over other data structures to represent sets, such as binary search trees that balance themselves, try, hash tables, or simple arrays or related lists of entries. Most of this requires storing at least the data item itself, which can require anywhere from a small number of bits, to small integers, to a number of random bits, as for strings (try is an exception, since they can share storage between elements with prefixes the same one). However, the Bloom filter does not store any data items at all, and a separate solution must be provided for the actual storage. Linked structures incur additional linear space overhead for pointers. The Bloom filter with 1% error and the optimal value k , in contrast, only takes about 9.6 bits per element, regardless of element size. This advantage is partly derived from its compactness, inherited from the array, and part of its probabilistic nature. The false-positive rate of 1% can be reduced by a factor of ten by adding only about 4.8 bits per element.

However, if the number of potential values ​​is small and many of them can be set, the Bloom filter is easily surpassed by a deterministic bit array, which requires only one bit for each potential element. Note also that hash tables gain time and space gain if they begin to ignore collisions and store only whether each basket contains an entry; in this case, they effectively become Bloom filters with k Ã, = 1.

The Bloom filter also has an unusual property that takes time to add items or to check if the items in the set are constant, O ( k ), completely independent of the number of items already on the set. There is no other constant data set structure that has this property, but the average access time of the hash table can rarely make it faster in practice than some Bloom filters. In hardware implementations, however, the Bloom filter shines because the search k is independent and can be parallelized.

To understand the space efficiency, it is instructive to compare the common Bloom filters with special cases when k Ã, = 1. If k = 1, then to keep the false positive rate low enough , small fraction of the bits must be set, which means the array must be very large and contain the running of zeros. The information content of the array relative to its size is low. The general Bloom filter ( k greater than 1) allows more bits to be set while still maintaining a low false positive rate; if the parameters ( k and m ) are selected properly, about half of the bits will be set, and this will look random, minimize redundancy and maximize information content.

Privacy-preserving record linkage using Bloom filters | BMC ...
src: media.springernature.com


Possibility of false positives

Asumsikan bahwa fungi hash memilih setiap posisi array dengan probabilitas yang sama. Jika m adalah jumlah bit dalam array, probabilitas bahwa bit tertentu tidak diatur ke 1 oleh fungsi hash tertentu selama penyisipan elemen adalah

                   1        -                              1           m                         .             {\ displaystyle 1 - {\ frac {1} {m}}.}  Â

Jika k adalah jumlah fungsi hash, probabilitas bahwa bit tidak diatur ke 1 oleh salah satu fungsi hash adalah

                                        (                          1              -                                                1                  m                                                       )                                k                        .             {\ displaystyle \ left (1 - {\ frac {1} {m}} \ right) ^ {k}.}  Â

Jika kita telah memasukkan n elemen, probabilitas bahwa bit tertentu masih 0 adalah

                                        (                          1              -                                                1                  m                                                       )                                k            n                         ;             {\ displaystyle \ left (1 - {\ frac {1} {m}} \ right) ^ {kn};}  Â

kemungkinan bahwa itu adalah 1 karenanya

                   1        -                             (                          1              -                                                1                  m                                                       )                                k            n                        .             {\ displaystyle 1- \ left (1 - {\ frac {1} {m}} \ right) ^ {kn}.}  Â

Sekarang uji keanggotaan dari elemen yang tidak ada dalam set. Masing-masing dari k posisi array dihitung oleh fungsi hash adalah 1 dengan probabilitas seperti di atas. Probabilitas semuanya menjadi 1, yang akan menyebabkan algoritma salah mengklaim bahwa elemen dalam set, sering diberikan sebagai

                                                (                             1                -                                                    [                                         1                      -                                                                      1                          m                                                                                  ]                                                      k                    n                                                          )                                    k                             ?                                  (                             1                -                                 e                                     -                    k                    n                                        /                                       m                                                          )                                    k                             .                  {\ displaystyle \ left (1- \ left [1 - {\ frac {1} {m}} \ right] ^ {kn} \ right) ^ {k} \ kira-kira \ kiri (1-e ^ {- kn/m} \ right) ^ {k}.}   

This is not entirely true because it assumes independence for the probability of each given bit. However, assuming it is our close approximation that the false positive probability decreases as m (number of bits in the array) increases, and increases as n (number of elements entered) increases.

An alternative analysis that arrives at the same estimate without assuming independence is given by Mitzenmacher and Upfal. After all n items have been added to the Bloom filter, let q be a fraction of the m bit set to 0. (That is, the number of bits still set to 0 is qm .) Then, when testing the membership of an element that is not set, for the position of the array given by one of the k hash functions, the probability that the bit is found is set to 1 is               1         -          q           {\ displaystyle 1-q}   . So the probability that all k hash functions find their bits set to 1 is              (         1         -          q                  )                 Â                                 {\ displaystyle (1-q) ^ {k}}   . Furthermore, the expected value of q is the probability that the given array positions are untouched by each of the hash functions k for each n item , which (as above)

              E        [          q        ]         =                           Â (                 Â 1      Â                                    Â 1                 m                       ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ,                          Â )                           Â       Â ·                                 {\ displaystyle E [q] = \ left (1 - {\ frac {1} {m}} \ right) ^ {kn}}   .

Adalah mungkin untuk membuktikan, tanpa asumsi independensi, bahwa q sangat terkonsentrasi di sekitar nilai yang diharapkan. Khususnya, dari ketidaksetaraan Azuma-Hoeffding, mereka membuktikan hal itu

                        Pr          (                     |                         q              -              E              [              q             ]                       |                  > =                                 ?              m                             )          <=          2          exp                   (          -          2                    ?                         2                                        /                   k          n         )                  {\ displaystyle \ Pr (\ left | qE [q] \ right | \ geq {\ frac {\ lambda} {m}}) \ leq 2 \ exp (-2 \ lambda ^ {2}/kn)}   

Karena ini, kita dapat mengatakan bahwa probabilitas pasti dari kesalahan positif adalah

                                  ?                         t                              Pr          (          q          =          t         )          (          1          -          t                    )                         k                             ?          (          1          -          E          [          q         ]                    )                         k                              =                                  (                             1                -                                                    [                                         1                      -                                                                      1                          m                                                                                  ]                                                      k                    n                                                          )                                    k                             ?                                  (                             1                -                                 e                                     -                    k                    n                                        /                                       m                                                          )                                    k                                      {\ displaystyle \ sum _ {t} \ Pr (q = t) (1-t) ^ {k} \ kira-kira (1-E [q]) ^ {k } = \ kiri (1- \ kiri [1 - {\ frac {1} {m}} \ right] ^ {kn} \ right) ^ {k} \ approx \ left (1-e ^ {- kn/m } \ right) ^ {k}}   

as previously.

Number of optimum hash functions

Jumlah fungsi hash, k , harus berupa bilangan bulat positif. Mengesampingkan batasan ini, untuk diberikan m dan n , nilai k yang meminimalkan probabilitas positif palsu adalah

                   k        =                             m            n                         In              2.             {\ displaystyle k = {\ frac {m} {n}} \ ln 2.}  Â

Jumlah bit yang diperlukan, m , diberikan n (jumlah elemen yang dimasukkan) give probability positif salah yang diinginkan p (they denounce asumsi optimal nilai < i> k digunakan) dapat dihitung dengan mengganti nilai optimal k dalam empress probabilitas di atas:

                   p        =                             (                          1              -                            e                                  -                   (                                                            m                      n                                                        In                                  2                   )                                                            n                      m                                                                                            )                                                              m                n                                      In                      2                             {\ displaystyle p = \ kiri (1-e ^ {- ({\ frac {m} {n}} \ n 2) {\ frac {n } {m}}} \ right) ^ {{\ frac {m} {n}} \ ln 2}}  Â

yang dapat disederhanakan menjadi:

                   In             p        =        -                             m            n                                              (                          In                          2                         )                                 2                        .             {\ displaystyle \ ln p = - {\ frac {m} {n}} \ kiri (\ ln 2 \ right) 2.}  Â

Ini menghasilkan:

                   m        =        -                                            n              In                           p                                      (              In                          2                            )                                  2                                                                       {\ displaystyle m = - {\ frac {n \ ln p} {(\ ln 2) 2}}}}  Â

Jadi Jumlah Optimal Bit by Elena Adalah

                                        m            n                         =        -                                                            log                                  2                                                         p                                      In                          2                                      ?        -        1.44                              log                          2                                        p                     {\ displaystyle {\ frac {m} {n}} = - {\ frac {\ log _2} p} {\ ln 2}} \ approx -1.44 {\ log2 p}}  Â

deny jumlah fungsi hash yang sesuai k (mengabaikan integralitas):

                   k        =        -                                            In                           p                                      In                          2                                     =        -                              log                          2                                        p                .             {\ displaystyle k = - {\ frac {\ ln p} {\ ln 2}} = - {\ log2 p}.}  Â

This means that for false positives false p , the length of the Bloom filter m is proportional to the number of filtered elements n and the number of required hash functions depends only on the false positive probability target p .

Rumus                    m        =        -                                            n              In                           p                                      (              In                          2                            )                                  2                                                                       {\ displaystyle m = - {\ frac {n \ ln p} {(\ ln 2) 2}}}}   adalah obtains one-to-one tiga wings. Pertama, give paling tidak perhatian, itu mendekati                    1        -                              1           m                              {\ displaystyle 1 - {\ frac {1} {m}}}   sebagai                            e                       -                                          1                m                                                       {\ displaystyle e ^ {- {\ frac {1} {m}}}}   , yang merupakan pendekatan asimtotik yang baik (yaitu, yang berlaku sebagai m ->?). Kedua, yang lebih memprihatinkan, mengasumsikan bahwa selama uji keanggotaan, peristiwa bahwa satu bit yang diuji diatur ke 1 adalah independen dari peristiwa bahwa setiap bit yang diuji lainnya diatur ke 1. Ketiga, yang paling penting, mengasumsikan bahwa < math xmlns = "http://www.w3.org/1998/Math/MathML" alttext = "{\ displaystyle k = {\ frac {m} {n}} \ n 2}">                    k        =                             m            n                         In              2             {\ displaystyle k = {\ frac {m} {n}} \ ln 2}  drives kebetulan tidak terpisahkan.

Goel and Gupta, however, provide a rigorous upper limit that does not make estimates and requires no assumptions. They show that the false-positive probability for the Bloom filter is limited by m bit (                m         & gt;         1               {\ displaystyle m & gt; 1}   ), n elements, and k hash function at most

                                Â (                 Â 1      Â      ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ...               e                                -                                                                             k                    (                         n                                ·  ·     <   Â <Â> <0>                    )              Â                                             m                       -                       1              Â                   <      ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ,        ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ,     ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ,                      Â )                           Â                           .               {\ displaystyle \ left (1-e ^ {- {\ frac {k (n 0,5)} {m-1}}} \ right) ^ {k}. }  Â

Batas ini dapat diartikan sebagai mengatakan bahwa rumus perkiraan                                         (                          1              -                            e                                  -                                                                                    k                        n                                            m                                                                                            )                                k                             {\ displaystyle \ left (1-e ^ {- {\ frac {kn} {m}}} \ right) ^ {k}}  dapat diterapkan denotes penalty paling banyak setengah elemen extra dan paling sedikit satu lebih sedikit.

Bloom Filters - YouTube
src: i.ytimg.com


Mendekati jumlah item dalam filter Bloom

Swamidass & amp; Baldi (2007) menunjukkan bahwa jumlah item dalam filter Bloom dapat didekati dengan rumus berikut,

                          n                       *                        =        -                             m            k                         In                       [                      1             -                                            X                m                                              ]                ,             {\ displaystyle n ^ {*} = - {\ frac {m} {k}} \ ln \ left [1 - {\ frac {X} { m}} \ kanan],}  Â

di mana                                    n                         *                                      {\ displaystyle n ^ {*}}    adalah perkiraan jumlah item dalam filter, m adalah panjang (ukuran) filter, k adalah jumlah fungsi hash, dan X adalah jumlah bit yang diset ke satu.

Mapreduce framework suffling & sorting. mapreduce example ...
src: images.slideplayer.com


Persatuan dan perpotongan set

Bloom filters are a way of representing a set of items. It is common to try to calculate the size of the intersection or union between two sets. Bloom filters can be used to estimate the size of the intersection and the unification of two sets. Swamidass & amp; Baldi (2007) showed that for two Bloom filters with length m , their number, each can be estimated as

               n        (                    A                      *                         )         =         -                       Â       Â                            In                           [             Â 1     ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÃ, -                ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ...
                 n                (                 A                 )        ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ,      Â         ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ,                                  ]                       {\ displaystyle n {A} {k}} { n (A)} {m}} \ right]}  Â

dan

                   n        (                   B                       *                        )        =        -                             m            k                         In                       [                      1             -                                                            n                   (                   B                   )                                m                                              ]                .             {\ displaystyle n (B ^ {*}) = - {\ frac {m} {k}} \ ln \

Source of the article : Wikipedia

Comments
0 Comments