Paul Masurel2020-10-10T15:58:40+00:00https://fulmicoton.comPaul MasurelOf bitpacking with or without SSE32019-09-20T00:00:00+00:00https://fulmicoton.com/posts/bitpacking<p><em>For those who came from reddit and are not familiar with tantivy. <a href="https://github.com/tantivy-search/tantivy">tantivy</a> is a search engine library for Rust. It is strongly inspired by lucene.</em></p>
<p>This blog post might interest three type of readers.</p>
<ul>
<li><strong>people interested in tantivy</strong>: You’ll learn how tantivy uses SIMD
instructions to decode posting lists, and what happens on platform where the relevant
instruction set is not available.</li>
<li><strong>rustaceans</strong> who would like to hear a good SIMD in rust story.</li>
<li><strong>lucene core devs (yeah it is a very select club)</strong> who might be interested
in a possible (unconfirmed) optimization opportunity.</li>
</ul>
<p>Depending on the category you belong to, you may want to skip parts of this blog post.
Go ahead, I won’t be offended.</p>
<h1 id="integer-compression">Integer compression</h1>
<p>Full text search engines sequentially read <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote">1</a></sup> long lists of sorted document ids. In tantivy and in Lucene, these <code class="language-plaintext highlighter-rouge">DocId</code>s are represented as 32-bits integers. 32-bits may sound a little small,
but since <code class="language-plaintext highlighter-rouge">DocId</code>s are local to a segment of the index, and an index can have more than one segment, both tantivy and lucene can handle indices exceeding the 4 billions documents.</p>
<p>It is important to compress this data in a compact way, and in a way that
makes uncompressing as fast as possible.
The best algorithms typically clock at >4 billions integers/s. At this speed, depending
on your architecture, you can actually uncompress integers slightly faster than your memory
bandwidth (16GB/s) limit.</p>
<p>In comparison, a general compression scheme that optimizes for decompression speed like LZ4 will typically decompress 1 billion integer per second.</p>
<h1 id="compression-schemes">Compression schemes</h1>
<p>There is a wealth of integer compression algorithms, and they typically offer a different trade-off between decompression/compression speed
and the compression rate. <a href="https://github.com/powturbo/TurboPFor">TurboPFor</a>’s README offers a comprehensive benchmark of the most popular compression format. The data shown in the graph below was
taken from there.</p>
<p><img src="/images/compression-algorithm.png" alt="Compression schemes" /></p>
<p>Let’s walk around this chart together.</p>
<p>Ideally you would want to appear on the top left corner of this graph
but this does not tell the entire story. Different compression algorithm families have different usages.</p>
<p>The scheme you see on the left (to the exception of Elias Fano<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">2</a></sup>) typically require to compress and decompress integers in blocks. Tantivy uses SIMDPack128 for instance, which works on blocks of 128 integers.</p>
<p>The scheme you are seeing on the right side are typically variable byte schemes.
Algorithm in that family represent each integer over 1, 2, 3, 4, or 5 bytes.
They do not compress as well especially for very small integers. On the other hand, they do
not require to decompress entire blocks at a time.</p>
<p>The compression implementations at the top of the chart use SIMD instructions…
As a result, they are unfortunately not usable in Lucene as Java does not support
SIMD instructions. But we’ll see at the end of this blog post, that there might be some interesting
workaround for Java developers.</p>
<h1 id="bitpacking">Bitpacking</h1>
<p>The format used by tantivy is a pure rust reimplementation of <a href="https://github.com/lemire/simdcomp">simdcomp</a> from <a href="https://lemire.me/blog/">Daniel Lemire</a>. It is a SSE3 implementation of delta-encoding + bitpacking.
But what are delta-encoding and bitpacking exactly?</p>
<p>Bitpacking relies on the idea that the integers you are trying to compress are small.
Delta-encoding consists in replacing your sorted list of integers by the difference between two consecutive integers.</p>
<p><code class="language-plaintext highlighter-rouge">1, 3, 7, 8, 13, ...</code> for instance, becomes <code class="language-plaintext highlighter-rouge">1, 2, 4, 1, 5, ...</code>.</p>
<p>Bit packing then consists in identifying the minimum number of bits k required to represent
all of the integers in a pack and then concatenate their lowest k-bits.
For instance, in the previous example, the highest number is 5.
It requires 3 bits to be represented.</p>
<p>Our bitpacked block for <code class="language-plaintext highlighter-rouge">1, 2, 4, 1, 5, ...</code> becomes :
<code class="language-plaintext highlighter-rouge">100 010 001 100 101 ...</code>.</p>
<p>We typically want to make sure the size of our compressed block is a nice rounded number of 32-bit words.
Regardless of the bitwidth, using blocks of a multiple of 32 elements should do the trick.
<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote">3</a></sup>. The size of the blocks is not just a matter of arithmetics.
We also need to prepend our encoded block by the bitwidth used to encode it. Tantivy is not very
smart about that and burns an entire byte for this, while 6 bits (or 5 if you forbid the value 0) would have been sufficient<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">4</a></sup>. If our blocks are too small we risk using too much space on reencoding the bitwidth needlessly. If our blocks are too large, the average bitwidth used will be larger.<br />
Both Lucene and tantivy use blocks of 128 elements, but let’s stick to blocks of 32 elements for the moment.</p>
<p>We will also forget about delta-encoding and focus on bitpacking.</p>
<p>A simple implementation of bitpacking in pseudo-code might look like this.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">cmp</span><span class="p">::</span><span class="n">Ordering</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nb">u32</span><span class="p">;</span>
<span class="k">fn</span> <span class="nf">pack_len</span><span class="p">(</span><span class="n">bit_width</span><span class="p">:</span> <span class="nb">u8</span><span class="p">)</span> <span class="k">-></span> <span class="nb">usize</span> <span class="p">{</span>
<span class="mi">32</span> <span class="o">*</span> <span class="n">bit_width</span> <span class="k">as</span> <span class="nb">usize</span> <span class="o">/</span> <span class="mi">8</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">bitpack</span><span class="p">(</span><span class="n">bit_width</span><span class="p">:</span> <span class="nb">u8</span><span class="p">,</span> <span class="n">uncompressed</span><span class="p">:</span> <span class="o">&</span><span class="p">[</span><span class="nb">u32</span><span class="p">;</span> <span class="mi">32</span><span class="p">],</span> <span class="k">mut</span> <span class="n">compressed</span><span class="p">:</span> <span class="o">&</span><span class="k">mut</span> <span class="p">[</span><span class="nb">u8</span><span class="p">])</span> <span class="p">{</span>
<span class="nd">assert_eq!</span><span class="p">(</span><span class="n">compressed</span><span class="nf">.len</span><span class="p">(),</span> <span class="nf">pack_len</span><span class="p">(</span><span class="n">bit_width</span><span class="p">));</span>
<span class="c">// We will use a `u32` as a mini buffer of 32 bits.</span>
<span class="c">// We accumulate bits in it until capacity, at which point we just copy this </span>
<span class="c">// mini buffer to compressed.</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">mini_buffer</span><span class="p">:</span> <span class="nb">u32</span> <span class="o">=</span> <span class="mi">0u32</span><span class="p">;</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">cursor</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c">//< number of bits written in the mini_buffer.</span>
<span class="k">for</span> <span class="n">el</span> <span class="n">in</span> <span class="n">uncompressed</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">remaining</span> <span class="o">=</span> <span class="mi">32</span> <span class="o">-</span> <span class="n">cursor</span><span class="p">;</span>
<span class="k">match</span> <span class="n">bit_width</span><span class="nf">.cmp</span><span class="p">(</span><span class="o">&</span><span class="n">remaining</span><span class="p">)</span> <span class="p">{</span>
<span class="nn">Ordering</span><span class="p">::</span><span class="n">Less</span> <span class="k">=></span> <span class="p">{</span>
<span class="c">// Plenty of room remaining in our mini buffer.</span>
<span class="n">mini_buffer</span> <span class="p">|</span><span class="o">=</span> <span class="n">el</span> <span class="o"><<</span> <span class="n">cursor</span><span class="p">;</span>
<span class="n">cursor</span> <span class="o">+=</span> <span class="n">bit_width</span><span class="p">;</span>
<span class="p">}</span>
<span class="nn">Ordering</span><span class="p">::</span><span class="n">Equal</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">mini_buffer</span> <span class="p">|</span><span class="o">=</span> <span class="n">el</span> <span class="o"><<</span> <span class="n">cursor</span><span class="p">;</span>
<span class="c">// We have completed our minibuffer exactly.</span>
<span class="c">// Let's write it to `compressed`.</span>
<span class="n">compressed</span><span class="p">[</span><span class="o">..</span><span class="mi">4</span><span class="p">]</span><span class="nf">.copy_from_slice</span><span class="p">(</span><span class="o">&</span><span class="n">mini_buffer</span><span class="nf">.to_le_bytes</span><span class="p">());</span>
<span class="n">compressed</span> <span class="o">=</span> <span class="o">&</span><span class="k">mut</span> <span class="n">compressed</span><span class="p">[</span><span class="mi">4</span><span class="o">..</span><span class="p">];</span>
<span class="n">mini_buffer</span> <span class="o">=</span> <span class="mi">0u32</span><span class="p">;</span>
<span class="n">cursor</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="nn">Ordering</span><span class="p">::</span><span class="n">Greater</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">mini_buffer</span> <span class="p">|</span><span class="o">=</span> <span class="n">el</span> <span class="o"><<</span> <span class="n">cursor</span><span class="p">;</span>
<span class="c">// We have completed our minibuffer.</span>
<span class="c">// Let's write it to `compressed` and set the fresh mini_buffer</span>
<span class="c">// with the remaining bits.</span>
<span class="n">compressed</span><span class="p">[</span><span class="o">..</span><span class="mi">4</span><span class="p">]</span><span class="nf">.copy_from_slice</span><span class="p">(</span><span class="o">&</span><span class="n">mini_buffer</span><span class="nf">.to_le_bytes</span><span class="p">());</span>
<span class="n">compressed</span> <span class="o">=</span> <span class="o">&</span><span class="k">mut</span> <span class="n">compressed</span><span class="p">[</span><span class="mi">4</span><span class="o">..</span><span class="p">];</span>
<span class="n">cursor</span> <span class="o">=</span> <span class="n">bit_width</span> <span class="o">-</span> <span class="n">remaining</span><span class="p">;</span>
<span class="n">mini_buffer</span> <span class="o">=</span> <span class="n">el</span> <span class="o">>></span> <span class="n">remaining</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="nd">debug_assert!</span><span class="p">(</span><span class="n">compressed</span><span class="nf">.is_empty</span><span class="p">());</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Ok I lied… This is not pseudo-code but rust. Isn’t it very readable?
In a nutshell, we accumulate our values bit in a <code class="language-plaintext highlighter-rouge">mini_buffer</code>
until saturation, at which point we flush it out. Rince and repeat.</p>
<p><strong>I haven’t tested this code, so please don’t use it</strong>. The <a href="https://github.com/tantivy-search/bitpacking">bitpacking</a> crate contains a well-tested efficient implementation. But we’ll get there in a
second.</p>
<h1 id="simd-for-the-win">SIMD for the win.</h1>
<p>Now the key idea of the SIMD version of this algorithm is very simple.
Let’s use a 128 bits SIMD register (in rust, you will find the type in <code class="language-plaintext highlighter-rouge">std::arch::x86_64::__m128i</code>) to represents an array of 4 32-bit ints (i.e.: <code class="language-plaintext highlighter-rouge">[u32; 4]</code>) and let’s pack 4 integers at a time.</p>
<p>Note that the method leads to a different format than the scalar implementation we just saw :
imagine people packing 32 books in boxes. For simplification, we’ll assume
each box can fit exactly 4 books, so that 8 boxes will be required.</p>
<p>If only one person is accomplishing this task.</p>
<ul>
<li>Box #0 will contain book #0..#3,</li>
<li>Box #1 will contain book #4..#7,</li>
<li>Box #2 will contain book #8..#11,</li>
<li>…</li>
<li>Box N will contain book N<em>4..(N+1)</em>4 - 1</li>
</ul>
<p>But if several people work together, surely things will go smoother if they
work on filling their own individual box. If they pick books from a common stack,
you should end up with the following books in the boxes.</p>
<p>On the first round, Packer #0, #1, #2, #3 will respectively pick book #0, #1, #2, #3
from the stack.</p>
<p>On the second round, they will respectively pick book #4, #5, #6, #7.</p>
<p>Box #0 was filled by packer 0, so it will contain book #0, #4, #8, #12.
That’s exactly the way things will happen in the SIMD implementation.</p>
<h1 id="the-implementation-a-story-where-rust-really-shined">The implementation, a story where Rust really shined</h1>
<p>tantivy should work on architectures that may lack the
SSE3 instruction set (e.g. ARM, WebAssembly, very old x86 CPUs), and
ideally the index format should not depend on the architecture.</p>
<p>It was therefore necessary for me to also implement a fallback scalar
implementation that was compatible with the SSE3 format.</p>
<p>Daniel Lemire’s <a href="https://github.com/lemire/simdcomp">simdcomp</a> also has a AVX2
implementations that produces blocks of 256 integers.
Tantivy does not use that, but surely it could be handy for a fellow rustaceans
some day?<br />
Finally a good old well-optimized unrolled scalar implementation could
definitely be useful to some people right? Already, we are discussing implementing <code class="language-plaintext highlighter-rouge">2 + 2 + 1 = 5</code> different flavors of bitpacking.</p>
<p>simdcomp and Lucene generate unrolled code for differently
<code class="language-plaintext highlighter-rouge">bit_width</code> using a python script. Implementing and maintaining that
kind of script is not an easy task… Doing it for</p>
<ul>
<li>a scalar implementation</li>
<li>a SSE3 implementation</li>
<li>a scalar fallback implementation for SSE3</li>
<li>AVX2</li>
<li>a scalar fallback implementation for AVX2</li>
</ul>
<p>sounded like a daunting challenge.</p>
<p>Now… Since we said the algorithm was conceptually the same for all of
these implementations, could we abstract out the bit that is the same from
what is actually different?</p>
<p>As we will learn in a second, using a trait to build such an abstraction for this
would require us to use const generics, and these are unfortunately not yet
available in rust. For this reason, the <a href="https://github.com/tantivy-search/bitpacking">bitpacking</a> crate relies on a macro.</p>
<p>Each bitpacking implementation gets its own module in which I simply
need to define the type they operate on, and a simple set of atomic operations.</p>
<p>Here are the data types for the different format :</p>
<table>
<thead>
<tr>
<th>Implementation</th>
<th>DataType</th>
</tr>
</thead>
<tbody>
<tr>
<td>scalar</td>
<td>u32</td>
</tr>
<tr>
<td>sse3</td>
<td>__m128i</td>
</tr>
<tr>
<td>scalar fixture for sse3</td>
<td>[u32; 4]</td>
</tr>
<tr>
<td>avx2</td>
<td>__m256i</td>
</tr>
<tr>
<td>scalar fixture for avx2</td>
<td>[u32; 8]</td>
</tr>
</tbody>
</table>
<p>The simple set of operations is then :</p>
<ul>
<li>how to apply an OR</li>
<li>how to apply an AND</li>
<li>how to load a new chunk of data type</li>
<li>how to store a new chunk of data type</li>
<li>how to set the datatype to a scalar value.</li>
<li>how to left shift how to right shift.</li>
</ul>
<p>That’s it. This is sufficient for bitpacking and bitunpacking!</p>
<p>For instance, the module for scalar operation will look like</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">mod</span> <span class="n">scalar</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">BLOCK_LEN</span><span class="p">:</span> <span class="nb">usize</span> <span class="o">=</span> <span class="mi">32</span><span class="p">;</span>
<span class="k">type</span> <span class="n">DataType</span> <span class="o">=</span> <span class="nb">u32</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="n">read_unaligned</span> <span class="k">as</span> <span class="n">load_unaligned</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="n">write_unaligned</span> <span class="k">as</span> <span class="n">store_unaligned</span><span class="p">;</span>
<span class="k">fn</span> <span class="nf">set1</span><span class="p">(</span><span class="n">el</span><span class="p">:</span> <span class="nb">i32</span><span class="p">)</span> <span class="k">-></span> <span class="n">DataType</span> <span class="p">{</span>
<span class="n">el</span> <span class="k">as</span> <span class="nb">u32</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">right_shift_32</span><span class="p">(</span><span class="n">el</span><span class="p">:</span> <span class="n">DataType</span><span class="p">,</span> <span class="n">shift</span><span class="p">:</span> <span class="nb">i32</span><span class="p">)</span> <span class="k">-></span> <span class="n">DataType</span> <span class="p">{</span>
<span class="n">el</span> <span class="o">>></span> <span class="n">shift</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">left_shift_32</span><span class="p">(</span><span class="n">el</span><span class="p">:</span> <span class="n">DataType</span><span class="p">,</span> <span class="n">shift</span><span class="p">:</span> <span class="nb">i32</span><span class="p">)</span> <span class="k">-></span> <span class="n">DataType</span> <span class="p">{</span>
<span class="n">el</span> <span class="o"><<</span> <span class="n">shift</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">op_or</span><span class="p">(</span><span class="n">left</span><span class="p">:</span> <span class="n">DataType</span><span class="p">,</span> <span class="n">right</span><span class="p">:</span> <span class="n">DataType</span><span class="p">)</span> <span class="k">-></span> <span class="n">DataType</span> <span class="p">{</span>
<span class="n">left</span> <span class="p">|</span> <span class="n">right</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">op_and</span><span class="p">(</span><span class="n">left</span><span class="p">:</span> <span class="n">DataType</span><span class="p">,</span> <span class="n">right</span><span class="p">:</span> <span class="n">DataType</span><span class="p">)</span> <span class="k">-></span> <span class="n">DataType</span> <span class="p">{</span>
<span class="n">left</span> <span class="o">&</span> <span class="n">right</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>While the module for SSE3 will look like.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">mod</span> <span class="n">sse3</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">BLOCK_LEN</span><span class="p">:</span> <span class="nb">usize</span> <span class="o">=</span> <span class="mi">128</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">arch</span><span class="p">::</span><span class="nn">x86_64</span><span class="p">::</span><span class="mi">__</span><span class="n">m128i</span> <span class="k">as</span> <span class="n">DataType</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">arch</span><span class="p">::</span><span class="nn">x86_64</span><span class="p">::</span><span class="mi">_</span><span class="n">mm_and_si128</span> <span class="k">as</span> <span class="n">op_and</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">arch</span><span class="p">::</span><span class="nn">x86_64</span><span class="p">::</span><span class="mi">_</span><span class="n">mm_lddqu_si128</span> <span class="k">as</span> <span class="n">load_unaligned</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">arch</span><span class="p">::</span><span class="nn">x86_64</span><span class="p">::</span><span class="mi">_</span><span class="n">mm_or_si128</span> <span class="k">as</span> <span class="n">op_or</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">arch</span><span class="p">::</span><span class="nn">x86_64</span><span class="p">::</span><span class="mi">_</span><span class="n">mm_set1_epi32</span> <span class="k">as</span> <span class="n">set1</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">arch</span><span class="p">::</span><span class="nn">x86_64</span><span class="p">::</span><span class="mi">_</span><span class="n">mm_slli_epi32</span> <span class="k">as</span> <span class="n">left_shift_32</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">arch</span><span class="p">::</span><span class="nn">x86_64</span><span class="p">::</span><span class="mi">_</span><span class="n">mm_srli_epi32</span> <span class="k">as</span> <span class="n">right_shift_32</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">arch</span><span class="p">::</span><span class="nn">x86_64</span><span class="p">::</span><span class="mi">_</span><span class="n">mm_storeu_si128</span> <span class="k">as</span> <span class="n">store_unaligned</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p><strong>This is the only amount of SSE3 code required!</strong></p>
<p>A single rust macro will then generate a lot of code to bitpack and bitunpack for
every single bit width from 0 to 32 included… All of this comes for free
with unit tests and french fries.</p>
<p>One thing though… <code class="language-plaintext highlighter-rouge">_mm_slli_epi32</code> and <code class="language-plaintext highlighter-rouge">_mm_srli_epi32</code> requires the operand
that represents the number of bits to shift to be const.</p>
<p>As you may have guessed, this is where const generics would have really been
helpful. Even using macros, our little routine for bitpacking will not compile
without a little bit more of work. The number of bits in left/right bit shifts
is dynamically computed and depends on the loop iteration.
Of course forcing the compiler to unroll the loop will not cut it either. The
compiler error is happening in the early steps of the compilation.</p>
<p>Fortunately someone made a loop unrolling macro crate called
<a href="https://crates.io/crates/crunchy">crunchy</a>. It simply unrolls loop of the form</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="n">in</span> <span class="mi">0</span><span class="o">..</span><span class="n">N</span> <span class="p">{</span>
<span class="c">// ...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This is perfect. After unroll, all our dynamic bitshift are effectively constant.</p>
<p>Our unpacking routine in our macro looks like this.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span><span class="p">(</span><span class="n">crate</span><span class="p">)</span> <span class="k">unsafe</span> <span class="k">fn</span> <span class="n">unpack</span><span class="o"><</span><span class="n">Output</span><span class="p">:</span> <span class="n">Sink</span><span class="o">></span><span class="p">(</span><span class="n">compressed</span><span class="p">:</span> <span class="o">&</span><span class="p">[</span><span class="nb">u8</span><span class="p">],</span> <span class="n">uncompressed</span><span class="p">:</span> <span class="o">&</span><span class="k">mut</span> <span class="p">[</span><span class="nb">u32</span><span class="p">])</span> <span class="p">{</span>
<span class="k">assert</span><span class="o">!</span><span class="p">(</span><span class="n">compressed</span><span class="nf">.len</span><span class="p">()</span> <span class="o">>=</span> <span class="n">NUM_BYTES_PER_BLOCK</span><span class="p">,</span> <span class="s">"Compressed array seems too small. ({} < {}) "</span><span class="p">,</span> <span class="n">compressed</span><span class="nf">.len</span><span class="p">(),</span> <span class="n">NUM_BYTES_PER_BLOCK</span><span class="p">);</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">input_ptr</span> <span class="o">=</span> <span class="n">compressed</span><span class="nf">.as_ptr</span><span class="p">()</span> <span class="k">as</span> <span class="o">*</span><span class="k">const</span> <span class="n">DataType</span><span class="p">;</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">output_ptr</span> <span class="o">=</span> <span class="n">uncompressed</span><span class="nf">.as_mut_ptr</span><span class="p">()</span> <span class="k">as</span> <span class="o">*</span><span class="k">const</span> <span class="n">DataType</span><span class="p">;</span>
<span class="k">let</span> <span class="n">mask_scalar</span><span class="p">:</span> <span class="nb">u32</span> <span class="o">=</span> <span class="p">((</span><span class="mi">1u64</span> <span class="o"><<</span> <span class="n">BIT_WIDTH</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1u64</span><span class="p">)</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">;</span>
<span class="k">let</span> <span class="n">mask</span> <span class="o">=</span> <span class="nf">set1</span><span class="p">(</span><span class="n">mask_scalar</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">);</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">in_register</span><span class="p">:</span> <span class="n">DataType</span> <span class="o">=</span> <span class="nf">load_unaligned</span><span class="p">(</span><span class="n">input_ptr</span><span class="p">);</span>
<span class="k">let</span> <span class="n">out_register</span> <span class="o">=</span> <span class="nf">op_and</span><span class="p">(</span><span class="n">in_register</span><span class="p">,</span> <span class="n">mask</span><span class="p">);</span>
<span class="nf">store_unaligned</span><span class="p">(</span><span class="n">output_ptr</span><span class="p">,</span> <span class="n">out_register</span><span class="p">);</span>
<span class="n">output_ptr</span> <span class="o">=</span> <span class="n">output_ptr</span><span class="nf">.add</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="nd">unroll!</span> <span class="p">{</span> <span class="c">// the loop unrolling macro</span>
<span class="k">for</span> <span class="n">iter</span> <span class="n">in</span> <span class="mi">0</span><span class="o">..</span><span class="mi">31</span> <span class="p">{</span> <span class="c">//< that's certainly a bummer, but it only handles</span>
<span class="c">// loops starting at 0 </span>
<span class="k">const</span> <span class="n">i</span><span class="p">:</span> <span class="nb">usize</span> <span class="o">=</span> <span class="n">iter</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">const</span> <span class="n">inner_cursor</span><span class="p">:</span> <span class="nb">usize</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">*</span> <span class="n">BIT_WIDTH</span><span class="p">)</span> <span class="o">%</span> <span class="mi">32</span><span class="p">;</span>
<span class="k">const</span> <span class="n">inner_capacity</span><span class="p">:</span> <span class="nb">usize</span> <span class="o">=</span> <span class="mi">32</span> <span class="o">-</span> <span class="n">inner_cursor</span><span class="p">;</span>
<span class="c">// LLVM will not emit the shift operand if</span>
<span class="c">// `inner_cursor` is 0.</span>
<span class="k">let</span> <span class="n">shifted_in_register</span> <span class="o">=</span> <span class="nf">right_shift_32</span><span class="p">(</span><span class="n">in_register</span><span class="p">,</span> <span class="n">inner_cursor</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">);</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">out_register</span><span class="p">:</span> <span class="n">DataType</span> <span class="o">=</span> <span class="nf">op_and</span><span class="p">(</span><span class="n">shifted_in_register</span><span class="p">,</span> <span class="n">mask</span><span class="p">);</span>
<span class="c">// We consumed our current quadruplets entirely.</span>
<span class="c">// We therefore read another one.</span>
<span class="k">if</span> <span class="n">inner_capacity</span> <span class="o"><=</span> <span class="n">BIT_WIDTH</span> <span class="o">&&</span> <span class="n">i</span> <span class="o">!=</span> <span class="mi">31</span> <span class="p">{</span>
<span class="n">input_ptr</span> <span class="o">=</span> <span class="n">input_ptr</span><span class="nf">.add</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="n">in_register</span> <span class="o">=</span> <span class="nf">load_unaligned</span><span class="p">(</span><span class="n">input_ptr</span><span class="p">);</span>
<span class="c">// This quadruplets is actually cutting one of</span>
<span class="c">// our `DataType`. We need to read the next one.</span>
<span class="k">if</span> <span class="n">inner_capacity</span> <span class="o"><</span> <span class="n">BIT_WIDTH</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">shifted</span> <span class="o">=</span> <span class="nf">left_shift_32</span><span class="p">(</span><span class="n">in_register</span><span class="p">,</span> <span class="n">inner_capacity</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">);</span>
<span class="k">let</span> <span class="n">masked</span> <span class="o">=</span> <span class="nf">op_and</span><span class="p">(</span><span class="n">shifted</span><span class="p">,</span> <span class="n">mask</span><span class="p">);</span>
<span class="n">out_register</span> <span class="o">=</span> <span class="nf">op_or</span><span class="p">(</span><span class="n">out_register</span><span class="p">,</span> <span class="n">masked</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="nf">store_unaligned</span><span class="p">(,</span> <span class="n">out_register</span><span class="p">);</span>
<span class="k">self</span><span class="py">.output_ptr</span> <span class="o">=</span> <span class="k">self</span><span class="py">.output_ptr</span><span class="nf">.add</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="nf">store_unaligned</span><span class="p">(</span><span class="n">output_ptr</span><span class="p">,</span> <span class="n">out_register</span><span class="p">);</span>
<span class="n">output_ptr</span> <span class="o">=</span> <span class="n">output_ptr</span><span class="nf">.add</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Beautiful!</p>
<h1 id="what-was-it-about-this-lucene-performance-opportunity">What was it about this lucene performance opportunity?</h1>
<p>Sure Lucene cannot use SSE3 instructions as they are not accessible
in Java… but here is a funky idea : what if we tried to do SIMD simply
using operations on 64 bits integer.</p>
<p>Technically SIMD simply stands for “single instruction” doesn’t it?
Well how far can we go emulating our operations over a <code class="language-plaintext highlighter-rouge">[u32; 2]</code>using a <code class="language-plaintext highlighter-rouge">u64</code>.
If we are lucky we could process two integers at a time!</p>
<p>So let’s go through the operations one by one…</p>
<ul>
<li>bitwise AND operation. ✓</li>
<li>bitwise OR. ✓</li>
<li>set value ✓</li>
<li>store / load ✓</li>
<li>left/right bitshifts… Hum. This one is a bit tricky.</li>
</ul>
<p>We want these bitshifts to stay in our little <code class="language-plaintext highlighter-rouge">[u32; 2]</code> compartments with is not the
default behavior of a bitshift on <code class="language-plaintext highlighter-rouge">u64</code>. A bit mask should prevent bits from
leaking to the next compartment, shouldn’t it?</p>
<p>Here is the implementation, I ended up with.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">mod</span> <span class="n">fakesimd</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">BLOCK_LEN</span><span class="p">:</span> <span class="nb">usize</span> <span class="o">=</span> <span class="mi">64</span><span class="p">;</span>
<span class="k">type</span> <span class="n">DataType</span> <span class="o">=</span> <span class="nb">u64</span><span class="p">;</span>
<span class="k">fn</span> <span class="nf">set1</span><span class="p">(</span><span class="n">el</span><span class="p">:</span> <span class="nb">i32</span><span class="p">)</span> <span class="k">-></span> <span class="n">DataType</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">el</span> <span class="o">=</span> <span class="n">el</span> <span class="k">as</span> <span class="nb">u64</span><span class="p">;</span>
<span class="n">el</span> <span class="p">|</span> <span class="n">el</span> <span class="o"><<</span> <span class="mi">32</span>
<span class="p">}</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="n">read_unaligned</span> <span class="k">as</span> <span class="n">load_unaligned</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="n">write_unaligned</span> <span class="k">as</span> <span class="n">store_unaligned</span><span class="p">;</span>
<span class="k">fn</span> <span class="nf">compute_mask</span><span class="p">(</span><span class="n">num_bits</span><span class="p">:</span> <span class="nb">u64</span><span class="p">)</span> <span class="k">-></span> <span class="nb">u64</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1u64</span> <span class="o"><<</span> <span class="n">num_bits</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1u64</span><span class="p">;</span>
<span class="n">mask</span> <span class="p">|</span> <span class="p">(</span><span class="n">mask</span> <span class="o"><<</span> <span class="mi">32</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">right_shift_32</span><span class="p">(</span><span class="n">el</span><span class="p">:</span> <span class="n">DataType</span><span class="p">,</span> <span class="n">shift</span><span class="p">:</span> <span class="nb">i32</span><span class="p">)</span> <span class="k">-></span> <span class="n">DataType</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">shift</span> <span class="o">=</span> <span class="n">shift</span> <span class="k">as</span> <span class="nb">u64</span><span class="p">;</span>
<span class="k">let</span> <span class="n">mask</span> <span class="o">=</span> <span class="nf">compute_mask</span><span class="p">(</span><span class="n">shift</span><span class="p">);</span>
<span class="p">(</span><span class="n">el</span> <span class="o">&</span> <span class="o">!</span><span class="n">mask</span><span class="p">)</span> <span class="o">>></span> <span class="n">shift</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">left_shift_32</span><span class="p">(</span><span class="n">el</span><span class="p">:</span> <span class="n">DataType</span><span class="p">,</span> <span class="n">shift2</span><span class="p">:</span> <span class="nb">i32</span><span class="p">)</span> <span class="k">-></span> <span class="n">DataType</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">shift</span> <span class="o">=</span> <span class="n">shift2</span> <span class="k">as</span> <span class="nb">u64</span><span class="p">;</span>
<span class="k">let</span> <span class="n">mask</span> <span class="o">=</span> <span class="nf">compute_mask</span><span class="p">(</span><span class="mi">32</span><span class="o">-</span><span class="n">shift</span><span class="p">);</span>
<span class="p">(</span><span class="n">el</span> <span class="o">&</span> <span class="n">mask</span><span class="p">)</span> <span class="o"><<</span> <span class="n">shift</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">op_or</span><span class="p">(</span><span class="n">left</span><span class="p">:</span> <span class="n">DataType</span><span class="p">,</span> <span class="n">right</span><span class="p">:</span> <span class="n">DataType</span><span class="p">)</span> <span class="k">-></span> <span class="n">DataType</span> <span class="p">{</span>
<span class="n">left</span> <span class="p">|</span> <span class="n">right</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">op_and</span><span class="p">(</span><span class="n">left</span><span class="p">:</span> <span class="n">DataType</span><span class="p">,</span> <span class="n">right</span><span class="p">:</span> <span class="n">DataType</span><span class="p">)</span> <span class="k">-></span> <span class="n">DataType</span> <span class="p">{</span>
<span class="n">left</span> <span class="o">&</span> <span class="n">right</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">or_collapse_to_u32</span><span class="p">(</span><span class="n">accumulator</span><span class="p">:</span> <span class="n">DataType</span><span class="p">)</span> <span class="k">-></span> <span class="nb">u32</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">high</span> <span class="o">=</span> <span class="n">accumulator</span> <span class="o">>></span> <span class="mi">32u64</span><span class="p">;</span>
<span class="k">let</span> <span class="n">low</span> <span class="o">=</span> <span class="n">accumulator</span> <span class="o">%</span> <span class="p">(</span><span class="mi">1u64</span> <span class="o"><<</span> <span class="mi">32</span><span class="p">);</span>
<span class="p">(</span><span class="n">high</span> <span class="p">|</span> <span class="n">low</span><span class="p">)</span> <span class="k">as</span> <span class="nb">u32</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Hurray tests are passing!
Was there a performance benefit to jumping through this hoop?</p>
<h1 id="benchmark">Benchmark</h1>
<p>The following benchmark have been running on my laptop which is powered by
<code class="language-plaintext highlighter-rouge">Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz</code></p>
<p>Here is the resultwhen compressing integer with a bitwidth of 15.</p>
<table>
<thead>
<tr>
<th>Implementation</th>
<th>Unpack throughput</th>
</tr>
</thead>
<tbody>
<tr>
<td>scalar</td>
<td>1.48 billions integers /s</td>
</tr>
<tr>
<td>fake SIMD using u64</td>
<td>2.71 billions integers /s</td>
</tr>
<tr>
<td>sse3</td>
<td>6 billions integers / s</td>
</tr>
</tbody>
</table>
<p>Can this technique be used to get faster bitpacking in Java? I have no
idea. Maybe my initial scalar implementation sucked for reasons I do not
grasped? Also, I did not talk about the integration of the deltas which is a
rabbit hole of a subject in itself.</p>
<p>I also wonder if anyone has used this trick to get a little bit more
performance on a different problem.</p>
<p>But this blog post have reached a decent length, and trust me. You are probably
the only reader who read it so far.</p>
<hr />
<p>compute the minimum bit width required for a given array, but I decided to
decepitively show only the simple stuff in this blog post.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:4" role="doc-endnote">
<p>And seek into, but this is not the subject of this blog post. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>Elias Fano is very interesting too. I will try to blog about a possible cool usage of Elias Fano in search in a future blog post. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>Of course, some of the possible bitwidth are not prime with 32 and may be reach alignement before 32, but 32 has the merit to work for any bitwidth. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Feel free to tell people tantivy wastes 3 bits per every 128 integers encoded at your next cocktail party :). <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Of tantivy indexing datastructure, the stacker2019-04-16T00:00:00+00:00https://fulmicoton.com/posts/tantivy-stacker<p><em>For those who came from reddit and are not familiar with tantivy. <a href="https://github.com/tantivy-search/tantivy">tantivy</a> is a search engine library for Rust. It is strongly inspired by lucene.</em></p>
<h1 id="things-tantivy-does-differently">Things tantivy does differently</h1>
<p>Some developers have a strange aversion to reading other project source code.
I am not sure if this is the sequel of the school system, where this would be called cheating, or if it is a side-effect of people taking pride in viewing programming as a “creative” job.</p>
<p>I think this approach to software development is suboptimal to say the least.</p>
<p>It is not a secret : Tantivy is very heavily inspired by Lucene’s design and implementation. Whenever I have a doubt about how to implement something, I dive into books and other search engines implementation and I always treat Lucene as the default solution. <strong>It would be stupid to do otherwise : Lucene is simply the most battle-tested opensource search engine implementation.</strong></p>
<p>There are however a couple of parts where I knowingly decided to do things differently than lucene… Some of these solutions are a little bit original, in that I have never seen them used before. If someone else happen to like these ideas, I’d love it if people came and apply them somewhere else.</p>
<p>I hope I will be able to find time in the future to describe these ideas one by one, but let’s be honest: my life has been very busy lately, and it is fairly difficult for me to find time to blog.</p>
<p>In this first post, I will describe a datastructure that is at the core of tantivy’s indexer and differs from what I have seen so far. I suspect it could be helpful in very different contexts, to build a real time search/reverse search engine, to implement a map-reduce engine more efficiently, sort a log into user sessions, implement an alternative <code class="language-plaintext highlighter-rouge">sort -k</code>, etc.</p>
<h3 id="the-problem-we-are-trying-to-solve--building-an-inverted-index-efficiently">The problem we are trying to solve : Building an inverted index efficiently.</h3>
<p>Most search engine rely on a central datastructure called an inverted index.
For the sake of simplification, let’s consider documents consists of a single text field, and are identified by a <code class="language-plaintext highlighter-rouge">DocId</code>.</p>
<p>Each document’s text is split into a sequence of words. We call those words <em>tokens</em> and the splitting operation <em>tokenization</em>.
The inverted index is just like the index at the end of a book. For each token, it associates the list of document ids containing that token. This list of document is also called a <em>posting list</em>.</p>
<p>In other words, your set of document might look like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>doc1 -> [life, is, a, moderately, good, play, with, a, badly, written, third, act]
doc2 -> [life, is, a, long, process, of, getting, tired]
doc3 -> [life, is, not, so, bad, if, you, have, plenty, of, luck,...]
...
</code></pre></div></div>
<p>and your inverted index looks like :</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>play -> [1]
plenty -> [3]
luck -> [3]
process -> [2]
life -> [1,2,3]
is -> [1,2,3]
a -> [1,2]
moderately -> [1]
...
</code></pre></div></div>
<p>As explained in previous blogposts (<a href="/posts/behold-tantivy/">[part 1]</a> and <a href="/posts/behold-tantivy-part2/">[part 2]</a>), tantivy’s inverted index representation is extremely compact and efficient… but it is immutable.</p>
<p>Tantivy’s indexing process itself requires some mutable in-RAM datastructure.</p>
<p>A <code class="language-plaintext highlighter-rouge">HashMap<String, Vec<DocId>></code>, above would be a decent candidate for this job.</p>
<p>Our indexing function would then look as follows.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">collections</span><span class="p">::</span><span class="n">HashMap</span><span class="p">;</span>
<span class="k">pub</span> <span class="k">type</span> <span class="n">DocId</span> <span class="o">=</span> <span class="nb">u32</span><span class="p">;</span>
<span class="k">pub</span> <span class="k">type</span> <span class="n">InvertedIndex</span> <span class="o">=</span> <span class="n">HashMap</span><span class="o"><</span><span class="nb">String</span><span class="p">,</span> <span class="nb">Vec</span><span class="o"><</span><span class="n">DocId</span><span class="o">>></span><span class="p">;</span>
<span class="k">fn</span> <span class="nf">tokenize</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="o">&</span><span class="nb">str</span><span class="p">)</span> <span class="k">-></span> <span class="k">impl</span> <span class="n">Iterator</span><span class="o"><</span><span class="n">Item</span><span class="o">=&</span><span class="nb">str</span><span class="o">></span> <span class="p">{</span>
<span class="n">text</span><span class="nf">.split_whitespace</span><span class="p">()</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="n">build_index</span><span class="o"><</span><span class="nv">'a</span><span class="p">,</span> <span class="n">Corpus</span><span class="p">:</span> <span class="n">Iterator</span><span class="o"><</span><span class="n">Item</span><span class="o">=</span><span class="p">(</span><span class="n">DocId</span><span class="p">,</span> <span class="o">&</span><span class="nv">'a</span> <span class="nb">str</span><span class="p">)</span><span class="o">>></span><span class="p">(</span><span class="n">corpus</span><span class="p">:</span> <span class="n">Corpus</span><span class="p">)</span> <span class="k">-></span> <span class="n">InvertedIndex</span> <span class="p">{</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">inverted_index</span> <span class="o">=</span> <span class="nn">InvertedIndex</span><span class="p">::</span><span class="nf">default</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="n">doc_id</span><span class="p">,</span> <span class="n">doc_text</span><span class="p">)</span> <span class="n">in</span> <span class="n">corpus</span> <span class="p">{</span>
<span class="k">for</span> <span class="n">token</span> <span class="n">in</span> <span class="nf">tokenize</span><span class="p">(</span><span class="n">doc_text</span><span class="p">)</span> <span class="p">{</span>
<span class="n">inverted_index</span>
<span class="nf">.entry</span><span class="p">(</span><span class="n">token</span><span class="nf">.to_string</span><span class="p">())</span>
<span class="nf">.or_insert_with</span><span class="p">(</span><span class="nn">Vec</span><span class="p">::</span><span class="n">new</span><span class="p">)</span>
<span class="nf">.push</span><span class="p">(</span><span class="n">doc_id</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">inverted_index</span>
<span class="p">}</span>
</code></pre></div></div>
<p>As we append documents to this mutable/in-RAM datastructure,
we need to detect when it reaches some user defined memory budget,
pause indexing, and serialize this datastructure to our more compact & efficient on disk index format.</p>
<p>In reality, depending on our schema, we record more information per term occurence than simply <code class="language-plaintext highlighter-rouge">DocId</code>. We might for instance record the term frequency and the positions of the term in the document. Also, in order to stay memory efficient, these <code class="language-plaintext highlighter-rouge">DocId</code>s are somewhat compressed. This is not much of a problem, we can generalize our solution by exchanging our <code class="language-plaintext highlighter-rouge">HashMap<String, Vec<DocId>></code> for a <code class="language-plaintext highlighter-rouge">HashMap<String, Vec<u8>></code>.</p>
<p>In the end, our problem can be summed up as, how can we maintain and write efficiently into tens of thousands of buffers at the same time.</p>
<h1 id="specifications-for-our-problem">Specifications for our problem</h1>
<p>Let’s sum up our specifications, to see where we can improve on our original <code class="language-plaintext highlighter-rouge">HashMap</code> solution.</p>
<p><strong>First, we need to know our memory usage accurately.</strong>
Tantivy’s API let’s the user specify a memory budget and it is tantivy’s duty to stick to it. That way, tantivy can index datasets of any size (I already indexed corpuses of 5TB on a 8GB RAM machine) without swapping. This is extremely comfortable for the user.</p>
<p><strong>Second, our memory usage should be a slim fit for all terms.</strong></p>
<p>We cannot really make an assumption about the distribution of document frequency of our terms. When indexing text, we will typically reach tens of thousands of terms before
we decide to flush the segment. Word frequency typically follow a <a href="https://en.wikipedia.org/wiki/Zipf%27s_law">Zipf’s law</a>. Frequent words will be very frequent and their associated buffer will be very large. However a lot of rare words will appear only once and will require a tiny buffer. We need to be sure our solution is a tight fit for everyone of them.</p>
<p>When indexing logs, we will also reach a lot of extrems depending on the field: timestamps may be unique, while an AB-test group or a country could be heavily saturated.</p>
<p><strong>Third, we do not care about reading our buffers until we start serializing our index</strong>… And at that point, we only read our data sequentially.</p>
<p><strong>Fourth, we never delete any data and we release all our memory at the same time.</strong> The pattern in which we populate our HashMap is very simple. We only insert new terms, and never delete any. We only append new data to posting lists. All memory is released in bulk at the very end.</p>
<h1 id="behold-the-stacker">Behold, the stacker</h1>
<p>I called tantivy’s solution to this problem the “stacker”.</p>
<p>You probably guessed it, we will put all of our data in a memory arena.
Using a memory arena will remove the overhead associated to allocating memory, and deallocation will be as fast and simple as wiping off a magna doodle. Also, we will keep a super accurate idea of the memory usage.</p>
<p>We cannot free memory anymore though. <code class="language-plaintext highlighter-rouge">Vec<u8></code> are not a viable solution anymore,
as we would not be able to reclaim its memory after it is resized. Instead, a common solution could be to use an <a href="https://en.wikipedia.org/wiki/Unrolled_linked_list">unrolled linked list</a>. The problem with unrolled linked lists is that choosing the size of our blocks is a very complicated problem. If our blocks are too large, space will be wasted for rare terms. If blocks are too small, then reading the more frequent terms will require to jump in memory a lot. We would love to have large blocks for frequent terms, and small blacks for rare terms.</p>
<p><code class="language-plaintext highlighter-rouge">Vec</code>’s’ resize policy had an elegant solution to that problem. Doubling the capacity everytime <code class="language-plaintext highlighter-rouge">Vec</code> reaches its limit guaranteed us that at most <code class="language-plaintext highlighter-rouge">50%</code> of the memory is wasted.</p>
<p>Tantivy’s stacker takes the best of both worlds. We keep the benefit of unrolled linked list and <code class="language-plaintext highlighter-rouge">Vec</code> by using blocks growing exponentially in size. The first block has a capacity of 16 bytes, the second block has a size of 32, and then it goes 64B, 128B, 256B, 512B, 1KB, 4KB, 8KB, 16KB, 32KB above which it stagnates.</p>
<p>This way, depending on the payload, we remain somewhere in between 50% and 100% of memory utility.</p>
<p><img src="/images/stacker.png" alt="Stacker figure" /></p>
<h1 id="implementation-details">Implementation details.</h1>
<ul>
<li>The memory arena also makes it possible to rely on 32bits address instead of full-width pointers.</li>
<li>The hashmap is not shaped like usual hashmaps that have a String as a key. Each bucket only contains the hash key (currently 32-bits), and an address in the arena. The arena contains the key length (over 16-bits) followed by the key bytes, followed by the first block of our exponential unrolled linked list. This improves the locality between the key and the value, while keeping the table itself as lean as possible.</li>
</ul>
<p>It also makes it possible for me to store all fields in the same hashmap, even though they require values of different types.</p>
Of hosting files in url minifiers2018-07-13T00:00:00+00:00https://fulmicoton.com/posts/urlminifier<h1 id="of-storing-files-in-url-minifiers">Of storing files in url minifiers</h1>
<p>Today I had an epiphany while staring at a very long url.
I thought: “Url minifiers are really nice to store all of this data for free”.
And then it stroke me… One can really store 4KB of arbitrary data
with a url minifier system and share it for free.</p>
<h1 id="storing-files-as-a-programming-puzzle">Storing files as a programming puzzle</h1>
<p>So here is a programming puzzle. You are given a service that has two operations:</p>
<ul>
<li><em>PUT</em>: accepts payloads of at most 4KB of data, stores it and returns
a unique key of exactly 16 bytes</li>
<li><em>GET</em>: if you supply a key, you can then fetch your original payload.</li>
</ul>
<p>Given this abstraction, how would you store files of an arbitrary length,
such that random seek, and download concurrently is possible?</p>
<h1 id="a-solution">A solution</h1>
<p>We first split the file into 4KB pages and store each of them.</p>
<p>We are then left with a long list of keys aiming at these pages.
Of course sharing the list of keys would be lame. We want to store this list in
the url minifier service as well.</p>
<p>One page can hold 256 urls. We build packs of 256 urls and store them
as we did for the pages.</p>
<p>We can recursively apply the same trick until we are left with a single root key.</p>
<p>Anyone who knows this root key could now download all of our file.</p>
<p>Even better, the urls are forming a tree structure that allows for efficient
random access in the middle of the file.</p>
<h1 id="actual-implementation">Actual implementation</h1>
<p>I actually experimented with the idea with a famous url minifying service.
The first version was single threaded and I was downloading at a bit less than 20KB/s.
But when I tried to download 30 pages concurrently, I reached a very decent download
speed of 400KB/s, downloading my 3MB file in a little bit more than 7s.
I did not hit any rate-limiting at any time.</p>
<p>Since I don’t want to cause any trouble to any service, I will not share my script
as is… Instead here is a proof-of-concept version of my script, without
multithreading and with a mock in place of the url minifier implementation.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1">#!/usr/bin/python
# -*- coding: utf-8 -*-
</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">base64</span>
<span class="n">KEY_SIZE</span> <span class="o">=</span> <span class="mi">1</span> <span class="o"><<</span> <span class="mi">4</span> <span class="c1"># the url size
</span><span class="n">PAGE_SIZE</span> <span class="o">=</span> <span class="mi">1</span> <span class="o"><<</span> <span class="mi">12</span> <span class="c1"># the page size
</span><span class="n">NUM_KEY_PER_PAGE</span> <span class="o">=</span> <span class="n">PAGE_SIZE</span> <span class="o">/</span> <span class="n">KEY_SIZE</span>
<span class="k">class</span> <span class="nc">Minifier</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">put</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">get</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
<span class="k">class</span> <span class="nc">MockMinifier</span><span class="p">(</span><span class="n">Minifier</span><span class="p">):</span>
<span class="s">""" This replace our url shortener service.
Very handy for unit tests."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dictionary</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">def</span> <span class="nf">put</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="n">key</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dictionary</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dictionary</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="k">return</span> <span class="s">":"</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">key</span><span class="p">).</span><span class="n">zfill</span><span class="p">(</span><span class="n">KEY_SIZE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="c1"># in the mock we
</span> <span class="c1"># do not return a url... just a small key.
</span>
<span class="k">def</span> <span class="nf">get</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">key</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">dictionary</span><span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">key</span><span class="p">[</span><span class="mi">1</span><span class="p">:])]</span>
<span class="k">class</span> <span class="nc">CachedMinifier</span><span class="p">(</span><span class="n">Minifier</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">minifier</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">cache</span> <span class="o">=</span> <span class="p">{}</span>
<span class="bp">self</span><span class="p">.</span><span class="n">minifier</span> <span class="o">=</span> <span class="n">minifier</span>
<span class="k">def</span> <span class="nf">get</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">key</span><span class="p">):</span>
<span class="k">if</span> <span class="n">key</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">cache</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">cache</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">minifier</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">cache</span><span class="p">[</span><span class="n">key</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">chunk</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="n">chunk_len</span><span class="p">):</span>
<span class="n">num_full_chunks</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span> <span class="o">/</span> <span class="n">chunk_len</span>
<span class="n">start_chunk</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_full_chunks</span><span class="p">):</span>
<span class="n">end_chunk</span> <span class="o">=</span> <span class="n">start_chunk</span> <span class="o">+</span> <span class="n">chunk_len</span>
<span class="k">yield</span> <span class="n">arr</span><span class="p">[</span><span class="n">start_chunk</span><span class="p">:</span><span class="n">end_chunk</span><span class="p">]</span>
<span class="n">start_chunk</span> <span class="o">=</span> <span class="n">end_chunk</span>
<span class="k">if</span> <span class="n">start_chunk</span> <span class="o">!=</span> <span class="nb">len</span><span class="p">(</span><span class="n">arr</span><span class="p">):</span>
<span class="k">yield</span> <span class="n">arr</span><span class="p">[</span><span class="n">start_chunk</span><span class="p">:]</span>
<span class="k">def</span> <span class="nf">upload_aux</span><span class="p">(</span><span class="n">minifier</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="n">keys</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">page</span> <span class="ow">in</span> <span class="n">chunk</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">):</span>
<span class="n">page_key</span> <span class="o">=</span> <span class="n">minifier</span><span class="p">.</span><span class="n">put</span><span class="p">(</span><span class="n">page</span><span class="p">)</span>
<span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">page_key</span><span class="p">)</span> <span class="o">==</span> <span class="n">KEY_SIZE</span>
<span class="n">keys</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">page_key</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">keys</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="n">keys</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">upload_aux</span><span class="p">(</span><span class="n">minifier</span><span class="p">,</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">keys</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">upload</span><span class="p">(</span><span class="n">compressor</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="n">key</span> <span class="o">=</span> <span class="n">upload_aux</span><span class="p">(</span><span class="n">compressor</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
<span class="k">return</span> <span class="n">compressor</span><span class="p">.</span><span class="n">put</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">({</span>
<span class="s">"root"</span><span class="p">:</span> <span class="n">key</span><span class="p">,</span>
<span class="s">"len"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="p">}))</span>
<span class="k">def</span> <span class="nf">download</span><span class="p">(</span><span class="n">minifier</span><span class="p">,</span> <span class="n">key</span><span class="p">):</span>
<span class="n">meta</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">minifier</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">key</span><span class="p">))</span>
<span class="nb">len</span> <span class="o">=</span> <span class="n">meta</span><span class="p">[</span><span class="s">"len"</span><span class="p">]</span>
<span class="n">root</span> <span class="o">=</span> <span class="n">meta</span><span class="p">[</span><span class="s">"root"</span><span class="p">]</span>
<span class="n">cached_minifier</span> <span class="o">=</span> <span class="n">CachedMinifier</span><span class="p">(</span><span class="n">minifier</span><span class="p">)</span>
<span class="k">return</span> <span class="n">download_aux</span><span class="p">(</span><span class="n">cached_minifier</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="nb">len</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">extract_key</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="n">key_ord</span><span class="p">):</span>
<span class="k">return</span> <span class="n">page</span><span class="p">[</span><span class="n">key_ord</span> <span class="o">*</span> <span class="n">KEY_SIZE</span><span class="p">:(</span><span class="n">key_ord</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="n">KEY_SIZE</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">get_path</span><span class="p">(</span><span class="n">minifier</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">path</span><span class="p">):</span>
<span class="n">page</span> <span class="o">=</span> <span class="n">minifier</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="k">if</span> <span class="n">path</span><span class="p">:</span>
<span class="n">head</span><span class="p">,</span> <span class="n">tail</span> <span class="o">=</span> <span class="n">path</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">path</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span>
<span class="k">return</span> <span class="n">get_path</span><span class="p">(</span><span class="n">minifier</span><span class="p">,</span> <span class="n">extract_key</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="n">head</span><span class="p">),</span> <span class="n">tail</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">page</span>
<span class="k">def</span> <span class="nf">build_path</span><span class="p">(</span><span class="n">page_id</span><span class="p">,</span> <span class="nb">len</span><span class="p">):</span>
<span class="k">if</span> <span class="nb">len</span> <span class="o"><=</span> <span class="n">PAGE_SIZE</span><span class="p">:</span>
<span class="k">return</span> <span class="p">[]</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">build_path</span><span class="p">(</span><span class="n">page_id</span> <span class="o">/</span> <span class="n">NUM_KEY_PER_PAGE</span><span class="p">,</span> <span class="p">(</span><span class="nb">len</span> <span class="o">+</span> <span class="n">NUM_KEY_PER_PAGE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">NUM_KEY_PER_PAGE</span><span class="p">)</span> <span class="o">+</span> <span class="p">[</span><span class="n">page_id</span> <span class="o">%</span> <span class="n">NUM_KEY_PER_PAGE</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">download_aux</span><span class="p">(</span><span class="n">minifier</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="nb">len</span><span class="p">):</span>
<span class="n">num_pages</span> <span class="o">=</span> <span class="p">(</span><span class="nb">len</span> <span class="o">+</span> <span class="n">PAGE_SIZE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">PAGE_SIZE</span>
<span class="n">pages</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">page_id</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_pages</span><span class="p">):</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">build_path</span><span class="p">(</span><span class="n">page_id</span><span class="p">,</span> <span class="nb">len</span><span class="p">)</span>
<span class="n">pages</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">get_path</span><span class="p">(</span><span class="n">minifier</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="n">path</span><span class="p">))</span>
<span class="k">return</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">pages</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="c1"># a small test
</span> <span class="n">dummy</span> <span class="o">=</span> <span class="n">MockMinifier</span><span class="p">()</span>
<span class="n">msg</span> <span class="o">=</span> <span class="s">"arbitrary long message..."</span> <span class="o">*</span> <span class="mi">1000000</span>
<span class="n">key</span> <span class="o">=</span> <span class="n">upload</span><span class="p">(</span><span class="n">dummy</span><span class="p">,</span> <span class="n">msg</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">download</span><span class="p">(</span><span class="n">dummy</span><span class="p">,</span> <span class="n">key</span><span class="p">)</span> <span class="o">==</span> <span class="n">msg</span></code></pre></figure>
Of using Common Crawl to play Family Feud2018-02-23T00:00:00+00:00https://fulmicoton.com/posts/commoncrawl<h1 id="family-feud-meets-big-data">Family feud meets Big Data</h1>
<p>When I was working at Exalead, I had the chance to have access to a 16 billions pages search engine to play with.
During a hackathon, I plugged together Exalead’s search engine with a nifty python package called <a href="https://www.clips.uantwerpen.be/pages/pattern"><code class="language-plaintext highlighter-rouge">pattern</code></a>,
and a word cloud generator.</p>
<p><a href="https://www.clips.uantwerpen.be/pages/pattern"><code class="language-plaintext highlighter-rouge">Pattern</code></a> allows you to define phrase patterns and extract the text matching a specific placeholders.
I packaged it with a straightforward GUI and presented the demo as a big data driven family feud.</p>
<p>To answer a question like, <strong>“Which adjectives are stereotypically associated with French people?”</strong>, one would simply enter</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>French people are <adjective>
</code></pre></div></div>
<p>The app would run the phrase query <code class="language-plaintext highlighter-rouge">"French people are"</code> on the search engine, stream the results to a short python program that would then try and find adjectives coming right after the phrase. The app would then display the results as a world cloud as follows.</p>
<p>I wondered how much it would cost me to try and reproduce this demo nowadays.
Exalead is a company with hundreds of servers to back this search engine. Obviously I’m on a tighter budget.</p>
<p><img src="https://fulmicoton.com/tantivy-logo/tantivy-logo.png" alt="Tantivy" /></p>
<p>I happen to develop a search engine library in Rust called <a href="https://github.com/tantivy-search/tantivy">tantivy</a>.
Indexing common-crawl would be a great way to test it, and a cool way to slap a well-deserved <a href="https://www.youtube.com/watch?v=b2F-DItXtZs">sarcastic webscale label on it</a>.</p>
<p>Well so far, I indexed a bit more than 25% of it, and indexing it entirely should cost me less than $400. Let me explain how I did it. If you are impatient, just scroll down, you’ll be able to see colorful pictures, I promise.</p>
<h1 id="common-crawl">Common Crawl</h1>
<p><a href="http://commoncrawl.org/">Common Crawl</a> is one of my favorite open datasets. It consists in 3.2 billions pages crawled from the
web. Of course, 3 billions is far from exhaustive. The web contains hundreds of trillions of webpages, and most of it is unindexed.</p>
<p>It would be interesting to compare this figure to recent search engines to give us some frame of reference.
Unfortunately Google and Bing are very secretive about the number of web pages they index.</p>
<p>We have some figure about the past:
In 2000, <a href="https://googleblog.blogspot.jp/2008/07/we-knew-web-was-big.html">Google reached its first billion indexed web pages</a>.
In 2012, <a href="https://thenextweb.com/insider/2012/06/27/yandex-expands-global-search-index-from-4-billion-to-tens-of-billions-of-pages/">Yandex -the leading russian search engine- grew from 4 billions to tens of billions web pages</a>.</p>
<blockquote>
<p>3 billions pages indexed might have been enough to compete in the global search engine market in 2002.</p>
</blockquote>
<p>Nothing to sneeze as really.</p>
<p>The Common Crawl website <a href="http://commoncrawl.org/the-data/examples/">lists example projects</a> .
That kind of dataset is typically useful to mine for facts or linguistics. It can be helpful to train train a language model for instance, or try to create a list of companies in a specific industry for instance.</p>
<p>As far as I know, all of these projects are batching Common Crawl’s data. Since it sits conveniently on Amazon S3, it is possible to grep through it with EC2 instances for <a href="https://engineeringblog.yelp.com/2015/03/analyzing-the-web-for-the-price-of-a-sandwich.html">the price of a sandwich</a>.</p>
<p>As far as I know, nobody actually indexed Common Crawl so far.
A opensource project called <a href="https://github.com/commonsearch">Common Search</a> had the ambitious plan to make a public search engine out of it using elasticsearch. It seems inactive today unfortunately.
I would assume it lacked financial support to cover server costs. That kind of project would require a bare minimum of 40 server relatively high spec servers.</p>
<h1 id="my-initial-plan-and-back-of-the-envelope-computations">My initial plan and back of the envelope computations</h1>
<p>Since the data is conveniently sitting on <code class="language-plaintext highlighter-rouge">Amazon S3</code> as part of <a href="https://aws.amazon.com/fr/public-datasets/">Amazon’s public dataset program</a>, I naturally first considered indexing everything on <code class="language-plaintext highlighter-rouge">EC2</code>.</p>
<p>Let’s see how much that would have cost.</p>
<p>Since I focus on the documents containing English text, we can bring the 3.2 billions documents down to roughly 2.15 billions.</p>
<p>Common Crawl conveniently distributes so-called WET files that contains the text extracted from the HTML markup of the page.
The data is split into 80,000 WET files of roughly 115MB each, amounting overall to 9TB GZipped data, and somewhere around 17TB uncompressed.</p>
<p>We can shard our index into 80 shards including 1,000 WET files each.</p>
<p>To reproduce the family Feud demo, we will need to access the original text of the matched documents. For convenience, Tantivy makes this possible by defining our fields as <em>STORED</em> in our schema.</p>
<p>Tantivy’s docstore compresses the data using LZ4 compression. After We typically get an inverse compression rate of 0.6 on natural language (by which I mean you compressed file is 60% the size of your original data).
The inverted index on the other hand, with positions, takes around 40% of the size of the uncompressed text. We should therefore expect our index, including the stored data, to be roughly equal to 17TB as well.</p>
<p>Indexing cost should not be an issue. Tantivy is already quite fast at indexing.
Indexing wikipedia (8GB) even with stemming enabled and including stored data typically takes around 10mn on my recently acquired Dell XPS 13 laptop.
We might want larger segments for Common-crawl, so maybe we should take a large margin and consider that a cheap t2.medium (2 vCPU) instance can index index 1GB of text in 3mn?
Our 17TB would require an overall 875 hours to index on instances that cost $0.05. The problem is extremely easy to distribute over
80 instances, each of them in charge of 1000 WET files for instance. The whole operation should cost us less than 50 bucks. Not bad…</p>
<p>But where do we store this 17B index ? Should we upload all of these shards to S3. Then when we eventually want to query it, start many instances, have them download their respective set of shards and start up a search engine instance? That’s sounds extremely expensive, and would require a very high start up time.</p>
<p>Interestingly, search engines are designed so that an individual query actually requires as litte IO as possible.
My initial plan was therefore to leave the index on <code class="language-plaintext highlighter-rouge">Amazon S3</code>, and query the data directly from there. Tantivy abstracts file accesses via a <a href="https://tantivy-search.github.io/tantivy/tantivy/directory/trait.Directory.html"><code class="language-plaintext highlighter-rouge">Directory</code></a> trait. Maybe it would be a good solution to have some kind of <code class="language-plaintext highlighter-rouge">S3</code> directory that downloads specific slices of files while queries are being run?
How would that go?</p>
<p>The default dictionary in <code class="language-plaintext highlighter-rouge">tantivy</code> is based on a finite state transduce implementation : the excellent <code class="language-plaintext highlighter-rouge">fst</code> crate.
This is not ideal here, as accessing a key requires quite a few random accesses. When hitting S3, the cost of random accesses is magnified. We should expect 100ms of latency for each read. The API allows to ask for several ranges at once,
but since we have no idea where the subsequent jumps will be, all of these reads will end up being sequential. Looking up a single keyword in our dictionary may end up taking close to a second.
Fortunately tantivy has an undocumented alternative dictionary format that should help us here.</p>
<p>Another problem is that files are accessed via a <a href="https://tantivy-search.github.io/tantivy/tantivy/directory/enum.ReadOnlySource.html"><code class="language-plaintext highlighter-rouge">ReadOnlySource</code></a> struct.
Currently, the only real directory relies on <code class="language-plaintext highlighter-rouge">Mmap</code>, so throughout the code, tantivy relies heavily on the OS paging data for us, and liberally request for huge slices of data. We will therefore also need to go through all lines of code that access data, and only request the amount of data that is needed. Alternatively we could try and hack a solution around
<a href="https://www.gnu.org/software/libsigsegv/">libsigsegv</a>, but really this sounds dangerous, and might not be worth the artistic points.</p>
<p>Well, overall this sounds like a quite a bit of work, but which may result in valuable features for tantivy.</p>
<p>Oh by the way, what is the cost of simply storing this data in S3 ?</p>
<p>Well after checking the <a href="https://aws.amazon.com/fr/s3/pricing/">Amazon S3 pricing details</a>, just storing our 17TB data will cost us
around 400 USD per month. Ouch. Call me cheap… I know many people have more expensive hobbies but that’s still too much money for me!</p>
<blockquote>
<p>The most important cost of indexing this on EC2/S3 would have been the storage of the index. Around 400 USD per month.</p>
</blockquote>
<p>Back to the black board!</p>
<p>By the way, my estimates were not too far from reality.
I did not take in account the WET file headers, that ends up being thrown away. Also, some of the document which passed our English language detector
are multilingual. The tokenizer is configured to discard all tokens that do not contain exclusively characters in <code class="language-plaintext highlighter-rouge">[a-zA-Z0-9]</code>.</p>
<blockquote>
<p>In the end, one shard takes 165 GB, so the overall size of the index would te 13.2 TB.</p>
</blockquote>
<h1 id="indexing-common-crawl-for-less-than-a-dinner-at-a-2-star-michelin-restaurant">Indexing Common Crawl for less than a dinner at a 2-star Michelin Restaurant</h1>
<p>What’s great with back of the envelope computations is that they actually help you reconsider solutions that you unconsciously ruled out by “common sense”.
What about indexing the whole thing on my desktop computer… Downloading the whole thing using my private internet connnection. Is this ridiculous?</p>
<p>Think about it, a 4TB hard drive nowadays on amazon Japan cost around 85 dollars.
I could buy three or four of these and store the index there.
The 8ms-10ms random seek latency will be actually much more comfortable than the S3 solution.
That would cost me around $255, which is around the cost of dinner at a 2-star Michelin restaurant.</p>
<p>What about CPU time and download time ?
Well my internet connection seems to be able to download shards at a relatively stable 3MB/s to 4MB/s.
9TB will probably take 830 hours or 34 days. I can probably wait.
Once again, indexing at this speed is really not a problem.</p>
<p>In fact, my bandwidth is only fast enough to keep two indexing threads busy, leaving me plenty of CPU to watch netflix and code. On my laptop, 1 thread would probably be ok.
Explicitely limiting the number of threads has the positive side effect of allocating more RAM to each segment being indexed. As a result, new segments produced are larger and less merging work is needed.</p>
<p>So I randomly partitioned the 80,000 WET files into 80 shards of 10,000 files each.
I then started indexing these shards sequentially. For each shard, after having indexed all documents, I force-merge all of the segments into a single very large segment.</p>
<p>I’m not gonna lie to you. I haven’t indexed Common-Crawl entirely yet. I only bought one 4TB hard disk, and indexed 21 shards (26%).
Indexing is in a iatus at this point, because I have been quite busy recently (see the personal news below). Shards are independent : the feasibility of indexing Common-Crawl entirely on one machine is proven at this point. Finishing the job is only a matter of throwing time and money.</p>
<h1 id="resuming">Resuming</h1>
<p>I recently bought a house in Tokyo and the power installation was not too really suited with morning routine : dishwaser, heater and kettle was apparently too much and our fuses blew half of dozen of times.</p>
<p>This was a very nice test for tantivy’s ability to avoid data corruption and resume indexing under a a black out scenario.
In order to make it easier to keep track of the progress of indexing and resume from the right position, tantivy 0.5.0 now makes it possible to embed a small payload with every commit. For common-crawl, I commit after every 10 WET files. The payload is the last WET filename that got indexed.</p>
<h1 id="reproducing-it-at-home">Reproducing it at home</h1>
<p>On the off chance indexing Common-Crawl might interest businesses, academics or you,
I made the code I used to download and index common-crawl available <a href="https://github.com/tantivy-search/tantivy-ccrawl">here</a>.</p>
<p>The <code class="language-plaintext highlighter-rouge">README</code> file explains how to install and run the indexer part.
It’s fairly well package.</p>
<p>You can then query each shard individually using <a href="https://github.com/tantivy-search/tantivy-cli"><code class="language-plaintext highlighter-rouge">tantivy-cli</code></a>.</p>
<p>For instance, the search command will stream documents matching a given query.
You just need to pass it a shard directory and a query.
Its speed will be dominated limited by your IO, so if you have more than one disc, you can
speed up the results by spreading shards over different shards and query them in parallel.</p>
<p>For instance, running the following command</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tantivy search -i my_index/shard_01 --query "\"I like\""
</code></pre></div></div>
<p>will output all the documents containing the phrase <code class="language-plaintext highlighter-rouge">"I like"</code>, in a json format, one document per-line, in no specific order.</p>
<h1 id="demo-time-">Demo time !</h1>
<p>I wrote a small python script that reproduces the “family feud” demo. The script just outputs the data and the tag cloud are actually create manually on <a href="https://www.wordclouds.com/">wordclouds.com</a> Here are a few results.</p>
<h2 id="the-useful-stuff">The useful stuff</h2>
<p>First, we can use this to understand stereotypes.</p>
<p>At Indeed, I had to work a lot with domain specific vocabulary.
Jobseeker might search for an <code class="language-plaintext highlighter-rouge">RN</code> or an <code class="language-plaintext highlighter-rouge">LVN</code> job for instance.
These acronyms were very obscure for me and other most non-native speakers.</p>
<p>If I search for <code class="language-plaintext highlighter-rouge">RN stands for</code>, I get the following results</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>registered nurse
retired nuisance
staff badges
registered nurses
the series code
registered nurse
removable neck
radon
rn(i
the input vector
resort network
certified nurses
registered nut
registered nurse
registered nurse
registered identification number
registered nurse
registered nurses
</code></pre></div></div>
<p>I got my answer: users searching for RN meant <em>registered nurse</em>.
For LVN, the results are similar :</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>license occupation nurse
registered nurse
license occupation nurse
license occupation nurse
sorry
license occupation nurse
registered nurse
the cause
license occupation nurse
licensed vocational nurse
licensed vocational nursing
license occupation nurse
license occupation nurse
license occupation nurse
</code></pre></div></div>
<p>LVN stands for <em>licensed vocational nurse</em>.</p>
<h1 id="boostrapping-dictionaries">Boostrapping dictionaries</h1>
<p>It’s often handy for prototyping to bootstrap rapidly a dictionary. For instance,
I might need rapidly a list of <em>job titles</em>. A fairly non-ambiguous pattern would be</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I work as a <noun phrase>
</code></pre></div></div>
<p>If I run this pattern on my index, I get these 5,000 unique jobtitles</p>
<textarea style="height: 400px; overflow-y: visible">
model
tutor
nurse
teacher
physical therapist
freelancer
writer
graphic designer
consultant
registered nurse
designer
post-doc fellow
solo
movie location scout
freelance writer
software engineer
teaching assistant
photographer
librarian
nutritional consultant
paralegal
lawyer
manager
journalist
receptionist
translator
team
programmer
financial consultant
waitress
freelance reporter
cloud engineer
substitute teacher
freelance illustrator
freelance designer
scholar
computer programmer
web developer
software developer
pastor
background actor
freelance editor
freelance journalist
social worker
volunteer
secretary
scientist
school teacher
baker
barista
professional photographer
full time
nanny
concept artist
geology technician
cook
copywriter
researcher
cashier
virtual assistant
public health nurse
dance teacher
professional ski
full time tutor
cna
credit education consultant i
chef
copy editor
systems analyst
contractor
java developer
medical assistant
freelance photographer
caregiver
program manager
private mathematics
geologist
tour guide
school administrator
hospice chaplain
counselor
therapist
tech support
project manager
high school librarian
hr
freelance artist
bookkeeper
ranger
people development specialist
communication consultant
doctor
director
blogger
flight attendant
bartender
lunch lady
library assistant
stylist
captain
lecturer
freelance translator
freelance interior designer
freelance digital marketer
freelance author
visual artist
preschool teacher
freelance fashion stylist
nurse aide
graphic artist
chemo nurse
safety specialist
senior consultant
professional ilustrator
developer
train conduktor
counter and rental clerk
retail pharmacist
personal trainer
technical writer
senior python engineer
communications specialist
waiter
labor
psychologist
chemist
tax preparer
product designer
human resource specialist
finance manager
counsellor
computer systems analyst
sub editor
videographer
full time freelancer
career
university lecturer
mechanic
university professor
paraeducator
hospital chaplain
food/wine/travel photographer
server
professor
paramedic
newspaper distributor
freelance copyeditor
writer/editor
quality manager
program assistant
guide
research
model stitcher
trainer
coach
tech
solution architect
peer support specialist
pediatric oncology nurse
human resources representative
feature writer
fashion designer
psychotherapist
freelance interior
freelance graphic designer
correctional officer
marketing coordinator
supervisor
freelance web-developer
financial analyst
finance officer
critical care nurse
technician
team lead
teacher aide
mortgage broker
library technician
front-end web developer
creative designer
high school math
freelance information visualizer
support escalation engineer
student assistance program specialist
reporter
freelance professional sculptor
fitness trainer
business analyst
systems developer
producer
part time
management assistant
gardener
fireman
web designer
publisher
neighbourhood advocate
lifeguard
fundraiser/consultant
film technician
doula
consulting engineer
carpenter
bookseller
senior library assistant
security guard
sales rep
real estate agent
marketing assistant
tech arch manager
research assistant
nutritionist
medical laboratory assistant
medical doctor
massage therapist
handyman
critical care nurses
zoological field assistant
visual merchandiser
support worker
health
freelance trainer
dental hygienist
storyboard artist
sound engineer
registered architect
principal consultant
management consultant
gymnastics coach
full time actor
director general
database administrator
cartoonist
wildlife technician
web content
teller
technical illustrator
sccm engineer
rn
police officer
freelance digital marketing consultant
business advisor
biologist
video editor
team-lead
school
sales
research associate
pizza driver
lineman
dental assistant
customer service representative
school nurse
recruiter
photojournalist
medium
freelance sound technician
children
spanish interpreter
physician assistant
personal assistant
paginator
makeup artist
hairdresser
freelance theatre technician
dental asst
teachers aide
sports
retail assistant
rehab instructor
primary school teacher
pharmacist
pediatric nurse practitioner
medical transcriptionist
marketing communications manager
full-time writer
csr
chaplain
web admin
tv
survey taker
speech-language pathologist
software test engineer
social media manager
sleep educator
senior lecturer
sauce maker
publicist
product director
part time care assistant
ophthalmic technician
medical illustrator
marine biologist
illustrator
housing
customer service rep
campaign assistant
business consultant
webdesigner
system engineer
remedial paraprofessional
psychic , reiki master
professional statistician
product manager
part time housing counselor
mechanical engineer
manger
humanitarian worker
horticulturist
dermatologist
cosmetologist
civil servant
tv director
sales manager
mentor
housekeeper
freelance project manager
freelance copywriter
firefighter
dentist
writing mentor
web coder guy
vet tech
unit secretary
system analyst
system administrator
sysadmin
sr
reference librarian
reading
private chef
part time tour guide
painter
freelance ebook writer
fin
debt counselor
travel consultant
student assistant
spiritual counselor
projectionist
professional content writer
product reviewer
principal road designer
partner
part time decorator
network engineer
infrastructure engineer
high school teacher
gp
freelance contractor
enterprise solutions architect
dog trainer
digital marketer
development manager
creative head
controller
civil engineer
case manager
barista i
accounting
technology
surveyor
special education assistant
secondary teacher
school counselor
research scientist
postdoc
mother
lifestyle photographer
legal secretary
kindergartenteacher
java developer/architect
front-end developer
freelance digital business coach
computer technician
client support manager
clerk
chief science officer
6th grade teacher
women
technical director
ta
systems engineer
senior software engineer
senior designer
senior concept artist
professional artist
physiotherapist
pharmacy technician
person
part-time events safety steward
paramedic and i love gaming
nurse practitioner
naturalist
museum teacher i stockholm
linux system engineer
graphic designer/web designer
freelance virtual assistant
digital designer
design engineer
design director
dental nurse
buyer
business development manager
zoologist
web author
video games programmer
telecommunications engineer
stripper
staff accountant
software architect
senior research engineer
research horticulturist
property manager
professional cook
pastry chef
part time water garden consultant
part time art teacher
math-teacher
marketing manager
janitor
heavy duty truck mechanic
freelance seo
fishing guide
field
dog groomer
digital strategy consultant
digital marketing consultant
construction worker
carer
cake decorator
book slave
beautician
a radiographer
911 dispatcher/telecommunicator
writing tutor
veterinarian
title-one tutor
technical architect
systems administrator
street fundraiser
staff nurse
software consultant
software
self
quality control inspector
production assistant
private tutor
nursing assistant
musical director
music teacher
multimedia journalist
math teacher
management level systems architect
make-up artist
library technology trainer
legal assistant
hospital pharmacist
harmony animator
hairstylist
general contractor
freelance harpist
florist
fisherman
esthetician
email responder
driver
dj
dispatcher
data analyst
cto
credit manager
countryside ranger
costume-designer
corrections officer
communication officer
clinical liaison
veterinary assistant
travel agent
telecom/network architect
substitute
student
specialist
scripter
science writer
school psychologist
school librarian
photo model
pharmacy tech
pa
nurse part
millwright
medical secretary
medical receptionist
mathematical statistician
hostess
homemaker
full time artist
freelance makeup artist
freelance court reporter
financial advisor
feng shui consultant
cycling coach
creative director
contact centre manager full-time
community manager
clinical psychologist
cardiologist
wedding coordinator
website designer
visual designer
user experience designer
train driver
tester
temp
support engineer
spin
special education teacher
somm
software engineering manager
software development consultant
senior software developer
senior design researcher
senior customer service manager
senior citrix administrator
senior backend developer
seismic tester
secondary school teacher
salesman
ruby developer
professional writer
professional actor
principal lecturer
planning manager
physician
php developer
pediatric icu doctor
nail technician
monitor
member
live
leader
intensive care nurse
hair stylist
full-time dental assistant
freelance web developer
freelance make-up artist
freelance costume maker
freelance climate
freelance business analyst
engineer
clown
character designer
butcher
beauty therapist
babysitter
youth
yoga instructor
veterinary technician
veterinary nurse
ux designer
unix developer
trader
team leader
teacher assistant
story artist
steel erector/scaffolder
staff writer
ski instructor
sharepoint
sexuality
senior manager
security officer
screenwriter
science teacher
rural mail carrier associate
public servant
public defender
professional visual effects artist
product manager/business analyst
planning analyst
pilot
network administrator
musician
motivational speaker
medical examiner
medical biller
mediator
media designer
maid
machinist
lumberjack
kindergarten teacher
home appraiser
hockey player
headhunter
graphics designer
game director
game designer
fundraiser
full-time cta
freelance musician
financial risk manager
field engineer
field biologist
fashion
family doctor
facilitator
design manager
deputy sheriff
delivery driver
dancer
cost accountant
cop
content writer
character artist
camp counselor
bouncer
biology lab technician
writer/pr guy
web dev
venture capitalist
truck driver
traffic controller
town clerk
technology consultant
speech therapist
social media strategist
security
seamstress
sales representative
registrar
realtor
ra
psychiatrist
proofreader
professional tutor
professional sideman
professional ecologist
principal architect
pastry
part-time pet stylist
panoramic photographer
nursery nurse
newspaper reporter
network
midwife
middle
marketing representative
maintenance tech
lutheran hospital priest
locum emergency physician
language teacher
interior designer
hotel butler
health coach
gyno doctor
guitar teacher
general secretary
general osteopath
full time illustrator
french teacher
floral designer
film location scout
farmhand
farmer
faculty
dog sitter
docent
digital nomad
content manager
construction engineer
conservationist
clinical supervisor
cleaner
carpenter/craftsman
bookkeeper/office manager
bank manager
3d environment artist
welder
webmaster
web
trauma
telemarketer
tefl teacher
technology lead
teacher ’
tattoo artist
tail guide
systems architect
surgical nurse
supply teacher
sub-editor
sonographer
solicitor
software tester
service tech
senior level software engineer
security consultant
sales analyst
retail manager
prostitute
professional developer
product photographer
product marketing manager
process engineer
postdoctoral researcher
post-doc
personal chef
pediatric icu nurse
pca house keeper
pca
payroll specialist
music producer
mom
mobile specialist
ministry assistant
medic
mechanical designer
marketing specialist
marketing consultant
marketing
marine ecologist
manufacturing engineer
line service tech
lead
layout/pre-press/graphic artist
lab assistant
kind
guidance counselor
guard
graphics artist
fulltime nanny
full-time freelancer
full time mechanic
freelance web designer
freelance graphic artist
foreign exchange trader
fitness instructor
film director
fashion consultant
dropout prevention counselor
digital intern
diesel mechanic
dialysis nurse
data entry operator
custodian
cpa
courier
coordinator
content marketer
content editor
construction electrician
concierge
computer tech
computer scientist
computer consultant
compliance analyst
cocktail bartender
cloud solutions architect
clinician
cinematographer
chinese-english interpreter/translator
child
charge nurse
chalet chef
certified nursing assistant
call center agent
weekend package
wedding photographer
website developer
web programmer
web editor
wardrobe stylist/assistant
voice actor
virtual infrastructure administrator
video news photographer
typist
tv sports anchor
truck-driver
test engineer
techsupport
technology facilitator
technical consultant
sys admin
sub
strategist
store manager
special assistant
sound technician
solutions architect
soloist
sole trader
software solutions architect
singer-songwriter
short order cook/clerk/waitress/janitor/whatever
shop assistant
senior portrait photographer
second grade teacher
scribe
screen printer
scientific and regulatory specialist
sales consultant
robotics engineer
rn parttime
respiratory therapist
resident assistant
research analyst
rental manager
recruitment officer
recruitment consultant
radio host
quantum physics researcher
public relation officer
proofreader/copyeditor
project leader
project coordinator
proffesional dj
professional software developer
private investigator
private eye
private english tutor
principal
pr
post-doctoral researcher
phd student
part time writer
part time software developer
part
paraprofessional
para
nurses aid
night guard
nanny part time
middle school teacher
meteorologist
mestiza cook
merchandiser
member services analyst
medical supervisor
marketer
man nurse
maintenance worker
life guard
life coach
library clerk
library aide
lexicographic editor
leadership
lash stylist
laboratory technician
knowledge management specialist
ios developer
hot tar roofer
high school
healer
head conservator
government contractor
gm marketing
glass technician
ghostwriter
geomatic technician
freelance new media producer
freelance health
free lance artist painting
flash animator
filmmaker
field instructor
district nurse
digital marketing manager
dice dealer
devops engineer
decoratvie
data scientist
data developer
curator
ct/mri technologist
creative writing teacher
copy writer
community
business
busboy
builder
bricklayer
bike mechanic
baler operator
asst
yoga teacher
weekend chef
webprogrammer
voice
vlsi chip designer
verification engineer
tv editor
trumpet teacher
trade publication editor
ticket agent
technology manager
teaching artist
taxi driver
system architect
subcontractor
student ambassador
store clerk
steward
statistician
sports event coordinator
speech pathologist
special projects advisor
spanish teacher
songwriter
software dev
social worker aid
slave
senior programmer
senior marketing executive
senior editor
senior developer
security systems
security specialist
sculptor
salesperson
sales account manager
robotics software
resource teacher
research coordinator
representative
rep
relief teacher
rehabilitation specialist
real estate paralegal
quality technician
quality engineer
qualified english teacher
public school teacher
professional translator
professional model
professional commercial photographer
production
private security specialist
private agronomist
primary teacher
postdoctoral fellow
portrait
plumber
play-by-play announcer
pilot biologist
personal shopper
personal banker
pension benefits analyst
pediatrician
pediatric nurse
payroll clerk
part-time wedding videographer
parking enforcement officer
p.
nurse ’
news production assistant
news anchor
network technician
motion designer
minister
middle school math teacher
meter reader
messenger
mental health therapist
medical office manager
medical language specialist
meat cutter
mathematics teacher
masseuse
marketing director
magician
machine
locksmith
live bookie
liquor
line cook
library page
landscape designer
laboratory engineer
insurance salesman
hypnotherapist
hotel bartender
host
hospice nurse
home security specialist
home health aide
home
healthcare giver
health researcher
hardware engineer
hall director
hair dresser
grant writer
graduate assistant
glass artist
ghost
general practitioner
freelance videographer
freelance television camera operator
freelance designer/illustrator
freelance content writer
freelance content creator
freelance consultant
freelance composer
free-lance architect/designer
free lancer
framer
food photographer
fluid dynamics engineer
fish biologist
fire investigator
fire fighter
file clerk/archivist
figure model
field technician
fashion stylist
dog trainer´s assistant
dishwasher
digital illustrator
development officer
dealer
dba
data engineer
customer rep
cpo
corporate travel counsellor
corporate trainer
cookbook publicist
content lead
community support worker
community nurse
communications
commercial diver
college professor
clinical research nurse
cleaning lady
child abuse prevention educator
chief
certified school counselor
cashier/grocery/dairy/and deli clerk
career counselor
care aide
cancer co-ordinator
cable television technician
c
business coach
broker
bridal consultant
blast investigation expert
bike courier
barmaid
banker
bank teller
background painter
youth worker
youth librarian
yields researcher
writer/reporter
writer cake decorator
wire man
web-programmer
web writer
web strategy consultant
web content manager
web application engineer
waitress part time
video producer
video
vet
vendor manager
unit manager
tv journalist
travel
transportation planner
transport officer
transaction broker
training consultant
trainee
tpi fitness
tourleader
tourist guide
third party language interpreter
theoretical physicist
textile designer
testing designer
technical trainer
technical support engineer
tech support agent
tech sthupport/customer
tech lead
tax analyst
system
sw engineer
surgical technologist
surgeon
support technician
structural engineer
strength
stem ambassador
standard user
stagehand
staff
spanish translator
spanish language instructor
sound designer
software development manager
shopmanager
shopkeeper
shift supervisor
shadow
service technician
service designer
service
server software engineer
servant
senior systems administrator
senior payroll administrator
senior digital strategist
senior analyst
seller
self-development consultant
select role
secondary drama teacher
scuba instructor
salesgirl
sales man
sales executive
sales clerk
sales agent
safety officer
roast master
roadside tech
river guide
reviewer
retoucher
research physicist
research nurse
research engineer
research compliance officer
research chemist
reporter—i set
repair technician
relief worker
regular interviewer
recreation therapist
reader
rape crisis counselor
radiographer
r&d engineer
public speaker
public relations
public librarian
pt
psychic medium
psychiatric nurse
psych nurse
promo director
project engineer
programmer part-time
program coordinator
professional musician
professional marketing specialist
professional illustrator
professional graphic designer
professional genealogist
professional book designer
production manager
production designer
production artist
process manager
probation officer
private teacher
private medical secretary
presenter
practice manager
pra
pr consultant
port-a-john mopper
pizza delivery man
pi
physics teacher
physical therapy tech
physical therapy assistant
phlebotomist
personal fitness trainer
personal finance coach
personal coach
pct
part timer
part time transcriber
part time teacher
part time sports photographer
part time nanny
part time adjunct professor
park ranger
office asst
novelist
news reporter
news editor
network analyst
nail tech
multimedia designer
mortgage planner
moderator
mobile massage therapist
mobile and desktop+raspberry pi software developer
minion
microsoft sharepoint sme
microsoft crm consultant
mental health counselor
medical writer
medical transcriber
medical laboratory technologist
mechanical design engineer
mechanical and electrical designer
master
marketing/public relations consultant
marketing executive
manicure master
manager we
madwomen
machine operator
lifeguard/swim instructor
life celebrant
licensed massage therapist
liaison
lead pastor
lead developer
language assistant
land surveyor
laboratory manger
kintsugi artist
jr
journalist my wife
jewelry designer
jeweller
incident
home tutor
home health nurse
home business entrepreneur
heavy equipment operator
heavy equipment mechanic
health care assistant
head
hands-on healer
hand engraver
greeter
graphic designer/illustrator
graduate research assistant
graduate engineer
graduate
grade
government contract administrator
global librarian
german teacher
geo
general manager
games artist
game programmer
game developer
game artist
fund-raising consultant
full-time nanny
full-time copywriter
full time surgeon
full time paramedic
full time nurse
full time nanny
full time engineer
front end developer
freelancer designer
freelance story artist
freelance social media manager
freelance network analyst
freelance model
freelance interpreter
freelance examiner
freelance director
freelance developer
freelance art pa/assistant
freelance animator
forensic dna analyst
foreman
fitness model
fish
fine artist
financial manager
finance
film editor
filing clerk
federal contractor
farm labourer
family aide
dynamics
drudge
drafter
domestic help
dog walker
distributor
direct support worker
development engineer
designer/seamstress
design consultant
design
dental technician
decorator
dean
day laborer
data specialist
data entry
data analytics person
customer service agent
customer service
critical thinking
credit analyst
creative recruiter
creative art photography support assistant
correspondent
corporate lawyer
content developer
consultant cardiologist
construction manager
construction engineering inspector
computer geek
compliance officer
communications assistant
communication specialist
communication manager
comic artist
code wrangler
cocktail waitress
co-director
cnc machinist
closer/funder
clinical social worker
clinical pharmacist
clinical manager
chiropractor
childminder
chief financial executive
chemical engineer
cg animator
care assistant
campus recruiter
campaigner
campaign
cameraman
cabinetmaker
c.
business manager
building engjneer
broadcast engineer
brand strategist
bounser
book editor
biochemist
bicycle mechanic
behavior therapist
beauty consultant
barrister
bank clerk
band
bacp
backend developer
automotive technician
automotive designer
auto tech
assistant
animator
administrator
accountant
911 operator
3d engine/tools/etc programmer
3d artist
youth pastor
youth minister
youth councelor
writing coach
wraparound
worship leader
workforce management planner
work
wordpress specialist
wine sales consultant
windows/linux specialist
wildlife biologist
wildland firefighter
wilderness guide
whole
welder/ fabricator
webdeveloper
webcam model
web marketer
web content editor
web analyst
wastewater treatment plant operator
ward sister
vj
visual facilitator
visual effects
vfx
vet assistant
venture
vendor
vegetarian chef
valet
va
ux/ui designer
ux contractor
ux consultant
user experience specialist
unix system/web development specialist
unit clerk
unit
union stills photographer
union carpenter
ub-cit consultant
two-person team
tv-producer
tv news reporter
turbine engineer
treavle
treadmill coach
travel nurse
trasher
transparent partner
translator/interpreter
translation specialist
translation assistant
transaltor italian
training coordinator
traditional illustrator/creator
tradesman
tow truck driver
tour leader
tour guide i
tool
toddler teacher
ticket office clerk
textbook editor
territory sales manager
television editor
telemetry nurse
telecom
technology analyst
technical/professional writer
technical translator
technical support representative
technical lead
technical evangelist
technical editor
technical artist
technical analyst
tech writer
team player
team member
team coordinator
teacher-librarian
taxonomist
talent agent
t.
systems manager
systems integrator
system integrator
sysadmin/technician
sydney
swimsuit model
surgery manager
support chef
support
supplier
superintendent
successful model
subtitler
sub contractor
stylist it
student worker
student supervisor
student counselor
structural drafer
stringer
strategic procurement specialist
store-man
stocker
stand
stage hand
staffing specialist
staff member
staff attorney
spiritual advisor
speechwriter
special needs
special education secretary
special education paraprofessional
space pirate
sous chef
solution center consultant
solution architect/specialist
solo response
solo pianist/vocalist
solo librarian
soldier
software analyst
social studies teacher
social studies specialist
social researcher
social media specialist
social media coordinator
social media consultant
social media
snowboard instructor
small team
ski patroller
site manager
site
singing teacher
shop-assistant
sheet metal worker
sharing economy lifestyle blogger
seta contractor
set decorator
service supervisor
service provider
service management consultant
service engineer
service advisor
serious injury consultant
senior web developer
senior technician
senior scientist
senior researcher
senior habilitation provider
senior director
senior digital designer
senior concept designer
senior art director
senior analyst programmer
senior accountant
semi-professional musician
self-employed graphic designer
self-employed artist
security supervisor
security analyst
secretary d.
secratary
script reader
scientific researcher
science tutor
science technician
schoolteacher
school secretary
school bus driver
salon manager
salesmanager
sales floor guy
sales administrator
salary employee
salaried w2 employee
sailing instructor
safety engineer
safety
rural health educator
roofer
risk manager
revenue inspector
restorative aide
restaurant server
resource
resort receptionist
resident
reservation
research it specialist
research fellow
research doctor
replacement work
relational and existential therapist
registered veterinary nurse
registered nurse bsn
registered dental assistant
reference
referee
recreation
record producer
reactor operator
radiologist
quality controller
quality control point
quality control
quality assurance engineer
purrsonal assistant
purchasing co-ordinator
public engagement officer
public affairs officer
pt church secretary
pt aide
psw
pse
protestant minister
prosecutor
promoter
project supervisor
project consultant
project co-ordinator
project administrator
program director
professional video editor
professional urdu translator
professional story artist
professional magician
professional graphic designer/illustrator
professional freelance writer
professional editor
professional coach
professional chef
professional archaeologist
professional and academic coach
professional actress
professional 3d artist
production accountant
product development engineer
product developer
private oboe instructor
private independent physiotherapist
private home school teacher
print designer
principal member
primry teacher
primary
president
preschool teacher ’
premier field engineer
pre-k teacher
practice development lawyer
practical theologian
pr manager
potter
post-doc researcher
portrait photographer
porter
policy advisor
policeman
police
plm consultant
planner
piano teacher
physical trainer
physical therapist assistant
photographic and visual historian
photographer/videographer
photo editor
phone sex operator
performance artist
peer tutor
peer mentor
peer educator
pediatric oncology rn
pedagogic worker
pc technician
pc tech
payroll manager
patternmaker
patient care assistant
patient advocate
pastry artist
parts clerk
part-time waitress
part-time tour guide
part-time security guy
part-time math tutor
part-time maid
part-time interpreter
part-time baby
part time pro
part time phlebotomist
part time job
part time event planner
par-time clerk
pair
paid surveyor
paid consultant
p/t
one-man data
nursing aid
nursery tech
nurse tech
nurse specialist
nurse case manager
nurse assistant
nuclear pharmacy technician
nuclear equipment operator
notary
normal writer
nightclub dj
night-time taxi driver
night security guard
night auditor
newspaper columnist
new registered nurse
neonatal nurse
n.
musicotherapist
music video director
music minister
music director
museum attendant
movie theater manager
mover
motion picture camera assistant
motion graphics designer
mobile¹ mobile²
mobile notary public
mobile developer
missionary and international aid worker
miscarriage
minicab driver
milliner
military psychologist
military contractor
microbiologist
mental health peer specialist
mental health
medical nurse
medical laboratory technician
medical interpreter
medical editor
medical affairs
media supervisor
maths coach
math tutor
martial artist
marriage
marketing intern
marketing guy
marketing associate
marketing analyst
management representative
management accountant
male nurse
magazine editor
lot
long term volunteer
logo designer
logistics manager
logestics tech
locum
local realtor
local newspaper photographer
loan officer
live engineer
literacy coach
linux system administrator
linux server systems administrator
line editor
lighting technician
light technician
lieutenant
learning specialist
lead archives technician
language editor
landscaper
landscape
laboratory assistant
lab technician
lab monitor
lab manager
lab instructor
lab
kitchen porter
kitchen assistant
kindergarden teacher
kids
kennel assistant
kayak guide
junior executive
jungian analyst
judge
jazz player
japanese translator
jailer
jack
itt trainer
it analyst
individual
humorist
human resources assistant
housing advice officer
housemanager
hospitalist
hospital
horse caretaker
horse
homeopath
home health aid
home care aide
holistic therapist
hlta
history teacher
high
helper
healthcare attorney
health physicist
health expert
health educator
health benefits specialist
head hunter
handy man
hair
guest teacher
guest blogger
graphic ui designer
graduate student
government contractor it
global marketing manager
gis analyst
general practice manager
games journalist
game purchaser
gallery technician
ga
full-time realtor
full-time photographer
full-time english
full-time digital marketer
full-time artist
full timer
full time teacher
full time software engineer
full time school librarian
full time realtor
full time microstock photographer
full time lecturer
full time firefighter
full time employee
full time chef
full time artist/tattooer
full time analyst
full stack developer
ft pro
front desk rep
freelancer videomaker
freelancer spending my
freelance webdesigner
freelance teaching artist
freelance social media expert
freelance rental agent
freelance programmer
freelance print
freelance personal trainer
freelance personal stylist
freelance oboist
freelance layout sub-editor
freelance front-end developer
freelance exhibition stand design
freelance engineer
freelance employee
freelance database administrator
freelance casting agent
freelance bookkeeper
freelance blogger
free-lance pianist
forklift driver
forestry advisor
forester
floor painter
fitter
fish wholesaler
financial services expert
financial represenative
financial planner
financial adviser
finance contractor
finance analyst
final expense agent
film
feldenkrais teacher
federal field archaeologist
federal employee
fashion model
family therapist
family lawyer
family law attorney
family
faculty member
fabricator
executive assistant
electrician
economist
dvd qc technician
dumptruck driver
drug
drilling technologist
drilling engineer
draftsman
dp
doorman
door knocker
dominatrix
dog bather
documentary film editor
dive instructor
disability transportation specialist
director/ designer
diplomat
digital marketing strategist
digital marketer executive
digital marketeer
dietary aide
diesel technician
developer i
developer advocate
desktop publisher
deputy
department manager
delphi dev
defense contractor
debt collector
data entry clerk
dancer/bodywork practitioner
cybersecurity engineer
customer services manager
customer service manager
customer
custom picture framer
crisis counsellor
criminal lawyer
crime scene photographer
creator
creative writer
creative consultant
couple
countryside warden
counseling psychologist
costume designer
corporate wellness consultant
corporate recruiter
cook-server-bartender
convenience store clerk
contributor
contracts administrator
contract fire fighter
content strategist
content provider
content assistant
contemporary dancer
consultant musculoskeletal radiologist
consultant ecologist
construction inspector
conductor
concert producer
concept designer
computer repair tech
computer operator
computer lab
computer installation technician
computer
composer
company secretary
companionship facilitator
communicator
communications planner
communications analyst
commercial photographer
commercial interior designer
columnist
college recruiter
cognitive behavioural therapist
coder
cns
cm
club promoter part-time
clinical laboratory scientist
classroom assistant
clapham
clairvoyant medium
church
choreographer
chinese teacher
childcare provider
chief technician
chief marketing officer
chief financial officer
chief engineer
chha
chartered accountant
channel
certified nursing aid
certified deaf interpreter
central supply
caterer
cataloger
casual laborer
casting director
carryout
caricaturist
caricature artist
caretaker
career firefighter
career councilor
career advisor
care worker
care provider
care giver
candidate/political opponent i
cad/cam designer
buyer ’
business intelligence
business developer
bus driver
building operator
broadcast journalist
brick layer
brian c.
bottle water
border patrol agent
bookings coordinator
book illustrator
book designer
bond enforcement agent
boily
body worker
bitcoin miner
birthday party coordinator
birth doula
birth
bilingual e/j tech translator
bike tech
behavioural therapist
behavior/habilitative interventionist working
bathroom fitter
bathroom attendant
bartender/server
barman
barista full time tuesday
bar tender
bank cashier
ballet trainee teacher
background artist
back-end developer
back office
baby sitter
audio technician
assistant professor
applied researcher
agent
academic tutor
academic superstar
911 telecommunicator
3d generalist
…
‘ lunch lady ’
zookeeper
youtuber
youth educator
youth director
youth counselor
youth advisor
young people
young business owner
yoga therapist
year
yardmaster
ya librarian
xylitol educator
xxxx xxxx
xxxx
xx teacher
writing specialist
writing fellow
writing coach/tutor
writing coach teacher
writing center tutor
writer/producer
writer/photographer
writer/director/producer/editor/animator
write
worship
worldview consultant
workman
workamper
wordpress consultant
worcester county divorce
woodworker my passion
wireline engineer
wireless engineer
wipe
winemaker
wine tour guide
wine journalist
wine
windsurfing instructor
windows systems administrator
window dresser
window
wildlife tour guide
wildlife cameraman
wildlife artist
wide area comms/internal it/desktop support tech
wholesaler
whisky guide
wellness coach
welder/blacksmith
weekday team member
wedding stationer
wedding planner
wedding
webtoon artist
website developer/designer
website coordinator
webdev
web-developer
web worker
web ui designer
web staffer
web publisher
web producer
web pedagog
web marketing specialist
web marketing analyst
web marketing advisor
web marketing
web developer/programmer
web developer/designer
web designer/programmer
web designer/developer
web designer she
web architect
web applications developer
web application specialist
web application programmer
web application developer
web application
web analytics program manager
weaver
way
watershed practitioner
waterfowl technician
water microbiologist
watchman
washing machine repair man
warehouse supervisor
warehouse foreman
wardrobe stylist
wakeboarding coach
waitress/bartender
waiteress
waiter part time
waiter dekat satu cafe area kursk
wage slave
voting member
volunteer web designer
volunteer tutor
volunteer soccer/basketball/softball coach
volunteer shelver
volunteer school board attorney
volunteer reader
volunteer pa
volunteer museum guide
volunteer helper
volunteer guide
volunteer firefighter
volunteer editor
volunteer counselor
volunteer counsellor
volunteer chaplain
volunteer ambassador
volunteer adult literacy tutor
voluntary worker
voluntary coach
voluntary art tutor
volkswagen tech
voice-over artiste
voice teacher
voice coach
vodafone dealer
vocational rehabilitation specialist
vocal teacher
vocal coach
visualising
visual field technician
visual effects artist
visual development
visitor experience associate
virtualization consultant
virtual teacher
virtual hr support assistant
virtual cfo
violin teacher
vintage hair
vintage bus driver
village trustee part time
vigilante group member
vietnamese tutor
videogames programming teacher
video game store
video game producer
video content producer
video blogger
victim witness
vicar
vfx supervisor
veteran
vehicle transporter
vehicle mechanic
vehicle emissions tester
vegetation ecologist
vegan chef
vector
vb
van driver
uxui designer
ux/ui mobile designer
ux lead
utility meter reader
user researcher
user interface engineer
user interface
user experience researcher
user experience
usability consultant
us
urology nurse
urban planner
unix system engineer
unix sysadmin
unix administrator
unix
university instructor
university grade software developer
university chaplain
united states census employee
unit supervisor
unit secretay
union electrician
union boilermaker
ui engineer
ui developer
ui designer
ui
typing machine
typesetter
tv talent
tv script writer
tv producer
tv presenter
tv news producer
tv news anchor/lineup editor
tv cameraman
tv analyst
tuk tuk driver
tss
trustee
trusted adviser
truck unloader
truck driver/warehouseman
trial attorney
tree faller
traveler
travel rn
travel publicist
travel photojournalist
travel agency
travel adviser
trauma nurse
trashy translator
transpricing consultant
transporter
transport industry trainer
transport driver
transport coordinator
translator/linguist
translator myself
translator i
translator german/english
translator english <-> urdu
transcriptionist part-time
transcriptionist
transcription quality controller
transcriber
transalator
transaction agent
training specialist i
training development director
trainer/content developer
trainer/consultant
trainee student
trainee purchaser
trainee bass teacher
trainee actuary
train guard
train
trails guide
trail crew member
traffic/billing director
traffic cop
traditional nonprofit consultant
tradeswoman
trademark paralegal
trade union official
trade manager
trade compliance specialist
tractor-trailer driver
toy designer
toxicologist
tourist
tourism photographer
tourism consultant
tourguide
tour-guide
topless waiter
tooth
tool designer
tl
tire tech
tiger team member
tier-1 technician
ticket seller
ticket
thief
these pastries
therapy dog
therapy aide
therapists
therapeutic support staff
therapeutic counsellor
theologian
theme wrangler
theatre nurse
theatre
theater critic
theater artist
theater
textiles sales rep
textile artist/illustrator
textile artist
textil worker
texas rancher
testing engineer
testing coordinator
test/automation engineer
test technician
test coordinator
terrain park ranger
terminal assistant
tenure track professor
tennis coach
temporary worker
temporary lecturer
temporary lab technician
temporary employee
temple worker
temp employee
television production manager
television director/scriptwriter
telephone repairman
telephone counsellor
telemetry assistant
telegraph officer
telecom expense management domain
teenager
teen librarian
technology writer
technology lawyer
technology investor
technology evangelist
technology enthusiast
technology coordinator
technologist
techno functional consultant
technician-programmer
technician ’
technicial manager
technical system administrator
technical support rep
technical support agent
technical support advisor
technical staff
technical services veterinarian
technical recruiter
technical project manager
technical producer
technical officer
technical instructor
technical drawer
technical diving instructor
technical content writer
technical author
technical assistant
technical account manager
tech service chemist
tech operator
tech integration specialist
tech guy
teatcher
teapot package designer
team you
team lead engineering
teaching/ research assistant
teaching staff
teaching assitant
teaching aid
teaching
teachers
teacherlect
teacher/manager
teacher/football coach
teacher/assistant
teacher working
teacher teaching
teacher spending
teacher part-time
teacher myself
teacher librarian
teacher aid
tea lady one day
tea boy
tc
taxidermyst
tax preparer part time
tax lawyer
tax consultant
tax advisor
tax accountant
tax
tattoomodel
tattooist
tarot reader
target warehouse member
tanner
talento
talent scout
talent manager
tailoress
systems/network engineer
systems software engineer
systems security engineer
systems
system operator
system developer consultant
system designer
system admin
sysadmin/head
sysad
sys administrator
sys
switchboard operator
swimming instructor
sw engr myself
sushi chef
survey supervisor assistant
survey interviewer
survey engineer
surgical vet tech
surgical technician
surgical tech
surgical fnp rnfa
surgical coord
surgery
surg
surf instructor
support staff
support specialist
support officer
supply sgt
supply manager
superviser
supermarket cashier
superior bikes marketing person
super-model
summer employee
summer counseler
summer associate
successful angel therapist™
subtitute teacher
substitute/on-call cashier
substitute teacher,do i
substitute teacher part-time
substitute p.
substance abuse counselor
subsitute teacher
submissions editor
submarine pilot
subject matter
style coach
stuntman
study career coach
studio arts educator
studio artist
studio
student-aid
student travel agent
student team manager
student teacher
student programmer
student library assistant
student intern
student guidance councellor
student greenskeeper
student aide
structural consultant
striptease dancer
street doctor
street counselor
strategy consultant/pm
strategy consultant
strategic digital marketing specialist
strategic adviser
stranger
stove promoter
storyteller
storeman
store accounting coordinator
store
storage systems administrator
stockroom guy
stockbroker
stock control
stitcher
stipend volunteer
stewardess
stenographer
stem teacher
steel worker
steel fabricator
steamfitter
stay-at-home mommy
stay-at-home mom
stay
stationery
start referee
start orientation leader
stand-in
stampin
stage manager
staffing coordinator
staff writer/editor
staff technician
staff systems engineer
staff photographer
staff lead
staff engineer
staff editor
staff assistant
staff artist
staff accoutant
stable manager
sr-associate
sr principle software engineer
squash centre manager
sql server administrator
sql database administrator
sprint team member
spotlight operator
sports teacher
sports reporter
sports photography stringer
sports photographer
sports performance coach
sports editor
sports coach
spokesman
spirituality/love coach/energy healer
spiritual/life/relationship coach
spiritual-healer
spiritual life coach
spiritual director
spiritual counselor offering sessions
spiritual care provider
speech/language pathologist
speech language pathologist
speech
specification manager
specialized assassin
specialist technician
specialist product photographer
special projects coordinator
special nurse
special needs resource consultant
special force candidate
special educator
special education para
special education
special ed.
special clerk
speaker coach i
speaker
spanish language interpreter
soundengineer and/or producer
sound tech engineer
sound tech
sound system engineer
sound editor
sound director/boom
soul retriever
solution manager
solution designer analyst
solo wedding photographer
solo photographer
solo performer
solo artist/star
solo artist
solo act
sole practitioner
sole operator
soil consultant
software support
software programmer
software guru
software executive
software enginer
software engineer andi
software development team leader
software development lead
software development engineer
software development director
software developer/application support engineer
software designer
soft dev btw
social worker/mental health counselor
social service work
social scientist
social media/marketing
social media team leader
social media person
social media manager/children
social media expert
social media assistant
social media ambassador
social managment network
social justice educator
soccer coach
socail worker
snowboard tour guide
snake handler
smm
smith
smart repairer
small time contractor
small engine technician
small business internet marketing consultant
small business consultant
small animal internist
sm @ pier1 imports
slp
slot attendant
skin therapist
skill game developer
ski mechanic
ski guide
ski coach
sketch artist
site supervisor
site hostess
single point
single mom
singing waitress
singer/ songwriter
singer
simple teacher
sign painter
sign language interpreter
sign language
sign holder
sideman
sideing installer
short order cook
shop-assistance
shitty generalist
ships fitter/ welder
shipping supervisor
shipping manager myself
shipper/receiver
shipper
shipelectroniks installer
shipbuilding engineer
shepherd
shelf stocker
sheet metal
sharepoint it
sharepoint developer
shampoo boy
shaman
sexual and reproductive health advisor
sexton
set dresser
sessional lobbyist
session musician
session leader
servicedesk analyst
service technician i
service tech/electronics repairman
service girl
service development engineer
service delivery manager
service coordinator
servant leader
serious orgy fest
sergeant
seo manager
seo freelancer
seo engineer
seo consultant
sensory research specialist
senior wordpress engineer
senior user experience designer
senior treasury analyst
senior testing specialist
senior teller
senior technical recruiter
senior technical analyst
senior systems/software engineer
senior systems engineer
senior system engineer
senior sysadmin
senior support worker
senior subeditor
senior sql server dba
senior software architect
senior server team engineer
senior search engineer
senior rov pilot
senior reservoir geologist
senior property accountant
senior program manager
senior product security engineer
senior portfolio developer
senior policy advisor
senior oceans campaigner
senior member
senior loan officer
senior litigation counsel
senior lead
senior laser engineer
senior java/j2ee developer
senior infrastructure consultant
senior home companion
senior high school teacher
senior helpdesk consultant
senior graphic designer
senior game artist
senior executive
senior engineer
senior electronics
senior economist
senior ecm consultant
senior designations officer
senior design engineer
senior dba
senior database developer
senior data scientist
senior copywriter
senior consultant specialising
senior concept
senior communications strategist
senior communications officer
senior commercial advisor
senior civil engineer
senior auditor
senior architect
senior application support analyst
senior admissions counselor
senior administrator
senior account manager
semiotician
semi-professional short-order cook
semester-to-semester employee
selling team
seller-consultant
self-trained nutritionist
self-published manga artist
self-employed tutor
self-employed translator
self-employed sub-contractor
self-employed specialist
self-employed registered dietitian
self-employed real estate broker
self-employed housekeeper
self-employed freelance web designer
self-employed fiction editor
self-employed cyclist
self-advocate
self help coach
sekretary
seduction
security worker
security officer working
security office
security network administrator
security manager
security inspector
security guard i
security gaurd
security administrator
secretary/personal assistant
secretary/middleman
secretary i
secretary general
secret shopper
secret agent
secret
secreatary
seafood clerk
sdet engineer
scuba
scrummaster
scripty
scriptwriter
screening officer
screener
scout
scientific project manager
scientific officer
science/chemistry teacher
science visualizer
science policy officer
science journalist
science editor
science communicator
science communication
science
school teacher i
school programs facilitator
school principal/teacher
school nurse we
school manager
school inspector
school home liaison
school custodian
school community resource specialist
school careers advisor
schoo/outdoor
scheme manager
scenic artist
saxophone player
santa
sandwich maker
saleswomen-consultant
saleswoman
salesman underwear
saleslady
sales/marketing officer
sales trainer
sales lady
sales engineer
sales department supervisor
sales associate
sale girl
sailor
safari tour guide
s.
russian tutor
russian teacher
russian escort
rural middle school
runwaymodel
runway model
runner
rule
rubbish web designer
routesetter
route-setter
route salesperson
room assistant
roofing salesman
roof
rollercoaster designer
roadside assistant
road designer
rmn
river biologist
ritzy investment banker
risk analyst
right size customer support
rib master
reviser
reviewer/proofreader
reverse engineer
revenue manager
revenue data clerk
retro-tainer
retained–search recruiter
retained firefighter
retail wireless consultant
retail supervisor
retail superviser
retail sales clerk
retail sales
retail experience specialist
retail customer service manager
retail cashier
retail accountant
resuscitation officer
restorer
restoration artist
restaurant manger
restaurant hostess
restaurant critic
restaurant accountant
resource specialist
resource room teacher
resource person
resource management officer
residential teacher
residential real estate appraiser
residential property consultant
residential counselor
residential appraiser
resident physician
resident manager
resident director
resident astronomer
resident advisor
reservoir engineer
reserve emt…
research technican
research specialist
research scholar
research historian
research assistant working
research assistant i
research administrator
reseacher
reporting
report,it ’
rentals manager
renovation specialist
removal occupation
remote-developer
remote teacher
remote medical officer
remote computer tech
remote cad operator
relief chef
relief advocate
release manager
release engineer
relationship counsellor
relationship coach
relate counsellor
rehabilitation counselor
regulatory associate
regulatory affairs assistant
regulator
regular english
registered respiratory therapist
registered nurse we
registered medical assistant
registered manager
registered clinical counsellor
regional workforce policy advisor
regional network administrator
regional manager sales(team
regional finance manager
regional director
refuge support worker
reflector
referral coordinator @ shands
redactor
recruitment manager
recruitment co-ordinator
recruitment
recruiter part time
recreational therapist
recreation programmer
recreation coordinator
recovery coach
recording studio assistant
recording
record mixer
recommendation
recess aide
receptionists
receptionist/secretary
receptionist/hall monitor
reception manager
recepcionist
realtor par-time
really good team
real np
real estate team
real estate investor
real estate investment analyst
real estate consultant
real estate broker
real estate appraiser
reader/note-taker
range officer
random old hairdresser
ranch hand
railway section hand
railroad
rail traffic controller
rail guide
raido promoter
raft guide
radiology supervisor
radio reporter
radio operator
radio journalist/newsreader
radio engineer
radio dj
radio d.
radio anchorperson
radiation therapist
racehorse trainer
race driver
ra/
r&d
quantity surveyor
quantitative user experience researcher
quantitative portfolio manager
quantitative analyst
quality control manager
quality contol data
quality checker
quality assurance specialist
quality assurance analyst day
quality analyst
quality
qualified biomechanist
qs cadet
qae
qa person
qa lead
qa graphic/layout designer
qa engineer
qa automation lead
qa analyst
python programmer
python developer
pyp hrt
pychologist
purchasing clerk
purchase
punching bag
pump consultant
publishing/writing consultant
public school bus driver
public safety officer
public relations manager
public purchasing agent
public hospital psychiatrist
public health registrar
public affairs specialist
public adjuster
pt officer
psychosocial counselor
psychology teacher
psychologist…
psychological counselor
psychodynamic therapist
psychodynamic psychotherapist
psychiatry resident
psychiatric social worker
psychiatric nurse practitioner
psych rn
pss
provider relations manager
protector
protection officer
prorammer
proposal writer
property manager myself
property developer
proofreader/editor
proofreader/copy-editor
proof-reader
promotora
promotional model
promo personnel
projet lea
project officer
project manger
project management consultant
project lead
project engineer/cost analyst
project director
project assistant
project architect
project admin
programmer/analyst
programmer today
programme officer
programme director
program support assistant
program manager/ instructor
program evaluator
program analyst/mission planning specialist/foreign liaison
program analyst
program
professionnal
professional wedding photographer
professional web developer
professional web designer
professional tour director
professional theatre director
professional tarot card reader
professional stylist
professional stripper
professional storyteller
professional software architect
professional snowboarder
professional seo
professional sculptor
professional psychic astologer
professional private guide
professional pianist
professional officer
professional nyc state electrologist
professional npc
professional musician i wish guitar pro
professional mixer
professional mediator
professional makeup artist
professional make-up artist
professional magician/mindreader
professional librarian
professional killer
professional interpreter
professional ifmga mountain guide
professional hairdresser
professional guitarist/vocalist
professional guide
professional gardener
professional game developer
professional fund manager
professional freelancing musician
professional firefighter/emt
professional filmmaker
professional fashion designer
professional family
professional environmentalist
professional english translator
professional educator
professional earth mover tire installer
professional driver
professional domme
professional dog masseur
professional designer
professional dancer
professional counselor
professional copywriter
professional consultant
professional computer programmer
professional communicator
professional cleaner
professional business coach
professional blogger
professional barber
professional backup/recovery sysadmin
professional author
professional animal communicator
professional aero engineer
profession photographer
profesional firefighter
prof
production supervisor
production sound mixer
production rigger
production dba
production coordinator
product tester
product specialist
product safety
product rep
product owner
product intern
product development manager
product design
product copywriter
product coordinator
product
producer/composer
procurer
procurement manager
procurement consultant
proctor
processor
process technician
problem solver
pro-grammer
pro-domme
pro house cleaner
pro fight judge
prn
privet
private writing coach
private trainer
private tour guide
private system administrator
private security officer
private practitioner
private practioner
private practice ibclc
private english teacher
private educational consultant
private duty nurse
private driver
private drive
private construction company
privatdozent
prison librarian
printer buy day
printer
principal imaging systems engineer
principal advisor
principal accounts officer
princess
primary school teacher teaching science
primary school teacher i ’ ve
primary school lsa
primary school librarian
primary instructor
primary care physician
priest
pride guide
pricing analyst
prevention specialist
pretty much voluntary moderator
press snapper
press officer
presidential ambassador
preschool teacher trainer
preschool teacher assistant
preschool substitute
presales solutions architect
preprimary teacher
prep cook
premises manager
preemie nurse
pre-school teacher
pre-sales consultant software
pre-op transsexual
pragensis journalist
practicing attorney
practice nurse
practical nurse
pr executive
pr director
powertrain
powerful team
power plant operator
power engineer
postman
postgresql dba
postdoctoral research associate
postal worker
postal service workman
postal carrier
post-doctoral clinical-health psychology researcher
post-audio engineer
post-anaesthetic nurse
post production runner
post doc researcher
post
possibility strategist
portrait artist
poll clerk
politician
political sociologist
political researcher
political consultant
polish/english translator/interpreter
policy officer
policy adviser
police/fire
police sergeant
police constable
police community engagement officer
police chaplain
pole dance
pol sci professor
poker tournament director
poetry-teacher member
poet
podiatrist
pm
plummer
playwright
playworker
playleader
playground monitor
player
plastic surgeon
plastic injection tool designer
plant breeder
planning engineer
planning director
planning
pj
pizza delivery man part time
pizza delivery driver
pizza delivery boy
pirate
pipeliner
pipeline system controller
pipefitter
pinterest consultant
piercer
picture editor
pick
pianist
pi paralegal
physiotherapists
physiotherapist alot
physio
physician living
physician liaison
physican assistant
physical therapy
physical educations teacher
physical education teacher
physical education
physiatrist
php programmer
photoshop web designer
photography technician
photography lecturer
photographic process workers
photographer/photo editor
photographer/editor
photographer/designer
photographer i concentrate
photo organizer
photo assistant
photo
phone english teacher
philosophical practitioner
phd-student/researcher
pharmacy technician trainee
pharmaceutical doctor
pharamcy tech
pf pm manager
petition-writer
pet stylist
pet sitter
pet nurse
pet groomer
personla trainer
personal tutor
personal training director
personal trainer/fitness instructor
personal trainer myself
personal support worker
personal support care giver
personal shopping assistant
personal rn
personal meditation coach
personal caregiver
personal care aide
person-centred counsellor
permanent substitute teacher
permanent researcher
perinatal nurse
peri operative tech
performance
perfectionist
peon
pen-tester
peer support worker
peer counselor
peer advisor
pee collector
pedicabber
pediatric speech-language pathologist
pediatric pt
pediatric physical therapist
pediatric oncologist
pediatric occupational therapist
pediatric home care nurse
pediatric hh lpn
pedagogical assistant
peace
pe teacher
pd dispatcher
pca/homemaker
pc/network technician
pc/network administrator
pc technician/network administrator full-time
pc tech cleaning
pc operator
payroll worker
payroll apprentice
payroll admin
pattern designer
patrol officer
patrol division deputy sheriff
patrol
patients
patient-churning
patient transporter
patient educator
patient education director
patient coordinator
patient companion
patient care technician
patient
pathfinder
patent writer
patent attorney
patent
pastry baker
party manager
party host
parttime
parts guy
partnerships locality officer
partner solution consultant
partner manager
partner driver
partime job
part-timer
part-time yoga teacher
part-time xxx
part-time worker
part-time wardrobe consultant
part-time volunteer i
part-time student
part-time software developer
part-time sales assistant
part-time sales
part-time research associate
part-time professor
part-time pastor
part-time nurse
part-time news editor
part-time nanny
part-time lifeguard
part-time library aide
part-time graphic designer
part-time game developer
part-time gallery manager
part-time french high-school teacher
part-time employee
part-time electrician
part-time doula
part-time contractor
part-time consultant
part-time cfo
part-time beauty artist/advisor
part-time assistant teacher
part-time assignment writer
part-time administrative assistant
part-time admin assistant
part time volunteer
part time vet
part time tutor
part time support worker
part time self
part time sales supervisor
part time relexologist
part time prof
part time police officer
part time nurse
part time lingerie
part time lifeguard
part time library ninja
part time instructor
part time illustrator
part time housekeeper
part time guardian
part time freelancer
part time fiction editor
part time engineer
part time consultant
part time circus coach
part time blogger
part time bartender
part time assistant
part ii
part i
parole officer
parliamentary assistant
parish
paraprofessional indexer
paramedic right
paramedic i
paramedic and i love it
para-librarian
para-educator
par time
paper-pusher
paper pusher
paper maker
painter,cam-girl,auctioneer,animal volunteer,lobbyist,petition writer
pain-nurse
pain consultant
page
paediatric nusre
padi instructor
padi dive instructor
pacs applications system specialist
packer
overnight charge nurse
orthodontist technican
optometric tech
operating manager
openstack solutions architect
online tutor
online marketer
one-woman crew
one man band
one
on-call tour guide
ojt tariner
oil distributor
officer
office worker
office manager
odds compiler
nyc process server
nutritional therapist
nutrition consultant
nursing student
nurses aide
nurses
nursery teacher
nurse-i need
nurse ‘ s aid
nurse shift worker
nurse part time
nurse manager
nurse educator
nurse anesthetist
nurse advisor
nurse acro
nuclear medicine technologist/ct tech
nuclear medicine technologist
nuclear engineer
nt administrator
novel editor
northbrook park wedding photographer
normal or plus size model
nonprofit fundraiser
nonlife actuary
nonformal educator
non-functional test contractor
noc systems administrator right
noc engg i.
no dig gardener
nite supervisor/case manager
nissan tech
night security officer
night security
night porter
night custodian/maintenance worker
night clerk
night
newspaper editor
news repporter
news photographer
news cameraman
news assistant
newborn baby photographer
new scotland yard intern
new real beast
new nurse
new media communications assistant
new hire mentor
neutral channel i
neurology nurse
neurologist
neuro surgeon
networking /electronics tech
network/systems
network specialist
network operations center technician
network manager
network geek
network engineer/system administrator
network engineer/administrator
network engineer/"your company
network engineer contractor
network assistant
network architect
network administator
net admin
nerdy scientist
neonatologist
neonatal nurse practitioner
negotiator
navigator
naturopathic doctor
naturopath
natural rhythms creation coach
natural resources planner
natural resource biologist
natural health care practitioner
national security researcher
national health service nurse
nannybabysitter
nanny/domestic helper
nanny/ sitter
mystic detective
musician/composer
musical instrument repair technician
music therapist
music librarian
music journalist
museum docent
museum conservator
munitions officer
municipal link officer
multimedia teacher
multimedia producer
multimedia artist
multimedia
multilingual intern
multi-disciplinary designer
mri researcher
movie editor
movement consultant
mountain rescuer
mountain rescue volunteer
mountain guide
motorcycle police officer
motorbike flight messenger
motor mechanic
mother i
mortgage underwriter
mortgage originator
mortgage lender
mortgage banker
montessori tutor
montessori teacher
montessori primary teacher
monorail guide
monkey
money coach
money adviser
model fulltime
mod
mobile reflexologist
mobile practitioner
mobile pet groomer
mobile massage
mobile hairdresser
mobile front end developer
mobile beautician
mixed-media
mixed martial artist reporter
ministers
mining engineer
miner
milling sharpener
milk recorder
military doctor
midwife/nurse
midweek treat
middle school math teaching associate
middle school math
middle school counselor
middle manager
mid-level provider
mid-level manager
microsoft systems
microsoft infrastructure consultant
meteorologist/computer programmer
metallurgist
metal worker
metal fabricator
messenger driver
mermaid
merchant mariner
merchant marine
merchant application underwriter
merchandising intern
mentors
mentor/coach/counselor
mental medium
mental health worker
mental health support worker
mental health practitioner
membership advisor
mekanik
meeting designer
meeting coordinator
meditation teacher
medicinal chemist
medical vri interpreter
medical translator
medical transcriptionist home base
medical technologist
medical student
medical sales rep
medical review officer
medical physicist
medical person
medical officer
medical laboratory tech
medical lab tech
medical electrophysiology technologist
medical courier
medical content editor full-time
medical consultant
medical coding associate
medical anaesthetist
medical admin
media relations consultant
media consultant
media buyer
media assistant
media archivist
media architect
media analyst
med/surg nurse
med surge nurse
mechanical/software engineer
mechanical q.
mechanical drafter
mechanic/lube tech
mechanic problem
mech/electrical engineer
mech
meat-cutter
me
mathematics language teacher
mathematics coach
mathematic tutor
math tutor part time
math teaching assistant
math specialist
math intervention teacher
math instructor
math consultant
math coach
math
materials inspector
material control officer a.
mate
master engineer
master creativity coach
master bear builder
massage/physical therapist
massage therapist trader joe
mason
marketingassistent
marketing strategist
marketing queen
marketing consultant i
marketing communications specialist
marketing analytics specialist
market research manager
market analyst
maritime archaeologist
marine service engineer
marine photographer
marine mechanic
marble merchant
marble
manufacturer
manual orthopedic physical therapist
manicurist
mandarin interpreter
manaɡer
managing engineer
manager-hostess
management consultant specia
management
man
mammography technologist
male stripper
male
makeup/fx artist
makeup effects artist
makeup artist they
majority
major gifts officer
maintenance technician
maintenance person
maintenance landscaper
mail-man/postman
magazine art director
madison
machinest
machinecal draftsman
mac specialist
lyricist
lync trainer
lvn
lunchtime supervisor
lunchroom supervisor
lunch time assistant
lumber secretary
lube technician
lpn
lowly tech
lowly student assistant
low-level college administrator
loss prevention officer
loss adjuster
longshoreman
long term translator
long term substitute school guidance counselor
long shoreman
lone writer
london artist
logistics supervisor
logistics specialist
logistics engineer specialist
logistics coordinator
logistic officer
lodging managers
locomotive engineer
locations coordinator
local radio dj
local pastor
local news columnist
local lic
local government manager
lobbyist
loan
loader
lna
lived experience development worker
live-in-carer
live-in nanny
live nursery sales specialist
litigation consultant
literature tutor
literary consultant
literacy tutor
literacy teacher
linux systems
linux servers administrator
linguist
lingerie model
line therapist
limousine driver
limodriver
limo driver
lighting technican
lighting artist
lighting
lifestyle reporter
lifegaurd
life enrichment coordinator
life cycles educator
life coach —
licensed vocational nurse
licensed substances abuse counselor
licensed special education teacher
licensed sales producer
licensed psychotherapist
licensed professional counselor intern
licensed professional counselor
licensed practical nurse
licensed nursing assistant
licensed mental health counselor
licensed marriage family therapist
licensed marriage
licensed counselor
licensed consulting
licensed clinical laboratory scientist
licensed assistant
library technology assistant
library officer
library media tech
librarian…don
librarian ɑnd i
librarian technician
librarian clerk
liaison person
level iii network technician
level designer
level
legislative assistant
legal secretary/paralegal
legal researcher
legal nurse consultant
legal consultant
legal clerk
legal assistant downtown
legal advocate
legal adviser
lecturer i
learning technologist
learning services officer
learning advisor
lean expert/continuous improvement consultant
lead user researcher
lead artist
lead advisor
layout
law enformance officer
law enforcement ranger
law enforcement officer
law clerk
laundry attendant
launchsource employee
laser therapist
large format printer
large equipment fueler
laptop technician
language tutor
language services provider
language arts consultant
landscape photographer
landscape gardener
landscape architecture
lamaze
lactation counselor
lactation consultant
labtech
labourer
laborer
laboratory worker
laboratory monitor
laboratory
laboratorie technisian
labor/delivery nurse
labor organizer
labor doula
lab technologist
lab tech myself
lab tech
lab ta
lab scientist
lab research assistant
lab coordinator
l&d rn
knowledge-base
knowledge
knowldge management
knight
kitchen utility
kitchen designer
kinesiologist
kindergartner
kindergarten/second-grade paraprofessional
kindergarten aide
kindergarten
kids illustrator
kfc
key-worker specialising
key holder
kennel technician
karaoke dj
jv manager
jury consultant
juniorprofessor
junior web developer
junior web application developer
junior ux designer
junior sql developer
junior software engineer
junior researcher
junior physician
junior marine educator
junior lawyer
junior graphic designer
junior doctor
junior director
junior consultant
junior associate
junior architect
junior administrator
judicial law clerk
judgmental one
journalist writing
job title
job coach
jewelry photographer
jewelry d.
jeweler
jbpm quality engineer
jazz musician
javascript
java
it security consultant
it person
it network administrator
it guy
it business systems analyst
it
interpreter/translator
internet/intranet specialist
international tax
interior decorator
insurer
instructor
instructional assistant
infrastructure administrator
information consultant
infant teacher
industrial firefighter
india
independent professional consultant
independent media artist
independent escort
ict sector manager
i.
i
hypno-domina
hygienist
hybrid it-guy/developer
hvac engineer
husband
hunter
humanist celebrant
humaniatrian aid worker
human-wildlife conflict officer
human rights representative
human resources specialist
human resources
human resource manager
human resource assistant
hs language teacher
hr support assistant
hr manager
hr generalist
hr executive
hr director
housing counselor
housepainter
household maid
house parent
house painter
house manager
house keeper
house director
hotel night auditor
hotel manager myself
hotel maid
hotel housekeeper
hotel clerk
hostess/model
hostess/bartender
hospitalist np
hospital-based social worker
hospital nurse
hospice cna
horseshoer
horse trainer
honda salesman
homeopathic doctor
homehealth
home remodeler
home insurance agent
home instructor
home inspector
home health companion
home health attendant
home fashion consultant
home economist
home attendant
holistic wellbeing
holistic nutritionist
holistic medicine consultant/coach
holistic health consultant/coach
holistic health coach
holidays manager
hod carrier
hobby
history professor
historical research assistant
historical consultant
historian
histopathology technician
hinges
higher education administrator
high-school teacher
high school music teacher
high school maths teacher
high school library media specialist
high school esl teacher
high school english teacher
high school drama teacher
high school counselor
high school computer teacher
high school art teacher
hench man
helpdesk/support technician
helpdesk
help desk analyst
helicopter maintenance supervisor
heavy engineer
heavy duty diesel mechanic
heath tech
healthcare worker
healthcare it
healthcare chaplain
healthcare
health visitor
health specialist
health promotion officer
health practitioner
health knowledge information scientist
health improvement lead
health educationist
health coordinator
health consumer
health club manager
health care worker
health care physician
health blogger
head chef
head \blanka{nurse
harpsichordist
harpist
hardware designer
handler
handicrafts teacher
hallmark retail merchandiser
half-time
hair-dresser
hair stylist part time
hair salon manager
h.
gynaecological cytologist
guy
guitarteacher
guitarist
guide-interprete
guide full time
guest relations manager
guest lecturer
guest host
gsi/gsr my
gs-9
growth hacker
group product manager
group home superviso
group fitness instructor
group facilitator
groundskeeper/jack
grounds manager
ground engineer
groom
grocery store
grocery stocker
grocery assistant
grief counsellor
grey patch
great team
grease monkey
graveyard cashier
grapic designer
graphics manager
graphics
graphic illustrator
graphic designer/product photographer/it support tech
graphic designer/interior decorator
graphic designer,animator
graphic designer entrepreneur
graphic design associate
graphic desi
graphi designer/art director
graduate-school professor
graduate trainee
graduate teaching fellow
graduate student instructor
graduate student adviser
graduate research
graduate nurse
graduate intern
graduate analyst
gradual assistant
grader
gpsi
gp registrar
govt
government performance auditor
government electronics contractor onboard u.
government documents
government
governess
gov
gopher
google app developer
goods manager
good team
golf pro
golf course superintent
golf course architect
gn
gm
gluten
global ambassador
global advocate
global accounts
glamour model
gis/remote
gis coordinator
ghost writer
ghost tour guide
german attorney
geophysicist
geographer
genetic counsellor
generalist
general/ family practitioner
general worker
general practioner
general pediatrician
general clerk
general adviser
general adult
geek
ged teacher
ged instructor
gay-affirmative therapist
gass station
gas station clerk
gardener tomorrow
garden designer
garden coordinator
gaming attendant
g.
fіnancial officer
futures consultant
furniture specialist
furniture design/ consultant…
funeral director
fund raiser
fulltime specialty pharmacy technician
fulltime freelancer
fulltime freelance illustrator
fulltime fire fighter paramedic
fullstack developer
full-time web developer/programmer
full-time web developer
full-time va
full-time tutor
full-time travel blogger
full-time translator
full-time support worker
full-time supervisor
full-time self-employed therapist
full-time rn
full-time proofreader
full-time programmer
full-time professional tutor
full-time professional model
full-time personal trainer
full-time personal support worker
full-time officer
full-time night shift nurse
full-time mountain employee
full-time marketer
full-time lecturer
full-time kindergarten teacher
full-time karaoke host/ events coordinator
full-time janitor
full-time hebrew
full-time freelancer translator
full-time freelance writer
full-time freelance translator
full-time entrepreneur
full-time employee
full-time editor
full-time developer
full-time content writer
full-time client representative
full-time cardiac nurse
full-time army contractor
full-time alderman
full-time academician
full-stack
full time studio artist
full time sr
full time sculptor
full time professor
full time pharmacist
full time musician
full time mommy
full time metal sculptor
full time manager
full time java developer
full time hairdresser
full time hair stylist
full time genealogist
full time fashion designer/stylist
full time designer
full time contractor
full time composer
full time chaplain
full time call center agent
full time bookkeeper
full time blogger
full time artist/illustrator
full stack web developer
full figured/plus size
fu
front office assistant
front end manager
french assistant
freelancer/independent contractor
freelancer translator
freelancer tour guide
freelancer programmer
freelancer myself
freelancer developer
freelancer business consultant
freelancer artist
freelancer article writer
freelancer a.
freelanced screen-designer
freelance/hobbyist developer
freelance writer/editor
freelance whiteboard animator
freelance website designer
freelance video
freelance ux concept creator
freelance tv sound recorsist
freelance travel photographer
freelance tourist guide
freelance television production manager
freelance television cameraman
freelance technical writer
freelance technical editor
freelance technical diving instructor trainer
freelance teacher
freelance stylist
freelance style author
freelance storyteller
freelance stage manager
freelance spanish translator
freelance songwriter
freelance software engineer
freelance software designer
freelance software
freelance shooting director
freelance set
freelance seo consultant
freelance security consultant
freelance sales specialist
freelance researcher
freelance record producer
freelance public relations writer
freelance production assistant
freelance producer
freelance pr consultant
freelance photojournalist
freelance photographer specialising
freelance organisational consultant
freelance network/computer consultant
freelance net designer
freelance motion picture projectionist
freelance motion
freelance media producer
freelance master
freelance marketing consultant
freelance marine biologist
freelance manga editor
freelance makeup artist part time
freelance location sound mixer
freelance lifestyle
freelance layout artist
freelance landscape designer
freelance instructional designer
freelance illustrator/animator
freelance guitarist
freelance graphic-artist
freelance graphic designer/art director
freelance graphic artist/painter
freelance ghostwriter
freelance french translator
freelance food writer
freelance financial consultant
freelance filmmaker
freelance film projectionist
freelance entertainment journalist
freelance english
freelance educator
freelance educational counselor
freelance editor/writer
freelance editor/proofreader
freelance dj
freelance digital strategy manager
freelance digital filmmaker
freelance digital designer
freelance designe
freelance deckhand/engineer
freelance database programmer
freelance dancer
freelance dance artist
freelance cycling journalist
freelance curator
freelance cross media designer
freelance creative writer
freelance cook
freelance content developer
freelance consultant engineer
freelance conference interpreter
freelance commercial
freelance choreographer
freelance ceramic designer
freelance cartoonist
freelance carpenter
freelance business consultant
freelance book designer
freelance artist working
freelance article writer
freelance art director
freelance application platform consultant
freelance and i study culture management
freelance advertising copywriter
freelance administrative assistant makeup
freelance academic editor
freelance 3d artist
free-lancer translator
free-lancer i aim
free-lancer
free-lance trainer
free-lance science writer
free-lance journalist
free-lance designer
free-lance consultant
free lancer i
free lance web
free lance translator
free lance photographer
free economist
free artist
franchise consultant
framework developer
fox news contributor
fourth grade teacher
foster carer
foster care case worker
foster care
fossil preparator
formulations chemist
formulation scientist
fork-lift driver
forest ranger
forest firefighter
forest fire ranger
forensic mental health support worker
foreign student
forecasting analyst
forced labor
footwear technologist
footman
football coach
food stylist
food service manger
food service managers
food seller
food safety/quality assurance specialist
food reviewer
folklorist
fne
flower-decorator
flooring specialist
flight nurse
fittings model
fitting model
fitter welder
fitness trainer n nutritionist
fitness coach
fitness
fit model
fisheries biologist
fish packer
first-grade chinese teacher
first-aider
first year analyst
first nations support worker
first grade teacher
first assistant director
firmware engineer
firefighter/emt
firefighter paramedic
fire sprinkler pipe fitter
fire protection project manager
fire fighter/emt
fire engineer
fire captain
fine art portrait
fine art photographer
fine
financial/loss adjuster
financial systems developer
financial specialist
financial rep
financial modeller
financial journalist
financial executive
financial controller
financial content writer
finance consultant
finance associate
finance assistant
filmmaker/art director/production designer/editor
film production manager
film critic
fill
file clerk
field tech
field supervisor
field operation assessor
field manager
field executive
field contract administrator
field archaeologist
field agent
fiddle player
fiction writer
festival photographer
fencing referee
fencing contract
felony prosecutor
fee
federal contract
features reporter
feature film editor
fax operator
fat acceptance activist
fast food manager
fashion-photographer
fashion photographer
fashion liaison
farrier
family physician
family nurse practitioner
family medicine physician
family counsellor
fair hostess
factory storeman
factor
facility consultant
facilities project manager
fabric desiger
fabric artist
fabracator
fa
events photographer
european tour guide
ese teacher
escort
equestrian course designer
environmental scientist
english teacher
engineer/researcher
engineer-technologist
emt
employee
electronics tech
electronic salesmen
electronic designer/programmer
electrician,my own place
educational researcher
editor
duty manager
duo
dubstep dj
dsp
dsa practitioner
dry wall contractor
drupal developer
drug rep
drug investigator
drug counselor
drug counsellor,which
driving instructor
driver/guide
driver education instructor part time
dressmaker
dressing room attendant
dressing chef
dramaturg
dramatist
dramatherapist
drama trainer
dpe
dp ’
dotor
doormat
door
donor relations officer
domestic violence prevention worker
domestic violence advocate
domestic maid
domestic care nurse
domain software tester
domain admin
dog handler
dog
doer
dod contractor
docutech
documentation manager
documentary photographer
documentary cameraman
documentalist
document translator
doctor physician
doctor it
dock manager
docent-tour guide
dj host
division manager
diving instructor
diversity educator
diversity consultant
diversional therapist
divemaster
dive guide
ditch digger/factory worker
district manager
distric contractor/architect
disability rights
disability
director hr
directional driller
direct support professionals
direct support
direct service provider
direct sales rep fpr comcast/xfinity
direct care staff
digital trainee
digital tech
digital strategist
digital signal processing
digital retoucher
digital press operator
digital portfolio manager
digital photographer
digital media specialist
digital media educator
digital marketing apprentice
digital literacy coach
digital guy
digital design
digital creative director
digital communications specialist
digital communication consultant
digital arts teacher
digital artist
digital analyst
dietitian full-time
dietitian
dietary supervisor
dietary aid
diesel instructor
dictation typist
dialysis technician
dialysis tech
diabetic specialist nurse
devil summoner
developmental specialist
development producer
development director
development coordinator
development
developer/architect
developer/analyst
developer researcher
developer programs engineer
developer evangelist
develoment finance manager
dev
detail cleaner
destination photographer
desktop soe guy
desktop engineer
desk attendant
desk assistant
designer/printer
designer/illustrator
designer/fe developer
designer sr
designer marketing exec
designer manager
design-engineer/designer
design specialist
design eng
design educator
desiger
dermatology nurse practitioner
deputy sales manager
deputy manger
deputy district attorney
deputy corrections officer
dept
departmental manager
department lead
dental nurse it
dental hygienst
dental hygienist(i
dental hygienist full time
dental consultant
dental clinician
dental assistant/receptionist
demonstrator
dementia champion
delivery woman
delivery postman
delivery partner
delivery man
deli clerk
delhi
delegate
defense attorney
deer biologist
deep sea
deep desktop support expert
deburr tech
death surrogate
dealership technician
dce
daycare manager
daycare assistant
day trader
day counselor
day
database/web developer
database/web administrator
database programming
database developer
database analyst
data typish
data science programmer
data retrieval specialist
data journalist
data engineer/scientist
data conversion analyst
data artist
data analytics consultant
data analyst writing code
dance instructor remember
dallas county master gar
daka
cytologist
cyclo driver
cycle courier
cybersecurity
cyber security analyst
cutter
cusual labourer
customs inspector
customs clearance agent
customs broker
customer support executive
customer service operator
customer service officer
customer care representative
custom car designer
custom cabinet designer
custodian/bus driver archdale trinity
cust
curriculum writer
curriculum reviewer
curriculum assistant
current affairs researcher reporter
currency volatility trader
culture/feature journalist
culinary producer
ctrs my mother
ctr
cti/telephony specialist
cs(canine supervisor
cs professor
cs
croupier part time
cross-disciplinary artist
cross platform developer/engineer
crma
crisis therapist
crisis manager
crisis counselor
criminal defense lawyer
criminal defense attorney
crime scene investigator
crime investigator
crime analyst
credit counsellor
creativity performance coach
creative technologist
creative strategy director
creative solutions consultant
creative lead
creative agent
crane operator
craftsperson
craft designer
cpp programmer
cpe supervisor
cpa auditor
cp
cow-cockey
courtesy clerk
court reporter
course manager
couples counsellors
couple therapist
couple counselor
counselor/teacher
counselor part-time
counselling psychologist
councillor
council officer
cota
costume-m
costume actor
cosplay photographer
cosmetic merchandiser
corrections deputy
correctional officer/ dispatcher
corporate video producer
corporate secretary
corporate sales rep
corporate sales executive
corporate paralegal
corporate level microbiologist
corporate flight attendant
corporate finance
corporate educator
corporate chaplain
corporate attorney
corp
core team
core process psychotherapist
copywriter/proofreader
copyright agent
copyist
copyeditor
copier technician
coop
cooker
cook supervisor
controls system programmer
control systems engineer
control room
control engineer
contributor fox
contractual freelancer
contractor/llr
contracting carpenter
contracted grant writer
contract surveyor
contract specialist
contract researcher
contract nursing
contract lawyer
contract individual
contract engineer(sw/hw
contract database administrator
contract case writer
contract bike
continuous improvement consultant
content/socialmedia manager
content producer(www
content producer
content marketing intern
content head
content director
content designer
content creator
contemporary mixed media artist
contact centre customer service advisor
consumer affairs specialist
consulting software tester
consultant/contract electrical engineer
consultant travel planner
consultant software
consultant psychiatrist
consultant programmer
consultant pharmacist
consultant i
consultant historian
consultant archaeologist/heritage consultant
consultant anesthesiologist
consultant @
constructor
construction site inspector
construction management engineering
construction labour
construction estimator
conservation
confidential informant
conference interpreter
conference coordinator
conduit assembler
concert master
conceptual portrait
concept manager
concept design
concept
computerized accounting
computer teacher
computer systems engineer
computer systems department manager
computer system administrator
computer shop-assistant
computer science researcher
computer repair technician
computer programmer half
computer operations mgr
computer network administrator
computer lab attendant
computer games artists
computer engineer
computer assistant
computational scientist
compulsory freelance social worker
comptuer network software technician
compliance coordinator
completely freelance painter
complementary team
compassionate , transformative guide
company secretary reporting
companion
community support individual
community rail development officer
community police officer
community pharmacist
community organizer
community officer
community minister
community liaison officer
community health worker
community health volunteer
community facilitator
community educator
community drug educator
community developer
community arts project coordinator
communications secretary
communications officer
communications manager
communications director
communications consultant
communications advisor
communication trainer
communication technologist
communication designer
communication coordinator
communication
commodoties analyst
commodity lumber broker
commissioned salesman
commissioned portrait artist
commission
commercial stingless beekeeper
commercial security officer
commercial realtor
commercial real estate lease administrator
commercial real estate attorney
commercial radio installer
commercial model
commercial marketing student writer
commercial litigation attorney
commercial lender
commercial landscaper
commercial animator
comics illustrator
comic book artist
comfort keeper
comercial
comedian
combat medic
combat engineer
colorist
color kitchen lead
color advisor
collision tech bodyman
colliery viewer
college student
college program advisor
college library assistant
college lecturer
college instructor
college english professor
college administrator
college adjunct professor
collector
collective worship councillor
colleague
collateral loan broker
collaborator
collaborative pianist
cognitive behavioral therapist
coffee distributor
code monkey
cobbler
coach/counsellor/guide
co-owner
co-ordinator
co-op student
co-author
co
cnc programmer
cnc machine programming engineer
cna full time
cma
club promoter
clown-dance character
clown doctor
cloud evangelist
clothing designer
clothes-maker
closing coordinator
close team
clinical trials program manager
clinical technician
clinical research coordinator
clinical research assistant
clinical pediatric dietitian
clinical office assistant
clinical nutritionist
clinical nurse specialist
clinical nurse
clinical laboratory technician
clinical informaticist
clinical hypnotist
clinical educator
clinical chemist
client executive
client delivery manager
client coordinator
client advisor
client
clerical worker
clerical staff
clerical assistant
clerck
cleaning team
cleaner i
classical homeopath
classic bike mechanic
class assistant
civilian love it
civilian human resources specialist
civilian contractor
civil/environmental engineer
civil litigator
civil engineer tech
civil enforcement officer
city worker
city planner
city guide
circus coach
circulation supervisor
circulation assistant
cio
cinema projectionist
church secretary
church planter
church minister
church administrator
choral director
chiropractic physician
chip engineer
chinese language tutor
chimney sweeper
children photographer
children event planner
childminder i
childcare teacher
child psychologist
child practitioner
child health nurse
child care teacher i
child abuse investigator
chief technical officer
chief officer
chief information security officer
chief curator
chicken farmer
chemistry lecturer
chemistry laboratory technician
chemistry
chef/food service director
cheer coach
check-in agent
chauffeur
chartered psychologist
charity lawyer
character art supervisor
change agent
chalkboard artist
cg operator
cg artist
cfe
certified yoganurse
certified veterinary technician
certified toefl instructor
certified substance abuse counselor
certified public accountant
certified professional coder
certified occupational therapy asst
certified nursing
certified nurse midwife
certified nurse assistant
certified nurse aide
certified microbiologist
certified medical interpreter
certified medical asst
certified master mechanic
certified health coach
certified flavor chemist
certified esl teacher
certified educator
certified dog trainer
certified clinical research coordinator
certified beauty advisor
cert iv
ceritifed nurses
ceo inn
ceo
centreless
celebrity manicurist
cd
cca
cave guide
catty receptionist
catholic church musician--i
catering manager
cater-waiter
catechist
catastrophe insurance adjuster
catalyst
catalog assistant
cat sitter
cat hunter
cat
casual employee
casual community nurse
casting producer
cashier/stocker
cashier/customer assistance person
cashier/clerk
cashier/bagger
cashier part time
case worker/teacher
case worker
cart puller
cart girl
carrier
carpet cleaner
carpenter my hobbies
carman
careworker
caregiver/cna
careers adviser
career coach
care team leader
care partner(which
care manager
care
cardinals writer/columnist
caravan technitian
car-salesman
car mechanic
car detailer
car dealership co-owner
car
cantor
canoe guide
canine massage therapist
canine behavioral consultant
cancer surgeon
cancer registrar
campus supervisor
campus nurse
campus minister
campaigns manager
camp nurse
camp counsellor
camera woman
cam model
café bar team member
cadet
cad/cam technician
cad manager
cable installer
cable
cabinet-maker
cabin crew
cab driver
buying
butler
buteyko breathing educator
bussiness development executive
busser
businessman
business/sales analyst
business technology analyst
business system analyst
business software developer
business psychologist
business process
business man
business growth coach
business development
business communication skills trainer
business analysts
business administrator
bus schedules compiler
burrito roller
burger
bunny girl waitress
building renovation worker
building performance engineer
building manager
build
buckner children
bridge
brazer
branding specialist
brand protection analyst
brand manager
brand ambassador
braille translation editor
bpo quality training
boxer
bow technician
bouncer/ doorman
botanist
botanic apothecary
boost
boom-operator
booking agent
book translator
book shepherd
book seller
book scout
book scanner
book publicist
book
bond enforcement officer
boiler operator
boiler engineer
bodywork therapist
bodyguard
body shop manager
body shop estimator
body guard
boat carpenter
board
bloody analyst
blacksmith
black topper
bisexual women
birth parent counselor
biotechnologist
biostatistician
biomedical scientist/cytotechnologist
biomedical scientist
biomechanical engineer
biological statistician
bioethicist
billing specialist
billing clerk
bilingual/esl teacher
bilingual sales representative
bilingual kindergarten teacher
bilingual chief poll worker
bikini barista
bigdata engineer
bicycle
bfing peer couselor
bespoke
bereavement counsellor
belly dancer
bellman
behavioural support worker
behaviorist
behavioral specialist
behavioral medicine specialist
behavioral health tech
behavioral health
behavior specialist
beauty-consultant
beautician i
bear guide
bass teacher
basketmaker
basis consultant
bartender/stripper
bartender,sounds
barn assistant/trainer
barista part time
bariatric educator/clinical nutritionist
bargeman
barber
bar supervisor
bar manager
bar man
bar maid
bar
banking it
banking analyst
bank courier
band teacher
baltimore city
ballroom dance teacher
ballet instructor
bail bondsman
bail bonds enforcer
baggage service agent
baggage handler
bag boy
background
backend java developer
backend engineer
backcountry
baby-stylist
baby photographer
auditing officer
audio-engineer
asst prof
associate professor
assitant teacher
assistant chef
asphalt technologist
artist instructor
art director
army guard soldier
armed security office
armed guard
archivist
architecture
arborist assistant
apprentice
applications interface designer
android os developer
analyst
an rn
alltime powerpoint designer
alarm operator
airline
air conditioning tech
aerospace engineer
acupuncturist
a pharmacy dispenser
a goat
a cowboy
911 dispatcher
911 call receiver
8th grade teacher
411 operator
3rd party retailer
3d modeler
3d artvisualiser
3d artist/animator
3d artist myself
2nd grade
2nd assoc
2d artist
2d animator
1st ad i
1:1 tutor
1:1 aide
</textarea>
<p>This is not perfect but this pretty good value considering how little was required.</p>
<h2 id="national-stereotypes">National stereotypes</h2>
<p>Here are the results for different nationalities.
<em>Hopefully useless disclaimer: This is measuring stereotypes, not actual fact.</em></p>
<p><img src="/images/commoncrawl/french.jpg" alt="French people" /></p>
<center><b>french people are...</b></center>
<p><br /></p>
<p><img src="/images/commoncrawl/japanese_people_are_jj.png" alt="Japanese people" /></p>
<center><b>Japanese people are...</b></center>
<p><br /></p>
<p><img src="/images/commoncrawl/russians_are_jj.png" alt="Russian people" /></p>
<center><b>Russian people are...</b></center>
<p><br /></p>
<p><img src="/images/commoncrawl/americans_are_jj.png" alt="American people" /></p>
<center><b>Americans are...</b></center>
<p><br /></p>
<p><img src="/images/commoncrawl/italians_are_jj.png" alt="Italian people" /></p>
<center><b>Italians are...</b></center>
<p><br /></p>
<h2 id="peoples-favorite">People’s favorite?</h2>
<p>One can also try to extract noun phrases instead of a single word. For instance,
here is a list of people’s favorite thing. For instance, the first tag cloud was generated using the pattern: <code class="language-plaintext highlighter-rouge">my favorite city is <noun phrase></code>.</p>
<p><img src="/images/commoncrawl/my_favorite_city_is_np.png" alt="Favorite city" /></p>
<center><b>Favorite city</b></center>
<p><br /></p>
<p><img src="/images/commoncrawl/my_favorite_band_is_np.png" alt="Favorite band" /></p>
<center><b>Favorite band</b></center>
<p><br /></p>
<h2 id="whats-google-whats-trump">What’s Google? What’s Trump?</h2>
<p>It’s also fun to search for the noun phrases associated
to something.</p>
<p>For instance, here is what is said about Google.</p>
<p><img src="/images/commoncrawl/google_is_np.png" alt="Google is ..." /></p>
<center><b>Google is ...</b></center>
<p><br /></p>
<p>… or what is said about Donald Trump</p>
<p><img src="/images/commoncrawl/trump_is_np.png" alt="Donald Trump is ..." /></p>
<center><b>Trump is...</b></center>
<p><br /></p>
<h1 id="a-couple-of-personal-news">A couple of personal news.</h1>
<p>You may have noticed I haven’t blogged for a while. In the last few months, I
crossed a lot of very interesting things I wanted to blog about but I preferred
to allocate my spare time on the development <a href="https://github.com/tantivy-search/tantivy">tantivy</a>.
0.5.0 was a pretty big milestone. It includes a lot of query-time performance
improvement, faceting, range queries, and more… Development has been going
full steam recently and tantivy is getting rapidly close to becoming a decent
alternative to Lucene.</p>
<p>I am unfortunately pretty sure I won’t be able to keep
up the nice pace.</p>
<p><img src="/images/commoncrawl/baby.jpg" alt="First daughter" /></p>
<p>First, my daughter just got born! I don’t expect to have
much time to work on tantivy or blog for quite a while.</p>
<p>Second, I will join Google Tokyo in April. I expect it will this new position
to nurture my imposter syndrome. Besides, starting a new job usually bring
its bit of overhead to get used to the new position / development
environment. The next year will be very busy for me !</p>
<p>By the way, I will travel to Mountain View in May, for Google Orientation.
If you know about some interesting events in San Francisco or Mountain View
during that period, please let me know!</p>
Of tantivy's indexing2017-07-16T00:00:00+00:00https://fulmicoton.com/posts/behold-tantivy-part2<p><em>This post is the second post of a series describing the
inner workings of a <a href="https://github.com/tantivy-search/tantivy/">rust search engine library called tantivy</a>.</em></p>
<h1 id="foreword">Foreword</h1>
<p>In my <a href="/posts/behold-tantivy/">last blog post</a>, I talked about the data-structures that are used in a tantivy index, but I did not explain how indexes are actually built.
In other words, how do you get from a file containing documents <em>(possibly too big to fit in RAM)</em> to the index described in my first post <em>(possibly too big to fit in RAM too)</em>.</p>
<p>You may have noticed that index data representation is very compact. If you do not index positions nor store documents, an inverted index is in fact typically much smaller in size than the original data itself.</p>
<p>This compactness is great because this makes it possible to put a large portion, if not all, of the index in RAM. Unfortunately, compact structures like this are typically not very easy to modify:</p>
<p>Imagine you have indexed 10 millions documents, and you want to add a new document.
This new document will bear document id <code class="language-plaintext highlighter-rouge">10,000,000</code> For simplification, this document might contain for instance a single token, the word “rabbit”.</p>
<p>In order to add this new document, we need to add the document id <code class="language-plaintext highlighter-rouge">10,000,000</code> to the posting list associated with the word “rabbit”. No matter how optimized our algorithm is, this will require moving around all of the postings list coming after the one associated with the word “rabbit”.</p>
<p>Another problem is that <em>tantivy</em> is supposed to be designed to handle building an index that does not fit in RAM.</p>
<p>This blog post will explain in detail how this is done in tantivy…</p>
<h1 id="a-lot-of-small-indexes-called-segments">A lot of small indexes called segments</h1>
<p>Well, our problems are quite similar to
<a href="http://neopythonic.blogspot.jp/2008/10/sorting-million-32-bit-integers-in-2mb.html">sorting a list of integers that does not fit in RAM</a>.
In this all-time-favorite interview question, a common solution is to split the input file in more chewable chunks, sort the different parts independently,
and merge the resulting parts. Tantivy’s indexing works in a similar fashion.</p>
<p>While my <a href="/posts/behold-tantivy/">last post</a> described a large single index,
a <a href="http://github.com/tantivy-search/tantivy">tantivy</a> index actually
consists in several smaller pieces called <strong>segments</strong>.</p>
<p>If you went through <a href="https://github.com/tantivy-search/tantivy-cli">tantivy’s tutorial</a>, you may have noticed that after
indexing Wikipedia, your index directory contains a bunch of files,
and all of their filenames follow the pattern</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SOME_UUID . SOME_EXTENSION
</code></pre></div></div>
<p>The UUID part identifies the segment the files belong too, while the extension
identifies which datastructure is stored in the file (as described in the first
post). Really, each segment contains all of the information to be a complete
index, including its own entire term dictionary.</p>
<h1 id="segments-commits-and-multithreading">Segments, commits, and multithreading</h1>
<p class="disclaimer">
While tantivy supports delete operations since version 0.3, I will not address deletes in this blog post, as they add a lot of complexity to the index.
</p>
<p>Let’s assume we want to create a brand new index.</p>
<p>After defining our schema, and creating our brand new empty index,
we need to add our documents. Tantivy is designed to ingest documents
in large batches.</p>
<p>API wise, this is done by getting an <code class="language-plaintext highlighter-rouge">IndexWriter</code>,
and calling <code class="language-plaintext highlighter-rouge">index_writer.add_document(doc)</code> once for each document
of your batch, and finally call <code class="language-plaintext highlighter-rouge">index_writer.commit()</code> to finalize the batch.</p>
<p>Before calling <code class="language-plaintext highlighter-rouge">.commit()</code> none of your document is visible for search.
In fact, before calling <code class="language-plaintext highlighter-rouge">.commit()</code>, none of your documents are persisted either.</p>
<p>If a power surge happens while you are indexing some documents, or even during <code class="language-plaintext highlighter-rouge">commit</code>, your index will not be corrupted. Tantivy will restart in the state of your last successful commit.</p>
<p>Under the hood, when you call <code class="language-plaintext highlighter-rouge">.add_document(...)</code>, your document is in fact just added to an indexing queue. As long as the queue is not saturated, the call should not block and return right away. Applications using <code class="language-plaintext highlighter-rouge">tantivy</code> are in charge of managing a journal if they want to ensure persistence for each insert.</p>
<p>The index writer internally handles several indexing threads who are consuming this queue. Each thread is working on building its own little segment.</p>
<p><img src="/images/tantivy/multithreading.png" /></p>
<p>Eventually, one of the thread will pick your newly added document and add it to its segment. You have no control on which segment your document will be routed to.</p>
<p>These indexing threads are working in RAM and use very different data-structures than what was described in <a href="/posts/behold-tantivy/">part 1</a>, as they need to be writable.
They are presented in details in the <a href="#stacker">stacker section</a>.</p>
<p>Every thread has a user-defined memory budget. Once this memory budget is about to be exceeded, the indexing thread automatically finalizes the segment:
it stops processing documents and proceeds to serialize this in-RAM representation to the compact representation I described in my <a href="/posts/behold-tantivy/">previous post</a>.</p>
<p>The resulting segment has reached its final form, and none of its files will ever be modified. This strategy is often called write-one-read-many, or WORM.</p>
<p>At this point, your new documents are still not searchable.
Our fresh segment is internally called an <code class="language-plaintext highlighter-rouge">uncommitted segment</code>.
An uncommitted segment will not be used in search queries until the user calls <code class="language-plaintext highlighter-rouge">.commit()</code>.</p>
<p>When you commit, your call blocks and all the documents that were added to the queue before the commit get processed by the indexing threads. All of the indexing threads get a signal that they need to finalize the segment they were building, regardless of their sizes.</p>
<p>Finally, all <code class="language-plaintext highlighter-rouge">uncommitted segments</code> become <code class="language-plaintext highlighter-rouge">committed segments</code> and your document is now searchable. Your commit call finally returns.</p>
<h1 id="search-performance-the-need-for-some-merging">Search performance, the need for some merging</h1>
<p>Having many small segments instead of a few larger segments has a negative impact on search IO time, search CPU time, and index size.</p>
<p>Assuming we are building an index with 10M documents, and our individual thread heap memory limit was producing segments of around 100K documents, indexing all of the documents would produce 100 segments. This is definitely too many.</p>
<p>For this reason, tantivy’s <code class="language-plaintext highlighter-rouge">index_writer</code> continuously considers opportunities to merge segments together. The strategy used by the <code class="language-plaintext highlighter-rouge">index_writer</code> is defined by a <code class="language-plaintext highlighter-rouge">MergePolicy</code>.</p>
<p>You can read about <a href="http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html">merge policies in Lucene in this blog post</a>.</p>
<p>The merge policy by default in tantivy is called <a href="`https://github.com/tantivy-search/tantivy/blob/master/src/indexer/log_merge_policy.rs`">LogMergePolicy</a> and was contributed by <em>currymj</em>.</p>
<p class="disclaimer">
It may be tempting to always try to have one single segment.
In practice, if most of your index fits in RAM, as you merge
segments, the benefit of having fewer segments will
become less and less apparent.
Having half a dozen of segments instead of having one big
segment makes in practice very little difference.
</p>
<h1 id="indexing-latency-vs-throughput-search-performance">Indexing Latency vs Throughput, search performance</h1>
<p>As we explained <strong>adding a document, does not make it searchable right away.</strong>
You might be tempted to call <code class="language-plaintext highlighter-rouge">.commit()</code> very often in order to lower the time it takes for a document you added to become visible for search, aka the <strong>indexing latency</strong> of your search engine.</p>
<p>Please do this carefully as it as there are downsides to call <code class="language-plaintext highlighter-rouge">.commit()</code> frequently.</p>
<p>First, it will considerably hurt the indexing throughput.
Second, by committing frequently, you will produce a lot of very small segments.</p>
<p>In other words, committing too often will hurt your indexing throughput considerably, as well as you search performance if the merge policy does not keep the number of segments low, and finally, it will raise the CPU time spent in merging segments.</p>
<p>Yet again we face the everlasting war of latency versus throughput.</p>
<h1 id="stacker-datastructure"><a name="stacker"></a>Stacker datastructure</h1>
<p>Now let’s talk a little bit about how the segments are built in RAM.
I will not talk about how fast fields or stored fields are written, as their implementation is quite straightforward. Let’s focus on the inverted index instead.</p>
<p>When serializing the segment on disk, we will need to iterate over the sorted terms,
and for each of these terms, we need to iterate over the <code class="language-plaintext highlighter-rouge">sorted docids</code> that contain this term.</p>
<p>The first prototype of <code class="language-plaintext highlighter-rouge">tantivy</code> was simply using a <code class="language-plaintext highlighter-rouge">BTreeMap<String, Vec<u32>></code> to do this job.
The code would go through the document one by one, and:</p>
<ul>
<li>increment the doc id</li>
<li>tokenize the document</li>
<li>for each token, look for the posting list (<code class="language-plaintext highlighter-rouge">Vec<u32></code>) associated with the term and append the <code class="language-plaintext highlighter-rouge">DocId</code> to each of the posting lists.</li>
</ul>
<p>The code probably looked something like this.</p>
<figure class="highlight"><pre><code class="language-rust" data-lang="rust"><span class="k">fn</span> <span class="n">tokenize</span><span class="o"><</span><span class="nv">'a</span><span class="o">></span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="o">&</span><span class="nv">'a</span> <span class="nb">str</span><span class="p">)</span> <span class="k">-></span> <span class="k">impl</span> <span class="n">Iterator</span><span class="o"><</span><span class="n">Item</span><span class="o">=&</span><span class="nv">'a</span> <span class="nb">str</span><span class="o">></span> <span class="p">{</span>
<span class="c">// ...</span>
<span class="p">}</span>
<span class="k">struct</span> <span class="n">SegmentWriter</span> <span class="p">{</span>
<span class="n">num_docs</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
<span class="n">inverted_index</span><span class="p">:</span> <span class="n">BTreeMap</span><span class="o"><</span><span class="nb">String</span><span class="p">,</span> <span class="nb">Vec</span><span class="o"><</span><span class="nb">u32</span><span class="o">>></span><span class="p">,</span>
<span class="p">}</span>
<span class="k">impl</span> <span class="n">SegmentWriter</span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">add_document</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">document</span><span class="p">:</span> <span class="o">&</span><span class="nb">str</span><span class="p">)</span> <span class="p">{</span>
<span class="c">// `DocId` are allocated by auto-incrementing</span>
<span class="k">let</span> <span class="n">doc_id</span> <span class="o">=</span> <span class="k">self</span><span class="py">.num_docs</span><span class="p">;</span>
<span class="k">for</span> <span class="n">token</span> <span class="n">in</span> <span class="nf">tokenize</span><span class="p">(</span><span class="n">document</span><span class="p">)</span> <span class="p">{</span>
<span class="k">self</span>
<span class="py">.inverted_index</span><span class="nf">.entry</span><span class="p">(</span><span class="n">token</span><span class="nf">.to_string</span><span class="p">())</span>
<span class="nf">.or_insert</span><span class="p">(</span><span class="nd">vec!</span><span class="p">())</span>
<span class="nf">.push</span><span class="p">(</span><span class="n">doc_id</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">self</span><span class="py">.num_docs</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>I call this task <em>stacking</em>, as it feels like we are trying to
push <code class="language-plaintext highlighter-rouge">DocId</code>s to stacks associated with each term.</p>
<p><img src="/images/tantivy/indexing.png" /></p>
<p>For simplification, I omitted term frequencies and term positions.
Depending on the indexing options, we may also need to keep the term frequencies and the term positions.</p>
<p>If we index term frequencies, then the <code class="language-plaintext highlighter-rouge">Vec<u32></code> above will contain a lasagna of <code class="language-plaintext highlighter-rouge">doc_id_1</code>, <code class="language-plaintext highlighter-rouge">term_freq_1</code>, <code class="language-plaintext highlighter-rouge">doc_id_2</code>, <code class="language-plaintext highlighter-rouge">term_freq_2</code>, etc.
If we index positions as well, then the <code class="language-plaintext highlighter-rouge">Vec<u32></code> will also contain the term position as follows,
<code class="language-plaintext highlighter-rouge">doc_id</code>, <code class="language-plaintext highlighter-rouge">term_freq</code>, <code class="language-plaintext highlighter-rouge">position1</code>, …, <code class="language-plaintext highlighter-rouge">position_termfreq</code> .</p>
<h2 id="no-more-btreemap">No more BTreemap</h2>
<p>The current version of tantivy is slightly more complicated than <code class="language-plaintext highlighter-rouge">BtreeMap</code>.</p>
<p>First, since we only need the terms to be sorted when the segment is flushed to disk, it is better to use a <code class="language-plaintext highlighter-rouge">HashMap</code> and just sort the terms at the very end.</p>
<h1 id="an-ad-hoc-hashmap">An ad-hoc HashMap</h1>
<p>So we will focus on a hash map implementation that fills the following contract:</p>
<ul>
<li>We should reduce the time spent in memory allocation and copies as much as possible.</li>
<li>The hash should be only computed once per token</li>
<li>As long as the hash differ, we should jump at most three times in memory to stack our <code class="language-plaintext highlighter-rouge">DocId</code>.</li>
</ul>
<p>The standard library <code class="language-plaintext highlighter-rouge">HashMap</code> does not make it possible to fill that contract, so I had to implement a rudimentary HashMap.</p>
<h1 id="using-a-memory-arena">Using a memory arena</h1>
<p>Indexing will require a lot of allocations. It might be interesting to make those as fast as possible by using an ad-hoc memory arena with a bump allocator.</p>
<p>Also, a memory arena makes it trivial to enforce the user-defined memory budget we discussed earlier.</p>
<p>This memory budget is split between the threads. Then, for each thread, the memory budget is split between the hash table and the size of the memory arena (roughly with the ratio 1/3, 2/3). After inserting each document, we simply check if the hash table is reaching saturation or if the memory arena is getting close to its limit. If it is, we finalize the segment being written.</p>
<p>This memory arena does not offer any API to deallocate objects. We just wipe it clean entirely <a href="https://en.wikipedia.org/wiki/Magna_Doodle"><em>Magna doodle style</em></a> after finalizing the segment and before starting a new segment.</p>
<p><img src="/images/tantivy/magnadoodle.jpg" alt="magna doodle" /></p>
<h1 id="location-location-location">Location, location, location</h1>
<p>Stacking -or building this postings list- requires jumping in memory quite a lot. By jumping, I mean accessing a random memory address that is likely to trigger any kind of cache miss.</p>
<p>The array used to store the buckets of the <code class="language-plaintext highlighter-rouge">HashMap</code> cannot reasonably include our keys as they have a variable length. Instead, each bucket contains the pair:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(hash: u32, addr: u32)
</code></pre></div></div>
<p>An empty bucket is simply expressed using the special value <code class="language-plaintext highlighter-rouge">addr==u32::max_value()</code>.</p>
<p>When the bucket is not empty, <code class="language-plaintext highlighter-rouge">addr</code> is the address, in the memory arena at which both the key and the value are stored,
one after the other, as follows:</p>
<ul>
<li>the length of the key (2 bytes)</li>
<li>the key (variable length)</li>
<li>an object that represents the posting list object (24 bytes).</li>
</ul>
<p>Keeping the key and the value contiguous in the memory arena, not only saves us from having two addresses, it also gives us better memory locality.</p>
<p>You might be surprised that the posting list is fixed. Its size is constant in the same sense that our original <code class="language-plaintext highlighter-rouge">Vec</code> object was <code class="language-plaintext highlighter-rouge">Sized</code>. It simply includes pointers to other areas in the memory arena. Let’s dive into the details.</p>
<h1 id="exponential-unrolled-linked-list">Exponential unrolled linked list</h1>
<p>We cannot reimplement <code class="language-plaintext highlighter-rouge">Vec</code> over our memory arena.</p>
<p>When it reaches capacity a <code class="language-plaintext highlighter-rouge">Vec</code> allocates twice its capacity, copies its previous data, and finally deallocates its previous data. Unfortunately, our <code class="language-plaintext highlighter-rouge">MemoryArena</code> does not allow for deallocation. Also, we do not really care about having fast random access. We only read our values when our segment is serialized to disk, so we are satisfied with a decent sequential access.</p>
<p><em>Unrolled linked list</em> is a common data-structure to address this problem. If like me you are not too familiar with data-structure terminology, an unrolled linked list is simply a linked list of blocks of <code class="language-plaintext highlighter-rouge">B</code> values.</p>
<p>Assuming a block size of <code class="language-plaintext highlighter-rouge">B</code>, iterating over an unrolled linked list of <code class="language-plaintext highlighter-rouge">N</code> elements now requires <code class="language-plaintext highlighter-rouge">N</code> / <code class="language-plaintext highlighter-rouge">B</code> jump in memory.</p>
<p>Of course, the last block may not be full, but we will waste at most <code class="language-plaintext highlighter-rouge">4 * (B - 1)</code> bytes of memory per term (the 4 is there because we are storing <code class="language-plaintext highlighter-rouge">u32</code>).</p>
<p>Choosing a good value of <code class="language-plaintext highlighter-rouge">B</code> is a bit tricky. Ideally, we would like a large <code class="language-plaintext highlighter-rouge">B</code> for terms that are extremely frequent, and we would like a small <code class="language-plaintext highlighter-rouge">B</code> for a dataset where there are many terms associated with few documents.</p>
<p>Instead of choosing a specific value, <code class="language-plaintext highlighter-rouge">tantivy</code> uses the same trick as <code class="language-plaintext highlighter-rouge">Vec</code> here, and allocates blocks that are exponentially bigger and bigger.
Each new block is twice as big as the previous block.</p>
<p>That way, we know that we are wasting at most half of the memory.
We also require only <code class="language-plaintext highlighter-rouge">log_(N)</code> jumps in memory to go through a long posting list of <code class="language-plaintext highlighter-rouge">N</code> elements.</p>
<p>In addition, in order to further optimize for terms that belong to a single document, the first 3 elements are inlined with the value, so that our posting list object looks like this.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct ExpUnrolledLinkedList {
len: u32,
end: u32,
// -- inlined first block
val0: u32,
val1: u32,
val2: u32,
// -- pointer to the next block
next: u32,
}
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">end</code> contains the address of the tail of our list. It is useful when adding a new element to the list.</p>
<p><code class="language-plaintext highlighter-rouge">len</code> is also required in order to detect when a new block should be created.</p>
<p><code class="language-plaintext highlighter-rouge">next</code> is a pointer to the second block. It is useful when iterating through the list, as we serialize our segment.</p>
<p>A block of size N simply consists of 4*N bytes to encode the N u32-values, followed by 4 bytes to store the address of its successor.</p>
<h1 id="which-hash-function">Which hash function?</h1>
<p>At this point, profiling showed that the major part of the time is spent hashing our terms.</p>
<p>I tested a bunch of hash functions. Previous versions of tantivy were using <code class="language-plaintext highlighter-rouge">djb2</code> which has the benefit of being fast and simple.</p>
<p>It performed really well on the Wikipedia dataset, but not as well on the <code class="language-plaintext highlighter-rouge">Movielens</code> dataset. <code class="language-plaintext highlighter-rouge">Movielens</code> is a dataset of movie reviews and it includes a lot of <em>close to unique</em> terms, like <code class="language-plaintext highlighter-rouge">userIds</code>.</p>
<p>More precisely, I noticed that indexing a segment was relatively fast at the beginning of the segment. But as the hash table was getting more saturated, indexing would get slower and slower.</p>
<p>I naturally suspected that we were suffering from collisions.</p>
<p>There is really two kind of collisions:</p>
<ul>
<li>
<p>two keys are mapped to the same bucket, in which case testing the equality of the hash key in the hash table should help to identify that we need to find another bucket using a probing method that has an ok locality (<code class="language-plaintext highlighter-rouge">tantivy</code> uses quadratic probing).
This happens very frequently. The frequency is precisely equal to the saturation of our hash table.</p>
</li>
<li>
<p>two keys are different but have the same hash, in which case <em>tantivy</em> has to check for string equality. This requires painfully jumping in memory and comparing the two strings.</p>
</li>
</ul>
<p>Assuming a good 32-bits hash key, the rate at which these collisions should happen is of roughly <code class="language-plaintext highlighter-rouge">K / 2^32</code>, where K is the number of keys inserted so far (In fact slightly less than this but this is a good approximation). So if we have 1 million terms in our segment, this should happen at a rate of roughly 1 out of 4000 new terms inserted.</p>
<p>Unfortunately, by construction, <code class="language-plaintext highlighter-rouge">djb2</code> tend to generate
way more <code class="language-plaintext highlighter-rouge">hash</code> collisions for short terms.</p>
<p>I tried different crates offering implementation of various hash and ended up settling for a short rust reimplementation of murmurhash32. Problem solved!</p>
<h1 id="benchmark">Benchmark</h1>
<p>English Wikipedia contains 5 millions documents (8GB).</p>
<p>In my benchmark, I did not store any of the fields, and the text and the title of the article are indexed with their positions. There is no stemming enabled, and we only lowercase our terms.
I also disabled segment merging, so we are really measuring
raw indexing speed.</p>
<p>The Wikipedia articles are read from a regular hard drive, but the index itself is written on a separate SSD disk.</p>
<p>My desktop has 4 cores with hyperthreading. I have no faith in hyperthreading so I only displayed the results for up to 4 indexing threads. If I increase the number of threads, it decreases a bit more down to 80 seconds.
Here is the result of this benchmark :</p>
<p><img src="/images/tantivy/benchmark.png" /></p>
<p>4 threads, 8 gigabytes, 94s is not too shabby, isn’t it?
That’s around 300GB / hour on an outdated desktop.</p>
<p>In comparison, the first version of tantivy would take 40mn to index Wikipedia on my desktop, without merging any segments.</p>
<p>Giving honest figures with segment merging enabled is a bit tricky. Scheduling merges is a bit like scheduling pit stops in a formula 1 race. There is a lot of room to tweak it and get better figures.</p>
<p>That being said, count between 3 minutes and 4 minutes to get an index with between 2 and 8 segments and a memory budget of between 4GB and 8GB.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Tantivy 0.4.0 is already very fast at indexing.</p>
<p>I did try to compare its performance with Lucene, but simply decoding utf-8 and reading lines from my file took over 60 seconds in Java. That did not seem like a fair match: remember it took 94 seconds to tantivy to read the file, decode JSON, and build a search index for the same amount of data. I was too lazy to work out a binary format to palliate Java’s suckiness and do a proper comparison with Lucene indexing performance.</p>
<p>While there is still room for improvement, the next version of tantivy will focus on adding a proper text processing pipeline (tokenization, stemming, removing stop words, etc.). <code class="language-plaintext highlighter-rouge">tantivy</code> is getting rapidly close to a decent search engine solution.</p>
<p>If you enjoyed this post, you may also want to have a look at this blog post from <code class="language-plaintext highlighter-rouge">JDemler</code> that explains how index building is done in another rust search engine project called <a href="https://github.com/JDemler/perlin">Perlin</a>.</p>
Of tantivy, a search engine in Rust2017-01-07T00:00:00+00:00https://fulmicoton.com/posts/behold-tantivy<h1 id="foreword-search-rust">Foreword. Search. Rust.</h1>
<p>I have been working more or less with search engines since 2010. Since then, I entertained the idea to try and code my own search engine. I ended up never starting this project, but accumulated more and more information over the year about how to implement a search engine, mostly by learning from coworkers, going through Lucene’s code, and reading academic papers and blogs.</p>
<p>Last year, after hearing a lot of good things about <a href="https://www.rust-lang.org/">Rust</a>
from <a href="http://guilload.com/">an old friend</a> and then a coworker, I started studying the language. I was very skeptical in the beginning, but the <a href="https://doc.rust-lang.org/stable/book/">Rust book</a> sold me rapidly. As I went through its pages, I learnt how Rust was solving all of the pain points I experienced with C++ or Java with mindblowing elegance.</p>
<p>I then started going through all of the exercises on <a href="http://www.exercism.io/languages/rust/about">exercism.io</a>.
The exercises are well calibrated and introduce new concepts gradually, if you want to try and learn Rust, I warmly recommend them, and it should only take your around a week-end. After finishing the exercises, I decided
it was time to go out for a test drive on a real-life project. I started working on a simple search engine…</p>
<p><img src="/images/tantivy/ppap.jpg" alt="Ppap" />
<em>Mildly relevant obscure Japanese reference (<a href="https://youtu.be/HFlgNoUsr4k">PPAP</a>)</em></p>
<p>Around two weeks later, to my own surprise, I was more productive in Rust than I was in C++ in which I have 5 years of experience. Don’t get me wrong. I am not saying Rust is a simple language. I was not an expert in Rust at that time, nor am I an expert in Rust today… But Rust is just a much more productive language. Also, while my code was sometimes clumsy, I felt a degree of confidence that my code was not buggy, that I had never experienced in any other language (Well, my experience of OCaml is so tiny it does not count).</p>
<p>The first version was a bit silly but only took a couple of months of my spare time to implement. Next step was to actually refactor, clean up the rookie mistakes, add documentation… <a href="https://github.com/tantivy-search/tantivy">tantivy</a> was born.</p>
<p><img src="/tantivy-logo/tantivy-logo.png" alt="Tantivy's logo" />
<em>The logo is so neat, you can feel it’s webscale.</em></p>
<p>But this blog post is not about my experience with rust, but about how <a href="https://github.com/tantivy-search/tantivy">tantivy</a> works.</p>
<p>Tantivy is strongly inspired by Lucene, and if you are a Lucene user, this will sound incredibly familiar… Like Lucene, Tantivy is a search engine library and does not address the problem of distribution. Making a proper distributed search engine that scales, would require to add an extra layer around tantivy, playing the role of what ElasticSearch or Solr are to Lucene.</p>
<h1 id="so-what-happens-when-i-search">So what happens when I search?</h1>
<p>Imagine that you indexed wikipedia with tantivy, as described in <a href="https://github.com/tantivy-search/tantivy-cli">tantivy-cli’s tutorial</a> for instance.
Let’s go through what happens when you search for <strong><code class="language-plaintext highlighter-rouge">President Obama</code></strong> on this index, and receive the 10 most relevant documents as a result.</p>
<p>This will introduce the datastructures at stake, before we eventually dive into the details.</p>
<p>This blog post will not describe how the index is built as it will be the subject of the part 2.</p>
<h3 id="query-parser">Query Parser</h3>
<p>First, the user query <code class="language-plaintext highlighter-rouge">President Obama</code> goes through the query parser, which will transform it into something more structured. For instance, depending on your configuration, the query could be transformed into <code class="language-plaintext highlighter-rouge">(title:president OR body:president) AND (title:obama OR body:obama)</code>. In other words, we want any document that contains the word president and the word obama regardless of whether they are in the body field or the title field. Obviously, a document having “President Obama” in its title field is probably more relevant and should appear at the top, but we will rely on scoring for that.</p>
<p>Following Lucene’s terminology, the couples <code class="language-plaintext highlighter-rouge">field:text</code> (e.g. <code class="language-plaintext highlighter-rouge">(title, obama)</code>) are called <strong>Term</strong>s in tantivy.</p>
<h3 id="term-dictionary-term-file">Term dictionary (.term file)</h3>
<p>We now have a boolean query with 4 terms. We first lookup all of these terms in a datastructure called the <strong>term dictionary</strong>.
For each of the term, it associates the following information :</p>
<ul>
<li>the number of documents containing the term (also called document frequency)</li>
<li>a pointer (or an address) into the inverted index file</li>
</ul>
<h3 id="inverted-index-idx-file">Inverted index (.idx file)</h3>
<p>The inverted index has its own separate file. It contains, for each term, a sorted list of document ids. Such a list is usually called inverted list, postings, or posting list. The pointer that was given to us from the term dictionary is simply an offset within this file.</p>
<p class="disclaimer">
Note that I haven't explained what is a document id. For the moment, just consider them as an incremental internal id identifying a document. I will tell you more about what they are in part 2.
</p>
<p>We can start and read in parallel all of these inverted lists. Since they are sorted, computing the relevant intersections and unions can be done very efficiently.</p>
<p>Currently I have put very little effort in optimizing this part. Whatever the number of terms involved (well, if there is more than one), and whatever the boolean formula, tantivy will compute the union of the terms using a simple <a href="https://en.wikipedia.org/wiki/Merge_algorithm#K-way_merging">binary-heap k-way merge</a>, and post filter the result.
There is therefore still a lot of room for improvement.</p>
<p>At this point, we have an iterator over the doc ids over the document that match our initial boolean query. But they are sorted by doc ids, ad what we really want is the top 10 most relevant docs.</p>
<p>We will go through this iterator entirely, and for each doc id, compute a relevance score for each document. We then push all of the pairs <code class="language-plaintext highlighter-rouge">(DocId, Score)</code> to a collector object. The collector is
in charge of retaining the ten documents with the highest score. This can be done simply using a heap.</p>
<h3 id="scoring">Scoring</h3>
<p>Tantivy relevance score is a flavor of the very classical <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">Tf-Idf</a>. I won’t get into the detail, but Tf-Idf expresses a distance between the query and documents. Its computation involves to know for each term of the query :</p>
<ul>
<li>the <strong>document frequency</strong> - that is the number of document containing the term. It was given in our term dictionary.</li>
<li>the <strong>term frequency</strong> - the number of occurences of the term within the document. As we will see, it is actually encoded within the inverted index file, interlaced in blocks with the doc ids.</li>
<li>the number of terms in each field for the document. This is how we know that being in the title field is more important than being in the body field. A dedicated file and datastructure is storing our <strong>fieldnorm</strong>.</li>
</ul>
<h3 id="doc-store-store">Doc store (.store)</h3>
<p>After having scored all of our documents, we are then left with a list of winning <code class="language-plaintext highlighter-rouge">DocId</code>s. We finally fetch the actual content of our documents in a datastructure called the doc store.</p>
<h1 id="index-files-and-directory">Index files, and Directory</h1>
<p>So far, we talked about four big component of tantivy</p>
<ul>
<li>the term dictionary</li>
<li>the inverted index</li>
<li>the doc store</li>
<li>the field norms</li>
</ul>
<p>Each of them is stored in its own file.</p>
<p class="disclaimer">
There are in fact two other type of files : the fast fields and the position files, but they are not useful for this type of query.
</p>
<p>Tantivy embraces the write-once-read-many (WORM) concept.
This means that all of these files are written once and for all, and can then be considered read-only. This does not mean that you cannot add any
documents. This will all be explained in the next part.</p>
<p>Like in Lucene, writing and reading these files is actually
abstracted by a <code class="language-plaintext highlighter-rouge">Directory</code> trait. By default, tantivy is meant to be used with the <code class="language-plaintext highlighter-rouge">MmapDirectory</code> in which <code class="language-plaintext highlighter-rouge">File</code> are actual files on disk, and are accessed via “mmap”.</p>
<p>Tantivy does not require to load any data structure in anonymous memory, so that when used with the <code class="language-plaintext highlighter-rouge">MmapDirectory</code>, tantivy resident memory footprint is extremely low.</p>
<p class="disclaimer">
This is actually a very nice feature.
Since page cache is shared, n servers reading the same index consumes about as much RAM as a single server.
Deploying a new version, or running two instances for AB-testing, has close to zero impact on memory usage.
<br /><br />
Tantivy can also easily work on indexes that do not fit entirely in RAM.
The OS will be in charge to decide which pages are the most useful.
<br /><br />
Finally, Tantivy has a very small loading time, and is perfect for a command line interface usage.
</p>
<p>Tantivy also comes with another <code class="language-plaintext highlighter-rouge">Directory</code> implementation called <code class="language-plaintext highlighter-rouge">RAMDirectory</code> which stores all of the data in anonymous memory, and is mostly useful when writing unit test.</p>
<p>As we will see, the IO required in search are mostly sequential and there might be a use case for more exotic
directories. Hitting on HDFS, or an HTTP interface for instance…</p>
<p>Tantivy’s file interface is however very different than that of Lucene in that it let’s the user take a slice out of the file, and then access a byte array (<code class="language-plaintext highlighter-rouge">&[u8]</code>) from it.</p>
<p>It is up to the client of the directory to behave responsibly and avoid asking for gigantic slices of data.
The current version of tantivy is not behaving great for the moment unfortunately, and this should be improved in the future.</p>
<p>Implementing a directory implementation is quite subtle as we need to ensure that writes are persistent and that at least some writes must be atomic. You can have a look at its contract in the <a href="https://tantivy-search.github.io/tantivy/tantivy/trait.Directory.html">reference documentation</a> .</p>
<h1 id="the-term-dictionary">The term dictionary</h1>
<p>The term dictionary is arguably one of the most complicated data structure to code in a search engine. While using a hash map might come to mind, it is often handy to be able to enumerate terms in a sorted manner. For this reason, Trie and Finite state transducers (FST) are popular data structures for search engine’s term dictionary. Rust is blessed with a great implementation of <a href="https://github.com/BurntSushi/fst/">FST</a> by <a href="http://blog.burntsushi.net/">BurntSushi</a>, so this was a no brainer for tantivy.</p>
<p>Recent version of Lucene also use an FST, while earlier version of Lucene used a Trie.
FST are more compact than Tries and they are only a tad more CPU intensive.
You can read more about FST on <a href="http://blog.burntsushi.net/transducers/">BurntSushi’s blog post</a>.</p>
<h1 id="inverted-index">Inverted index</h1>
<p>Because we want to make sure that most of our data fits in RAM, and to reduce the amount of data read from RAM and possibly disk, it is crucial to compress our lists of integers. Let’s describe the way tantivy represents our posting lists.</p>
<p>Doc ids and term frequencies are encoded together, in interlaced blocks of 128 documents. That way, as we iterate through our inverted list, we don’t have to jump between two lists. A block of 128 doc ids is followed by a block of 128 term frequencies.</p>
<p>The block of 128 term frequencies, are simply encoded using bit packing : for instance, assuming the largest value in the block is 10, we really only need 4 bits to encode each of the term frequency, as <code class="language-plaintext highlighter-rouge">2^4 - 1 = 16 >= 10</code>. Bit packing simply means we use the first to express how many
bits are used in our representation (here 10), and then we concatenate the 4 bits representation of our 128 integers. As a result,
<code class="language-plaintext highlighter-rouge">1 + (4 * 128) / 8 = 65 bytes</code> is required for the storage of our document frequencies.</p>
<p>Doc ids on the other hand are sorted. We therefore start by delta-encoding them. We replace the list of doc ids by the consecutive intervals between them.</p>
<p>For instance, assuming the document id list goes</p>
<p>7, 12, 15, 17, 25</p>
<p>We encode it as
7, 5, 3, 2, 8</p>
<p>The resulting deltas can then be bitpacked.</p>
<p>Our last block is very likely to contain less than 128 documents. In that case, we use <a href="https://en.wikipedia.org/wiki/Variable-length_quantity">variable-length integer</a> in place of bitpacking.</p>
<p><img src="/images/tantivy/interlace.png" alt="Encoding of an inverted list of 263 docs" /></p>
<p>By default, these operations are actually not implemented in tantivy, but delegated to a state-of-the-art C++ library called <a href="https://github.com/lemire/simdcomp">simdcomp</a> using SIMD instructions.</p>
<p>Because some platform do not handle SIMD instruction, this is actually a Cargo <code class="language-plaintext highlighter-rouge">feature</code>, that can be disabled by compiling tantivy with <code class="language-plaintext highlighter-rouge">--no-default-features</code>. Tantivy then uses a pure rust SIMD-free implementation of this encoding.</p>
<h1 id="field-norms">Field norms</h1>
<p>The field norm file contain the field norms for all the fields and all of the document in the index.</p>
<p>For each document, the difference between the field norm and the minimum field norm is simply bitpacked in order to make random access possible.</p>
<h1 id="doc-store">Doc Store</h1>
<p>Once we have identified the list of doc ids that must be returned to the user, we still need to fetch the actual content the documents to the user.</p>
<p>Tantivy’s docstore is very similar to Lucene’s doc store.
For each document, the subset of the fields’s that have been configured as stored in your index schema are serialized and appended to a buffer of data. Once the buffer exceeds 16KB, it is compressed and written on disk.</p>
<p>Choosing a lossless compression algorithm is a matter of picking the right speed / compression ratio trade-off for your use case. Tantivy uses LZ4, which sits on the very fast compression/decompression, but not so compact side of the spectrum.</p>
<p>Obviously we still need to identify the block in which our document belong to. The store file also embed a skip list that associates the last doc id of each block to the start of the next block.</p>
<p>This index makes it easy to identify in which block a doc id belongs. The whole block is then decompressed and the document pulled out.</p>
<p class="disclaimer">
For many usage it can be a good idea to decouple the doc store part from the search index, and possibly use an external KV store or database of your choice for this. Decoupling hardware doing search on one hand and fetching documents on the other hand can drastically lower the overall amount of RAM required for your architecture.
<br /><br />
Also, as we will see in the part 2 of this blog post, updating a document in a search engine is not instantaneous. Imagine a search engine for a newspaper, you might want to be able to correct a typo in an article instantaneously while users not finding the article when searching the mistyped word for a few minutes is not an issue at all.
<br /><br />
Nevertheless, tantivy is meant to come with batteries included, and therefore includes a doc store which should do just fine for many use cases!
</p>
<h1 id="wrapping-up">Wrapping up…</h1>
<p>In the next blog post, I will tell you how tantivy’s index are built.</p>
<p>In the meanwhile, if you are interested in the project, you can check out the
<a href="https://github.com/tantivy-search/tantivy">GitHub repository page</a>.
The <code class="language-plaintext highlighter-rouge">README.md</code> gives a bunch of pointer on how to get started.</p>
<p>If you want to contribute, or discuss a use case with me, feel free to comment or drop me email.</p>
Of caret awareness2016-08-17T00:00:00+00:00https://fulmicoton.com/posts/levenshtein-caret-aware<h1 id="caret-awareness--a-neat-feature-for-autocomplete">Caret Awareness : A neat feature for autocomplete</h1>
<p>Around 8 years ago, I read the description of a nice UI
improvement to the traditional autocomplete search box.
I cannot recall the name the author used for the feature, but I like to call it
<strong>caret awareness</strong>. (caret is just another fancy name for text cursor).</p>
<p>Here is the problem it was addressing.
When I search for something, and the results do not seem accurate, I like to add
an extra keyword to refine my query. Sometimes (especially in English), it makes more sense
to prepend than to append this keyword. In that case I bring the caret at the beginning
of the search box and start typing my extra keyword.</p>
<p>As I am typing these new words, an autocomplete system strictly working on prefix matching
will have a hard time offering me any suggestion.</p>
<p><em>“BarObama”, what the hell is this user searching for?</em></p>
<p>The idea of caret awareness is to send the autocomplete service
the position of the caret along with the query. For the query above,
the request to the service is along the line of</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?q=BarObama&caret=3
</code></pre></div></div>
<p>So we added the feature to <a href="http://indeed.com">indeed.com</a>. Here is how it looks like.</p>
<p><img src="/images/caret_aware/caret_aware.gif" /></p>
<p>This feature can be implemented in different ways. But here comes the twist :
our autocomplete is also fuzzy : if your query is long enough, it
will start considering options that at Levenshtein-Damerau distance of up to 2.</p>
<p>In the following example, even if the user mispelled <strong>“attorney”</strong>, indeed guessed that <strong>“litigation attorney”</strong> is really what he trying to type.</p>
<p><img src="/images/caret_aware/caret_aware_fuzzy.gif" /></p>
<p>Let’s see how it works.</p>
<h1 id="caret-aware-levenshtein-automaton-for-the-win">Caret aware Levenshtein automaton for the win!</h1>
<p>When I first heard about the existence of Levenshtein automaton, I was very surprised.</p>
<p>A mindboggling implication for instance, is that for any given string $s$ for any given $k$, there is a regular expression that match exactly the strings that are at a levenshtein distance from $s$ smaller than $k$.</p>
<p>While the result is not really practical at all, it is pretty cool isn’t it?</p>
<p>Well actually let’s go further : let’s consider a regular expression $s$.
For instance, <code class="language-plaintext highlighter-rouge">ab*c</code>.</p>
<p>It matches an infinite set of strings :</p>
<ul>
<li>abc</li>
<li>abbc</li>
<li>abbbc</li>
<li>abbbbc</li>
<li>…</li>
</ul>
<p>Let’s now extend this set by adding all of the strings that are at a levenshtein distance of less than 1
from one of the original elements.</p>
<p>We end up with a much larger set. For instance the string below have been added.</p>
<ul>
<li>yabc</li>
<li>ac</li>
<li>bac</li>
<li>…</li>
</ul>
<p>One can show that there once again exists a finite definite automaton <em>(and hence, a regular expression)</em>
that matches exactly the strings of this new set.</p>
<h1 id="what-does-this-have-to-do-with-caret-awareness-">What does this have to do with caret awareness ?</h1>
<p>Well, our caret-aware fuzzy search really is all about trying to find entries in a dictionary
that are at levenshtein distance of 2 of a string that matches the regular expression <code class="language-plaintext highlighter-rouge">lit.*atorney</code>.</p>
<p>We now know that there is a DFA, possibly huge, that actually does the job. But can we build it efficiently ?</p>
<h1 id="building-the-automaton">Building the automaton</h1>
<p>*This section is very technical, and assumes you have read my <a href="http://fulmicoton.com/posts/levenshtein">previous blog post about Levenshtein Automata</a>. *</p>
<p>Adapting the implicit NFA approach is relatively simple.</p>
<p>Essentially, we change our transition function</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def transitions(self, state, c):
(offset, D) = state
if D > 0:
yield (offset, D - 1)
yield (offset + 1, D - 1)
for d in range(min(D + 1, len(self.query) - offset)):
if c == self.query[offset + d]:
yield offset + d + 1, D - d
</code></pre></div></div>
<p>by adding the caret information</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def transitions(self, state, c, offset):
(offset, D) = state
# we matched up to the caret,
# any character we get can be matched thanks to the ".*"
# pattern, so staying in the same state is always an option
if offset == self.caret:
yield (offset, D)
if D > 0:
yield (offset, D - 1)
yield (offset + 1, D - 1)
for d in range(min(D + 1, len(self.query) - offset)):
if c == self.query[offset + d]:
yield offset + d + 1, D - d
</code></pre></div></div>
<p>In my previous post, I argued that the implicit NFA solution was not as efficient as the parametric
DFA approach of the original paper of Klaus Schulz and Stoyan Mihov.
Caret-awereness is very pathological, as the number of states can rapidly explode.</p>
<p>Without caret-awareness, the number of state that can coexist at the same time
was bounded by <code class="language-plaintext highlighter-rouge">2k + 1</code> where <code class="language-plaintext highlighter-rouge">k</code> is the Levenshtein distance considered. For Levenshtein-Damerau, a generous bound would be <code class="language-plaintext highlighter-rouge">2(2k + 1)</code>.</p>
<p>With caret-awareness, there is no such bound : the number of states in the NFA grows linearly with the length of the query. More accurately, it grows linearly with the length from the caret position to the end of the string. <strong>ouch</strong>.</p>
<p>For the same reason, Klaus Schulz and Stoyan Mihov parametric DFA caching trick cannot be applied directly : the parametric DFA would have an infinity of states.</p>
<p>Without going into too much details, what we did is that we approximate the automaton by one that is kind enough to be bounded. The approximation works as follows : when implementing the NFA that is then used to build the parametric DFA, we always trim the set of states by removing the states that have too low an offset. More accurately, if the largest offset is $m$, we remove all states associated with an offset lower than $m - (2k + 1) - 2$.</p>
<p>This approximation only can only create false negatives for terms in the dictionary that includes some long repetitions. For instance if we are searching for <code class="language-plaintext highlighter-rouge">I love.*Jar Jar Binks</code> and our dictionary contains <code class="language-plaintext highlighter-rouge">I love Jar Jar Binks</code>,
the trimmed automaton will make a mistake because of the repetition <code class="language-plaintext highlighter-rouge">Jar </code> coming right after the caret.</p>
<p>So we pre-built a parametric caret-aware automaton for Levenshtein Damerau with a distance of 1, and 2. The resulting file takes around 2MB, and it is shipped with
our code.</p>
<p>Once we have that, we can either use a parametric DFA, or build an explicit DFA for our language. We currently built the DFA because it was pretty fast in practise anyway.</p>
<h1 id="the-bizarro-dictionary">The bizarro dictionary</h1>
<p>Another issue with caret awareness is that if the caret is toward the beginning of the string, our automaton has to visit all or most of our trie. This phenomenon has nothing to do with fuzziness, so let’s forget Levenshtein for this section.</p>
<p>In the case of <code class="language-plaintext highlighter-rouge">l.*atorney</code>, the intersection with the trie will end up exploring
all of words starting by an <code class="language-plaintext highlighter-rouge">l</code>.</p>
<p>We solved this problem by processing these queries in what I like to call <em>the bizarro world</em>, where all string are reversed.
In this world, our query: <code class="language-plaintext highlighter-rouge">l.*atorney</code>, becomes <code class="language-plaintext highlighter-rouge">yenrota.*l</code>, and we will only visit
the string that starts up by <code class="language-plaintext highlighter-rouge">yenrota</code>.</p>
<p>This means we ship a bizarro trie along with our original dictionary trie.
If the caret is in the second half of the query, we do our regular matching, but
if the caret is the first half of the query, then we reverse the query, and
run it against our bizarro dictionary.</p>
Of sum of dices2015-11-12T00:00:00+00:00https://fulmicoton.com/posts/dices<h1 id="a-nice-little-puzzle">A nice little puzzle</h1>
<p>So my next post was supposed to be about an extension to
levenshtein automata, but I had some good reason to want to delay that
for a bit. Anyway stay tuned !</p>
<p>In the meanwhile, as I was talking about dices at work, I thought about a very cool puzzle. He it goes :</p>
<blockquote>
<p>** Find all ways to relabel the faces of two dices with non-negative integers in such a way that if you roll them, their sum follows a uniform law on [0, 35] ?**</p>
</blockquote>
<p>So you probably know that when you roll two classical dices, their sum does not follow a uniform distribution.</p>
<p>7 is the most likely sum with a probability of $1 \over 6$.
2 and 12 are the least likely, with a probability of $1 \over 36$</p>
<p>Actually here is the actual distribution.</p>
<iframe width="600" height="371" seamless="" frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/1rilZTWR_L_xjDfo-tVqIhmF2qvKupPRmzZzmLcFfHsw/pubchart?oid=1282500553&format=image"></iframe>
<p>The goal is to change the number on each faces in order to make this distribution uniform over [1, 36]. Some of the solution are rather simple, some other look very cool.</p>
<p>Stop reading hear if you want to tackle the puzzle on your own.</p>
<div style="height:100px;"> </div>
<h1 id="sicherman-dice">Sicherman dice</h1>
<p>A very similar problem is to find a way to relabel dices such that
their sum still follows the same distribution as two regular dices.</p>
<p>As explained in the Wikipedia page, there is only one solution to this problem and they are called the Sicherman dices. The dices go :</p>
<ul>
<li>1, 2, 2, 3, 3, 4</li>
<li>1, 3, 4, 5, 6, 8.</li>
</ul>
<p>Don’t click on the wikipedia page as the theory is basically is the same for the Sicherman dices and this problem.</p>
<h2 style="color:red; font-weight:bold; text-align:center">spoiler coming</h2>
<div style="height:100px"></div>
<h2 style="color:red; font-weight:bold; text-align:center">spoiler coming</h2>
<div style="height:100px"></div>
<h2 style="color:red; font-weight:bold; text-align:center">spoiler coming</h2>
<div style="height:100px"></div>
<h2 style="color:red; font-weight:bold; text-align:center">Tyrion is a dragon who took human form</h2>
<div style="height:100px"></div>
<h2 style="color:red; font-weight:bold; text-align:center">spoiler coming</h2>
<div style="height:100px"></div>
<h2 style="color:red; font-weight:bold; text-align:center">spoiler coming</h2>
<div style="height:100px"></div>
<h2 style="color:red; font-weight:bold; text-align:center">spoiler coming</h2>
<div style="height:100px"></div>
<h1 id="polynomials-showing-up">Polynomials showing up</h1>
<p>Algebra gives us all of the necessary tools to solve this problem in a very neat way.
Let’s start by having a look at the probability of the sum of two dices… Let’s call $D_1$ and $D_2$ the random variable associated with the outcome of the first dice and the second dice respectively.</p>
<p>Then we can compute $the probability that their sum is S</p>
<pre>
$$ p(D_1 + D_2 = S) = \sum_{k=1}^{S-1} p(D_1 = k) p(D_2 = S-k) $$
</pre>
<p>You may recognize here a <a href="https://en.wikipedia.org/wiki/Convolution">convolution</a>, or the formula for polynomial multiplication.
We’re going to heavily rely on this observation. Given a random variable D taking value within integer $0 \leq k \leq n$, we can define a polynomial $\phi(D)$ defined by</p>
<pre>
$$ \phi(D) = \sum_{k=0}^{n} p(D=k) X^k $$
</pre>
<p>We then have the interesting property that</p>
<pre>
$$ \phi({D_1} + {D_2}) = \phi({D_1})\phi({D_2})$$
</pre>
<p>Let’s see what it means with a very concrete example.
Assume that we have a tetraedric dice with the value <code class="language-plaintext highlighter-rouge">0, 1, 4, 5</code>
on one hand, and <code class="language-plaintext highlighter-rouge">0, 2, 8, 10</code> on the other hand.</p>
<p>The polynomial associated to the first one is</p>
<pre>
$$1 + X + X^4 + X^5$$
</pre>
<p>The polynomial associated to the second one is</p>
<pre>
$$1 + X^2 + X^8 + X^10$$
</pre>
<p>Then I can compute the distribution of their sum by multiplying these two polynomials.</p>
<pre>
$$1 + X + X^2 + X^3 + X^4 + X^5 + X^6 + X^7$$
</pre>
<p>Which is the polynomial for the uniform law over [0, 7].
These two polynomials are actually solving our problem for dices with 4 faces.</p>
<h1 id="cyclotomic">Cyclotomic</h1>
<p>Finding two dices that solve our problem then boils down to finding pair of polynomials $\phi(D_1)$ and $\phi(D_2)$ such that</p>
<pre>
$$ \phi(D_1) \phi(D_2) = 1 + X + X^2 + X^3 + ... + X^{35} $$
</pre>
<p>We will call this polynomial U.</p>
<p>Just as there is a unique prime decomposition for integer, there is a unique (up to a scalar) prime decomposition for polynoms. If we can find it for U. Then we can just test all of the way to allocate
each factors to one dice or the other. Going through all of the possible partitions, will give us all of the possible pair of dices.</p>
<p>Factorizing a polynomial is usually not easy at all. But this one is very special. You may have recognized a geometric series. We have :</p>
<pre>
$$ 1 + X + X^2 + X^3 + ... + X^{35} = { {1-X^{36}} \over {1- X}} $$
</pre>
<p>${1-X^{36}}$’s factorization is a well known problem. Its roots, also called root of unity are the complex numbers $(e^{2 i \pi k \over 36})_{0 \leq k < n}$. In $\mathbf{R}$, its factors are the so-called <a href="https://en.wikipedia.org/wiki/Cyclotomic_polynomial">cyclotomic polynomials</a> associated with the divisor of 36 (1, 2, 3, 4, 6, 9, 12, 18, 36). We will just remove the one for 1, $X-1$ as it is divided away.</p>
<p>The resulting factorization for U is given</p>
<pre>
$$ U = (1 + X) \\
~~~~(1 + X + X^2)\\
~~(1 + X^2)\\
~~~~(1 - X + X^2)\\
~~~~(1 + X^3 + X^6)\\
~~~~(1 - X^2 + X^4)\\
~~~~(1 - X^3 + X^6)\\
~~~~(1 - X^6 + X^{12})$$
</pre>
<p>We then need to attribute each factor to one or the other of the dice,
multiply each dice set of factor, and check if the result is a valid dice (for instance, a negative coefficient is a no-no, and more than 6 monomial is not possible to make a dice).</p>
<p>And the possible solutions are :</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[0, 3, 12, 15, 24, 27]
[0, 1, 2, 6, 7, 8]
[0, 1, 12, 13, 24, 25]
[0, 2, 4, 6, 8, 10]
[0, 3, 6, 9, 12, 15]
[0, 1, 2, 18, 19, 20]
[0, 1, 6, 7, 12, 13]
[0, 2, 4, 18, 20, 22]
[0, 1, 2, 9, 10, 11]
[0, 3, 6, 18, 21, 24]
[0, 1, 4, 5, 8, 9]
[0, 2, 12, 14, 24, 26]
[0, 1, 2, 3, 4, 5]
[0, 6, 12, 18, 24, 30]
</code></pre></div></div>
Of Levenshtein automata implementations2015-09-05T00:00:00+00:00https://fulmicoton.com/posts/levenshtein<p><em>Thanks to Ken Hironaka for kindly taking a lot of time to read and fix countless errors in this blog post!</em></p>
<h1 id="back-to-tokyo">Back to Tokyo</h1>
<p>It’s been such a long time since my last post, and so much have happened. I moved to Tokyo in November 2014 and started working for Indeed Japan. I’m still kind of foreign to the dev community in Japan, so if you are also in Tokyo and you have some good tips about tech/startup event of anykind in Tokyo, drop me a message!</p>
<h1 id="reacting-to-another-blog-post">Reacting to another blog post</h1>
<p>Earlier this year, Jules Jacob wrote an awesome blog post titled <a href="http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html"><strong>Levenshtein automata can be simple and fast</strong></a>. While reading it, you might notice that it is kind of a rebuke against the convoluted language of the original paper : <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652">Fast String Correction with Levenshtein-Automata</a> by Klaus Schulz and Stoyan Mihov. I read this paper, and I have to agree its style is rather abstract and opaque.</p>
<p>Jules’s blog post on the other hand wield great pedagogy, and walks the reader step-by-step through a simpler algorithm to build Levenshtein automata.</p>
<p>While I love this blogpost, I am afraid that I disagree with Jules, when he claims : <strong>After a bit of tinkering I had a working Levenshtein automaton in just a handful of lines of code with the same time complexity as the paper claims to have.</strong></p>
<p>Jules’s algorithm complexity is indeed linear in the number of characters. However, if you consider the complexity in the maximum edit distance supported, the algorithm does not do that well. The blog post dismisses it by saying that we will only consider edit distance < 2 anyway, so why not consider it constant. I would counter argue that at distance 2, the algorithm described here is already too slow to be usable in practise to build a search autocomplete system.</p>
<p>Moreover, the paper actually describes in <code class="language-plaintext highlighter-rouge">Chapter 6</code> a way to avoid computing the DFA at all… so isn’t calling it <strong>same time complexity</strong> a bit of a stretch?</p>
<p>In this blog post I will try to take the subject where Jules left it, and explain the actual algorithm in the article. I will also explain some specificities about Lucene’s implementation.</p>
<h1 id="what-are-levenshtein-automata-anyway">What are Levenshtein Automata anyway?</h1>
<p>I recently got interested in building an autocomplete service. You probably are familiar with those : the user starts typing a query, and is offered a bunch of suggestions before he has even finished typing.</p>
<p>Imagine you had to implement one of these…<br />
As a first shot, you might consider building a trie with
a list of suggestions. For each of the suggestions, you also probably want to store some kind of score. When a request comes, you can then use the trie to list up the suggestions which admit the user input as a prefix, and serve back the top 10 best entries.</p>
<p>But users make typos, and sometimes they don’t actually know how to spell the thing that they are search for. So you might want to relax the prefix constraint and allow for spelling mistake. The <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652">paper</a> precisely explains how to search rapidly in a dictionary which entries are at an edit distance lower than k from a query. I will leave the “prefix” part of the problem for a next blog post.</p>
<p>The solution starts by building a so-called Levenshtein Automaton for the user query. It is a Deterministic Finite Automaton (<a href="https://en.wikipedia.org/wiki/Deterministic_finite_automaton">DFA</a>) which has the property matching strings that are at a edit distance of at most D from the query.</p>
<p>Now, if our dictionary is also stored in a trie (or a transducer, or any kind of automaton), the problem consists in running the automaton over the trie. This operation is called an intersection and is rather straightforward.</p>
<p>The construction of such a DFA on the other hand is a bit tricky. Building it fast is quite a challenge. In this blog post, I will precisely describe the algorithm described in the paper. I will also talk about the specifics about Lucene’s implementation.</p>
<p><em>If you are not familiar with the concept of NFA, DFA, or levenshtein distance I really advise you to have a look at Jules’ blog post before reading this post.</em></p>
<p>In my next post, I will talk about an extension of Levenshtein Automata, with hopefully some actual original material.</p>
<h1 id="lets-get-started">Let’s get started</h1>
<p>As a warm up, let’s write the simplest implementation we can think of that checks if two strings are at an edit distance of lesser or equal to D.
In practise, you probably want to get the distance itself as an output as well, to compute a score for your suggestion, but for the sake of simplicity, I deliberately removed this refinement in this blog post :
Our implementations will simply return True iff the matched string is at an edit distance lesser or equal to D.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">,</span> <span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
<span class="s">"""
Returns True iff the two string
s1 and s2 is lesser or equal to D
"""</span>
<span class="k">if</span> <span class="n">D</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">s1</span><span class="p">)</span> <span class="o"><</span> <span class="nb">len</span><span class="p">(</span><span class="n">s2</span><span class="p">):</span>
<span class="k">return</span> <span class="n">levenshtein</span><span class="p">(</span><span class="n">s2</span><span class="p">,</span> <span class="n">s1</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">s1</span><span class="p">)</span> <span class="o"><=</span> <span class="n">D</span>
<span class="k">return</span> <span class="p">(</span><span class="n">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">s2</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">D</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># substitution\
</span> <span class="ow">or</span> <span class="n">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">D</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># insertion\
</span> <span class="ow">or</span> <span class="n">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">s2</span><span class="p">,</span> <span class="n">D</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># deletion\
</span> <span class="ow">or</span> <span class="p">(</span>
<span class="c1"># character match
</span> <span class="p">(</span><span class="n">s1</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="n">s2</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="ow">and</span> \
<span class="n">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">s2</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">D</span><span class="p">)</span>
<span class="p">))</span></code></pre></figure>
<p>Pretty straightforward, isn’t it? This version of the algorithm will unfortunately not help us building our automaton. <code class="language-plaintext highlighter-rouge">s1</code> and <code class="language-plaintext highlighter-rouge">s2</code> plays symmetric roles in this code.</p>
<p>On our way to build our automaton, we will have to break this symmetry : we build the automaton for one of those string <code class="language-plaintext highlighter-rouge">s2</code> and apply the automaton on <code class="language-plaintext highlighter-rouge">s1</code>.</p>
<p>So let’s modify our algorithm to make sure that we munch one character <code class="language-plaintext highlighter-rouge">c</code> away from s1 at each call.</p>
<p>At each step we will consider two cases.
Either <code class="language-plaintext highlighter-rouge">c</code> will not be used to recreate <code class="language-plaintext highlighter-rouge">s2</code> from <code class="language-plaintext highlighter-rouge">s1</code>, or it will be used. If it is used, it has to be used in a position of at most <code class="language-plaintext highlighter-rouge">D</code> in <code class="language-plaintext highlighter-rouge">s2</code>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">,</span> <span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
<span class="s">"""
Returns True iff the edit distance between
the two strings s1 and s2 is lesser or
equal to D
"""</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">s1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span> <span class="o"><=</span> <span class="n">D</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">s1</span><span class="p">)</span> <span class="o"><=</span> <span class="n">D</span>
<span class="c1"># assuming s1[0] is NOT used to build s2,
</span> <span class="k">if</span> <span class="n">D</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="k">if</span> <span class="n">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">s2</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span>
<span class="c1"># deletion
</span> <span class="k">return</span> <span class="bp">True</span>
<span class="k">if</span> <span class="n">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">s2</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span>
<span class="c1"># substitution
</span> <span class="k">return</span> <span class="bp">True</span>
<span class="c1"># assuming s1[0] is used to build s2
</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">D</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">s2</span><span class="p">))):</span>
<span class="c1"># d is the position where s1[0]
</span> <span class="c1"># might be used.
</span> <span class="c1"># it is also the number of character
</span> <span class="c1"># that are required to be inserted before
</span> <span class="c1"># using s1[d].
</span> <span class="k">if</span> <span class="n">s1</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="n">s2</span><span class="p">[</span><span class="n">d</span><span class="p">]:</span>
<span class="k">if</span> <span class="n">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">s2</span><span class="p">[</span><span class="n">d</span><span class="o">+</span><span class="mi">1</span><span class="p">:],</span> <span class="n">D</span> <span class="o">-</span> <span class="n">d</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="bp">False</span></code></pre></figure>
<p>I can already hear you rambling : Why are we copying all of this strings around? Let’s replace the string arguments by offsets to a const string.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">,</span> <span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">i1</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">i2</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
<span class="s">"""
Returns True iff the edit distance between
the two strings s1 and s2 is lesser or
equal to D
"""</span>
<span class="k">def</span> <span class="nf">aux</span><span class="p">(</span><span class="n">i1</span><span class="p">,</span> <span class="n">i2</span><span class="p">,</span> <span class="n">D</span><span class="p">):</span>
<span class="k">if</span> <span class="n">i1</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">s1</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span> <span class="o">-</span> <span class="n">i2</span> <span class="o"><=</span> <span class="n">D</span>
<span class="k">if</span> <span class="n">D</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="k">if</span> <span class="n">aux</span><span class="p">(</span><span class="n">i1</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">i2</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span>
<span class="c1"># deletion
</span> <span class="k">return</span> <span class="bp">True</span>
<span class="k">if</span> <span class="n">aux</span><span class="p">(</span><span class="n">i1</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">i2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span>
<span class="c1"># substitution
</span> <span class="k">return</span> <span class="bp">True</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">D</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span> <span class="o">-</span> <span class="n">i2</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">s1</span><span class="p">[</span><span class="n">i1</span><span class="p">]</span> <span class="o">==</span> <span class="n">s2</span><span class="p">[</span><span class="n">i2</span> <span class="o">+</span> <span class="n">d</span><span class="p">]:</span>
<span class="c1"># d insertion, followed
</span> <span class="c1"># by a character match.
</span> <span class="k">if</span> <span class="n">aux</span><span class="p">(</span><span class="n">i1</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">i2</span> <span class="o">+</span> <span class="n">d</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="n">d</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="k">return</span> <span class="n">aux</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span></code></pre></figure>
<p>One of the problem with that kind of recursive program, is that aux is called many times with the same arguments.</p>
<p>Let’s transform this method to make it iterative, and let’s group the calls with the same arguments by putting them in a set.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">,</span> <span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
<span class="s">"""
Returns True iff the edit distance between
the two strings s1 and s2 is lesser or
equal to D
"""</span>
<span class="k">def</span> <span class="nf">aux</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">i2</span><span class="p">,</span> <span class="n">D</span><span class="p">):</span>
<span class="c1"># i2 is the number of character
</span> <span class="c1"># consumed in the string s2.
</span> <span class="c1"># D is the number of error that we
</span> <span class="c1"># still alow.
</span> <span class="k">if</span> <span class="n">D</span> <span class="o">>=</span> <span class="mi">1</span><span class="p">:</span>
<span class="c1"># deletion
</span> <span class="k">yield</span> <span class="n">i2</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span>
<span class="c1"># substitution
</span> <span class="k">yield</span> <span class="n">i2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">D</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span> <span class="o">-</span> <span class="n">i2</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">c</span> <span class="o">==</span> <span class="n">s2</span><span class="p">[</span><span class="n">i2</span> <span class="o">+</span> <span class="n">d</span><span class="p">]:</span>
<span class="c1"># d insertions followed by a
</span> <span class="c1"># character match
</span> <span class="k">yield</span> <span class="n">i2</span> <span class="o">+</span> <span class="n">d</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="n">d</span>
<span class="n">current_args</span> <span class="o">=</span> <span class="p">{(</span><span class="mi">0</span><span class="p">,</span> <span class="n">D</span><span class="p">)}</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">s1</span><span class="p">:</span>
<span class="n">next_args</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i2</span><span class="p">,</span> <span class="n">d</span><span class="p">)</span> <span class="ow">in</span> <span class="n">current_args</span><span class="p">:</span>
<span class="k">for</span> <span class="n">next_arg</span> <span class="ow">in</span> <span class="n">aux</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">i2</span><span class="p">,</span> <span class="n">d</span><span class="p">):</span>
<span class="n">next_args</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">next_arg</span><span class="p">)</span>
<span class="n">current_args</span> <span class="o">=</span> <span class="n">next_args</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i2</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span> <span class="ow">in</span> <span class="n">current_args</span><span class="p">:</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span> <span class="o">-</span> <span class="n">i2</span> <span class="o"><=</span> <span class="n">D</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="bp">False</span></code></pre></figure>
<p>Now this is seriously looking like an automaton, which labels are annotated by i2 and n.</p>
<p>Let’s just rename some variables, and rearrange the code to let the NFA appear.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">NFA</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">transitions</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">,</span> <span class="n">c</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">accept</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">initial_states</span><span class="p">(</span><span class="bp">self</span><span class="p">,):</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">eval</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_string</span><span class="p">):</span>
<span class="n">states</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">initial_states</span><span class="p">()</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">input_string</span><span class="p">:</span>
<span class="n">next_states</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">states</span><span class="p">:</span>
<span class="n">next_states</span> <span class="o">|=</span> <span class="nb">set</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">transitions</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">c</span><span class="p">))</span>
<span class="n">states</span> <span class="o">=</span> <span class="n">next_states</span>
<span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">states</span><span class="p">:</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">accept</span><span class="p">(</span><span class="n">state</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">class</span> <span class="nc">LevenshteinAutomaton</span><span class="p">(</span><span class="n">NFA</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">query</span> <span class="o">=</span> <span class="n">query</span>
<span class="bp">self</span><span class="p">.</span><span class="n">max_D</span> <span class="o">=</span> <span class="n">D</span>
<span class="k">def</span> <span class="nf">transitions</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">,</span> <span class="n">c</span><span class="p">):</span>
<span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span> <span class="o">=</span> <span class="n">state</span>
<span class="k">if</span> <span class="n">D</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">offset</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">D</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">query</span><span class="p">)</span> <span class="o">-</span> <span class="n">offset</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">c</span> <span class="o">==</span> <span class="bp">self</span><span class="p">.</span><span class="n">query</span><span class="p">[</span><span class="n">offset</span> <span class="o">+</span> <span class="n">d</span><span class="p">]:</span>
<span class="k">yield</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">d</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="n">d</span>
<span class="k">def</span> <span class="nf">accept</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span> <span class="o">=</span> <span class="n">state</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">query</span><span class="p">)</span> <span class="o">-</span> <span class="n">offset</span> <span class="o"><=</span> <span class="n">D</span>
<span class="k">def</span> <span class="nf">initial_states</span><span class="p">(</span><span class="bp">self</span><span class="p">,):</span>
<span class="k">return</span> <span class="p">{(</span><span class="mi">0</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_D</span><span class="p">)}</span>
<span class="k">def</span> <span class="nf">levenshtein</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">,</span> <span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
<span class="k">return</span> <span class="n">LevenshteinAutomaton</span><span class="p">(</span><span class="n">s2</span><span class="p">,</span> <span class="n">D</span><span class="p">).</span><span class="nb">eval</span><span class="p">(</span><span class="n">s1</span><span class="p">)</span></code></pre></figure>
<p>That looks awesome!
Let’s step back for a second here. The states of a Levenshtein NFA are parametered two integers.</p>
<ul>
<li>the <code class="language-plaintext highlighter-rouge">offset</code> that tells you how many of the query you already matched</li>
<li>the number <code class="language-plaintext highlighter-rouge">d</code> of mistakes that are still allowed to to match the remaining <code class="language-plaintext highlighter-rouge">len(query) - offset</code> characters.</li>
</ul>
<p>At this point our algorithm is very similar to that of Jules Jacob.</p>
<h1 id="observations-lets-count-states">Observations, let’s count states.</h1>
<p>The next step is to get a DFA from this. This is typically done by running a <a href="https://en.wikipedia.org/wiki/Powerset_construction">powerset construction</a>. The cost of the powerset operation is highly dependant on the number of set of states that are accessible. Let’s get a reasonable upper bound of that.</p>
<p>To help us figure out what happens, here is a visualization of our Levenshtein Automaton for the word “flees” and a maximum edit distance of 1. You can type in strings (<code class="language-plaintext highlighter-rouge">flyers</code>, <code class="language-plaintext highlighter-rouge">flee</code>).
The states you end up after stepping into
the automaton will be displayed in blue.</p>
<p><b>Levenshtein Automaton for flees (type in!)</b></p>
<div id="levenshtein-simulator">
</div>
<p>The most striking thing to notice here, is that after consuming k characters with our NFA, while we end up in more than one state (in blue in the small visualization), the set of states we are always very close one another.</p>
<p>When you think about it, the reason is actually pretty simple : after n characters, you cannot reach any state with an offset of more than <code class="language-plaintext highlighter-rouge">n + D</code> (that would mean that you have inserted more than D characters). The same applies with states with an offset of less than <code class="language-plaintext highlighter-rouge">n - D</code> as it would require to delete more than d characters.</p>
<p>In other words, at one point of time, you know that all of the active state will lie between the offset <code class="language-plaintext highlighter-rouge">n - D</code> and <code class="language-plaintext highlighter-rouge">n + D</code>. That’s at most <code class="language-plaintext highlighter-rouge">2D + 1</code> possible positions.</p>
<p>At this point, the only upper bound we have for the complexity of the number of set of states in the DFA and its complexity.
(Note that this is an upperbound and that the reality is probably less grim)</p>
<pre>
$$ O \left((D+1)^{2D + 1}N \right)$$
</pre>
<p><em>Where N is the number of characters in the string we are building the automaton for, and D is the max edit distance allowed.</em></p>
<p>We also said that our second parameter for each state was the number of edit operations that we can still do and still belong to the language.</p>
<p>So in a sense if we reached the <code class="language-plaintext highlighter-rouge">State(n, d)</code>, it does not really matter whether we are in <code class="language-plaintext highlighter-rouge">State(n, d-1)</code> as well. The texts that will match or not in the end will be the same.</p>
<h1 id="removing-the-redundant-states">Removing the redundant states</h1>
<p>Let’s remove the states that are actually imply by other states.</p>
<p>The rule is for any integer k (note that k can be negative), <code class="language-plaintext highlighter-rouge">(n, d)</code> implies <code class="language-plaintext highlighter-rouge">(n+k, d-|k|)</code> as it is just a matter burning our jokers to insert or delete characters.</p>
<p>So with our simplification function, our code now looks like:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">NFA</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">transitions</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">,</span> <span class="n">c</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">accept</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">initial_states</span><span class="p">(</span><span class="bp">self</span><span class="p">,):</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">step</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="n">next_states</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">states</span><span class="p">:</span>
<span class="n">next_states</span> <span class="o">|=</span> <span class="nb">set</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">transitions</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">c</span><span class="p">))</span>
<span class="n">states</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">simplify</span><span class="p">(</span><span class="n">next_states</span><span class="p">)</span>
<span class="k">return</span> <span class="n">states</span>
<span class="k">def</span> <span class="nf">step_all</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_string</span><span class="p">):</span>
<span class="n">states</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">initial_states</span><span class="p">()</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">input_string</span><span class="p">:</span>
<span class="n">states</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">states</span><span class="p">)</span>
<span class="k">return</span> <span class="n">states</span>
<span class="k">def</span> <span class="nf">eval</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">s</span><span class="p">):</span>
<span class="n">final_states</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">step_all</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
<span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">final_states</span><span class="p">:</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">accept</span><span class="p">(</span><span class="n">state</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">def</span> <span class="nf">simplify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="k">return</span> <span class="n">states</span>
<span class="k">class</span> <span class="nc">LevenshteinNFA</span><span class="p">(</span><span class="n">NFA</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">query</span> <span class="o">=</span> <span class="n">query</span>
<span class="bp">self</span><span class="p">.</span><span class="n">D</span> <span class="o">=</span> <span class="n">D</span>
<span class="k">def</span> <span class="nf">transitions</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">,</span> <span class="n">c</span><span class="p">):</span>
<span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">d</span><span class="p">)</span> <span class="o">=</span> <span class="n">state</span>
<span class="k">if</span> <span class="n">d</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">d</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">offset</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">d</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">d</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">query</span><span class="p">)</span> <span class="o">-</span> <span class="n">offset</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">c</span> <span class="o">==</span> <span class="bp">self</span><span class="p">.</span><span class="n">query</span><span class="p">[</span><span class="n">offset</span> <span class="o">+</span> <span class="n">k</span><span class="p">]:</span>
<span class="k">yield</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">k</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">d</span> <span class="o">-</span> <span class="n">k</span>
<span class="k">def</span> <span class="nf">accept</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">d</span><span class="p">)</span> <span class="o">=</span> <span class="n">state</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">query</span><span class="p">)</span> <span class="o">-</span> <span class="n">offset</span> <span class="o"><=</span> <span class="n">d</span>
<span class="k">def</span> <span class="nf">initial_states</span><span class="p">(</span><span class="bp">self</span><span class="p">,):</span>
<span class="k">return</span> <span class="p">{(</span><span class="mi">0</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">D</span><span class="p">)}</span>
<span class="k">def</span> <span class="nf">simplify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">implies</span><span class="p">(</span><span class="n">state1</span><span class="p">,</span> <span class="n">state2</span><span class="p">):</span>
<span class="s">"""
Returns true, if state1 implies state2
"""</span>
<span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">d</span><span class="p">)</span> <span class="o">=</span> <span class="n">state1</span>
<span class="p">(</span><span class="n">offset2</span><span class="p">,</span> <span class="n">d2</span><span class="p">)</span> <span class="o">=</span> <span class="n">state2</span>
<span class="k">if</span> <span class="n">d2</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="n">d</span> <span class="o">-</span> <span class="n">d2</span> <span class="o">>=</span> <span class="nb">abs</span><span class="p">(</span><span class="n">offset2</span> <span class="o">-</span> <span class="n">offset</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">is_useful</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="k">for</span> <span class="n">s2</span> <span class="ow">in</span> <span class="n">states</span><span class="p">:</span>
<span class="k">if</span> <span class="n">s</span> <span class="o">!=</span> <span class="n">s2</span> <span class="ow">and</span> <span class="n">implies</span><span class="p">(</span><span class="n">s2</span><span class="p">,</span> <span class="n">s</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="nb">filter</span><span class="p">(</span><span class="n">is_useful</span><span class="p">,</span> <span class="n">states</span><span class="p">)</span></code></pre></figure>
<p>This will not necessarily make our automaton minimal, but it is definitely less hairy.</p>
<p>The new complexity for the number of states in our automaton is</p>
<pre>
$$ O \left( D^2 N \right)$$
</pre>
<h1 id="so-whats-next">So what’s next</h1>
<p>In their paper, Klaus Schulz and Stoyan Mihov then notice that the transitions function result actually only depends on what are the value of <code class="language-plaintext highlighter-rouge">d</code> for which we have <code class="language-plaintext highlighter-rouge">c == query[i]</code>. In plain english, as I am about to receive character c, the next state only depends on which of the n+1 characters following my offset is equal to c.
Because of that, they define what they call <strong>a characteristic vector</strong>, a vector of length <code class="language-plaintext highlighter-rouge">len(q)</code> (for the moment) where the value at offset <code class="language-plaintext highlighter-rouge">d</code> is True iff <code class="language-plaintext highlighter-rouge">query[i] == c</code>.</p>
<p>… And some noisy code to make sure that we go past the last character of query.</p>
<p>Our code now become :</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">LevenshteinNFA</span><span class="p">(</span><span class="n">NFA</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query_length</span><span class="p">,</span> <span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">D</span> <span class="o">=</span> <span class="n">D</span>
<span class="k">def</span> <span class="nf">transitions</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">,</span> <span class="n">chi</span><span class="p">):</span>
<span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span> <span class="o">=</span> <span class="n">state</span>
<span class="k">if</span> <span class="n">D</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">offset</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">for</span> <span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">chi</span><span class="p">[</span><span class="n">offset</span><span class="p">:]):</span>
<span class="k">if</span> <span class="n">val</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">d</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="n">d</span>
<span class="k">def</span> <span class="nf">accept</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">initial_states</span><span class="p">(</span><span class="bp">self</span><span class="p">,):</span>
<span class="k">return</span> <span class="p">{(</span><span class="mi">0</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">D</span><span class="p">)}</span>
<span class="k">def</span> <span class="nf">simplify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">implies</span><span class="p">(</span><span class="n">state1</span><span class="p">,</span> <span class="n">state2</span><span class="p">):</span>
<span class="s">"""
Returns true, if state1 implies state2
"""</span>
<span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span> <span class="o">=</span> <span class="n">state1</span>
<span class="p">(</span><span class="n">offset2</span><span class="p">,</span> <span class="n">D2</span><span class="p">)</span> <span class="o">=</span> <span class="n">state2</span>
<span class="k">if</span> <span class="n">D2</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="n">D</span> <span class="o">-</span> <span class="n">D2</span> <span class="o">>=</span> <span class="nb">abs</span><span class="p">(</span><span class="n">offset2</span> <span class="o">-</span> <span class="n">offset</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">is_useful</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="k">for</span> <span class="n">s2</span> <span class="ow">in</span> <span class="n">states</span><span class="p">:</span>
<span class="k">if</span> <span class="n">s</span> <span class="o">!=</span> <span class="n">s2</span> <span class="ow">and</span> <span class="n">implies</span><span class="p">(</span><span class="n">s2</span><span class="p">,</span> <span class="n">s</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="nb">filter</span><span class="p">(</span><span class="n">is_useful</span><span class="p">,</span> <span class="n">states</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">levenshtein</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">input_string</span><span class="p">,</span> <span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
<span class="n">nfa</span> <span class="o">=</span> <span class="n">LevenshteinNFA</span><span class="p">(</span><span class="n">D</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">characteristic</span><span class="p">(</span><span class="n">c</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">tuple</span><span class="p">(</span>
<span class="n">v</span> <span class="o">==</span> <span class="n">c</span>
<span class="k">for</span> <span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">states</span> <span class="o">=</span> <span class="n">nfa</span><span class="p">.</span><span class="n">initial_states</span><span class="p">()</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">input_string</span><span class="p">:</span>
<span class="n">chi</span> <span class="o">=</span> <span class="n">characteristic</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="n">states</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">nfa</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">chi</span><span class="p">,</span> <span class="n">states</span><span class="p">))</span>
<span class="k">for</span> <span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span> <span class="ow">in</span> <span class="n">states</span><span class="p">:</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">query</span><span class="p">)</span> <span class="o">-</span> <span class="n">offset</span> <span class="o"><=</span> <span class="n">D</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="bp">False</span></code></pre></figure>
<p>By doing so, we have built an NFA that works on the alphabet of characteristic vectors.
The benefit of that is that we almost completely removed the part that is dependant on the query.</p>
<p>This opens the door to building a DFA once, and reuse it for all queries which is the key idea behind the paper.</p>
<p>There is still a bunch of issue before reaching this holy grail.</p>
<p>First of all, the length of the characteristic vector is right now dependant on the length of the query. But if you look closely, it <code class="language-plaintext highlighter-rouge">transitions</code> yields a bunch of useless states for the values that go after <code class="language-plaintext highlighter-rouge">offset + D</code>. Also we saw before that the set of states had offset within a range of length <code class="language-plaintext highlighter-rouge">2D + 1</code>. We therefore will only need the values of the characteristic vector over a range of <code class="language-plaintext highlighter-rouge">3D + 1</code>.</p>
<p>The second problem is that if we try and apply a <a href="https://en.wikipedia.org/wiki/Powerset_construction">powerset construction</a> blindly on this NFA, we will see that it is not really finite. This NFA actually has an infinite number of states : Imagine it handles queries of any size! Well the trick here is to normalize our states into two parts</p>
<ul>
<li>a global offset that is the minimum offset of the states</li>
<li>the shape of the shifted states (we already saw that there was around )</li>
</ul>
<p>With this parametric DFA, transition will tell you, given a “shape”, what shape to transition two, as well as how much you should add to the global offset.</p>
<p>The implementation of these ideas is a tad tricky, so I am too lazy to detail the code step-by-step, but here is an implementation for reference.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">LevenshteinParametricDFA</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">max_D</span> <span class="o">=</span> <span class="n">D</span>
<span class="k">def</span> <span class="nf">transitions</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">chi</span><span class="p">):</span>
<span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span> <span class="o">=</span> <span class="n">state</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">offset</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">for</span> <span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">chi</span><span class="p">[</span><span class="n">offset</span><span class="p">:]):</span>
<span class="k">if</span> <span class="n">val</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">d</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">D</span> <span class="o">-</span> <span class="n">d</span>
<span class="k">def</span> <span class="nf">simplify</span><span class="p">(</span><span class="n">states</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">implies</span><span class="p">(</span><span class="n">state1</span><span class="p">,</span> <span class="n">state2</span><span class="p">):</span>
<span class="s">"""
Returns true, if state1 implies state2
"""</span>
<span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span> <span class="o">=</span> <span class="n">state1</span>
<span class="p">(</span><span class="n">offset2</span><span class="p">,</span> <span class="n">D2</span><span class="p">)</span> <span class="o">=</span> <span class="n">state2</span>
<span class="k">if</span> <span class="n">D2</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="n">D</span> <span class="o">-</span> <span class="n">D2</span> <span class="o">>=</span> <span class="nb">abs</span><span class="p">(</span><span class="n">offset2</span> <span class="o">-</span> <span class="n">offset</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">is_useful</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="k">for</span> <span class="n">s2</span> <span class="ow">in</span> <span class="n">states</span><span class="p">:</span>
<span class="k">if</span> <span class="n">s</span> <span class="o">!=</span> <span class="n">s2</span> <span class="ow">and</span> <span class="n">implies</span><span class="p">(</span><span class="n">s2</span><span class="p">,</span> <span class="n">s</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="nb">filter</span><span class="p">(</span><span class="n">is_useful</span><span class="p">,</span> <span class="n">states</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">step</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="n">next_states</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">states</span><span class="p">:</span>
<span class="n">next_states</span> <span class="o">|=</span> <span class="nb">set</span><span class="p">(</span><span class="n">transitions</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">c</span><span class="p">))</span>
<span class="k">return</span> <span class="n">simplify</span><span class="p">(</span><span class="n">next_states</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">enumerate_chi_values</span><span class="p">(</span><span class="n">width</span><span class="p">):</span>
<span class="k">if</span> <span class="n">width</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">yield</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">for</span> <span class="n">chi_value</span> <span class="ow">in</span> <span class="n">enumerate_chi_values</span><span class="p">(</span><span class="n">width</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span>
<span class="k">yield</span> <span class="p">(</span><span class="bp">False</span><span class="p">,)</span> <span class="o">+</span> <span class="n">chi_value</span>
<span class="k">yield</span> <span class="p">(</span><span class="bp">True</span><span class="p">,)</span> <span class="o">+</span> <span class="n">chi_value</span>
<span class="n">width</span> <span class="o">=</span> <span class="mi">3</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_D</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">chi_values</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">enumerate_chi_values</span><span class="p">(</span><span class="n">width</span><span class="p">))</span>
<span class="p">(</span><span class="n">global_offset</span><span class="p">,</span> <span class="n">norm_states</span><span class="p">)</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">normalize</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">initial_states</span><span class="p">())</span>
<span class="n">dfa</span> <span class="o">=</span> <span class="p">{</span><span class="n">norm_states</span><span class="p">:</span> <span class="p">{}}</span>
<span class="n">yet_to_visit</span> <span class="o">=</span> <span class="p">[</span><span class="n">norm_states</span><span class="p">]</span>
<span class="k">while</span> <span class="n">yet_to_visit</span><span class="p">:</span>
<span class="n">current_state</span> <span class="o">=</span> <span class="n">yet_to_visit</span><span class="p">.</span><span class="n">pop</span><span class="p">()</span>
<span class="n">state_transitions</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">chi</span> <span class="ow">in</span> <span class="n">chi_values</span><span class="p">:</span>
<span class="n">new_states</span> <span class="o">=</span> <span class="n">step</span><span class="p">(</span><span class="n">chi</span><span class="p">,</span> <span class="n">current_state</span><span class="p">)</span>
<span class="p">(</span><span class="n">min_offset</span><span class="p">,</span> <span class="n">norm_states</span><span class="p">)</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">normalize</span><span class="p">(</span><span class="n">new_states</span><span class="p">)</span>
<span class="k">if</span> <span class="n">norm_states</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dfa</span><span class="p">:</span>
<span class="n">dfa</span><span class="p">[</span><span class="n">norm_states</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">yet_to_visit</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">norm_states</span><span class="p">)</span>
<span class="n">state_transitions</span><span class="p">[</span><span class="n">chi</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">min_offset</span><span class="p">,</span> <span class="n">norm_states</span><span class="p">)</span>
<span class="n">dfa</span><span class="p">[</span><span class="n">norm_states</span><span class="p">]</span> <span class="o">=</span> <span class="n">state_transitions</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dfa</span> <span class="o">=</span> <span class="n">dfa</span>
<span class="k">def</span> <span class="nf">initial_states</span><span class="p">(</span><span class="bp">self</span><span class="p">,):</span>
<span class="k">return</span> <span class="p">{(</span><span class="mi">0</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_D</span><span class="p">)}</span>
<span class="k">def</span> <span class="nf">normalize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">states</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="p">())</span>
<span class="n">min_offset</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">offset</span> <span class="k">for</span> <span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">_</span><span class="p">)</span> <span class="ow">in</span> <span class="n">states</span><span class="p">)</span>
<span class="n">shifted_states</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span>
<span class="nb">sorted</span><span class="p">([(</span><span class="n">offset</span> <span class="o">-</span> <span class="n">min_offset</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span>
<span class="k">for</span> <span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span> <span class="ow">in</span> <span class="n">states</span><span class="p">]))</span>
<span class="k">return</span> <span class="p">(</span><span class="n">min_offset</span><span class="p">,</span> <span class="n">shifted_states</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">characteristic</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">offset</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">tuple</span><span class="p">(</span>
<span class="n">query</span><span class="p">[</span><span class="n">offset</span> <span class="o">+</span> <span class="n">d</span><span class="p">]</span> <span class="o">==</span> <span class="n">c</span> <span class="k">if</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">d</span> <span class="o"><</span> <span class="nb">len</span><span class="p">(</span><span class="n">query</span><span class="p">)</span> <span class="k">else</span> <span class="bp">False</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_D</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">step_all</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">s</span><span class="p">):</span>
<span class="p">(</span><span class="n">global_offset</span><span class="p">,</span> <span class="n">norm_states</span><span class="p">)</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">normalize</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">initial_states</span><span class="p">())</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">s</span><span class="p">:</span>
<span class="n">chi</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">characteristic</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">global_offset</span><span class="p">)</span>
<span class="p">(</span><span class="n">shift_offset</span><span class="p">,</span> <span class="n">norm_states</span><span class="p">)</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">dfa</span><span class="p">[</span><span class="n">norm_states</span><span class="p">][</span><span class="n">chi</span><span class="p">]</span>
<span class="n">global_offset</span> <span class="o">+=</span> <span class="n">shift_offset</span>
<span class="k">return</span> <span class="p">(</span><span class="n">global_offset</span><span class="p">,</span> <span class="n">norm_states</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">eval</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">input_string</span><span class="p">):</span>
<span class="p">(</span><span class="n">global_offset</span><span class="p">,</span> <span class="n">final_state</span><span class="p">)</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">step_all</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">input_string</span><span class="p">)</span>
<span class="k">for</span> <span class="p">(</span><span class="n">local_offset</span><span class="p">,</span> <span class="n">d</span><span class="p">)</span> <span class="ow">in</span> <span class="n">final_state</span><span class="p">:</span>
<span class="n">offset</span> <span class="o">=</span> <span class="n">local_offset</span> <span class="o">+</span> <span class="n">global_offset</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">query</span><span class="p">)</span> <span class="o">-</span> <span class="n">offset</span> <span class="o"><=</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_D</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="n">param_dfa</span> <span class="o">=</span> <span class="n">LevenshteinParametricDFA</span><span class="p">(</span><span class="n">D</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">levenshtein</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">input_string</span><span class="p">):</span>
<span class="k">return</span> <span class="n">param_dfa</span><span class="p">.</span><span class="nb">eval</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">input_string</span><span class="p">)</span></code></pre></figure>
<p>The style is a bit weird, but I wanted to emphasize that all of the process is done in the constructor, and that at eval time, the class is behaving like a regular automaton.</p>
<p>So what’s the catch?
Well in a sense our automaton construction has a complexity of <code class="language-plaintext highlighter-rouge">O(1)</code> if we let alone the preprocessing. The catch is in the eval function. We do need to eval what we called our characteristic function. Why is it not all that bad?</p>
<p>First, there are many ways to implement it in such a way that is it really cheap. I would be amazed if there wasn’t any SSE methods to compute it. But that actually does not really matter.</p>
<p>In the process of building your dfa, you will need a way to map unicode codepoints to the alphabet that really matters. Basically the letters in your query PLUS a symbol that represents letters that are not in your query. Similarly building this alphabet and map it to values of characteristic vectors is very cheap.
Sure if we want to talk about complexity that’s <code class="language-plaintext highlighter-rouge">O(nD)</code></p>
<h1 id="lucenes-implementation">Lucene’s implementation</h1>
<p>Lucene has an <a href="https://github.com/apache/lucene-solr/blob/72aa5784ecd7024dce7599c358b658bed4b31596/lucene/core/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java">implementation</a> of this algorithm. There is a bunch of interesting things and one quirk in its implementation.</p>
<p>First the result of the preprocessing is directly serialized into the java code. That approach will shave a few ms to the startup of the library.</p>
<p>Also, the parametric levenshtein automaton, is not used directly but is rather used to construct a DFA. This is also the approach that I take in my current project.</p>
<p>But this DFA works on unicode characters, while their dictionary structure is encoded in UTF-8. Rather than converting UTF-8 on the fly, they prefer to convert the DFA itself to UTF-8, which I found pretty nifty?</p>
<p>So where is the quirk? Well the algorithm used to build the DFA is very strange.</p>
<p>Rather than just browsing the reachable states of the parametric automaton, it shoves all of the parametric states and all of their transitions.
This is hurting performance pretty badly, but I assume automaton creation is already fast enough for most user’s need.</p>
<h1 id="conclusion">Conclusion</h1>
<p>I hope this blog post will help people who have to implement the construction of levenshtein automaton in an efficient manner.</p>
<p>In my next post, I will tell about extending the concept of Levenshtein Automaton, and building this parametric DFA will suddenly become crucial.</p>
<script src="/js/levenshtein/demo.js"></script>
Of performance tricks for the webprogrammer2014-03-04T00:00:00+00:00https://fulmicoton.com/posts/fattable<h1 id="fattable">Fattable</h1>
<p>Quite recently, I released under MIT license a javascript library to display large tables called <a href="http://fulmicoton.com/fattable/index2.html">fattable</a>. The project got an unexpected amount of good publicity, got many tweets and as of this day <a href="https://github.com/fulmicoton/fattable">270 github stars</a>, which is very rewarding !</p>
<p>Everything started with a problem we needed to address at <a href="http://www.dataiku.com">Dataiku</a> : our product gives datascientists a nice view of their dataset as they go through their data preparation. The dataset was displayed as an HTML table using the popular UI pattern of infinite scroll.</p>
<p>When the user scroll down up past the last row, an AJAX call would populate the table with 100 extra rows. We had however two issues. First, while the tool was working like a charm with regular datasets, some of the datasets our customers deal with are close to a thousand columns. For these datasets, our UI was getting sluggish to the point of ruining the user experience.</p>
<p>Second, infinite scroll makes it impossible for the user to jump rapidly in the middle of the dataset to rapidly sample the data. Browsing rapidly through data is a nice-to-have feature.</p>
<h1 id="if-js-is-the-new-assembly-code-the-browser-is-your-os-and-your-hardware">If JS is the new assembly code, the browser is your OS and your hardware</h1>
<p>There is a popular saying that <a href="http://www.hanselman.com/blog/JavaScriptIsAssemblyLanguageForTheWebSematicMarkupIsDeadCleanVsMachinecodedHTML.aspx">Javascript is the Assembly Language for the web</a>. I could not agree more with this statement, and my journey coding fattable led me to think that in addition, browsers are your hardware.</p>
<p>I’m not exactly specialized in front-end programming, but these days, that’s what I do.
In backend programming or in scientific computing, optimization typically shred apart one by one all the nice abstractions that your OS and your hardware offers. For instance, when I started as a software engineer, I thought of RAM as a uniform adressed memory universe in which the CPU had random access for free. One day, I noticed how multiplying two square matrix A and B, was way slower than multiplying the transposition of A by B. This phenomenon is well known in linear algebra libraries, and is due to your CPU cache. I experience an abstraction leak.</p>
<p>As a software engineer, optimization gives you the excitement of a physicist. As you gain experience you get
a better understanding of how your hardware or OS works and build your own new mental models or abstraction. The whole process is very close to that of scientific method.</p>
<p>In Front-end programming, the browser is your OS, the browser is your hardware, the browser is Mother Nature.</p>
<h1 id="how-browsers-render-your-page">How browsers render your page?</h1>
<p>The more DOM elements displayed, the worst the performance is most of the time true.
But let me write here in detail what I understand about browsers rendering. <strong>Don’t hold it as the truth, as it is just a pack of belief I accumulated from a mixture of experiments and reads about browser</strong>.</p>
<h2 id="one-paint-per-event-loop-">One paint per event loop …</h2>
<p>Javascript in your browser as well as in nodeJS is executed in a single thread. Events are queued and executed by a so-called event loop.</p>
<p>Let alone CSS transition/animation, your browser will paint a new frame at most once per loop, after your javascript code has been executed.</p>
<p>To check that we can run <a href="http://jsfiddle.net/w9g4u/">Experiment 1</a>.</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"> <span class="nb">window</span><span class="p">.</span><span class="nx">move</span> <span class="o">=</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">$square</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="dl">"</span><span class="s2">#square</span><span class="dl">"</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="nx">i</span><span class="o"><</span><span class="mi">100000</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">$square</span><span class="p">.</span><span class="nx">css</span><span class="p">(</span><span class="dl">"</span><span class="s2">top</span><span class="dl">"</span><span class="p">,</span> <span class="nx">i</span><span class="o">/</span><span class="mi">1000</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>When clicking on the link, the function takes a couple of seconds to run. Rather than seeing the red square move smoothly, the square just stays at the same place during the code execution, and only appears at its final destination when the javascript has finished running.</p>
<h2 id="-but-possibly-many-reflows">… but possibly many reflows</h2>
<p>But now, what happens when JS try to access some layout related attribute
within the loop.</p>
<p>To check that, we run a <a href="http://jsfiddle.net/QZMt4/">second experiment</a>. we put two div with <code class="language-plaintext highlighter-rouge">float: left;</code> and we grow the left one, so that the
right <code class="language-plaintext highlighter-rouge">div</code> should mechanically move to the right.</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"> <span class="nb">window</span><span class="p">.</span><span class="nx">grow</span> <span class="o">=</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">$left</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="dl">"</span><span class="s2">div.left</span><span class="dl">"</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">$right</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="dl">"</span><span class="s2">div.right</span><span class="dl">"</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="nx">i</span><span class="o"><</span><span class="mi">100</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">$left</span><span class="p">.</span><span class="nx">css</span><span class="p">(</span><span class="dl">"</span><span class="s2">width</span><span class="dl">"</span><span class="p">,</span> <span class="nx">i</span><span class="p">);</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">$right</span><span class="p">.</span><span class="nx">position</span><span class="p">())</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>The console outputs all the intermediary position of the right container :
while the browser avoided painting a new frame, it did actually updated
the layout many times within the loop.</p>
<p>The truth is that there is two big distinct phases in browser rendering.
These two distinct phases are called respectively <code class="language-plaintext highlighter-rouge">paint</code>, and <code class="language-plaintext highlighter-rouge">reflow</code>.</p>
<h2 id="reflow">Reflow</h2>
<p>Reflow consists in computing the position of your elements as as many (top, left, width, height) boxes.</p>
<p>It is called reflow because of the way it is computed. HTML was born at a moment were internet
connections were pretty slow. My first modem was a 14400bps. That’s right : that’s a max of
1.8 kB/s! At that time, everybody appreciated the fact that HTML pages were rendered partially as
they were getting downloaded. For this reason HTML was built upon the following golden rule :
<strong>the size and position of a DOM element should not be affected by the stuff coming after</strong>.</p>
<p>HTML element were therefore appended one by one, hence the image of a “flow”.</p>
<p>There can be more than one reflow per event-loop. For instance it may be triggered by a piece of javascript asking the browser the value of a layout related property, or at the end of the event-loop before paint.</p>
<p>Contrary to what I read in many places, the browser is rather smart when it comes to avoiding computing reflow,
and asking twenty times for the position of DOM elements will not necessarily end up triggering twenty reflows.</p>
<p>It relies on a dirty bit strategy to know whether it should trigger a reflow. Basically the browser will mark you DOM as dirty if you add new elements or change css properties of some of them.
It will not trigger a reflow right at once, but will wait for the next read operation to happen.</p>
<p>The cost of a reflow depends on many things. Some elements, especially tables, are especially expensive. But in the end the rule of thumb is
** Reflow’s cost is linear with the number of elements in your DOM with display != none.**</p>
<h2 id="paint">Paint</h2>
<p>Repaint phase happens at most once per JS loop, or as you are scrolling. It actually computes the color of the pixels visible on your screen.</p>
<p>Repaint’s cost depends on the elements that are actually visible on the screen, and the possible css effect you might have put in your CSS.</p>
<p>** Repaint’s cost only depends on what is visible on your screen**</p>
<h2 id="how-do-we-make-things-faster">How do we make things faster?</h2>
<p>There are countless tricks to optimize your browser speed.</p>
<p>First of all, make sure that your JS code is not triggering more reflow than required.
Most of the time one reflow per event loop is enough.</p>
<p>You might also “help” reflow by explicitely making the element’s content irrelevant to the layout. For instance using <code class="language-plaintext highlighter-rouge">overflow:hidden</code> may help.</p>
<p>Shaving milliseconds off the render phase is a bit more tricky. If you are on a tight budget, avoid using crazy combination of blur / opacity.</p>
<p>A nice trick specific is also to disable hover when scrolling using <code class="language-plaintext highlighter-rouge">pointer-events: none</code> as documented in <a href="http://www.thecssninja.com/javascript/pointer-events-60fps">this blogpost of css ninja</a>.</p>
<h2 id="what-about-fattable">What about fattable?</h2>
<p>In our case, reflow was clearly the culprit. We had to display tens of thousands of DOM element and our interaction with the table was triggering very expensive reflows.
The key for us was to go off the DOM. The idea is to make sure that only the elements that are visible on the screen are within the DOM at any given moment.</p>
<p>Time to pull out the big guns. You need to hook a js callback on scroll events and make sure to
pull out of the DOM elements that just disappeared, and append to the DOM element that are now visible.</p>
<p><img src="/images/fattable/captainplanet.png" alt="Chrome Inspector" /></p>
<h1 id="recycling-saves-the-dolphins">Recycling saves the dolphins</h1>
<p>When scrolling fast, such a strategy may stress the garbage collector of javascript. This will result on a small
stutter from time to time. A simple way to adress the problem is to recycle your elements. In fattable it is done
explicitely, but the usual popular pattern for that is use a <a href="http://en.wikipedia.org/wiki/Object_pool_pattern">pool pattern</a>.</p>
<h1 id="how-do-i-test-this-out">How do I test this out?</h1>
<p>Chrome inspector’s timeline/frame view is extremely helpful in your quest for performance.</p>
<p><img src="/images/fattable/inspector.png" alt="Chrome Inspector" /></p>
<p>Yellow is your javascript cost, purple is reflow, and green is your paint.
Checkout <a href="https://www.youtube.com/watch?v=Vp524yo0p44">this video</a> from Paul Irish to know more about its usage.</p>
<!--
# onScroll is not always synchronous
This technique require to bind an event to onScroll to be able to add and remove elements as the user
scrolls. On most recent browsers, this is very simple. Your callback will be called before render, and you will be able to
do all the processing you need before the render. If your callback is slow, less frames will be painted, and the scrollbar will be somewhat late compared to the mouse pointer moving it.
I however observed that things were not working quite that way on some webkit navigator. Most notably on safari versions and Chromium v28. The debate is more detailed on [stackoverflow](http://stackoverflow.com/questions/21830056/onscroll-fired-after-or-before-repaint). This kind of behavior is catastrophic for our use case. While the user is scrolling, part of the cell will appear missing.
Because of that, fattable relies on scroll proxies. Two big div are hosting respectively a vertical and an horizontal scroll.
The onscroll event is hooked on them, and then applied to our real container.
# Binding things on mouse move is terrible on firefox
When -->
Of how to implement transient in JavaScript2013-11-17T00:00:00+00:00https://fulmicoton.com/posts/transient<h1 id="whats-transient-anyway-">What’s transient anyway ?</h1>
<p><img src="/images/transient/ninja1.jpg" alt="Ninjavascript" /></p>
<p>Java programmers are probably familiar with the concept of <em>transient</em> as it is a keyword in this language. By marking an object property as transient, you tell Java that this property should be skipped in serialization.</p>
<p>While this kind of functionality should arguably not be part of a programming language, but live in its standard library (as a decorator maybe), last week at work, I kind of wished Javascript had such a functionality.</p>
<p>We are using AngularJS for our UI, and our UI-model had some extra property that we don’t want to persist on the server. There is a couple different ways to address this problem when it happens :</p>
<ul>
<li>Write a method extracting the part of the data you actually want to send the server. <code class="language-plaintext highlighter-rouge">JSON.stringify(scope.model.getData())</code></li>
<li>Split your model into two objects, one holding the part that will be persisted, and one with the part that will not. <code class="language-plaintext highlighter-rouge">JSON.stringify(scope.model.persisted)</code> This can be especially tricky if your model is within a collection as it was in our case.</li>
<li>Go NinJavascript and implement the transient keyword in Javascript !</li>
</ul>
<p><em>To tell the truth, I just went with solution 1. While tricks are exciting, they can rapidly make of you a bad coworker as magic tend to obsfuscate code.</em></p>
<p>Anyway, let’s state formally our …</p>
<h1 id="javascript-puzzle-of-the-day">Javascript puzzle of the day</h1>
<p>Implement the function called <em>transient</em> such that the following script does not print any error on your console.
Alternatively you can use this <a href="http://jsfiddle.net/9RpV9/">JsFiddle</a>.</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="kd">function</span> <span class="nx">transient</span><span class="p">(</span><span class="nx">obj</span><span class="p">,</span> <span class="nx">key</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// ... you need to implement this</span>
<span class="p">}</span>
<span class="c1">// ... while the following should stay untouched</span>
<span class="kd">function</span> <span class="nx">assert</span><span class="p">(</span><span class="nx">predicate</span><span class="p">,</span> <span class="nx">description</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">predicate</span> <span class="o">!==</span> <span class="kc">true</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="dl">"</span><span class="s2">FAILED</span><span class="dl">"</span><span class="p">,</span> <span class="nx">description</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kd">function</span> <span class="nx">SomeObject</span><span class="p">()</span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nx">someProp</span> <span class="o">=</span> <span class="p">{</span> <span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">José Bové</span><span class="dl">"</span> <span class="p">}</span>
<span class="k">this</span><span class="p">.</span><span class="nx">transientProp</span> <span class="o">=</span> <span class="p">{</span> <span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Aimé Jacquet</span><span class="dl">"</span> <span class="p">}</span>
<span class="p">}</span>
<span class="kd">var</span> <span class="nx">obj</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">SomeObject</span><span class="p">();</span>
<span class="nx">transient</span><span class="p">(</span><span class="nx">obj</span><span class="p">,</span> <span class="dl">"</span><span class="s2">transientProp</span><span class="dl">"</span><span class="p">)</span>
<span class="kd">var</span> <span class="nx">obj2</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">SomeObject</span><span class="p">();</span>
<span class="nx">transient</span><span class="p">(</span><span class="nx">obj2</span><span class="p">,</span> <span class="dl">"</span><span class="s2">transientProp</span><span class="dl">"</span><span class="p">)</span>
<span class="nx">obj2</span><span class="p">.</span><span class="nx">transientProp</span><span class="p">.</span><span class="nx">age</span> <span class="o">=</span> <span class="mi">53</span><span class="p">;</span>
<span class="nx">assert</span><span class="p">(</span><span class="nx">obj</span><span class="p">.</span><span class="nx">someProp</span> <span class="o">!==</span> <span class="kc">undefined</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">someProp should stay accessible</span><span class="dl">"</span> <span class="p">)</span>
<span class="nx">assert</span><span class="p">(</span><span class="nx">obj</span><span class="p">.</span><span class="nx">transientProp</span> <span class="o">!==</span> <span class="kc">undefined</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">transientProp should stay accessible</span><span class="dl">"</span><span class="p">)</span>
<span class="nx">assert</span><span class="p">(</span><span class="nx">obj</span><span class="p">.</span><span class="nx">transientProp</span><span class="p">.</span><span class="nx">age</span> <span class="o">===</span> <span class="kc">undefined</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">transientProp should not be shared between objects</span><span class="dl">"</span><span class="p">)</span>
<span class="nx">assert</span><span class="p">(</span><span class="nx">obj2</span><span class="p">.</span><span class="nx">transientProp</span><span class="p">.</span><span class="nx">age</span> <span class="o">===</span> <span class="mi">53</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">transientProp should not be shared between objects</span><span class="dl">"</span><span class="p">)</span>
<span class="nx">assert</span><span class="p">(</span><span class="nx">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">JSON</span><span class="p">.</span><span class="nx">stringify</span><span class="p">(</span><span class="nx">obj</span><span class="p">)).</span><span class="nx">someProp</span> <span class="o">!==</span> <span class="kc">undefined</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">someProp should still be serialized</span><span class="dl">"</span><span class="p">)</span>
<span class="nx">assert</span><span class="p">(</span><span class="nx">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">JSON</span><span class="p">.</span><span class="nx">stringify</span><span class="p">(</span><span class="nx">obj</span><span class="p">)).</span><span class="nx">transientProp</span> <span class="o">===</span> <span class="kc">undefined</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">transientProp should not be serialized</span><span class="dl">"</span><span class="p">)</span></code></pre></figure>
<h1 id="the-solution">The solution</h1>
<p><strong>Disclaimer</strong> <em>Some adaptation should be done to the following solution to make it compatible for IE, as it relies heavily on <code class="language-plaintext highlighter-rouge">__proto__</code>. I wont do it here as it would make the trick harder to read.</em></p>
<p>The idea relies on the fact that <code class="language-plaintext highlighter-rouge">JSON.stringify</code> will only serialized object’s own property, and ignore those he has access to through prototypal inheritance.</p>
<p>But what is JS’s prototypal inheritance all about?</p>
<p><br />
<img src="/images/transient/shooting-stars.jpg" alt="Ninjavascript" /></p>
<center><b>Linked lists</b>, by Jean-Francois Millet (1814 - 1875)</center>
<p><br />
<br /></p>
<p>Well prototypal inheritance is just about linked list. All javascript object belong to a <a href="http://en.wikipedia.org/wiki/Linked_list">linked list</a>. The reference leading to the next object in this linked list is explicitely accessible via <code class="language-plaintext highlighter-rouge">youObj.__proto__</code> on most browser (sorry for IE).</p>
<p>When looking for an object’s property via <code class="language-plaintext highlighter-rouge">obj.myattr</code> or <code class="language-plaintext highlighter-rouge">obj["myattr"]</code>, a JS interpreter will first check if <code class="language-plaintext highlighter-rouge">obj</code> has a property <em>of its own</em> named <code class="language-plaintext highlighter-rouge">myattr</code>. If it doesn’t, the interpreter will lookup recursively in the next element of the prototype chain, until he find the property, or the end of the prototype chain.</p>
<p>This mechanism is mostly used for inheritance purposes. In that case, instances’ prototype are poiting to their class prototype, while classes on the other hand are pointing to their mother class.</p>
<p>But there are other uses to the prototype chain. For instance it brilliantly backs up the concept of child-scope in <a href="http://angularjs.org/">angularJS</a>.</p>
<p>In our problem, we use it to dynamically add an extra prototype layer to host the transient properties of our object. This layer will be unique for each instance of the object, which is kind of unusual.</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="kd">function</span> <span class="nx">transient</span><span class="p">(</span><span class="nx">obj</span><span class="p">,</span> <span class="nx">k</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">obj</span><span class="p">.</span><span class="nx">hasOwnProperty</span><span class="p">(</span><span class="nx">k</span><span class="p">))</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">v</span> <span class="o">=</span> <span class="nx">obj</span><span class="p">[</span><span class="nx">k</span><span class="p">]</span>
<span class="k">if</span> <span class="p">(</span><span class="k">typeof</span> <span class="nx">v</span> <span class="o">!=</span> <span class="dl">"</span><span class="s2">object</span><span class="dl">"</span><span class="p">)</span> <span class="p">{</span>
<span class="k">throw</span> <span class="dl">"</span><span class="s2">Does not work well with integral types.</span><span class="dl">"</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">delete</span> <span class="nx">obj</span><span class="p">[</span><span class="nx">k</span><span class="p">];</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nx">obj</span><span class="p">.</span><span class="nx">__proto__</span><span class="p">.</span><span class="nx">__transientninja__</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">obj</span><span class="p">.</span><span class="nx">__proto__</span> <span class="o">=</span> <span class="p">{</span>
<span class="dl">"</span><span class="s2">__proto__</span><span class="dl">"</span><span class="p">:</span> <span class="nx">obj</span><span class="p">.</span><span class="nx">__proto__</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">__transientninja__</span><span class="dl">"</span><span class="p">:</span> <span class="kc">true</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="nx">obj</span><span class="p">.</span><span class="nx">__proto__</span><span class="p">[</span><span class="nx">k</span><span class="p">]</span> <span class="o">=</span> <span class="nx">v</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>What other use of the prototype chain can you think of?</p>
Of running multiple regexp at once2013-10-23T00:00:00+00:00https://fulmicoton.com/posts/multiregexp<h1 id="new-job-new-life">New job, new life</h1>
<p><img src="/slides/tokyo_webmining/kidlab.jpg" width="400" /></p>
<p>I changed job ! I recently joined <a href="http://www.dataiku.com/">Dataiku</a>. We’re creating the perfect Data Science Platform. And so far, it has been pretty awesome… By the way we are still recruiting, so if you are looking for a job in a top notch tech startup in Paris, drop me an email : paul.masurel at dataiku.com.</p>
<p>Back to today’s subject. Last week I was discussing with a colleague at work about the painful lack in Java for an equivalent of <a href="https://code.google.com/p/re2/">re2</a>.</p>
<p>re2 is regular expression matching library open sourced by Google and it is blazing fast. It also makes it possible to compile several regular expression together, which we might have a use for at Dataiku.
Basically matching N patterns against a string of length L has a complexity linear in L with <code class="language-plaintext highlighter-rouge">re2</code>. Yes you read that well. It is theoretically independant from the number of the patterns.</p>
<p>(A friend pointed me out a library)[https://twitter.com/sylvainutard/status/390378369168461824] to manipulate finite state deterministic automaton in Java : <a href="http://www.brics.dk/automaton/">dk.brics.automaton</a>. So guess what I did last week-end? I implemented the part that compiles several patterns together. You can get it and use it (MIT License) on <a href="https://github.com/poulejapon/multiregexp">github</a>.</p>
<p>Using it is as simple as :</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="nc">MultiPattern</span><span class="o">.</span><span class="na">compile</span><span class="o">(</span>
<span class="s">"ab+"</span><span class="o">,</span> <span class="c1">// 0</span>
<span class="s">"abc+"</span><span class="o">,</span> <span class="c1">// 1</span>
<span class="s">"ab?c"</span><span class="o">,</span> <span class="c1">// 2</span>
<span class="s">"v"</span><span class="o">,</span> <span class="c1">// 3</span>
<span class="s">"v.*"</span><span class="o">,</span> <span class="c1">// 4</span>
<span class="s">"(def)+"</span> <span class="c1">// 5</span>
<span class="o">);</span>
<span class="kt">int</span><span class="o">[]</span> <span class="n">matching</span> <span class="o">=</span> <span class="n">multiPattern</span><span class="o">.</span><span class="na">match</span><span class="o">(</span><span class="s">"abc"</span><span class="o">);</span> <span class="c1">// return {1, 2}</span></code></pre></figure>
<p>But how does it work?</p>
<h1 id="regular-expressions-">Regular expressions …</h1>
<p><img src="/images/regexp/xkcd.png" alt="xkcd" />
<small>(source: <a href="http://xkcd.com/208/">XKCD</a> )</small></p>
<p>Regular expressions are by far the most successful DSL ever. While most programmers have mastered their use, it can become pretty useful to understand how they work. First, a good low-level understanding helps when dealing with regexp related performance bottleneck, and second, what you’ll find under the hood is awesome.</p>
<p>Let alone Perl regular expressions for the moment, a regular expressions are defining what is called a <a href="http://en.wikipedia.#org/wiki/Formal_language_theory">formal language</a> . Basically they are boolean function which says whether a string matches or not.Not all of such function can be expressed as regular expressions.
For instance, <code class="language-plaintext highlighter-rouge">string that have as many a than b</code> cannot be written as regular expressions.</p>
<p>Actually a formal language that can be described with regular expression is called tautologically a regular language in <a href="http://en.#wikipedia.org/wiki/Regular_language">Chomsky hierarchy</a>.</p>
<p>But let’s stay practical. What happens when I try to match a string with <code class="language-plaintext highlighter-rouge">.*ab</code> ?</p>
<h1 id="-are-all-about-automata">.. are all about automata</h1>
<p>The regular expression is parsed and compiled into the following automaton.</p>
<p><img src="/images/regexp/some_ab.png" alt="Nondeterministic automaton matching .*ab" /></p>
<p>How do we read this thing? A finite automaton has a finite number of states, those are the three circles. State 1 is our starting state.
In the beginning it is the only active state. When trying to match a string against a regular expression, we just scan through the character of the string, and update the list of activated states by going through all of the active states and follow the arrow matching the character.</p>
<p>Once the whole string has been scanned, deciding whether the string is matching or not is just a matter of looking at the ending states.</p>
<p>Here only state 3 is marked as valid. If one of the matching state is valid, the regular expression is matching.</p>
<p>For instance, when matching the string <code class="language-plaintext highlighter-rouge">aab</code>, the automaton will start with <code class="language-plaintext highlighter-rouge">{1}</code> and will go have successively the following states activated : <code class="language-plaintext highlighter-rouge">{1,2}</code>, <code class="language-plaintext highlighter-rouge">{1,2}</code> and finally <code class="language-plaintext highlighter-rouge">{1,3}</code>. 3 is matching and therefore the string is matching. Notice how more than more than one state was activated at the same time. This is why such an automaton is called nondeterministic.</p>
<p>On the other hand, deterministic automaton can have only one state activated at a time. These latter perform better because only one arrow is followed per character. Their performance is straightforwardly linear with the number of character.</p>
<p>With non-deterministic automaton, things can be hairy.</p>
<h1 id="powerset-transformation-to-the-rescue">Powerset transformation to the rescue</h1>
<p>For this very reason, it is very often a good idea to convert our finite nondeterministic automaton into a deterministic one. This is done by using the so-called powerset transformation. The idea is to consider the <a href="http://en.wikipedia.org/wiki/Power_set">power set</a> of the state of our automaton. If you are not familiar with the concept, it is the set of the subset of the state of our automaton. Our nondeterministic automaton having only three states, it is actually possible to enumerate all of the elements of its powerset:
<code class="language-plaintext highlighter-rouge">∅</code>, <code class="language-plaintext highlighter-rouge">{1}</code>, <code class="language-plaintext highlighter-rouge">{2}</code>, <code class="language-plaintext highlighter-rouge">{3}</code>, <code class="language-plaintext highlighter-rouge">{1,2}</code>, <code class="language-plaintext highlighter-rouge">{1,3}</code>, <code class="language-plaintext highlighter-rouge">{2,3}</code>, <code class="language-plaintext highlighter-rouge">{1,2,3}</code>. If we have N states in the beginning, its powerset has 2<sup>N</sup> elements.</p>
<p>After consuming k characters, let’s consider the subset of activated states. Given the next character of the string we will reach another subset of activated states. We can therefore built up a deterministic automaton by using nodes labelled with subsets of the non deterministic automton.</p>
<p>In our case, the automaton will look as follows.</p>
<p><img src="/images/regexp/some_ab_det.png" alt="Deterministic automaton matching .*ab" /></p>
<p>Luckily enough our automaton have only 3 states… Yet deterministic automaton are typically bigger than their non-deterministic counterpart. Hence, you can put this trick in the memory vs cpu trade-off box.</p>
<h1 id="matching-more-than-one-regular-expression">Matching more than one regular expression</h1>
<p>Matching the same string against many regular expression is a very common problem. Imagine a weblog on which you want to extract statistics. You might want to identify different part of your website using regular expression applied on the urls. If you are in e-commerce, we are probably talking about hundreds of regular expression. Your for-loop on the regular expressions might be a little too CPU intensive.</p>
<p>We can use a trick very similar to that of the powerset transformation. Instead of consider a powerset, we just consider labelling the states with the cartesian product of the set of states of all of the regular expression respective automata.</p>
<p>Matching state will then hold the set of matched regular expressions.For instance we can try and merge the previous automaton with the automaton matching a.*b.</p>
<p>Once relabelled, automaton for .*ab looks like this
<img src="/images/regexp/some_ab_det_relabelled.png" alt="Deterministic automaton matching .*ab" /></p>
<p><code class="language-plaintext highlighter-rouge">a.*b</code>’s deterministic automaton looks like this.
<img src="/images/regexp/a_some_b_det.png" alt="Deterministic automaton matching a.*b" /></p>
<p>And the deterministic automaton matching both <code class="language-plaintext highlighter-rouge">a.*b</code> and <code class="language-plaintext highlighter-rouge">.*ab</code> looks as follows.</p>
<p><img src="/images/regexp/merge_automaton.png" alt="Deterministic automaton matching both a.*b and .*ab" /></p>
<hr />
<p><em>Thanks to <a href="http://www.reddit.com/r/programming/comments/1pbuab/of_running_multiple_regexp_at_once/">evmar</a> for correcting me : V8 does not use re2.</em></p>
Of solving the rubik's from scratch2013-08-18T00:00:00+00:00https://fulmicoton.com/posts/rubix<p>Internet is full of solution for the rubik’s cube. However, it is seldom described how these solutions were discovered. In this post I’ll try to detail how one can solve the rubik’s cube from scratch.</p>
<p>**Disclaimer : ** The solution presented here is by no mean the fastest… It is actually very long to solve the Rubix Cube using this algorithm. It is just the one I came up with, so I guess it is probably in some sense one of the simplest. Though I am very proud of having cracked it up, there isn’t much to be proud about : it took me about a year to come up with a solution. At that time, I was always carrying a rubik’s cube and a notebook to search for the magic moves in the train like a lunatic.</p>
<p>At that time, I was kind of making a point of using a computer as little as possible. It was years ago and I forgot the moves I came up with. So in this post I try to find these moves again, but in python.</p>
<h2 id="what-is-a-rubiks-cube-anyway">What is a rubik’s cube anyway?</h2>
<h3 id="a-fixed-referential-and-six-possible-moves">A fixed referential and six possible moves.</h3>
<p>I’m pretty sure you know what a rubik’s cube is.
Let’s still make a couple of obvious statements.</p>
<p>We’ll assuming that the rubik’s cube’s faces center are
fixed, and that we never rotate the whole thing. Which such
as setting, there is 6 atomic move you can make with a rubik’s cube. They each consists of turning one of its faces one way or the other.</p>
<p>Each operation will be named after the face that we are turning, and the sens of rotation will be the so-called positive sens, more commonly called counterclockwise.</p>
<p>We will associate a 3D-referential to the rubik’s to
easily code the rotation operations. The referential will be so-called direct. The normal vector for the right face, the upper face and the front face will respectively be <code class="language-plaintext highlighter-rouge">(1,0,0)</code>, <code class="language-plaintext highlighter-rouge">(0,1,0)</code>, and <code class="language-plaintext highlighter-rouge">(0,0,1)</code>.</p>
<p>Clockwise turn can be obtained by repeating a turn three times.</p>
<h3 id="two-different-sets-of-blocks">Two different sets of blocks</h3>
<p><img src="/images/rubix/rubix2.png" /></p>
<p>The rubik’s cube is made of smaller blocks. Let alone the block at the very center of the rubik’s cube, and the center of the faces,
we have <code class="language-plaintext highlighter-rouge">3x3x3 - 1 - 6 = 20</code> blocks. They are of two types. Corners (8 blocks) and side blocks (12 blocks), showing respectively 3 and 2 faces. While moving the rubik’s cube, corners will not become side blocks and side blocks will not become corner blocks
(obvious statements remember?). Everything happens as if they are living independant lives. That will be the root of the method I’m presenting here.</p>
<h2 id="whats-the-plan-then">What’s the plan then?</h2>
<p>We will solve the rubik’s cube the following way.</p>
<p><strong>Step 1</strong></p>
<p>Place the side blocks at the correct position with their correct orientation.</p>
<p><img src="/images/rubix/rubix1.jpg" /></p>
<p><strong>Step 2</strong></p>
<p>Place the corners blocks at their correct location regardless of their orientation.</p>
<p><img src="/images/rubix/rubix2.jpg" /></p>
<p>** Step 3 **
Fix the orientation of the corners.</p>
<p><img src="/images/rubix/rubix3.jpg" /></p>
<p>Reaching step 1 is a very nice and cute puzzle that does not
require much crunching. I will not detail its solution here.</p>
<p>Step 2 and step 3 however are very difficult.</p>
<p>The trick to achieve step 2 will be to find some simple sequences of moves that makes it possible to move corner blocks without moving the side blocks.</p>
<p>The trick to achieve step 3 will be to find some simple sequence of moves that leaves all blocks at the same place, but change the orientation of some of the corners.</p>
<p>We also want to find movements that can generates all the possible positions
of the rubik’s cube.</p>
<h1 id="a-bit-of-math">A bit of math</h1>
<p>For step two we ideally would want to generate all permutations on the corner, while letting the side in place. However this is not quite reasonable. Let’s proof that it is not possible to apply any permutation on corners, while letting sides unchanged..</p>
<p>The proof relies on letting alone the orientations of the cube of the rubik’s cubes, and consider only the effect of the basic moves on the permutation of the side blocks on one hand, and the permutations of the corners on the other hand. The basic moves are a cycle of length 4 in both case. That’s an odd signature. The identity has an even signature. In other words, any sequence of basic operation letting the sides untouched has an even number of operation. Hence, the permutations applied on the corners for any sequence of moves letting sides untouched must have an even signature. A transposition for instance, is not possible.</p>
<p><strong>For step 2</strong>, we’ll be happy if we find a move that generates all the permutations with a even signature. <strong>A cycle of length 3 of corners belonging all on one face should do the trick.</strong></p>
<p>For similar reasons, it is not possible to turn only one corner as well.
<strong>For step 3, the move we will be looking for is a move that changes the orientation of at most 3 corners belonging to the same face</strong>.</p>
<p>Let’s bruteforce finding these movements.</p>
<h1 id="coding-a-rubiks-cube-in-python">Coding a rubik’s cube in python</h1>
<p>We will need to be able to handle very simple geometry
operations. Considering numpy as a bit overkill, let’s just recode a couple of 3D vector operation.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># The six directions
</span><span class="n">DIRECTIONS</span> <span class="o">=</span> <span class="p">(</span>
<span class="p">(</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="c1">#right
</span> <span class="p">(</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="c1">#up
</span> <span class="p">(</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="c1">#front
</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="c1">#left
</span> <span class="p">(</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="c1">#down
</span> <span class="p">(</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="c1">#back
</span><span class="p">)</span>
<span class="n">DIRECTIONS_NAME</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">DIRECTIONS</span><span class="p">,</span> <span class="s">"rufldb"</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">cross</span><span class="p">(</span><span class="n">axis</span><span class="p">,</span><span class="n">direction</span><span class="p">):</span>
<span class="c1"># cross product
</span> <span class="k">return</span> <span class="p">(</span><span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="n">direction</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">-</span> <span class="n">axis</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="o">*</span><span class="n">direction</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="o">*</span><span class="n">direction</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">axis</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">direction</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">direction</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="n">direction</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">dot</span><span class="p">(</span><span class="n">va</span><span class="p">,</span><span class="n">vb</span><span class="p">):</span>
<span class="c1"># dot product
</span> <span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="n">a</span><span class="o">*</span><span class="n">b</span> <span class="k">for</span> <span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="n">b</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">va</span><span class="p">,</span><span class="n">vb</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">scale</span><span class="p">(</span><span class="n">alpha</span><span class="p">,</span> <span class="n">v</span><span class="p">):</span>
<span class="c1"># scaling a vector
</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">z</span><span class="p">)</span> <span class="o">=</span> <span class="n">v</span>
<span class="k">return</span> <span class="p">(</span><span class="n">alpha</span><span class="o">*</span><span class="n">x</span><span class="p">,</span> <span class="n">alpha</span><span class="o">*</span><span class="n">y</span><span class="p">,</span> <span class="n">alpha</span><span class="o">*</span><span class="n">z</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">add</span><span class="p">(</span><span class="n">u</span><span class="p">,</span><span class="n">v</span><span class="p">):</span>
<span class="c1"># adding two vectors
</span> <span class="k">return</span> <span class="p">(</span><span class="n">u</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">v</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">u</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">v</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">u</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="n">v</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">rotate</span><span class="p">(</span><span class="n">axis</span><span class="p">,</span> <span class="n">u</span><span class="p">):</span>
<span class="c1"># rotation by a quarter in the
</span> <span class="c1"># positive sense around a normal vector.
</span> <span class="n">axis_projection</span> <span class="o">=</span> <span class="n">scale</span><span class="p">(</span><span class="n">dot</span><span class="p">(</span><span class="n">axis</span><span class="p">,</span><span class="n">u</span><span class="p">),</span> <span class="n">axis</span><span class="p">)</span>
<span class="n">ortho_projection</span> <span class="o">=</span> <span class="n">cross</span><span class="p">(</span><span class="n">axis</span><span class="p">,</span> <span class="n">u</span><span class="p">)</span>
<span class="k">return</span> <span class="n">add</span><span class="p">(</span><span class="n">axis_projection</span><span class="p">,</span> <span class="n">ortho_projection</span><span class="p">)</span></code></pre></figure>
<p>Representing the rubik’s cube by a data structure is actually pretty tricky. To keep code small and cute, I chose to avoid implementing a rubik’s cube class, but instead represent the rubik’s cube state as a simple dictionary.</p>
<p>In order to solve step 2 or step 3, we will also need
to be able to consider a rubik’s cube with only side cubes,
a rubik’s cube with only corner cubes, a rubik’s cube for which corners are not oriented but are different one to each other (as if they had a number associated) and finally a full rubik’s cube.</p>
<p>In the following piece of code, the rubik’s cube will just be a dictionary having for key coordinates in the 3x3x3 3D grid,
and for value an object describe the part of the rubik’s cube associated (simply called cube in the code).</p>
<p>We use alternatively different implementation of a cube.
One storing orientation information while the other does not.</p>
<p>Note that it would have been possible to use much a much more abstract and efficient way to describe the rubik’s cube, but I prefer to keep things as explicit as possible here.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">degree</span><span class="p">(</span><span class="n">coords</span><span class="p">):</span>
<span class="c1"># Given the position of a block
</span> <span class="c1"># return the number of faces that
</span> <span class="c1"># that are visible.
</span> <span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">abs</span><span class="p">,</span><span class="n">coords</span><span class="p">))</span>
<span class="k">class</span> <span class="nc">NonOrientedCube</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">rotate</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">class</span> <span class="nc">OrientedCube</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="n">__slots__</span><span class="o">=</span><span class="p">(</span><span class="s">"orientation"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">orientation</span><span class="o">=</span><span class="n">DIRECTIONS</span><span class="p">[:</span><span class="mi">2</span><span class="p">]):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">orientation</span> <span class="o">=</span> <span class="n">orientation</span>
<span class="k">def</span> <span class="nf">rotate</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
<span class="k">return</span> <span class="n">OrientedCube</span><span class="p">(</span>
<span class="nb">tuple</span><span class="p">(</span><span class="n">rotate</span><span class="p">(</span><span class="n">axis</span><span class="p">,</span><span class="n">u</span><span class="p">)</span>
<span class="k">for</span> <span class="n">u</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">orientation</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">__eq__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">orientation</span> <span class="o">==</span> <span class="n">other</span><span class="p">.</span><span class="n">orientation</span>
<span class="k">def</span> <span class="nf">__ne__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">orientation</span> <span class="o">!=</span> <span class="n">other</span><span class="p">.</span><span class="n">orientation</span>
<span class="c1"># The oriented rubik's cube at its initial
# state. All blocks are oriented the same way.
</span><span class="n">zero_oriented</span> <span class="o">=</span> <span class="p">{</span>
<span class="n">coords</span><span class="p">:</span> <span class="n">OrientedCube</span><span class="p">()</span>
<span class="k">for</span> <span class="n">coords</span> <span class="ow">in</span> <span class="n">product</span><span class="p">(</span><span class="o">*</span><span class="p">([(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">)]</span><span class="o">*</span><span class="mi">3</span><span class="p">))</span>
<span class="k">if</span> <span class="n">degree</span><span class="p">(</span><span class="n">coords</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">2</span>
<span class="p">}</span>
<span class="c1"># The oriented rubik's cube at its initial
# state. Cube are not oriented. Only their position counts.
</span><span class="n">zero_non_oriented</span> <span class="o">=</span> <span class="p">{</span>
<span class="n">coords</span><span class="p">:</span> <span class="n">NonOrientedCube</span><span class="p">()</span>
<span class="k">for</span> <span class="n">coords</span> <span class="ow">in</span> <span class="n">product</span><span class="p">(</span><span class="o">*</span><span class="p">([(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">)]</span><span class="o">*</span><span class="mi">3</span><span class="p">))</span>
<span class="k">if</span> <span class="n">degree</span><span class="p">(</span><span class="n">coords</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">2</span>
<span class="p">}</span>
<span class="c1"># Applying a basic operation on the rubik's
# cube.
#
# Turning the face facing the direction
# axis by a quarter in the positive sense.
# (counter clockwise)
</span><span class="k">def</span> <span class="nf">turn</span><span class="p">(</span><span class="n">axis</span><span class="p">,</span> <span class="n">rubix_cube</span><span class="p">):</span>
<span class="n">parts</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="p">(</span><span class="n">coord</span><span class="p">,</span> <span class="n">cube</span><span class="p">)</span> <span class="ow">in</span> <span class="n">rubix_cube</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="k">if</span> <span class="nb">any</span><span class="p">(</span> <span class="n">x</span><span class="o">==</span><span class="n">y</span><span class="o">!=</span><span class="mi">0</span> <span class="k">for</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">axis</span><span class="p">,</span> <span class="n">coord</span><span class="p">)):</span>
<span class="c1"># this cube is on the face rotating,
</span> <span class="c1"># let's rotate it and register it to
</span> <span class="c1"># its destination.
</span> <span class="n">new_cube</span> <span class="o">=</span> <span class="n">cube</span><span class="p">.</span><span class="n">rotate</span><span class="p">(</span><span class="n">axis</span><span class="p">)</span>
<span class="n">new_coord</span> <span class="o">=</span> <span class="n">rotate</span><span class="p">(</span><span class="n">axis</span><span class="p">,</span> <span class="n">coord</span><span class="p">)</span>
<span class="n">parts</span><span class="p">[</span><span class="n">new_coord</span><span class="p">]</span> <span class="o">=</span> <span class="n">new_cube</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># this cube is not on the face that is rotating.
</span> <span class="n">parts</span><span class="p">[</span><span class="n">coord</span><span class="p">]</span> <span class="o">=</span> <span class="n">cube</span>
<span class="k">return</span> <span class="n">parts</span>
<span class="c1"># Returns a partial rubik's cube :
# only the blocks with d faces visibles.
</span><span class="k">def</span> <span class="nf">project</span><span class="p">(</span><span class="n">rubix</span><span class="p">,</span> <span class="n">d</span><span class="p">):</span>
<span class="k">return</span> <span class="p">{</span>
<span class="n">coords</span><span class="p">:</span><span class="n">v</span>
<span class="k">for</span> <span class="p">(</span><span class="n">coords</span><span class="p">,</span><span class="n">v</span><span class="p">)</span> <span class="ow">in</span> <span class="n">rubix</span><span class="p">.</span><span class="n">items</span><span class="p">()</span>
<span class="k">if</span> <span class="n">degree</span><span class="p">(</span><span class="n">coords</span><span class="p">)</span> <span class="o">==</span> <span class="n">d</span>
<span class="p">}</span>
<span class="c1"># Returns a partial rubik's cube :
# only the side blocks
</span><span class="k">def</span> <span class="nf">sides</span><span class="p">(</span><span class="n">rubix</span><span class="p">):</span>
<span class="k">return</span> <span class="n">project</span><span class="p">(</span><span class="n">rubix</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="c1"># Returns a partial rubik's cube :
# only the corner blocks
</span><span class="k">def</span> <span class="nf">corners</span><span class="p">(</span><span class="n">rubix</span><span class="p">):</span>
<span class="k">return</span> <span class="n">project</span><span class="p">(</span><span class="n">rubix</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span></code></pre></figure>
<h2 id="finding-your-own-magic-move">Finding your own magic move</h2>
<p><img src="/images/rubix/magic_move.jpg" /></p>
<p>All the difficulty left here is to find a <strong>magic move</strong> which leaves sides untouched and yet have some effect on the corners. Let’s call such a combination a magic move!</p>
<p>When trying to find a magic move, especially without a computer, a good trick is to test many moves and consider for each of them what happens if you repeat this moves over and over. The images obtained by repeating the operation is also called an orbit.</p>
<p>When doing that by hand, it can be tested very rapidly by representing the underlying permutations as a union of cycles.</p>
<p>But with a computer we can do all this very simply.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Possible moves
# we add clockwise quater turn (counterclockwise * 3)
# and half turn (counterclockwise * 2)
</span><span class="n">OPERATIONS</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">[</span> <span class="n">direction</span> <span class="p">]</span><span class="o">*</span><span class="n">i</span>
<span class="k">for</span> <span class="n">direction</span> <span class="ow">in</span> <span class="n">DIRECTIONS</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">)</span>
<span class="p">]</span>
<span class="k">def</span> <span class="nf">sequence</span><span class="p">(</span><span class="n">seq</span><span class="p">,</span> <span class="n">rubix</span><span class="p">):</span>
<span class="k">for</span> <span class="n">axis</span> <span class="ow">in</span> <span class="n">seq</span><span class="p">:</span>
<span class="n">rubix</span> <span class="o">=</span> <span class="n">turn</span><span class="p">(</span><span class="n">axis</span><span class="p">,</span> <span class="n">rubix</span><span class="p">)</span>
<span class="k">return</span> <span class="n">rubix</span>
<span class="k">def</span> <span class="nf">differences</span><span class="p">(</span><span class="n">rubix_1</span><span class="p">,</span> <span class="n">rubix_2</span><span class="p">):</span>
<span class="k">return</span> <span class="p">[</span>
<span class="n">k</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">rubix_1</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span>
<span class="k">if</span> <span class="n">rubix_1</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">!=</span> <span class="n">rubix_2</span><span class="p">[</span><span class="n">k</span><span class="p">]</span>
<span class="p">]</span>
<span class="c1"># yields all possible tuples of size n
# of a given set of elements
</span><span class="k">def</span> <span class="nf">browse_with_length</span><span class="p">(</span><span class="n">els</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="k">if</span> <span class="n">n</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span>
<span class="k">yield</span> <span class="p">[]</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">for</span> <span class="n">head</span> <span class="ow">in</span> <span class="n">els</span><span class="p">:</span>
<span class="k">for</span> <span class="n">tail</span> <span class="ow">in</span> <span class="n">browse_with_length</span><span class="p">(</span><span class="n">els</span><span class="p">,</span> <span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span>
<span class="k">yield</span> <span class="n">head</span> <span class="o">+</span> <span class="n">tail</span>
<span class="c1"># yields all possible tuples of a
# given set of elements
</span><span class="k">def</span> <span class="nf">browse_tuples</span><span class="p">(</span><span class="n">els</span><span class="p">):</span>
<span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">count</span><span class="p">(</span><span class="mi">1</span><span class="p">):</span>
<span class="k">for</span> <span class="n">seq</span> <span class="ow">in</span> <span class="n">browse_with_length</span><span class="p">(</span><span class="n">els</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="k">yield</span> <span class="n">seq</span>
<span class="c1"># Returns true if all the position given
# belong to the same face
</span><span class="k">def</span> <span class="nf">all_on_one_face</span><span class="p">(</span><span class="n">positions</span><span class="p">):</span>
<span class="k">for</span> <span class="n">els</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">positions</span><span class="p">):</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">els</span><span class="p">))</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="c1"># Search within the orbit of an operation
# for an operation that leaves fixed_rubik's fix,
# and has a diff with diff rubik's of at most 3
# elements, all from the same face.
</span><span class="k">def</span> <span class="nf">search_orbit</span><span class="p">(</span><span class="n">seq</span><span class="p">,</span> <span class="n">fixed_rubix</span><span class="p">,</span> <span class="n">diff_rubix</span><span class="p">,</span> <span class="n">max_depth</span><span class="p">):</span>
<span class="n">iter_fixed_rubix</span> <span class="o">=</span> <span class="n">fixed_rubix</span>
<span class="n">iter_diff_rubix</span> <span class="o">=</span> <span class="n">diff_rubix</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">max_depth</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span> <span class="c1"># we don't want to find moves
</span> <span class="c1"># that we repeat more than 6 times.
</span> <span class="n">iter_fixed_rubix</span> <span class="o">=</span> <span class="n">sequence</span><span class="p">(</span><span class="n">seq</span><span class="p">,</span> <span class="n">iter_fixed_rubix</span><span class="p">)</span>
<span class="n">iter_diff_rubix</span> <span class="o">=</span> <span class="n">sequence</span><span class="p">(</span><span class="n">seq</span><span class="p">,</span> <span class="n">iter_diff_rubix</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">differences</span><span class="p">(</span><span class="n">fixed_rubix</span><span class="p">,</span> <span class="n">iter_fixed_rubix</span><span class="p">):</span>
<span class="n">diff</span> <span class="o">=</span> <span class="n">differences</span><span class="p">(</span><span class="n">diff_rubix</span><span class="p">,</span> <span class="n">iter_diff_rubix</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">diff</span><span class="p">:</span>
<span class="k">break</span> <span class="c1"># we ran through a full orbit.
</span> <span class="k">elif</span> <span class="n">all_on_one_face</span><span class="p">(</span><span class="n">diff</span><span class="p">)</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">diff</span><span class="p">)</span> <span class="o"><=</span> <span class="mi">3</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="n">seq</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">diff</span><span class="p">)</span>
<span class="n">DIRECTIONS_NAME</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">DIRECTIONS</span><span class="p">,</span>
<span class="p">[</span><span class="s">"right"</span><span class="p">,</span>
<span class="s">"up"</span><span class="p">,</span>
<span class="s">"front"</span><span class="p">,</span>
<span class="s">"left"</span><span class="p">,</span>
<span class="s">"down"</span><span class="p">,</span>
<span class="s">"back"</span> <span class="p">]))</span>
<span class="k">def</span> <span class="nf">operation_to_string</span><span class="p">(</span><span class="n">seq</span><span class="p">):</span>
<span class="k">return</span> <span class="s">"-"</span><span class="p">.</span><span class="n">join</span><span class="p">([</span> <span class="n">DIRECTIONS_NAME</span><span class="p">[</span><span class="n">axis</span><span class="p">]</span>
<span class="k">for</span> <span class="n">axis</span> <span class="ow">in</span> <span class="n">seq</span> <span class="p">])</span>
<span class="k">print</span> <span class="s">"""
Step 2
Searching for a move letting sides
untouched, letting all but three corners belonging to the
same face at the same place.
"""</span>
<span class="k">def</span> <span class="nf">search_step2_move</span><span class="p">():</span>
<span class="k">for</span> <span class="n">seq</span> <span class="ow">in</span> <span class="n">browse_tuples</span><span class="p">(</span><span class="n">OPERATIONS</span><span class="p">):</span>
<span class="n">seq</span> <span class="o">=</span> <span class="p">[</span><span class="n">DIRECTIONS</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">+</span> <span class="n">seq</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">seq</span><span class="p">)</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">magic_move</span> <span class="o">=</span> <span class="n">search_orbit</span><span class="p">(</span><span class="n">seq</span><span class="p">,</span>
<span class="n">sides</span><span class="p">(</span><span class="n">zero_oriented</span><span class="p">),</span>
<span class="n">corners</span><span class="p">(</span><span class="n">zero_non_oriented</span><span class="p">),</span> <span class="mi">4</span><span class="p">)</span>
<span class="k">if</span> <span class="n">magic_move</span><span class="p">:</span>
<span class="p">(</span><span class="n">operation</span><span class="p">,</span> <span class="n">repeat</span><span class="p">,</span> <span class="n">dist</span><span class="p">)</span><span class="o">=</span><span class="n">magic_move</span>
<span class="k">print</span> <span class="n">operation_to_string</span><span class="p">(</span><span class="n">operation</span><span class="p">),</span>
<span class="k">print</span> <span class="s">"x"</span> <span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">repeat</span><span class="p">),</span>
<span class="k">print</span> <span class="n">dist</span>
<span class="k">break</span>
<span class="n">search_step2_move</span><span class="p">()</span>
<span class="k">print</span> <span class="s">"</span><span class="se">\n</span><span class="s">---------------</span><span class="se">\n</span><span class="s">"</span>
<span class="k">def</span> <span class="nf">search_step3_move</span><span class="p">():</span>
<span class="k">for</span> <span class="n">seq</span> <span class="ow">in</span> <span class="n">browse_tuples</span><span class="p">(</span><span class="n">OPERATIONS</span><span class="p">):</span>
<span class="n">seq</span> <span class="o">=</span> <span class="p">[</span><span class="n">DIRECTIONS</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">+</span> <span class="n">seq</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">seq</span><span class="p">)</span> <span class="o">%</span><span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">corners_non_oriented</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span>
<span class="n">zero_oriented</span><span class="p">,</span>
<span class="o">**</span><span class="n">corners</span><span class="p">(</span><span class="n">zero_non_oriented</span><span class="p">))</span>
<span class="n">magic_move</span> <span class="o">=</span> <span class="n">search_orbit</span><span class="p">(</span><span class="n">seq</span><span class="p">,</span>
<span class="n">corners_non_oriented</span><span class="p">,</span>
<span class="n">corners</span><span class="p">(</span><span class="n">zero_oriented</span><span class="p">),</span>
<span class="mi">6</span><span class="p">)</span>
<span class="k">if</span> <span class="n">magic_move</span><span class="p">:</span>
<span class="p">(</span><span class="n">operation</span><span class="p">,</span> <span class="n">repeat</span><span class="p">,</span> <span class="n">dist</span><span class="p">)</span><span class="o">=</span><span class="n">magic_move</span>
<span class="k">print</span> <span class="n">operation_to_string</span><span class="p">(</span><span class="n">operation</span><span class="p">),</span>
<span class="k">print</span> <span class="s">"x"</span> <span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">repeat</span><span class="p">),</span>
<span class="k">print</span> <span class="n">dist</span>
<span class="k">break</span>
<span class="k">print</span> <span class="s">"""
Step 3
Searching a sequence that only change the orientation
of three corners.
"""</span>
<span class="n">search_step3_move</span><span class="p">()</span></code></pre></figure>
<h2 id="we-gat-da-moves">We gat da moves!</h2>
<p><a href="https://github.com/poulejapon/poulejapon.github.com/blob/master/code/rubix/rubix.py">The code is available here</a>,
and should take a couple of minutes to run on your computer.</p>
<p>The output you should get is</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Step 2
Searching for a move letting sides
untouched, letting all but three corners belonging to the
same face at the same place.
right-left-left-up-down-down x4
[(-1, 1, 1), (-1, -1, 1), (1, -1, 1)]
---------------
Step 3
Searching a sequence that only change the orientation
of three corners.
right-up-right-right-right-front-front-up-up-up x6
[(-1, 1, 1), (-1, -1, 1), (1, 1, 1)]
</code></pre></div></div>
<p>Let’s what these operation look like when applied on a resolved rubik’s cube.</p>
<p><img src="/images/rubix/initial.jpg" /></p>
<p>The operation returned for step 2 is permuting 3 cubes on the back of the rubik’s cube.</p>
<p><img src="/images/rubix/step2.jpg" /></p>
<p>The operation returned for step 3 is changing the orientation on 3 cubes on the front face.</p>
<p><img src="/images/rubix/step3.jpg" /></p>
Of how much of a file is in RAM2013-08-10T00:00:00+00:00https://fulmicoton.com/posts/pagecache<h1 id="memory-my-friend-">Memory my friend !</h1>
<p>Nowadays RAM is so cheap, you might be tempted to just rely on his database being in RAM to get the wanted performance. Disk is just there for persistence.</p>
<p>Many people talk on the web about their production setup bein in TmpFs, or using the RAMDirectory.</p>
<p>But isn’t your OS supposed to make sure that the stuff your accessing is page cache? Let’s see how we can measure how much of your db/index/data is in page cache.</p>
<h1 id="whats-page-cache-anyway">What’s page cache anyway?</h1>
<p>It takes from 5 to 10ms to read something from a random part of your hard disk. Accessing data in RAM on the other hand, takes between 50 ns and 100 ns. It is only natural for the OS to make sure that the same data is not loaded twice if we can afford caching it in RAM. That’s precisely the role of the page cache.</p>
<p>If you are on Linux or MacOS, here is a very simple experiment to see the page cache in action. Go find a fat and useless file sleeping on your hard disk. That DivX of <code class="language-plaintext highlighter-rouge">Beethoven 2</code> will do. Do not open it, just run the following command twice</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time cat ./free-willy-2.mpg > /dev/null
</code></pre></div></div>
<p>The command reads your whole file and print out the duratio of the operation. The second time, you should get a pretty nice performance improvement. By reading the file the first time, we made sure that the file was sitting in RAM for the second turn.</p>
<p>This trick is actually pretty legit. You can actually warmup files by cat’ing them to your good old <code class="language-plaintext highlighter-rouge">/dev/null</code>.</p>
<h1 id="pmap-to-the-rescue">pmap to the rescue</h1>
<p>Assuming your database is using memory mapping (mmap), pmap will actually give a nice picture of what’s in your virtual memory and help you a bit about how much of your database file are in RAM.</p>
<p>The default parameters however won’t be helpful to know how much of your files are in RAM. To know that, you need to stick it the <code class="language-plaintext highlighter-rouge">-x</code> param.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pmap -x <pid>
</code></pre></div></div>
<p>You can find the pid of your process by running</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ps -aux
</code></pre></div></div>
<p>Let’s take a look at a very cold Solr in which I just pushed 1M+ documents.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Address Kbytes RSS Dirty Mode Mapping
0000000000400000 4 4 0 r-x-- java
0000000000600000 4 4 4 rw--- java
000000000234e000 132 12 12 rw--- [ anon ]
00000006fae00000 56704 27564 27564 rw--- [ anon ]
00000006fe560000 4800 0 0 ----- [ anon ]
00000006fea10000 22464 0 0 rw--- [ anon ]
0000000700000000 146304 144384 144384 rw--- [ anon ]
0000000708ee0000 23744 0 0 ----- [ anon ]
000000070a610000 2626176 0 0 rw--- [ anon ]
00000007aaab0000 1398080 1387668 1387668 rw--- [ anon ]
00007f6c071fe000 280 4 0 r--s- _1.fdx
00007f6c07244000 64492 4 0 r--s- _1.fdt
00007f6c0b13f000 36 4 0 r--s- _1_nrm.cfs
00007f6c0b148000 1460 540 0 r--s- _1_Lucene40_0.tim
00007f6c0b2b5000 3472 4 0 r--s- _1_Lucene40_0.prx
00007f6c0b619000 4732 184 0 r--s- _1_Lucene40_0.frq
00007f6c0bab8000 284 4 0 r--s- _2.fdx
00007f6c0baff000 66200 4 0 r--s- _2.fdt
00007f6c0fba5000 36 4 0 r--s- _2_nrm.cfs
00007f6c0fbae000 1392 488 0 r--s- _2_Lucene40_0.tim
00007f6c0fd0a000 3532 4 0 r--s- _2_Lucene40_0.prx
00007f6c1007d000 4892 164 0 r--s- _2_Lucene40_0.frq
00007f6c3f21f000 284 4 0 r--s- _d.fdx
00007f6c3f266000 69544 4 0 r--s- _d.fdt
00007f6c43650000 69224 4 0 r--s- _e.fdt
00007f6c479ea000 280 4 0 r--s- _f.fdx
00007f6c47a30000 68916 4 0 r--s- _f.fdt
00007f6c4bd7d000 68552 4 0 r--s- _g.fdt
00007f6c54f25000 705388 4 0 r--s- _i.fdt
00007f6c80000000 132 8 8 rw--- [ anon ]
00007f6c80021000 65404 0 0 ----- [ anon ]
00007f6d9789d000 1016 120 120 rw--- [ anon ]
00007f6d9799b000 32 28 0 r-x-- libmanagement.so
00007f6d979a3000 2044 0 0 ----- libmanagement.so
00007f6d9c296000 1016 92 92 rw--- [ anon ]
00007f6d9c394000 12 12 0 r--s- lucene-highlighter-4.0.0.jar
</code></pre></div></div>
<p>Anonymous is all the stuff that is not associated with a file, in this case
your Java heap. You should see shared native libraries and jar. They indeed are mapped in your process virtual memory. At this point you need to locate which files are the actual data of your database. They may not appear here if you are using a database working mainly in anonymous space, or if your database does not rely on mmap to access the data.</p>
<table>
<tbody>
<tr>
<td>In my case, we see that the file of our index are mapped into memory. The so-called <a href="http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termindex">posting lists</a> are the file matching the _*_Lucene.(frq</td>
<td>tim</td>
<td>prx</td>
<td>tip).</td>
</tr>
</tbody>
</table>
<p>Let’s check how much of these are in RAM.</p>
<p>RSS stands for resident memory. It’s the part of your virtual memory that is actually sitting on your actual physical memory rather than on your file in your filesystem (for mmapped files) or your swap for anonymous memory.</p>
<h1 id="wait-a-minute-pmap-showing-its-limits">Wait a minute… pmap showing its limits.</h1>
<p>Ok, let’s check whether the RSS column is working out as expected.</p>
<p>If we cat <code class="language-plaintext highlighter-rouge">_2_Lucene40_0.prx</code> to <code class="language-plaintext highlighter-rouge">/dev/null</code> we saw that it was loaded into RAM. Right now only 476 / 688 KBytes are in RAM, we should observe this figure to go 100%.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat _2_Lucene40_0.prx > /dev/null
pmap -x 10988 | grep _2_Lucene40_0.prx
</code></pre></div></div>
<p>gives me back :</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00007f6c0fd0a000 3532 4 0 r--s- _2_Lucene40_0.prx
</code></pre></div></div>
<p>This does not work as expected. Why the hell did this happen?</p>
<h1 id="minor-and-major-page-faults">Minor and major page faults</h1>
<p>MMap mapped a segment of the virtual memory of our program to a segment of the disk. All this operation is lazy and at this point nothing was read from disk or anything.</p>
<p>On the first attempt to access data from this virtual memory range, the OS will do whatever necessary to map the virtual memory page to a physical memory page that holds the same information as the disk.</p>
<p>If at this moment, the file is actually in page cache, the OS just have to create the mapping between the virtual memory and the page cache (yes most of the time mmap are actually direcly mapped to the page cache!). This is usually called <code class="language-plaintext highlighter-rouge">minor page fault</code>.</p>
<p>If however the page is not in page cache, we need to wait for the system to read the info from the disk and put it in page cache. This is the dreaded <code class="language-plaintext highlighter-rouge">major page fault</code>.</p>
<p>If our process tried to access a segment not marked as in resident in our Lucene file right now, this would result in a minor page fault… but not a major page fault. The OS would just have to map the virtual memory to the already filled page cache.</p>
<p>You can check for the number of page fault (minor and major) by using ps.
ps -o min_flt,maj_flt <PID></PID></p>
<p># What can we do? mincore to the rescue.</p>
<p>A database may mmap and munmap files or you may restart your process, or a process may mmap a file that have been just created by another process. Since what we really want to avoid is major page fault, <code class="language-plaintext highlighter-rouge">pmap</code>’s figures are not exactly reliable.</p>
<p>I don’t know any linux command that answer this question directly, but <code class="language-plaintext highlighter-rouge">[mincore](http://man7.org/linux/man-pages/man2/mincore.2.html)</code> is a system call that makes it possible to know whether accessing a page virtual memory page will require an IO or not.</p>
<p>We can therefore mmap a file, and ask mincore whether accessing each or each byte would trigger a major page fault or not.</p>
<p>I wrote a little utility doing that, and you can find it on <a href="https://github.com/poulejapon/isresident">github</a>.
Let’s use it to take a look at our <code class="language-plaintext highlighter-rouge">_2_Lucene40_0.prx</code> file again.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ isresident _2_Lucene40_0.prx
FILE RSS SIZE PERCT _2_Lucene40_0.prx 3530 3530 100 %
</code></pre></div></div>
<p>Hurray ! We indeed observe that the file is indeed completely in RAM.</p>
<p>You can run it use wildcard to use it on a directory as well.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./isresident /usr/lib/*
</code></pre></div></div>
Of collapsing in Solr2013-08-06T00:00:00+00:00https://fulmicoton.com/posts/grouping-in-solr<h2 id="a-post-about-solr">A post about Solr.</h2>
<p>This post is about the innerworkings of one of the two most popular open source search engines : <a href="http://wiki.apache.org/solr/SchemaRESTAPI">Solr</a>. I noticed that many questions (one or two everyday) on solr-user’s mailing list were about Solr’s collapsing functionality.</p>
<p>I thought it would be a good idea to explain how Solr’s collapsing is working. Because its documentation is very sparse, and because a search engine is the kind of car you to take a peek under the hood to make sure you’ll drive it right.</p>
<h2 id="regular-search">Regular search</h2>
<h3 id="phase-1---getting-the-document-ids-list">Phase 1 - Getting the document ids list.</h3>
<p>Before jumping into collapsing let’s review without going into the details how a regular search engine works.</p>
<p>When ingesting a document, a search engine starts by attributing it a document id that usually takes shape into an integer as small as possible. Consider it a simple incremental index.</p>
<p>Then, it splits the text of the document into words. For each word encounterred, it creates a list of doc ids that contained this word. This data structure is usually called inverted list, or posting list.</p>
<p>What happens then when someone search for the ten most recent documents containing the words <code class="language-plaintext highlighter-rouge">Burgundy</code> and <code class="language-plaintext highlighter-rouge">wine</code>. Both posting for <code class="language-plaintext highlighter-rouge">Burgundy</code> and <code class="language-plaintext highlighter-rouge">wine</code> are opened. The list of document ids containing both words is then logically the intersection of the two lists.</p>
<p>An important point here is that posting lists are sorted. The main benefit for that is that it makes computing the intersection or the union of two posting lists much easier. You just need to scan the two lists at the same time, and check for their first two elements. Here is the implementation of this algorithm in Python. The algorithm is linear in time and bounded in memory.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"> <span class="k">def</span> <span class="nf">intersection</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">):</span>
<span class="n">left_head</span> <span class="o">=</span> <span class="n">left</span><span class="p">.</span><span class="nb">next</span><span class="p">()</span>
<span class="n">right_head</span> <span class="o">=</span> <span class="n">right</span><span class="p">.</span><span class="nb">next</span><span class="p">()</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="k">if</span> <span class="n">left_head</span> <span class="o">==</span> <span class="n">right_head</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">left_head</span>
<span class="n">left_head</span> <span class="o">=</span> <span class="n">left</span><span class="p">.</span><span class="nb">next</span><span class="p">()</span>
<span class="n">right_head</span> <span class="o">=</span> <span class="n">right</span><span class="p">.</span><span class="nb">next</span><span class="p">()</span>
<span class="k">elif</span> <span class="n">left_head</span> <span class="o"><</span> <span class="n">right_head</span><span class="p">:</span>
<span class="n">left_head</span> <span class="o">=</span> <span class="n">left</span><span class="p">.</span><span class="nb">next</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">right_head</span> <span class="o">=</span> <span class="n">right</span><span class="p">.</span><span class="nb">next</span><span class="p">()</span>
</code></pre></figure>
<p>This simple scan makes it possible to keep correct performances even when your posting list is still on your hard disk.</p>
<h3 id="phase-2---getting-the-ten-best-document-ids-sorted">Phase 2 - Getting the ten best document ids sorted.</h3>
<p>Once the list of document ids have been retrieved, the search engine
goes through this list of document ids and retrieves the sorting field for each of the document. Here it is the date. For this reason, it is very important that the index holds in RAM a map going from document ids to the sort field.
For this reason you need to use an indexed field (in which case it will get uninverted, one of Lucene’s exotic feature) or a docValue.</p>
<p>It then appends the document to a collector object which will make sure to only retain the n-best documents. Many algorithm exists for that. A pythonista could just call <a href="http://docs.python.org/2/library/heapq.html"><code class="language-plaintext highlighter-rouge">heapq.nlargest</code></a>.</p>
<p>All of these algorithms are linear in the number of document ids we have, and bounded in memory.</p>
<h3 id="phase-3---get-the-storables">Phase 3 - Get the storables</h3>
<p>Once the document have been selected, we can finally iterate on them, and fetch for all the other field that we need to give back to the user. There is only ten documents here, so it is ok if some of the documents actually require to hit the disk. These fields are what Solr calls storable field.</p>
<h2 id="distributed-search">Distributed search</h2>
<p>When your reached a big number of document, Solr makes it easy to cut your index and distribute it into different computers called shards.</p>
<p>The server receiving the request will play the role of a master for the request. It will dispatch relevant requests to the shards,
and merge their answers.</p>
<p>A typical search query will be done in two rounds.</p>
<h3 id="round-1">Round 1</h3>
<p>The server receiving the query asks all of the shards for their ten best document ids, along with their score (here the date).
He can then merge these list and retrieve the ten best document ids in the whole index.</p>
<h3 id="round-2">Round 2</h3>
<p>The computer asks the different shards for the full document associated to these document ids.</p>
<p>I think we are all set to think about how things are done when grouping / collapsing.</p>
<h2 id="non-distributed-grouping-queries">Non-distributed Grouping queries</h2>
<p>As we did for a regular search, let’s first consider the non-distributed case.</p>
<h3 id="phase-1">Phase 1</h3>
<p>We fetch the list of document ids matching the query. Nothing different here.</p>
<h3 id="phase-2">Phase 2</h3>
<p>We loop on the document and fetch in some map living in RAM both
the score and the grouping field value. Our collector is slightly
more tricky here. Instead of keeping a data structure holding the ten best doc ids, we will keep the ten best group values.</p>
<p>The collector implementation in Lucene is in AbstractFirstPassGroupingCollector](https://github.com/apache/lucene-solr/search?q=AbstractFirstPassGroupingCollector&ref=cmdform) and maintains the list of the n-best groups until now implicitely sorted.</p>
<p>We are once again bounded in memory and linear in number of doc ids.</p>
<p>Below is a simple possible implementation of such a collector.
Python doesn’t come with any equivalent of a red-black tree. I used here the very nice bintree package that needs to be pip-installed. It basically acts as a dictionary for which items remains sorted by their keys.</p>
<p>Lucene itself relies on Java’s TreeMap.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">bintrees</span> <span class="kn">import</span> <span class="n">BinaryTree</span>
<span class="k">def</span> <span class="nf">collapsing_first_round_collector</span><span class="p">(</span><span class="n">docs</span><span class="p">,</span> <span class="n">n_bests</span><span class="p">):</span>
<span class="c1"># we actually use it as an ordered set.
</span> <span class="n">top_score_group</span> <span class="o">=</span> <span class="n">BinaryTree</span><span class="p">()</span>
<span class="n">top_group_score</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="p">(</span><span class="n">doc_id</span><span class="p">,</span> <span class="n">group_val</span><span class="p">,</span> <span class="n">score</span><span class="p">)</span> <span class="ow">in</span> <span class="n">docs</span><span class="p">:</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">top_score_group</span><span class="p">)</span> <span class="o">>=</span> <span class="n">n_bests</span><span class="p">:</span>
<span class="n">worst_score</span><span class="p">,</span> <span class="n">worst_group</span> <span class="o">=</span> <span class="n">top_score_group</span><span class="p">.</span><span class="n">min_key</span><span class="p">()</span>
<span class="k">if</span> <span class="n">score</span> <span class="o"><=</span> <span class="n">worst_score</span><span class="p">:</span>
<span class="c1"># there is already n candidates and
</span> <span class="c1"># not better than the worst of them
</span> <span class="k">continue</span>
<span class="k">if</span> <span class="n">group_val</span> <span class="ow">in</span> <span class="n">top_group_score</span><span class="p">:</span>
<span class="n">former_score</span> <span class="o">=</span> <span class="n">top_group_score</span><span class="p">[</span><span class="n">group_val</span><span class="p">]</span>
<span class="k">if</span> <span class="n">score</span> <span class="o"><</span> <span class="n">former_score</span><span class="p">:</span>
<span class="c1"># we just need to update the score
</span> <span class="c1"># associated to the group
</span> <span class="k">continue</span>
<span class="k">del</span> <span class="n">top_score_group</span><span class="p">[(</span><span class="n">former_score</span><span class="p">,</span> <span class="n">group_val</span><span class="p">)]</span>
<span class="n">top_group_score</span><span class="p">[</span><span class="n">group_val</span><span class="p">]</span> <span class="o">=</span> <span class="n">score</span>
<span class="n">top_score_group</span><span class="p">[(</span><span class="n">score</span><span class="p">,</span> <span class="n">group_val</span><span class="p">)]</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">top_score_group</span><span class="p">)</span> <span class="o">==</span> <span class="n">n_bests</span> <span class="o">+</span> <span class="mi">1</span><span class="p">:</span>
<span class="c1"># we need to erase one the extra element
</span> <span class="p">(</span><span class="n">last_score</span><span class="p">,</span> <span class="n">last_group</span><span class="p">)</span> <span class="o">=</span> <span class="n">top_score_group</span><span class="p">.</span><span class="n">pop_min</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">del</span> <span class="n">top_group_score</span><span class="p">[</span><span class="n">last_group</span><span class="p">]</span>
<span class="k">return</span> <span class="nb">list</span><span class="p">(</span><span class="nb">reversed</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">top_score_group</span><span class="p">.</span><span class="n">keys</span><span class="p">())))</span></code></pre></figure>
<h3 id="phase-3">Phase 3</h3>
<p>Once the groups to be returned are selected, we need to return the best <code class="language-plaintext highlighter-rouge">group.limit</code> hits associated to this group.</p>
<p>Once again this will only be a matter of scanning through the doc ids, check if the group value belongs to the top 10 groups, and if so, append it to a dedicated collector. The sort used here to select the best hits belonging to a group can be completely different from the one used to select groups.</p>
<h2 id="distributed-grouping-queries">Distributed Grouping queries</h2>
<p>The server receiving the request will play the role of a master for the request. It will dispatch relevant requests to the shards,
and merge their answers.</p>
<p>Distributed grouping queries are done in three rounds.</p>
<h3 id="round-1-1">Round 1</h3>
<p>The master asks all of the shards for their ten best group ids, along with their score (here the date).
Each shard computes them by running phase 1 and 2 of the non-distributed case. They give back their local ten bests group values and their score.</p>
<p>The master can then merge these lists and retrieve the ten best group ids in the whole index.</p>
<h3 id="round-2-1">Round 2</h3>
<p>All of the shards are asked for their best <code class="language-plaintext highlighter-rouge">group.limit</code> representant doc ids and their score for each of these best groups. The group ids are passed within the query.</p>
<p>The server can then merge these results and deduce the best hits to be returned for each of the best groups.</p>
<h3 id="round-3">Round 3</h3>
<p>The shard are requested for the documents.</p>
<h2 id="what-can-we-deduce-from-that">What can we deduce from that?</h2>
<p>At this point, all the extra queries appearing in your log should start to make sense.</p>
<p>In addition, you should rapidly get the sense of what can be done and what cannot be done. Sorting groups by descending lowest value of a field is conceptually impossible in linear time, and bounded memory without pre-processing, while ascending lowest value is very simple.</p>
<p>You should also get a sense that giving back the exact number of groups would require at one point for the shards to send back the list of all the group term they encounterred which is way too expensive. Solr chose to have the shard send back the number of groups encounterred, and returning the sum of all of these. This result is actually only correct if you made sure to partition your index with respect to your group values. If it is not the case, Solr
will only give you back a big upper bound.</p>
<p>We also now understand that, grouped or not, queries asking for results from the 100th to the 110th (page 10) to a distributed search engine are very expensive, as they require to query the shards for the results from 0 to 100.</p>
<p>Finally we observe that solr could run round 2 and round 3 at once if the index was partitioned with respect to the group values.</p>
Of bayesian average and star ratings2013-03-17T00:00:00+00:00https://fulmicoton.com/posts/bayesian_rating<h2 id="e-commerce-sometimes-doing-it-wrong">E-Commerce (sometimes) doing it wrong</h2>
<p>Most e-commerce websites are offering you to sort your search results by customer ratings… and quite a lot are doing it wrong. Let’s assume here I’m looking for a book about CSS. I want to get the best book money can buy, so I will definitely hit the sort by rating button. The website is offering two options</p>
<ul>
<li>book A : 1 rating of 5. Average rating of 5.</li>
<li>book B : 50 ratings. Average rating of 4.5</li>
</ul>
<p>Think about it, would you rather have <em>book A</em> come first of <em>book B</em>
come first. Probably <em>book B</em> right? That means we need some thing
smarter than just sorting by average rating.</p>
<p>A first simple answer, which would definitely be an improvement compared to sorting by average rating might be to put product with less than k ratings at the bottom. But then, how to choose k? What if we are looking for a niche and all products have less than k ratings except one, which has a k+1 awful ratings. Should it go on top ?</p>
<p>A second answer you might come up to would be to choose an empirical scoring formula that seems to match our constraints.</p>
<p>Most of the formulas out there rely on Bayesian estimation. Generally speaking, Bayesian estimation really shines on this kind of situation : you want to measure something, but you know you won’t have enough data to reach a perfect estimation.</p>
<p>If m is the mean of the ratings and n is the number of the ratings, we might consider something like :</p>
<pre>
$$rating(m, n) = {mn \over {n+K}}$$
</pre>
<p>This will probably work just fine. <strong>Probably</strong>… Still you have to choose the right K without knowing to what physical values it relates. More importantly you will have to convince your coworker that this is the nice solution that will covers the edge cases perfectly.</p>
<h2 id="bayesian-estimation-crash-course">Bayesian estimation crash course</h2>
<p>The big idea is, rather than trying to directly compute our estimate, first we compute a probability distribution describing “what we know” of the value we want to estimate, and then (and only then) we can extract an estimate of this value that fits our purpose.</p>
<p>The separation of concern in that last bit is actually quite important. Depending on your point of view you may consider very different value as estimates of a physical value.</p>
<p>For instance, if I need to estimate the number of serums that a government needs to buy in order to cope with an epidemic, I will want to deliver a figure for which I can say : I am sure at 90% that this will be sufficient. That figure can sometimes <a href="http://www.infowars.com/french-government-plans-mass-swine-flu-vaccination-program/">be very far away from the expectation</a>. If I am actually working as in accounting in the company selling those serums, and I want to get an idea of a lower bound for my income for next month, I will probably take a totally different <a href="http://en.wikipedia.org/wiki/Quantile">quantile</a>.</p>
<h2 id="a-simple-example">A simple example</h2>
<p>Let’s assume you just discovered a parasite called toxoplasmosis and you want to estimate the ratio $X$ of the people infected by a parasite called <a href="http://en.wikipedia.org/wiki/Toxoplasmosis">toxoplasmosis</a>.</p>
<p>Human patients infected by the parasite does not show any symptoms at all, so you pretty as far as you know it could be anything. We might describe your vision on the probability distribution of this value to be a uniform distribution. .</p>
<p>Talking about probability here might feel a little bit weird.
First of all is it legitimate to talk about probability when we are estimating something a very tangible, non-random value? In term of Bayesian probability, a variable is random if you don’t know its value exactly. It is a piece of information that sums up our knowledge about something.</p>
<p>But let’s get back to our problem. As you test people for toxoplasmosis, you will make <strong>observations</strong>.Each person will have a probability <code class="language-plaintext highlighter-rouge">X</code> to have toxoplasmosis, and you want to estimate this very X. Let’s assume that after seing $n$ persons, you detected k people with toxoplasmosis.</p>
<p>You started with a uniform prior probability, and each observation will bend your vision on X, making it more and more accurate.
This updated vision of X is called its <strong>posterior distribution</strong>.
We call <code class="language-plaintext highlighter-rouge">O</code> (as in observation) the sequence of results of our N tests.</p>
<p>Bayes delivers a little formula to compute it</p>
<pre>
$$P(X | O) = { P( O | X) P(X) \over { P(O)} }$$
</pre>
<p>$P(O)$ is the probability of observing what we observed. It is constant with X, and therefore of little interest. Likewise we chose our prior probability $P(X)$ to be uniform and it therefore does not vary with X. We are only interested into the proportionality relation :</p>
<pre>
$$ P(X | O) \propto P( O | X) $$
</pre>
<table>
<tbody>
<tr>
<td>$$P( O</td>
<td>X)$$ is called the likelihood. It is given X (the value we are looking for) the probability of observing what we observed. That’s usually something rather straightforward to compute.</td>
</tr>
</tbody>
</table>
<p>In our case, the probability of observing the sequence of independent observations</p>
<pre>
$$ O = ({o_1}, ..., {o_N}) $$
</pre>
<p>is given by multiplying the probability of each observation :</p>
<pre>
$$ P(O | X) = P({o_1}| X) \times ... \times P({o_N} | X) $$
</pre>
<p>For one single observation, the probability to observe o<sub>i</sub> positive (respectively negative) is by definition X (respectively 1-X). In the end, if we observe K positive, and N-K negative the posterior probability is</p>
<pre>
$$ P(X | O) \propto X^{K}(1-X)^{N-K} $$
</pre>
<p>This distribution is also called <a href="http://en.wikipedia.org/wiki/Binomial_distribution">binomial distribution</a>.</p>
<p>It’s interesting to see how the posterior probability evolves with the number of observations. The graph below shows how the posterior gets more and more refined with the number of observations we get.</p>
<p><img src="https://docs.google.com/spreadsheet/oimg?key=0As3ux_ykgGX1dEk3LV9WQ1E0SE03RTMzbmlIbUFzbmc&oid=1&zx=2u5tfzvqm8zf" alt="Posterior probabilities" /></p>
<p>Now that we have the exact probability, we might consider computing any kind of estimates from this distribution. Arguably the most common output would be to compute a confidence interval : an interval [a,b] for which we can claim with a confidence of 90% our value lies somewhere between a and b.</p>
<p>Nowadays everybody has a computer and probably the simplest way to produce such a confidence interval is probably to compute the cumulative distribution function of this distribution.</p>
<p>A lot of statisticians also worked on finding very accurate confidence intervals for binomial distributions when the normal approximation does not hold. You might want to check for <a href="http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval">this wikipedia page</a> if you want to use one of this formulas.</p>
<h2 id="back-to-the-stars">Back to the stars</h2>
<p>Let’s go back to star ratings! In this section, for simplification we will consider a range of 1, 2, or 3 stars. We will try to estimate, given people’s answer, the posterior distribution of the proportion of people who would give it respectively 1,2, or 3 stars , if we had the chance to ask an infinite number of people.</p>
<p>The random variable we observe follows a so-called categorical distribution. That’s basically a variable that takes its values within <code class="language-plaintext highlighter-rouge">{1,2,3}</code> with a some probabilities p<sub>1</sub>, p<sub>2</sub>, p<sub>3</sub> with</p>
<pre>
$$ {p_1} + {p_2} + {p_3} = 1 $$
</pre>
<p>What makes it harder is that we are not looking at the distribution of a scalar value, but the joint distribution of three scalar values (or rather two considering the linear constraint).</p>
<p>Still, we can apply the same reasoning as we did with the estimation of a single probability :</p>
<pre>
$$ P({p_1}, {p_2}, {p_3} | O) \propto P( O | {p_1}, {p_2}, {p_3}) P({p_1}, {p_2}, {p_3}) $$
</pre>
<p>This time we will however include a prior. In order to simplify computations, it is always a good idea to choose a prior that has the same shape as the likelihood. Let’s first compute the likelihood.</p>
<p>Just like in our previous example parameter estimation, we can use the independance of our observation.</p>
<pre>
$$ P(O | {p_1}, {p_2}, {p_3}) = P({o_1}| {p_1}, {p_2}, {p_3}) \times \cdots \times P({o_N} | {p_1}, {p_2}, {p_3}) $$
</pre>
<p>And the likelihood of each individual observation is given by the associated probability</p>
<pre>
$$\forall j \in \{1,2,3\}, ~~ \forall 1\leq i \leq N, ~~P( {o_i = j} | {p_1}, {p_2}, {p_3}) = {p_j} $$
</pre>
<p>Therefore if within the N reviews we received there was respectively K<sub>1</sub>, K<sub>2</sub>, K<sub>3</sub> reviews with respectively 1,2 and 3 stars, we have a likelihood of</p>
<pre>
$$ P(O | {p_1}, {p_2}, {p_3}) = {p_1}^{K_1} {p_2}^{K_2} {p_3}^{K_3} $$
</pre>
<p>Which is called a <a href="http://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet distribution</a> with parameter</p>
<pre>
$$
\alpha = \left(
\begin{array}{c}
{K_1} + 1 \\
{K_2} + 1 \\
{K_3} + 1
\end{array}
\right)
$$
</pre>
<p>In order to make the math much simpler, let’s consider a prior with the very same shape, and parameter alpha<sup>0</sup>.</p>
<p>The posterior, is proportional to</p>
<pre>
$$ P({p_1}, {p_2}, {p_3} | O) \propto { {p_1}^{K_1} } { {p_2}^{K_2} } { {p_3}^{K_3} } { {p_1}^{ {\alpha_1^0} - 1 } } { {p_2}^{ {\alpha_2^0} - 1 } } { {p_3}^{ {\alpha_3^0} - 1 } } $$
</pre>
<p>Which we can factorize into</p>
<pre>
$$ P({p_1}, {p_2}, {p_3} | O) \propto { {p_1}^{ {K_1} + {\alpha_1^0} - 1 } } { {p_2}^{ {K_2} + {\alpha_2^0} - 1 } } { {p_3}^{ {K_3} + {\alpha_3^0} - 1 } }. $$
</pre>
<p>in which we see a dirichlet distribution with parameter</p>
<pre>
$$ {\alpha^1} = \left( \begin{array}{c}
{K_1} + \alpha_1^0 \\
{K_2} + \alpha_2^0 \\
{K_3} + \alpha_3^0
\end{array}
\right)
$$
</pre>
<p>Now what we really want is an estimate of the average number of star. Let’s consider the use of the expectancy of this average, given our posterior.</p>
<pre>
$$ E( {p_1} + 2{p_2} + 3{p_3} | O ) = E( {p_1} | O ) + 2 E({p_2} | O ) + 3E({p_3} | O ) $$
</pre>
<p>The expectancy of the probability of getting 1,2, or 3 number of stars is given by the dirichlet distribution</p>
<pre>
$$ E(p_i | O) = { {\alpha_i^1} \over { {\alpha_1^1} + {\alpha_2^1} + {\alpha_3^1} } } $$
</pre>
<p>We therefore have for our bayesian average :</p>
<pre>
$$ rating({K_1}, {K_2}, {K_3}) = \frac{ {K_1} + \alpha_1^0}{ N + A} + 2 \frac{ {K_2} + \alpha_2^0}{ N + A} + 3 \frac{ {K_3} + \alpha_3^0}{ N + A}, $$
</pre>
<p>where we define</p>
<pre>
$$ N = {K_1} + {K_2} + {K_3}~~and~~A = {\alpha_1^0} + {\alpha_2^0} + {\alpha_3^0} $$
</pre>
<p>We can regroup that as</p>
<pre>
$$ rating({K_1}, {K_2}, {K_3}) = \frac{ \left(\alpha_1^0 + 2 \alpha_2^0 + 3 \alpha_3^0 \right) + \left({K_1} + 2{K_2} + 3{K_3}\right) }{A + N} $$
</pre>
<p>Voilà ! Let’s just digest this formula in order to make it something usable in real life. Bayesian average for star rating would consist of choosing some parameter C and m in which</p>
<ul>
<li>m represents a prior for the average of the stars</li>
<li>C represents how confident we in our prior. It is equivalent to a number of observations.</li>
</ul>
<p>Then the bayesian average will be</p>
<pre>
$$ rating({K_1}, {K_2}, {K_3}) = \frac{ C \times m + total~number~of~stars }{C + number~of~reviews } $$
</pre>
<p>If you have the relevant data and infinite time, you may set these two values by fitting a Dirichlet distribution on the dataset of the ratings of all your computer books. However it is very common to just choose a pair of parameter that mimick the behavior that we are looking for. m is the value toward which we will adjust the average review of products with very few reviews. The bigger C is, the higher the number of reviews required to “get away from m”.</p>
<p>Let’s now take a look at our first example. Two possible values might be for instance, <code class="language-plaintext highlighter-rouge">m=3</code> and <code class="language-plaintext highlighter-rouge">C=5</code>.</p>
<p>The bayesian averages for the two books become</p>
<pre>
$$ {rating_{book~A}} = \frac{5 \times 3 + 5 \times 1}{ 5 + 1 } = 3.3 $$
$$ {rating_{book~B}} = \frac{5 \times 3 + 4.5 \times 50 }{ 5 + 50 } = 4.36 $$
</pre>
<p>As expected, Book 2 has a better bayesian average than Book 1.</p>
Of wiggle photography2013-02-24T00:00:00+00:00https://fulmicoton.com/posts/wiggle<h2 id="demo">Demo</h2>
<canvas id="wigglestereoscopy"></canvas>
<h2 id="wiggle-stereoscopy">Wiggle Stereoscopy</h2>
<p>Wiggle stereoscopy is probably the cheapest trick to give your brain a 3D feeling. While most technics try to take advantage of binocular vision by giving your left eye and your right eye different pictures, <a href="http://en.wikipedia.org/wiki/Wiggle_stereoscopy">wiggle stereoscopy</a> is just about looping on a couple of pictures with a slight shift of point of view. Your brain will interpret the parallax and give you a sense of 3D. What’s <a href="http://en.wikipedia.org/wiki/Parallax">parallax</a> ? Imagine you’re in the mountain driving in a car. You’re looking out the window. Objects far away are moving slowly, objects closer seem to move very fast.
That’s parallax!</p>
<p>In this post, I’ll show how to implement this effect with CoffeeScript, Canvas (no need for WebGL, Flash or anything), and simple image editor.</p>
<h2 id="depth-map">Depth Map</h2>
<p>But first, we need to be able to get a description of the 3D shape of our picture. <strong>The three animation above have been created very simply from 2D-images.</strong>. The 3D information has been added using a “depth map”. A depth consists of an image of the same size of your original picture, but encoding the z-coordinates of each pixel. Basically it’s encoding how close to us the pixel is supposed to be.</p>
<p>To create one, just open your favorite image editor, and create a new picture. Fill it with black. Black will be our background background. It is assumed to be far away. Now just pain very roughly in grays shades the region you want to “pop out”. The brighter it is, the closer it will feel. Don’t try to be too accurate, 3 shades are probably sufficient.</p>
<p><img src="/images/wiggle/depthmap.png" alt="Wiggle stereoscopy geometry" /></p>
<p>At this point the resulting depth map probably consists of a couple of wide areas associated to different levels. But let’s get real, most of the world is not that square. One very neat trick here is to smooth your depth map with a gaussian blur. It will very likely make everything look more natural. And to tell the truth, our rendering will look very shitty if your depth map is not smooth.</p>
<h2 id="the-poor-mans-3d">The poor man’s 3D</h2>
<p>We could write a ray-tracer, but I’m afraid it might take a little too long at execution (And I’m lazy, so proove me wrong internet!).
Instead we’re going to do something very cheap. High Schooler cheap actually.
We’ll do the opposite of ray tracing and compute the project of the point of our initial image into our camera’s sensor.</p>
<p>So let’s take a look at the geometry of our little problem here.
Given <code class="language-plaintext highlighter-rouge">(i,j)</code> and its depth <code class="language-plaintext highlighter-rouge">d(i,j)</code>, we want to find the coordinates <code class="language-plaintext highlighter-rouge">(x,y)</code> of its projection in our camera. In order to get a nice effect we decided that all the camera were pointing at a very specific point of the image, which would always be projected at the center of our rendered image <code class="language-plaintext highlighter-rouge">(x0, y0)</code>.</p>
<p>For simplification I did a little schema in 2D but it is actually to adapt it to 3D.
We just apply Thales theorem to this problem in order to get the distance x of the projected
point to the center of our rendered image <code class="language-plaintext highlighter-rouge">x0</code>.</p>
<p><img src="/images/wiggle/wiggle.png" alt="Wiggle stereoscopy geometry" /></p>
<p>Ok so we assume we are looking to the center <code class="language-plaintext highlighter-rouge">O</code> of our scene, and take this point as the depth 0 reference. It means that the image of <code class="language-plaintext highlighter-rouge">O</code> is the center <code class="language-plaintext highlighter-rouge">Q</code> of our scene. <code class="language-plaintext highlighter-rouge">H</code> is the projection of the camera <code class="language-plaintext highlighter-rouge">C</code> on the scene plan.</p>
<p>Given a pixel at a position i, it’s image <code class="language-plaintext highlighter-rouge">A</code> is at a depth <code class="language-plaintext highlighter-rouge">d(i)</code>. We want to find the coordinates of its image on the projection screen (Or rather, the sensor of our camera). Basically we want to find the distance QB=y.</p>
<p>Thalès gives us the answer to this problem :
<img src="http://latex.codecogs.com/gif.latex?y=QB=\frac{L}{L-d}\left(i%20-%20{c_x}%20\right%20)%20+{c_x}" alt="parallax formula" /></p>
<p>So if we have a value for a pixel at coordinates (i,j) it should be projected into the pixel (x,y) such that :</p>
<p><img src="http://latex.codecogs.com/gif.latex?\left\{\begin{matrix}%20x=\frac{L}{L-d}\left(i-{c_x}\right)+{c_x}\\%20y=\frac{L}{L-d}\left(j-{c_y}\right)+{c_y}%20\end{matrix}\right." alt="parallax formula" /></p>
<p>where c<sub>x</sub> and c<sub>y</sub> is the coordinates of the projection of the camera on the scene.</p>
<p>Actually here our projection screen has been considered parallele
to the scene for this simple version, we might want to keep
it perpendicular to the axis of the camera. With the value we used for our camera, the correcting factor was however neglectable (0.99 instead of 1).</p>
<figure class="highlight"><pre><code class="language-coffeescript" data-lang="coffeescript"><span class="nx">render_scene</span> <span class="o">=</span> <span class="p">(</span><span class="nx">img</span><span class="p">,</span> <span class="nx">depth</span><span class="p">,</span> <span class="nx">camera</span><span class="p">)</span><span class="o">-></span>
<span class="c1"># the poor man's 3d</span>
<span class="nx">W</span> <span class="o">=</span> <span class="nx">img</span><span class="p">.</span><span class="na">width</span>
<span class="nx">H</span> <span class="o">=</span> <span class="nx">img</span><span class="p">.</span><span class="na">height</span>
<span class="nx">dest</span> <span class="o">=</span> <span class="nx">create_image_data</span> <span class="nx">W</span><span class="p">,</span><span class="nx">H</span>
<span class="nx">imgp</span> <span class="o">=</span> <span class="nx">img</span><span class="p">.</span><span class="na">data</span>
<span class="nx">depthp</span> <span class="o">=</span> <span class="nx">depth</span><span class="p">.</span><span class="na">data</span>
<span class="nx">destp</span> <span class="o">=</span> <span class="nx">dest</span><span class="p">.</span><span class="na">data</span>
<span class="nx">x0</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="na">ceil</span> <span class="p">(</span><span class="nx">W</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span>
<span class="nx">y0</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="na">ceil</span> <span class="p">(</span><span class="nx">H</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span>
<span class="nx">d0</span> <span class="o">=</span> <span class="nx">depthp</span><span class="p">[</span> <span class="p">(</span><span class="nx">x0</span><span class="o">+</span><span class="nx">W</span><span class="o">*</span><span class="nx">y0</span><span class="p">)</span><span class="o">*</span><span class="mi">4</span> <span class="p">]</span>
<span class="nx">c</span> <span class="o">=</span> <span class="mi">0</span>
<span class="nx">N</span> <span class="o">=</span> <span class="nx">W</span><span class="o">*</span><span class="nx">H</span><span class="o">*</span><span class="mi">4</span>
<span class="k">for</span> <span class="nx">j</span> <span class="o">in</span> <span class="p">[</span><span class="mi">0</span><span class="p">...</span><span class="na">H</span><span class="p">]</span>
<span class="k">for</span> <span class="nx">i</span> <span class="o">in</span> <span class="p">[</span><span class="mi">0</span><span class="p">...</span><span class="na">W</span><span class="p">]</span>
<span class="nx">d</span> <span class="o">=</span> <span class="p">(</span><span class="nx">depthp</span><span class="p">[</span><span class="nx">c</span><span class="p">]</span><span class="o">-</span><span class="nx">d0</span><span class="p">)</span>
<span class="nx">r</span> <span class="o">=</span> <span class="nx">camera</span><span class="p">.</span><span class="na">L</span> <span class="o">/</span> <span class="p">(</span><span class="nx">camera</span><span class="p">.</span><span class="na">L</span> <span class="o">-</span> <span class="nx">d</span><span class="p">)</span>
<span class="c1"># compute x,y : the projection of pixel (i,j,d)</span>
<span class="c1"># according to our camera.</span>
<span class="nx">x</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="na">floor</span><span class="p">(</span><span class="nx">x0</span> <span class="o">-</span> <span class="nx">camera</span><span class="p">.</span><span class="na">x</span> <span class="o">+</span> <span class="p">(</span><span class="nx">camera</span><span class="p">.</span><span class="na">x</span> <span class="o">+</span> <span class="nx">i</span> <span class="o">-</span> <span class="nx">x0</span><span class="p">))</span> <span class="o">*</span> <span class="nx">r</span>
<span class="nx">y</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="na">floor</span><span class="p">(</span><span class="nx">y0</span> <span class="o">-</span> <span class="nx">camera</span><span class="p">.</span><span class="na">y</span> <span class="o">+</span> <span class="p">(</span><span class="nx">camera</span><span class="p">.</span><span class="na">y</span> <span class="o">+</span> <span class="nx">j</span> <span class="o">-</span> <span class="nx">y0</span><span class="p">))</span> <span class="o">*</span> <span class="nx">r</span>
<span class="nx">destc</span> <span class="o">=</span> <span class="p">(</span><span class="nx">x</span><span class="o">+</span><span class="nx">W</span><span class="o">*</span><span class="nx">y</span><span class="p">)</span><span class="o">*</span><span class="mi">4</span>
<span class="k">if</span> <span class="p">(</span><span class="mi">0</span> <span class="o"><=</span> <span class="nx">destc</span> <span class="o"><</span> <span class="nx">N</span><span class="p">)</span>
<span class="k">for</span> <span class="nx">v</span> <span class="o">in</span> <span class="p">[</span><span class="mi">0</span><span class="p">..</span><span class="mi">3</span><span class="p">]</span>
<span class="nx">destp</span><span class="p">[</span><span class="nx">destc</span><span class="o">+</span><span class="nx">v</span><span class="p">]</span> <span class="o">=</span> <span class="nx">imgp</span><span class="p">[</span><span class="nx">c</span><span class="o">+</span><span class="nx">v</span><span class="p">]</span>
<span class="nx">c</span><span class="o">+=</span><span class="mi">4</span>
<span class="nx">fill_with_average</span> <span class="nx">destp</span><span class="p">,</span> <span class="nx">W</span><span class="p">,</span> <span class="nx">H</span>
<span class="k">for</span> <span class="nx">c</span> <span class="o">in</span> <span class="p">[</span><span class="mi">0</span><span class="p">...</span><span class="na">W</span><span class="o">*</span><span class="nx">H</span><span class="o">*</span><span class="mi">4</span><span class="p">]</span> <span class="k">by</span> <span class="mi">4</span>
<span class="nx">destp</span><span class="p">[</span><span class="nx">c</span><span class="o">+</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="mi">255</span>
<span class="k">return</span> <span class="nx">dest</span></code></pre></figure>
<p>At the end of the function we are calling a function called <code class="language-plaintext highlighter-rouge">fill_with_average</code>. It is here to fill the gap. Since we are just
moving pixels, some pixel are likely to get moved to the same destination pixels… Which means that we get some gaps in
the destination image. Yes that’s a cheap algorithm, but we
just run a small method to detect these gap pixels and fill them with
an average of their neighbors.</p>
<p>Wiggle stereography typically use two pictures. Actually I found that the effect was much more efficient if we were rendering a lot of pictures, just turning around given point of view.</p>
<p>You can check out the final script on <a href="https://github.com/poulejapon/wigglejs">github</a>.</p>
<script src="/js/wiggle/ready.min.js"></script>
<script src="/js/wiggle/wiggle.js"></script>
<script type="text/javascript">
(function() {
domready(function() {
loadWiggle('poulejapon');
});
function loadWiggle(imgId) {
if (window.animation) {
window.animation.stop();
window.animation = null;
}
canvas = document.getElementById('wigglestereoscopy');
load_animation('/images/wiggle/' + imgId, canvas, function(animation) {
window.animation = animation;
animation.play(canvas, 24);
});
}
window.loadWiggle = loadWiggle;
})();
</script>
Of the pearl puzzle2013-02-16T00:00:00+00:00https://fulmicoton.com/posts/pearls<h2 id="comparison-sort-complexity">Comparison sort complexity</h2>
<p>In my first post I dropped a line about <code class="language-plaintext highlighter-rouge">525</code> being the theoretical minimal number of comparison required to sort a list of <code class="language-plaintext highlighter-rouge">100</code> elements without explaining it. I will do so here, and show how the same thought process can help solving the 12 pearls puzzle.</p>
<p>Let’s got through a <a href="http://en.wikipedia.org/wiki/Thought_experiment">thought experiment</a>.
Let’s imagine you are playing a game with a friend.
You are locked in a room and your friend is outside, holding a randomly shuffled deck of cards marked with numbers ranging from 1 to N. You have a similar deck of card, but sorted.
Your goal is to put it in the same order as the deck of your friend. The only questions you can ask him are of the form “is the card at position <code class="language-plaintext highlighter-rouge">#6</code> shows a bigger number than the card at the position <code class="language-plaintext highlighter-rouge">#29</code>?” and the guy from the other side will answer you by yes or no.</p>
<p>We assume you are a very reasonable person and that you are asking your questions determiniscally. In other words, if you played the game a second time, you would ask the same questions as long you got the same answers.</p>
<p>Let’s have you play the game a LOT of times, we could log in a book
the shuffle of the cards and the answers you got for each questions.
The book would look something like :</p>
<p><code class="language-plaintext highlighter-rouge">[ 3, 17, 29, 12, ..., 15 ]: Yes No Yes Yes No No Yes</code></p>
<p>For two different shuffle, you cannot possibly have gotten the exact same series of answer, because if it was so, you wouldn’t had enough information to discriminate between the two shuffle when you were playing.</p>
<p>In other words, if you were to play all the possible <code class="language-plaintext highlighter-rouge">n!</code> (factorial n) shuffle, you would have gotten <code class="language-plaintext highlighter-rouge">n!</code>- different series of answers.
Now let’s call C the highest number of questions you had to ask
to finish a game. 2<sup>C</sup> is an upper bound for the number of series of answer you had. We therefore have</p>
<p><img src="http://latex.codecogs.com/gif.latex?2^C \geq n!" title="2^C \geq n!" /></p>
<p>The algorithm you ran as you were playing is called a comparison sort. C is the worst-case complexity of your algorithm.</p>
<p>Apply the log to the inequation leads us to
<img src="http://latex.codecogs.com/gif.latex?C \geq log_2 (n! )" title="C \geq log_2 (n!)" /></p>
<p>How smart could you be, you will never be able to think about a strategy for which you can sort out your deck of card in less than
<code class="language-plaintext highlighter-rouge">log_2 (n!)</code> questions all the time.</p>
<p>This result is usually presented as the best possible complexity for a comparison sort is <code class="language-plaintext highlighter-rouge">n ln n</code>.
This can be shown very easily by using the <a href="http://en.wikipedia.org/wiki/Stirling's_approximation">Stirling formula</a>.</p>
<p><strong>In the rest of this post, I’ll try to show that this kind of reasonning can actually help solving problems the 12 pearls puzzle.</strong></p>
<h2 id="the-12-pearls-puzzle">The 12 pearls puzzle</h2>
<p><img src="/images/pearls/balance_scale.jpg" alt="Guess who" /></p>
<p>You’ve been given 12 pearls and a balance scale
like the one in the picture. It has two plates, and you
can compare the weight of the things you put on each plates.</p>
<p>One of the pearl is fake, and you don’t know which one.
All pearls have the very same weight, except for the fake one which is either heavier or lighter (and you don’t know which).</p>
<p>The problem consists of finding out which pearls is fake and whether it is heavier of lighter, by using the balance scale at most three times.</p>
<h2 id="three-times-cant-be-enough">Three times can’t be enough?</h2>
<p><em>Spoiler alert. If like me, you like puzzles, you might
want to stop reading now, and come back after you solved it.</em></p>
<p>The problem is intentionnally misleading. You might think that
the balance gives you only two outputs : “heavier” or “lighter”.</p>
<p>If it was so, using the same argument as before, we can show that using the balance three times will make it only possible to discriminate within 2<sup>3</sup>=8 configurations.</p>
<p>But in this problem, your answer consists of identifying the fake pearl (12 candidates) and whether it is lighter or heavier. The problems has 24 possible different answers, which is greater than 8.</p>
<p>The first trick is to notice that such a balance has a ternary output. It may tell you that the objects on both plates have the exact same weight. That’s 3<sup>3</sup>=27 which is greater than 24.</p>
<h2 id="finding-out-the-first-weighting-with-no-sweat">Finding out the first weighting with no sweat</h2>
<p>Now let’s consider the first weighting. First we reject the possibility of putting a different number of pearls in each plate,
as it doesn’t much information if the balance goes in the direction of bigger number.</p>
<p>The first weighting will be of 1, 2, 3, 4, 5, or 6 pearls on each plate.</p>
<p>Let’s show how a simple reasonning will let us get rid of some of these options. After this first weighting, we will have only 2 shot to find our pearls. With two weighting we can at most discriminate between 3<sup>2</sup> = 9 possibilities.</p>
<p>For 1, 2, 3, if the scale tells us that the two plates have equal weights, we will only know that the fake pearl is within the remaining 10, 8, 6 pearls. We won’t have any info about the fake pearl being heavier or lighter either. That’s respectively 20, 16, or 12 possible answers. We won’t be able to discriminate that many answer with only 2 weighting. 1,2,3 are therefore not an option.</p>
<p>Now let’s consider weighting 5 pearls against 5 pearls, or 6 pearls against 6 pearls. If the scale outputs that the left plate is heavier than the right plate, we will be sticked with the possibilities that either the fake pearl is within the left plate
and is heavier, (respectively 5 and 6 possibilities) or that the fake pearl is in the right plate and is lighter than the other pearls (respectively 5 and 6 possibilities). Overall we will have to
discriminate between 10 or 12 possibilities with two weighting, which is impossible.</p>
<p>We haven’t explored any of the possibilities, yet we showed that <strong>the only possible first move is to weight 4 pearls against 4 other pearls !</strong></p>
<h2 id="a-naive-implementation">A naive implementation</h2>
<p>Let’s now try to write an algorithm that solves this problem
for n pearls. More accurately, it will return, given n, the minimal number of measures to solve the problem for n pearls. The first implementation will be very naive and have a complexity I honestly don’t want to think about. It will help however to give some reality to some of the concept I talked about.</p>
<p>We will go through all possible weighting, while keeping a list of all the remaining answer that are still possible. For each weighting we will consider the new list of possible configurations if we get each of the three output from the scale… A bit like you would do in a game of <em>Guess Who?</em>. We keep a list of all the possible answers, and ask questions and depending on the answer of the questions, we get rid of the solutions one by one.</p>
<p><img src="/images/pearls/guessWho.jpg" alt="Guess who" /></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># let's call that the naive implementation!
</span>
<span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">combinations</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="k">def</span> <span class="nf">measure</span><span class="p">(</span><span class="n">pearl_weights</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">):</span>
<span class="c1"># returns the result of a measure.
</span> <span class="n">left_weight</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pearl_weights</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">left</span><span class="p">)</span>
<span class="n">right_weight</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pearl_weights</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">right</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">cmp</span><span class="p">(</span><span class="n">left_weight</span><span class="p">,</span> <span class="n">right_weight</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">measures</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="c1"># generator yielding all the possible way
</span> <span class="c1"># to select 2 set of k pearls
</span> <span class="c1"># to put on the plates of the scale
</span> <span class="k">for</span> <span class="n">nb_pearls</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span>
<span class="k">for</span> <span class="n">pearls_involved</span> <span class="ow">in</span> <span class="n">combinations</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="n">nb_pearls</span><span class="o">*</span><span class="mi">2</span><span class="p">):</span>
<span class="n">pearls_involved_set</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">pearls_involved</span><span class="p">)</span>
<span class="k">for</span> <span class="n">left</span> <span class="ow">in</span> <span class="n">combinations</span><span class="p">(</span><span class="n">pearls_involved</span><span class="p">,</span> <span class="n">nb_pearls</span><span class="p">):</span>
<span class="n">right</span> <span class="o">=</span> <span class="n">pearls_involved_set</span><span class="p">.</span><span class="n">difference</span><span class="p">(</span><span class="n">left</span><span class="p">)</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">populations_after_measures</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="c1"># loops on the possible way to make a measure
</span> <span class="c1"># and yield list of the three populations
</span> <span class="c1"># matching with the 3 possible outcome of
</span> <span class="c1"># the scale
</span> <span class="k">for</span> <span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span> <span class="ow">in</span> <span class="n">measures</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">measure_results</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">configuration</span> <span class="ow">in</span> <span class="n">population</span><span class="p">:</span>
<span class="n">measure_output</span> <span class="o">=</span> <span class="n">measure</span><span class="p">(</span><span class="n">configuration</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span>
<span class="n">measure_results</span><span class="p">[</span><span class="n">measure_output</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">configuration</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">measure_results</span><span class="p">)</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">measure_results</span><span class="p">.</span><span class="n">values</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">browse_solutions</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="k">for</span> <span class="n">branches</span> <span class="ow">in</span> <span class="n">populations_after_measures</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="k">yield</span> <span class="nb">max</span><span class="p">(</span>
<span class="n">solve</span><span class="p">(</span><span class="n">branch_population</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="k">for</span> <span class="n">branch_population</span> <span class="ow">in</span> <span class="n">branches</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">solve</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">population</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">solutions</span> <span class="o">=</span> <span class="n">browse_solutions</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">1</span> <span class="o">+</span> <span class="nb">min</span><span class="p">(</span><span class="n">solutions</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">pearl_problem</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="k">if</span> <span class="n">n</span> <span class="o"><=</span> <span class="mi">2</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="n">population</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">pearl_weights</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">n</span>
<span class="n">pearl_weights</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">population</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">tuple</span><span class="p">(</span><span class="n">pearl_weights</span><span class="p">))</span>
<span class="n">pearl_weights</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="c1"># negative weight haha!
</span> <span class="n">population</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">tuple</span><span class="p">(</span><span class="n">pearl_weights</span><span class="p">))</span>
<span class="k">return</span> <span class="n">solve</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span></code></pre></figure>
<h2 id="cutting-branches">Cutting branches</h2>
<p>With the naive implementation, things are going reaaallllly slow starting <code class="language-plaintext highlighter-rouge">n=5</code>. It is slow because recursive calls themselves perform recursive calls and so on.
Your program is like running through a gigantic tree. And how do you get gigantic trees thinner? You cut its branches of course.</p>
<p>One simple way to cut branches for instance, would be to let the different calls know about the current best result. That way, as soon as they detect they won’t be able to beat the high score, they can just stop exploring this branch. I won’t implement this optimization here. What I am going to do is return results directly if I reached a theoretical minimum. If I find a solution that finds our fake pearl in <code class="language-plaintext highlighter-rouge">ceil(log_3(len(population)))</code> measures, then I can just stop exploring sibling branches.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># let's call that the "cutting branches"
# implementation!
</span>
<span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">combinations</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="kn">from</span> <span class="nn">math</span> <span class="kn">import</span> <span class="n">ceil</span><span class="p">,</span> <span class="n">log</span>
<span class="k">def</span> <span class="nf">measure</span><span class="p">(</span><span class="n">pearl_weights</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">):</span>
<span class="c1"># returns the result of a measure.
</span> <span class="n">left_weight</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pearl_weights</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">left</span><span class="p">)</span>
<span class="n">right_weight</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pearl_weights</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">right</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">cmp</span><span class="p">(</span><span class="n">left_weight</span><span class="p">,</span> <span class="n">right_weight</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">measures</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="c1"># generator yielding all the possible way
</span> <span class="c1"># to select 2 set of k pearls
</span> <span class="c1"># to put on the plates of the scale
</span> <span class="n">pearls</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="k">for</span> <span class="n">nb_pearls</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span>
<span class="k">for</span> <span class="n">pearls_involved</span> <span class="ow">in</span> <span class="n">combinations</span><span class="p">(</span><span class="n">pearls</span><span class="p">,</span> <span class="n">nb_pearls</span><span class="o">*</span><span class="mi">2</span><span class="p">):</span>
<span class="n">pearls_involved_set</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">pearls_involved</span><span class="p">)</span>
<span class="k">for</span> <span class="n">left</span> <span class="ow">in</span> <span class="n">combinations</span><span class="p">(</span><span class="n">pearls_involved</span><span class="p">,</span> <span class="n">nb_pearls</span><span class="p">):</span>
<span class="n">right</span> <span class="o">=</span> <span class="n">pearls_involved_set</span><span class="p">.</span><span class="n">difference</span><span class="p">(</span><span class="n">left</span><span class="p">)</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">populations_after_measures</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="k">for</span> <span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span> <span class="ow">in</span> <span class="n">measures</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">measure_results</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">configuration</span> <span class="ow">in</span> <span class="n">population</span><span class="p">:</span>
<span class="n">measure_output</span> <span class="o">=</span> <span class="n">measure</span><span class="p">(</span><span class="n">configuration</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span>
<span class="n">measure_results</span><span class="p">[</span><span class="n">measure_output</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">configuration</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">measure_results</span><span class="p">)</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">measure_results</span><span class="p">.</span><span class="n">values</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">browse_solutions</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="k">for</span> <span class="n">branches</span> <span class="ow">in</span> <span class="n">populations_after_measures</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="k">yield</span> <span class="nb">max</span><span class="p">(</span>
<span class="n">solve</span><span class="p">(</span><span class="n">branch_population</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="k">for</span> <span class="n">branch_population</span> <span class="ow">in</span> <span class="n">branches</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">min_with_limit</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">limit</span><span class="p">):</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">g</span><span class="p">.</span><span class="nb">next</span><span class="p">()</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">g</span><span class="p">:</span>
<span class="n">res</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">res</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="k">if</span> <span class="n">res</span><span class="o"><=</span><span class="n">limit</span><span class="p">:</span>
<span class="k">return</span> <span class="n">res</span>
<span class="k">return</span> <span class="n">res</span>
<span class="k">def</span> <span class="nf">solve</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">population</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">limit</span> <span class="o">=</span> <span class="n">ceil</span><span class="p">(</span><span class="n">log</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">population</span><span class="p">),</span><span class="mi">3</span><span class="p">))</span><span class="o">-</span><span class="mi">1</span>
<span class="n">solutions</span> <span class="o">=</span> <span class="n">browse_solutions</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">min_with_limit</span><span class="p">(</span><span class="n">solutions</span><span class="p">,</span> <span class="n">limit</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">pearl_problem</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="k">if</span> <span class="n">n</span> <span class="o"><=</span> <span class="mi">2</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="n">population</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">pearl_weights</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">n</span>
<span class="n">pearl_weights</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">population</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">tuple</span><span class="p">(</span><span class="n">pearl_weights</span><span class="p">))</span>
<span class="n">pearl_weights</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="c1"># negative weight haha!
</span> <span class="n">population</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">tuple</span><span class="p">(</span><span class="n">pearl_weights</span><span class="p">))</span>
<span class="k">return</span> <span class="n">solve</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span></code></pre></figure>
<p>Not much has changed right? Our algorithm performs slightly better. We can now compute in 15s the solution of our problem for <code class="language-plaintext highlighter-rouge">n=7</code>. We still cannot solve our initial problem, which is for <code class="language-plaintext highlighter-rouge">n=12</code>.
How can we do it better?</p>
<p>A good trick is to try to cut branches as soon as possible. In our cases, it would be nice to check start exploring the branch that are more promising first. That way if there is a perfect answer we will find it sooner.</p>
<p>To do so we only need to tune our population after measures method.
We will sort it in order to have the measures splitting the population as evenly as possible first.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># let's call that the "cutting sooner"
# implementation!
</span>
<span class="k">def</span> <span class="nf">populations_after_measures</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="n">branches</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span> <span class="ow">in</span> <span class="n">measures</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">measure_results</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">configuration</span> <span class="ow">in</span> <span class="n">population</span><span class="p">:</span>
<span class="n">measure_output</span> <span class="o">=</span> <span class="n">measure</span><span class="p">(</span><span class="n">configuration</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span>
<span class="n">measure_results</span><span class="p">[</span><span class="n">measure_output</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">configuration</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">measure_results</span><span class="p">)</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="n">branches</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">measure_results</span><span class="p">.</span><span class="n">values</span><span class="p">())</span>
<span class="k">return</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">branches</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span><span class="nb">max</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">len</span><span class="p">,</span> <span class="n">x</span><span class="p">)))</span></code></pre></figure>
<p>That doesn’t look like to promising, but it is actually a huge improvement on the previous algorithm. We can now solve our problem for 12 pearls!</p>
<p><strong>So let’s sum up what we did here! We cut branches by returning results if we have some way to know they are optimal. We try to find those optimal faster by going through the most promising ones first. We also saw but didn’t implement the fact that stopping as soon as we detect we won’t be able to do as well as our siblings.</strong></p>
<h2 id="being-more-human">Being more human</h2>
<p>Still, people solve this problem right? They’re definitely doing something smarter than this, or it would take them days to find the
solution.</p>
<p>I think the difference lies the way our brain pictures
a population and the way the previous algorithm does. My brain doesn’t consider all the combinations like we do here, but understands how pearl 1 and pearl 2 are playing symmetric roles.</p>
<p>For the last algorithm we will try to sum up a population by a tuple (n,l,h,r) where</p>
<ul>
<li>n is the number of pearls for which we don’t know anything</li>
<li>l is the number of pearls for which we know that if they are fake, they are heavier</li>
<li>h is the number of pearls for which we know that if they are fake, they are lighter</li>
<li>r is the number of pearls for which we know that they are real.</li>
</ul>
<p>This greatly reduce the number of branch in our tree. In addition, the space of the possible arguments with which solve is called with is small enought to cache the results.</p>
<p><em>(I erased the previous mention of an implementation which was not as performant as the one below, and a lot more difficult to understand.)</em></p>
<p>The other great benefits is that it makes it a mathematical object which is much easier to grasp.</p>
<p>Let’s call <code class="language-plaintext highlighter-rouge">c(n,l,h,r)</code> the function that gives us the minimum number of weighting required by the best strategy possible.
We will also call p the function of our initial problem, that is :
<img src="http://latex.codecogs.com/gif.latex?p: n \rightarrow c(n,0,0,0)" title="p: n \rightarrow c(n,0,0,0)" />.</p>
<p>With some good coffee, paper and pen it is actually easy to prove that there exists a solution to solve the pearl problem for n=3<sup>k</sup> in k+1 weightings.</p>
<p>In terms of our function <code class="language-plaintext highlighter-rouge">p</code>, we have
<img src="http://latex.codecogs.com/gif.latex?\forall k\geq1,~ p(3^k) \leq k+1" title="\forall k\geq1,~ p(3^k) \leq k+1" /></p>
<p>That’s actually a very strong result because, as we have shown it before, we also know that</p>
<p><img src="http://latex.codecogs.com/gif.latex?\forall k\geq1,~ p(3^k)^3 \geq 2 \times 3^k" title="\forall k\geq1,~ p(3^k)^3 \geq 2 \times 3^k" /></p>
<p>Applying the log, we can get very tight boundaries for c(3<sup>k) :
<img src="http://latex.codecogs.com/gif.latex?\forall k\geq1,~ log_3(2) + 3^k \leq p(3^k) \leq 1+3^k" title="\forall k\geq1,~ log_3(2) + 3^k \leq p(3^k) \leq 1+3^k" />.</sup></p>
<p>This interval actually contains only one integer, we have proven that :
<img src="http://latex.codecogs.com/gif.latex?\forall k\geq1,~ p(3^k) = 1+3^k" title="\forall k\geq1,~ p(3^k) = 1+3^k" /></p>
<p>It is also possible to show that the function c is increasing, so that for all n,</p>
<p><img src="http://latex.codecogs.com/gif.latex?3^k \leq n\leq 3^{k+1} \Rightarrow k+1~ \leq c(n) \leq k+2" title="3^k \leq n\leq 3^{k+1} \Rightarrow k+1~ \leq c(n) \leq k+2" /></p>
<p>So for all n, we actually have only two possible values. Surely we can use that for optimization! Imagine that : as soon as we get a solution with a cost of the lower bound we can return, as soon as we show that a solution will not reach the lower bound we can return.</p>
<p>Here goes the implemementation :</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">math</span> <span class="kn">import</span> <span class="n">floor</span><span class="p">,</span> <span class="n">log</span>
<span class="c1"># Let's call that the "smart" algorithm
</span>
<span class="s">"""In this implementation we represent our knowledge
on the pearls as a quadruplet (n,h,l,r) where
* n is the number pearls for which we don't
know anything.
* h is the number pearls for which we know
that if they are fake, they must be heavier
than the real pearls.
* l is the number of pearls for which we know
that if they are fake they must be lighter
than the real pearls
* r is the number of pearls for which we know they are real.
"""</span>
<span class="k">def</span> <span class="nf">diff</span><span class="p">(</span><span class="n">pop_a</span><span class="p">,</span> <span class="n">pop_b</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">a</span><span class="o">-</span><span class="n">b</span> <span class="k">for</span> <span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="n">b</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">pop_a</span><span class="p">,</span><span class="n">pop_b</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">add</span><span class="p">(</span><span class="o">*</span><span class="n">pops</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">tuple</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">els</span><span class="p">)</span> <span class="k">for</span> <span class="n">els</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">pops</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">minus</span><span class="p">(</span><span class="n">pop</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">tuple</span><span class="p">(</span><span class="o">-</span><span class="n">el</span> <span class="k">for</span> <span class="n">el</span> <span class="ow">in</span> <span class="n">pop</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">heavier</span><span class="p">(</span><span class="n">pop</span><span class="p">):</span>
<span class="p">(</span><span class="n">anything</span><span class="p">,</span> <span class="n">light</span><span class="p">,</span> <span class="n">heavy</span><span class="p">,</span> <span class="n">real</span><span class="p">)</span> <span class="o">=</span> <span class="n">pop</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">heavy</span> <span class="o">+</span> <span class="n">anything</span><span class="p">,</span> <span class="n">real</span> <span class="o">+</span> <span class="n">light</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">lighter</span><span class="p">(</span><span class="n">pop</span><span class="p">):</span>
<span class="p">(</span><span class="n">anything</span><span class="p">,</span> <span class="n">light</span><span class="p">,</span> <span class="n">heavy</span><span class="p">,</span> <span class="n">real</span><span class="p">)</span> <span class="o">=</span> <span class="n">pop</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">light</span> <span class="o">+</span> <span class="n">anything</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">real</span> <span class="o">+</span> <span class="n">heavy</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">even</span><span class="p">(</span><span class="n">pop</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pop</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">population_after_measures</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">measure</span><span class="p">):</span>
<span class="s">"""Given a population and a measure, returns
the three resulting populations, depending on
the outcome of the measure.
"""</span>
<span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span> <span class="o">=</span> <span class="n">measure</span>
<span class="n">pop_no_weighted</span> <span class="o">=</span> <span class="n">add</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">minus</span><span class="p">(</span><span class="n">left</span><span class="p">),</span> <span class="n">minus</span><span class="p">(</span><span class="n">right</span><span class="p">))</span>
<span class="c1"># if the balance says the two plates are even
</span> <span class="k">yield</span> <span class="n">add</span><span class="p">(</span><span class="n">pop_no_weighted</span><span class="p">,</span> <span class="n">even</span><span class="p">(</span><span class="n">left</span><span class="p">),</span> <span class="n">even</span><span class="p">(</span><span class="n">right</span><span class="p">))</span>
<span class="n">O</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pop_no_weighted</span><span class="p">)</span>
<span class="c1"># if the balance says the left plate is lighter
</span> <span class="k">yield</span> <span class="n">add</span><span class="p">(</span> <span class="n">lighter</span><span class="p">(</span><span class="n">left</span><span class="p">),</span> <span class="n">heavier</span><span class="p">(</span><span class="n">right</span><span class="p">),</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="n">O</span><span class="p">)</span> <span class="p">)</span>
<span class="c1"># if the balance says the left plate is heavier
</span> <span class="k">yield</span> <span class="n">add</span><span class="p">(</span> <span class="n">heavier</span><span class="p">(</span><span class="n">left</span><span class="p">),</span> <span class="n">lighter</span><span class="p">(</span><span class="n">right</span><span class="p">),</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="n">O</span><span class="p">)</span> <span class="p">)</span>
<span class="k">def</span> <span class="nf">nb_of_answers</span><span class="p">(</span><span class="n">pop</span><span class="p">):</span>
<span class="s">""" Returns the number of possible
answer given a population.
"""</span>
<span class="k">return</span> <span class="n">pop</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="mi">2</span> <span class="o">+</span> <span class="n">pop</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">pop</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">fill_plate</span><span class="p">(</span><span class="n">pop</span><span class="p">,</span> <span class="n">plate_size</span><span class="p">):</span>
<span class="s">""" Yields all the possible sub population
of plate_size pearls within pop.
"""</span>
<span class="n">head</span><span class="p">,</span><span class="n">tail</span> <span class="o">=</span> <span class="n">pop</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">pop</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">pop</span><span class="p">)</span><span class="o">==</span><span class="mi">1</span><span class="p">:</span>
<span class="k">if</span> <span class="n">head</span> <span class="o">>=</span> <span class="n">plate_size</span><span class="p">:</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">plate_size</span><span class="p">,)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">plate_size</span><span class="p">,</span><span class="n">pop</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span>
<span class="k">for</span> <span class="n">fill_remaining</span> <span class="ow">in</span> <span class="n">fill_plate</span><span class="p">(</span><span class="n">tail</span><span class="p">,</span> <span class="n">plate_size</span><span class="o">-</span><span class="n">i</span><span class="p">):</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">i</span><span class="p">,)</span> <span class="o">+</span> <span class="n">fill_remaining</span>
<span class="k">def</span> <span class="nf">measures</span><span class="p">(</span><span class="n">pop</span><span class="p">):</span>
<span class="s">""" Returns all possible normalized.
measures for a given population.
A measure is described as a couple (left, right)
where left is the population to put in the left
plate and right is the population to put in the
right plate.
Since (left,right) is equivalent to
(right, left), we only yield measures for
which left >= right
"""</span>
<span class="n">N</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pop</span><span class="p">)</span> <span class="c1"># the number of pearls
</span> <span class="n">possible_plate_sizes</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">N</span><span class="o">/</span><span class="mi">2</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="k">for</span> <span class="n">plate_size</span> <span class="ow">in</span> <span class="n">possible_plate_sizes</span><span class="p">:</span>
<span class="k">for</span> <span class="n">left</span> <span class="ow">in</span> <span class="n">fill_plate</span><span class="p">(</span><span class="n">pop</span><span class="p">,</span> <span class="n">plate_size</span><span class="p">):</span>
<span class="n">remaining</span> <span class="o">=</span> <span class="n">diff</span><span class="p">(</span><span class="n">pop</span><span class="p">,</span> <span class="n">left</span><span class="p">)</span>
<span class="k">for</span> <span class="n">right</span> <span class="ow">in</span> <span class="n">fill_plate</span><span class="p">(</span><span class="n">remaining</span><span class="p">,</span> <span class="n">plate_size</span><span class="p">):</span>
<span class="k">if</span> <span class="n">left</span> <span class="o">>=</span> <span class="n">right</span><span class="p">:</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">solved</span><span class="p">(</span><span class="n">pop</span><span class="p">):</span>
<span class="k">return</span> <span class="n">pop</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">==</span><span class="mi">0</span> <span class="ow">and</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pop</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="mi">3</span><span class="p">])</span> <span class="o"><=</span> <span class="mi">1</span>
<span class="k">def</span> <span class="nf">solve</span><span class="p">(</span><span class="n">pop</span><span class="p">,</span> <span class="n">m</span><span class="p">):</span>
<span class="s">"""Returns True if the pearl problem
of the population pop can be solved
in less than m measures.
To do so, apart from the special cases
we test all possible measures.
If one measure makes it possible to
solve the problem in m, we return m.
If one outcome of one measure gives
a result greater than m-1, we
test the next measure.
"""</span>
<span class="k">if</span> <span class="n">m</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="k">if</span> <span class="n">solved</span><span class="p">(</span><span class="n">pop</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">if</span> <span class="mi">3</span><span class="o">**</span><span class="n">m</span> <span class="o"><</span> <span class="n">nb_of_answers</span><span class="p">(</span><span class="n">pop</span><span class="p">):</span>
<span class="c1"># we will never be able to
</span> <span class="c1"># reach the limit of m
</span> <span class="c1"># because of the information
</span> <span class="c1"># argument
</span> <span class="k">return</span> <span class="bp">False</span>
<span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="n">l</span><span class="p">,</span><span class="n">h</span><span class="p">,</span><span class="n">r</span><span class="p">)</span> <span class="o">=</span> <span class="n">pop</span>
<span class="k">if</span> <span class="n">l</span><span class="o"><</span><span class="n">h</span><span class="p">:</span>
<span class="k">return</span> <span class="n">solve</span><span class="p">((</span><span class="n">n</span><span class="p">,</span><span class="n">h</span><span class="p">,</span><span class="n">l</span><span class="p">,</span><span class="n">r</span><span class="p">),</span> <span class="n">m</span><span class="p">)</span>
<span class="k">for</span> <span class="n">measure</span> <span class="ow">in</span> <span class="n">measures</span><span class="p">(</span><span class="n">pop</span><span class="p">):</span>
<span class="k">for</span> <span class="n">pops_branch</span> <span class="ow">in</span> <span class="n">population_after_measures</span><span class="p">(</span><span class="n">pop</span><span class="p">,</span> <span class="n">measure</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">solve</span><span class="p">(</span><span class="n">pops_branch</span><span class="p">,</span> <span class="n">m</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span>
<span class="k">break</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="k">def</span> <span class="nf">pearl_smart</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="k">if</span> <span class="n">n</span><span class="o"><</span><span class="mi">3</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="n">pop</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="c1"># our solution is either m
</span> <span class="c1"># or m+1
</span> <span class="n">m</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">floor</span><span class="p">(</span><span class="n">log</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="mi">3</span><span class="p">)))</span><span class="o">+</span><span class="mi">1</span>
<span class="k">if</span> <span class="n">solve</span><span class="p">(</span><span class="n">pop</span><span class="p">,</span> <span class="n">m</span><span class="p">):</span>
<span class="k">return</span> <span class="n">m</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">m</span><span class="o">+</span><span class="mi">1</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="k">assert</span> <span class="n">pearl_smart</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>
<span class="k">assert</span> <span class="n">pearl_smart</span><span class="p">(</span><span class="mi">12</span><span class="p">)</span> <span class="o">==</span> <span class="mi">3</span>
<span class="k">assert</span> <span class="n">pearl_smart</span><span class="p">(</span><span class="mi">13</span><span class="p">)</span> <span class="o">==</span> <span class="mi">4</span></code></pre></figure>
<h2 id="results-because-everyone-loves-a-colorful-graph">Results, because everyone loves a colorful graph.</h2>
<p><img src="https://docs.google.com/spreadsheet/oimg?key=0As3ux_ykgGX1dG1USEgwcGdrSlZFR2VVMkw5RnYxcXc&oid=3&zx=bsk5yv5pjw75" /></p>
<p>Here is the running time of the algorithm for the different implementation.</p>
<p>As you can see the “smart” is doing way better than the other implementation.</p>
<p>Interestingly the computational of the smart implementation is not an increasing function of the number of pearls. If we plot it for a bigger</p>
<p><img src="https://docs.google.com/spreadsheet/oimg?key=0As3ux_ykgGX1dG1USEgwcGdrSlZFR2VVMkw5RnYxcXc&oid=4&zx=5e6pre1ocudh" /></p>
Of being lazy2013-02-02T00:00:00+00:00https://fulmicoton.com/posts/lazy<h2 id="whats-lazy-evaluation-about-">What’s lazy evaluation about ?</h2>
<p>Some functional programming languages (like Haskell)
offers a functionality called lazy evaluation by default.
It consists of defering evaluation of functions
to the moment their results are actually used.</p>
<p>Instead of results, everything works as if your
function call are returning the recipe to compute the actual result. In python, it is actually pretty straightforward to hack <code class="language-plaintext highlighter-rouge">__getattr__</code> and <code class="language-plaintext highlighter-rouge">__setattr__</code> to implement an hackish lazy evaluation as a decorator.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">LazyObject</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="n">__slots__</span> <span class="o">=</span> <span class="p">[</span> <span class="s">"_recipe"</span><span class="p">,</span> <span class="s">"_result"</span><span class="p">,</span> <span class="s">"_evaluated"</span> <span class="p">]</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">recipe</span><span class="p">):</span>
<span class="nb">object</span><span class="p">.</span><span class="n">__setattr__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="s">"_recipe"</span><span class="p">,</span> <span class="n">recipe</span><span class="p">)</span>
<span class="nb">object</span><span class="p">.</span><span class="n">__setattr__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="s">"_result"</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
<span class="nb">object</span><span class="p">.</span><span class="n">__setattr__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="s">"_evaluated"</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_eval</span><span class="p">(</span><span class="bp">self</span><span class="p">,):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">_evaluated</span><span class="p">:</span>
<span class="nb">object</span><span class="p">.</span><span class="n">__setattr__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="s">"_result"</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">_recipe</span><span class="p">())</span>
<span class="nb">object</span><span class="p">.</span><span class="n">__setattr__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="s">"_evaluated"</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_result</span>
<span class="k">def</span> <span class="nf">__getattr__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kargs</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">getattr</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_eval</span><span class="p">(),</span> <span class="n">name</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kargs</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">__setattr__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kargs</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">setattr</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_eval</span><span class="p">(),</span> <span class="n">name</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kargs</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">__getitem__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kargs</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_eval</span><span class="p">().</span><span class="n">__getitem__</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kargs</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">,):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_eval</span><span class="p">())</span>
<span class="k">def</span> <span class="nf">__add__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="o">*</span><span class="n">args</span><span class="p">,</span><span class="o">**</span><span class="n">kargs</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_eval</span><span class="p">().</span><span class="n">__add__</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span><span class="o">**</span><span class="n">kargs</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">,):</span>
<span class="k">return</span> <span class="nb">repr</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_eval</span><span class="p">())</span>
<span class="c1"># ... __mult__, __slice__ and so on ...
</span>
<span class="c1"># the lazy evaluation decorator !
</span><span class="k">def</span> <span class="nf">lazy</span><span class="p">(</span><span class="n">f</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">aux</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kargs</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">recipe</span><span class="p">():</span>
<span class="k">return</span> <span class="n">f</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span><span class="o">**</span><span class="n">kargs</span><span class="p">)</span>
<span class="k">return</span> <span class="n">LazyObject</span><span class="p">(</span><span class="n">recipe</span><span class="p">)</span>
<span class="k">return</span> <span class="n">aux</span></code></pre></figure>
<p>Let’s now check out that the evaluation is done
at the last moment with a couple of “print statement”.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">@</span><span class="n">lazy</span>
<span class="k">def</span> <span class="nf">returns_two</span><span class="p">():</span>
<span class="k">print</span> <span class="s">"evaluation for good"</span>
<span class="k">return</span> <span class="mi">2</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">returns_two</span><span class="p">()</span>
<span class="k">print</span> <span class="s">"lazy evaluation"</span>
<span class="k">print</span> <span class="n">result</span> <span class="o">+</span> <span class="mi">1</span></code></pre></figure>
<p>This should result in the following output</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lazy evaluation
evaluation for good
3 # 2+1
</code></pre></div></div>
<p>As expected, the call to <code class="language-plaintext highlighter-rouge">returns_two</code> does not
actually call our implementation <code class="language-plaintext highlighter-rouge">returns_two</code>, but instead creates an object <code class="language-plaintext highlighter-rouge">LazyObject</code> embedding this “recipe”.</p>
<p>Eventually, when we try to add it with 1,
python will call our object’s own implementation of
<code class="language-plaintext highlighter-rouge">__add__</code> which triggers the call to <code class="language-plaintext highlighter-rouge">returns_two</code>.
The result is cached and latter use of the object will
not require further calls to <code class="language-plaintext highlighter-rouge">returns_two</code>.</p>
<h2 id="a-simple-example--the-k-smallest-elements">A simple example : The k-smallest elements</h2>
<p>I can enjoy a free lunch like anyone else…
However I used to see this functionality as just another
of these functional programmer toy.</p>
<p>I could see how some applications could see their performance
increase with lazy optimization “by default”, yet I had the
feeling that optimizing such application was a no-brainer
anyway.</p>
<p>I was wrong on two points. First explicitely deferring evaluation can have a bad impact on readability. The second point is
somewhat a little trickier, and that’s the whole point of my post. Lazy Evaluation can result in a non-trivial impact on
performances. It can actually change a program’s very complexity.</p>
<p>I was a little skeptical as I read that point on
some OCaml-related newsletter. He took the example
of trying to pick the k-greatest numbers
of a list of n-elements.</p>
<h2 id="the-classical-answer-to-this-problem">The classical answer to this problem</h2>
<p>Let’s detail the textbook-way to address this problem. The idea is to put the k-first elements into a binary heap, (that’s a complexity of <code class="language-plaintext highlighter-rouge">k log k</code> to build the heap), go through through the remaining elements, append each of them to the heap, and iteratively pick up the greatest element so that the heap remains of size k (hence a complexity of <code class="language-plaintext highlighter-rouge">n log k</code>).
When the last element is reached, the heap will contain the k-lowest elements. and then we need to pick up elements from the heap (a cost of <code class="language-plaintext highlighter-rouge">k log k</code>). Overall the complexity of such an operation is therefore <code class="language-plaintext highlighter-rouge">n log k</code>. More importantly, the memory complexity is linear with k.</p>
<p>An implementation of this algorithm is available <code class="language-plaintext highlighter-rouge">heapq.nsmallest</code>.</p>
<p>We also consider the simpler solution, yet assumingly less performant algorithm which consists of building a complete heap
and then pop an element one after the other. The nature of heap sort
should help us defer part of the comparisons.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">heapsort</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="n">heap</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">l</span><span class="p">[:]:</span>
<span class="n">heappush</span><span class="p">(</span><span class="n">heap</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="k">while</span> <span class="n">heap</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">heappop</span><span class="p">(</span><span class="n">heap</span><span class="p">)</span></code></pre></figure>
<h2 id="lazy-implementation">Lazy Implementation</h2>
<p>Now let’s assume you skipped algorithm class, and can only
remember about the good old merge sort. Let’s take a look at a simple implementation.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">zip_merge</span><span class="p">(</span><span class="n">left</span><span class="p">,</span><span class="n">right</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">left</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">right</span><span class="p">:</span>
<span class="k">return</span> <span class="n">left</span> <span class="o">+</span> <span class="n">right</span>
<span class="k">elif</span> <span class="n">left</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o"><=</span> <span class="n">right</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span>
<span class="k">return</span> <span class="p">[</span><span class="n">left</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">+</span> <span class="n">zip_merge</span><span class="p">(</span><span class="n">left</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">right</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">zip_merge</span><span class="p">(</span><span class="n">right</span><span class="p">,</span><span class="n">left</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">merge_sort</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="c1"># Assuming l is a list, returns
</span> <span class="c1"># a sorted version of l.
</span> <span class="n">L</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="k">if</span> <span class="n">L</span> <span class="o"><=</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="n">l</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">L</span><span class="o">/</span><span class="mi">2</span>
<span class="n">left</span> <span class="o">=</span> <span class="n">merge_sort</span><span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="n">m</span><span class="p">])</span>
<span class="n">right</span> <span class="o">=</span> <span class="n">merge_sort</span><span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="n">m</span><span class="p">:])</span>
<span class="k">return</span> <span class="n">zip_merge</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span></code></pre></figure>
<p>Ok, so let’s take a look at the code.
Merge sort adopts a “divide and conquer” strategy.
We split the list in two parts, sort recursively both of them
and merge them back. Merging two sorted list can
be done in linear time by peeking at the head of
both list and picking the lowest of the two values.
This algorithm is sometimes called <code class="language-plaintext highlighter-rouge">zip_merge</code>.</p>
<p>In order to take advantage of lazy evaluation,
we cannot candidly apply our <code class="language-plaintext highlighter-rouge">lazy</code> decorator
because the concatenation of the <code class="language-plaintext highlighter-rouge">zip_merge</code> algorithm will
require the whole evaluation of both list. To get the drop
in complexity I promised we need to get close to ML’s
definition of a list.</p>
<p>In ML a list are defined recursively as follows, a list can be either :</p>
<ul>
<li>the empty list <code class="language-plaintext highlighter-rouge">nil</code>, in python this will translate as the empty tuple ()</li>
<li><code class="language-plaintext highlighter-rouge">head::tail</code> where <code class="language-plaintext highlighter-rouge">head</code> is the name of first element of the list, and <code class="language-plaintext highlighter-rouge">tail</code> is list of the other elements. In python this will translate as the tuple (head, <rest-of-the-list>).</rest-of-the-list></li>
</ul>
<p>For instance if our algorithm is to output <code class="language-plaintext highlighter-rouge">range(5)</code> as
its output, it will actually return a lazy version of
<code class="language-plaintext highlighter-rouge">(0, (1, 2, (3, (4, ()))))</code></p>
<p>Now let’s adapt our algorithm to this new representation!</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">@</span><span class="n">lazy</span>
<span class="k">def</span> <span class="nf">zip_merge</span><span class="p">(</span><span class="n">left</span><span class="p">,</span><span class="n">right</span><span class="p">):</span>
<span class="k">if</span> <span class="n">left</span> <span class="o">==</span> <span class="p">():</span>
<span class="k">return</span> <span class="n">right</span> <span class="c1"># right is never empty.
</span> <span class="k">else</span><span class="p">:</span>
<span class="p">(</span><span class="n">left_head</span><span class="p">,</span> <span class="n">left_tail</span><span class="p">)</span> <span class="o">=</span> <span class="n">left</span>
<span class="p">(</span><span class="n">right_head</span><span class="p">,</span> <span class="n">right_tail</span><span class="p">)</span> <span class="o">=</span> <span class="n">right</span>
<span class="k">if</span> <span class="n">left_head</span> <span class="o"><=</span> <span class="n">right_head</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="n">left_head</span><span class="p">,</span> <span class="n">zip_merge</span><span class="p">(</span><span class="n">left_tail</span><span class="p">,</span> <span class="n">right</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">zip_merge</span><span class="p">(</span><span class="n">right</span><span class="p">,</span><span class="n">left</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">merge_sort</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="c1"># Assuming l is a list, returns a sorted
</span> <span class="c1"># version of l in the format (t,q)
</span> <span class="n">L</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="k">if</span> <span class="n">L</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="p">()</span>
<span class="k">elif</span> <span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="o">==</span><span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">())</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">L</span><span class="o">/</span><span class="mi">2</span>
<span class="n">left</span> <span class="o">=</span> <span class="n">merge_sort</span><span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="n">m</span><span class="p">])</span>
<span class="n">right</span> <span class="o">=</span> <span class="n">merge_sort</span><span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="n">m</span><span class="p">:])</span>
<span class="k">return</span> <span class="n">zip_merge</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span></code></pre></figure>
<p>Note that we added our decorator to the zip_merge function.</p>
<h2 id="now-with-generators">Now with generators</h2>
<p>Now let’s go back at our first implementation of merge sort, and
let’s try to implement lazyness in a pythonic way this time.</p>
<p>Impermeable to irony, we rely on heapq.merge which
basically will do the same job as <code class="language-plaintext highlighter-rouge">zip_merge</code>,
but with iterators.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">heapq</span> <span class="kn">import</span> <span class="n">merge</span> <span class="k">as</span> <span class="n">zip_merge</span>
<span class="k">def</span> <span class="nf">merge_sort</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="c1"># Assuming l is a list, returns an
</span> <span class="c1"># iterator on a sorted version of
</span> <span class="c1"># the list.
</span> <span class="n">L</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="k">if</span> <span class="n">L</span> <span class="o"><=</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">iter</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">L</span><span class="o">/</span><span class="mi">2</span>
<span class="n">left</span> <span class="o">=</span> <span class="n">merge_sort</span><span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="n">m</span><span class="p">])</span>
<span class="n">right</span> <span class="o">=</span> <span class="n">merge_sort</span><span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="n">m</span><span class="p">:])</span>
<span class="k">return</span> <span class="n">zip_merge</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">)</span></code></pre></figure>
<p>Not much as changed right? It’s not even longer.
But here comes the awesomeness.</p>
<p>This generator returned here acts pretty much
like lazy evaluation. As long as we don’t consume the elements of the sorted generator, the computation is not done.
Moreover, the elements will get sorted as we consume the list.</p>
<h2 id="number-of-comparisons-required">Number of comparisons required</h2>
<p>Here we only focus on the number of comparison involved.
For various reasons studying runtime would probably tell a whole different story. For the sake of readibility, the implementation above also performs a useless copy of list slices. A runtime focused algorithm would get intervals as arguments.</p>
<p>The graph below shows the number of comparisons required to peek at the the k-lowest elements of a randomly sorted list of 100 elements
for the following 4 algorithms.</p>
<ul>
<li>complete merge sort</li>
<li>lazy merge sort</li>
<li>complete heap sort</li>
<li>heapq.nsmallest</li>
</ul>
<p>The x-axis is the number of elements we requested at, and the y-axis
is the cumulative number of comparisons required.</p>
<p>Notice that the number of comparison required to perform a full sort
is <code class="language-plaintext highlighter-rouge">535</code> which is very close to the which is very close to the theoretical minimum of <code class="language-plaintext highlighter-rouge">525</code>.</p>
<p>The real surprise here is the bad results of heaqp.nsmallest performs. It starts on par with the lazy algorithm and then gets even more expensive than a complete heap sort. At this point, I don’t exactly understand why it does so. Keep in mind however that another benefit of this algorithm is to be linear with k in memory usage, while this is not true for the other algorithm.</p>
<p><img src="https://docs.google.com/spreadsheet/oimg?key=0As3ux_ykgGX1dEk4Sk01ak41UHJOVXJ2SGN6XzdrZnc&oid=5&zx=3i4d8q37pnig" alt="Number of comparison required to fetch the k-first elements of a list of 1000 elements." /></p>
<hr />
<p><em>Thanks to NewCarSmell from reddit for point out a flaw in my first implementation of LazyObject.</em></p>