Implementing search as a jekyll plugin

Feb 28, 2014

Search for most dynamically-generated websites is usually implemented as a server-side feature. The server generates and maintains an index of all unique words and their occurrences on the site. When a client runs a search query, the query is sent to the server which then looks up the query in its index and returns a list of results, which are then rendered on the client.

An approach like this is clearly infeasible on static websites such as those generated using a static site generator like Jekyll, so one has to come up with alternative solutions. One possible alternative is to send the content of the entire site to the client, and to then search through it clientside. This approach is infeasible for two reasons:

  1. Downloading the entire site will take a while, even though you would only be sending the text and none of the images/javascript files/stylesheets.
  2. Searching through the entire site will also be pretty slow. Building an index would speed this up, but would be overkill for a client who is probably going to search your site a couple of times per session.

So, we need some sort of optimization during the site’s generation process that will optimize on both the network usage (the content that the client has to download) and execution time (how long it takes the client to search through the downloaded content). With respect to the second requirement, a hash map would be the fastest possible data structure for the client to search through, but it would be ever worse than downloading the entire site in terms of bandwidth usage. Enter bloom filters.

(I won’t go into details about how bloom filters work here, for that have a look at my previous post)

So, for the purposes of implementing search for a statically-generated site (namely this one), here is the approach that I followed:

  1. Statically generate a bloom filter per post/page in your site. The filter will contain the bitmap itself, along wit h the probability of false positives, the size of your dataset (number of unique words inserted into the filter) and the URL of the associated page.
  2. Statically generate a JSON file that has an array of all the generated bloom filters.
  3. On the client, download the generated JSON file and recreate the bloom filters. As long as you use the same hashing functions that you used for generating the filters, the data mentioned in the first point is everything that’s required for searching for membership of elements in each filter
  4. Whenever a search query is received, iterate over all the filters and return those that contain the word(s) in the query
  5. Fill a div with the results

Below I’ll describe my implementation of each of the above steps in some detail.

Generate a bloom filter for each page

Once I had my implementation of the bloom filter class ready, creating one for each page was relatively trivial. Jekyll plugins have a Generator category that lets you generate new pages besides the templates that you already have. If you implement a def generator, that method gets invoked with the entire site’s contents just prior to the actual generation. site.posts contains a list of all posts and site.pages contains a list of all pages. All I had to do was iterate over each page and

  1. Convert the content to lowercase, e.g. so that python and Python will not be treated as unique elements in the set
  2. Remove everything between {% raw %} {% and %} {% endraw %} and between {% raw %}{{ and }}{% endraw %}. This strips out liquid template variables and code segments
  3. Replace all non-alphabetic characters with a space
  4. Split on space and remove all duplicates

The following two lines accomplished this:

content = post.content.downcase.gsub(/{% raw %}{%.*?%}{% endraw %}/, " ").gsub(/{{.*?}}/, " ").gsub(/[^a-zA-Z]/, " ")
words = content.split.uniq

After that, I iterate over each word and call filter.insert where filter is the bloom filter created for that page. Once I’m done with the page, the final filters is inserted into an array.

Generate a JSON file from all bloom filters

This was slightly trickier. I need a template to fill up with the generated JSON, so I obviously need a file somewhere in my site that will be filled up. I opted for a file named search.json in the root of my site. I started off filling it manually by iterating over the generated filters via liquid tags and then slotting member variables into different fields, but then I thought of an easier approach: have the template be empty, and fill it entirely from ruby!

In my def generate method, once I’m done creating the array of bloom filters, I convert the array to json and insert it directly into the template:

json_page = site.pages.detect {|page| page.name == "search.json"}
json_page.data['filters'] = filters.to_json

Of course this needed a corresponding to_json for each of my filter objects, which was also trivial:

def to_json(options = {})
  {
   'u' => @url,
   'l' => @length,
   'h' => @hash_count,
   'f' => @filter
  }.to_json
end

Now, running jekyll build generates _site/search.json which contains the bloom filters for all posts in my site.

Fetch and recreate the filters on the client

Since I want my page loads to appear faster than they actually are, I opted for fetching the search JSON asynchronously after pageload was done. This was easily done with a simple jquery $.get call that fetches the json, iterates over each item in the JSON (which will be defined by the to_json method above) and creates a bloom filter out of it.

var initialize = function() {
    var index;

    $.get("/search.json", function(data) {
    for ( index = 0; index < data.length; index += 1 ) {
        filters.push( BloomFilter.create(data[index].u,
                         data[index].l,
                         data[index].h,
                         data[index].f) );
    }
    })
    .fail(function(jqXHR, textStatus, errorThrown) {
        console.log("textStatus = " + textStatus + ", errorThrown = " + errorThrown);
    });
};

The BloomFilter.create method uses the exact same hashing functions as were used to create the filter in the Jekyll plugin. I ended up using Murmur hashes.

Look for matches in a search query

The search function has to do the same sort of setup that was required for inserting words into the filters. It converts the query to lowercase and replaces all non-alphabetic characters to spaces. It then iterates over bloom filter (and hence over each post) and returns a list of words that were found in that bloom filter. The return-type of this function is a list of JSON objects that have two fields - the URL of the page and a list of matches for that page.

var search = function(query) {
    var matches = [];
    var i;

    query = query.toLowerCase().replace(/[^a-zA-Z]/, " ");
    var words = query.split(/\s+/);

    for ( i = 0; i < filters.length; i += 1 ) {
    var results = words.filter( function(word) {
        return filters[i].has(word);
    } );

    matches.push({
        url: filters[i].url(),
        words: results
    });
    }

    return matches.filter(removeMisses).sort(rsortMatches);
};

The removeMisses function on the last line of the function removes any object whose list of matched words is empty. That’s required because the var results = words.filter will be an empty array if there was no match for any of the words in words, and that’s therefore a page that is of no interest to us. The rsortMatches function sorts the matches in reverse order by:

  1. Number of matches. For a multi-word query, pages that had more matches are sorted to the top of the array
  2. Alphabetically for pages with equal number of matches

Display the results to the user

Once have a list of all the posts that matched the user’s query and also the words that matched for each page, I display the results to the user in a div. This div starts off hidden on each page and sits alongside the main #content div. When I have a list of search results, I hide the #content div, append the results to the #search-results div and unhide it. Clearing all text from the search input box hides the #search-results div and unhides the #content div. Pretty simple.

Bonus - Show snippets of text surround matches

While the approach outlined above is pretty solid in its own right, it has two shortcomings:

  1. The probability of false positives is ever-present. Even with the probability set to 0.01, I invariable ran into a couple of false matches during testing
  2. Relevance of results is usually hard for users to judge based solely on the link of a page. Think of how most of us usually search on Google. We type the query, and then we quickly scan through the snippets of text that are displayed below the link to the page that matched the query to determine if that page would help us

The solution to both of the above involves downloading a bit more content, but I believe it’s worth it. What I do is, I display the results of the user’s query as expected the moment I’ve checked through all the bloom filters and got the results. However, as soon as that is done, I iterate over all the matched pages and I fire off ajax requests to my webserve to fetch the post itself. In the callback for the request:

  1. I run an indexOf on the fetched content, and if that returns -1 then I know that the page was a false positive, so I remove it from the search results. This fixes the first drawback mentioned above.
  2. If the indexOf returned a valid index into the content, I run a regexp that extracts the entire sentence in which the match was found, up to a maximum of ten words before and ten words after the match. I match the regexp a maximum of two times over the post’s content, highlight the searched word in the result of the regexp, and then replace the simple matched: keyword1 keyword2 of the search result with a ... and in the context of blah and bloop, the awesome keyword1 was unmatched ... however, some people have recommended keyword2 as a highly viable alternative, and I am highly tempted to .... This (I hope) addresses the second drowback by making it easier for users to judge for themselves if the content on that page is what they are looking for.