Space Vatican

Ramblings of a curious coder

Fun With Elasticsearch's Children and Nested Documents

When you’re indexing data, the world is rarely as simple as each document existing in isolation. Sometimes, you’re better off denormalizing all data into the child documents. For example if you were modelling books, adding an author field to books could be a sensible choice (even if in the database that is your authoritative datasource the data is split into separate authors and books table). It’s simple and you can easily construct queries on both attributes of the book and the author’s name.

That’s not always practical - there might be too much data in the parent document to duplicate it in each child document. If you had your typical blog/comment app then you wouldn’t want to repeat the entire content of the blog post in each comment as this would vastly increase the amount of indexed data several. Yet without that you can’t easily write queries to find comments on posts matching certain criteria (other than by doing a 2 step of process of first finding matching posts and then fetching comments with a certain post_id, which is often unwieldy or slow (or both)).

Another option is to place child documents inside the parent document, for example you can have documents of the form

1
2
3
4
5
6
7
8
9
10
11
12
{
  "name": "A. N Author",
  "biography": "A leading wordsmith",

  "books": [
    {
      "title": "My first book",
      "genre": "scifi",
      "publisher": "penguin"
    }
  ]
}

One drawback here is that adding a child requires reindexing the entire document. Finding authors that have written books with a certain genre is easy but finding authors that have had a science fiction book published by penguin is harder.

If our index contains

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
curl -XPUT localhost:9200/authors/author/1 -d'{
  "name": "Multi G. Enre",
  "books": [
    {
      "name": "Guns and lasers",
      "genre": "scifi",
      "publisher": "orbit"
    },
    {
      "name": "Dead in the night",
      "genre": "thriller",
      "publisher": "penguin"
    }
  ]
}'
curl -XPUT localhost:9200/authors/author/2 -d'{
  "name": "Alastair Reynolds",
  "books": [
    {
      "name": "Revelation Space",
      "genre": "scifi",
      "publisher": "penguin"
    }
  ]
}'

then the most obvious query

1
2
3
4
5
6
7
8
9
10
11
12
13
curl -XGET localhost:9200/authors/author/_search -d '{
  "query": {
    "filtered": {
       "query": {"match_all": {}},
       "filter": {
          "and": [
            {"term": {"books.publisher": "penguin"}},
            {"term": {"books.genre": "scifi"}}
          ]
       }
    }
  }
}''

finds both authors - you can’t express that your conditions on published and genre need to match against the same book.

Nested Documents

ElasticSearch provides two things that help with this. The first is the concept of a nested document/query. This allows you to say that you are looking for authors where at least one book satisfies both of your criteria.

First you need to setup a mapping that says that the books field is going to be nested:

1
2
3
4
5
6
7
8
9
curl -XPOST localhost:9200/authors/nested_author/_mapping -d '{
  "nested_author":{
    "properties":{
      "books": {
        "type": "nested"
      }
    }
  }
}'

If we then insert the same data as before into this new index then this query

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
curl -XGET localhost:9200/authors/nested_author/_search -d '
{
  "query": {
    "filtered": {
      "query": {"match_all": {}},
      "filter": {
        "nested": {
          "path": "books",
          "query":{
            "filtered": {
              "query": { "match_all": {}},
              "filter": {
                "and": [
                  {"term": {"books.publisher": "penguin"}},
                  {"term": {"books.genre": "scifi"}}
                ]
              }
            }
          }
        }
      }
    }
  }
}'

Here the nested filter allows you to run a query against the nested documents (ie the books) and filter authors by those that have at least one nested document matching the query. The path option tells us which part of the authors document this query applies to and then the query option is a query to run against these nested documents. Unlike the previous query this requires that an individual book be found satisfying both requirements so only Alaistair Reynolds is returned

Parent & child

The other concept elasticsearch provides is that of a parent and child relationship between documents. The previous example can be reworked with authors as the parent documents and books as the child documents.

This time, index the authors separately from their books:

1
2
3
4
5
6
7
curl -XPUT localhost:9200/authors/bare_author/1 -d'{
  "name": "Multi G. Enre"
}'

curl -XPUT localhost:9200/authors/bare_author/2 -d'{
  "name": "Alastair Reynolds"
}'

Then configure the mapping for the book type and say that its parent type is bare_author. You need to do this before you create any books.

1
2
3
4
5
curl -XPOST localhost:9200/authors/book/_mapping -d '{
  "book":{
    "_parent": {"type": "bare_author"}
  }
}'

When we indexing books, you must then give the id of their parent (ie we supply the id of one of the previously created authors)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
curl -XPOST localhost:9200/authors/book/1?parent=2 -d '{
   "name": "Revelation Space",
   "genre": "scifi",
   "publisher": "penguin"
}'

curl -XPOST localhost:9200/authors/book/2?parent=1 -d '{
   "name": "Guns and lasers",
   "genre": "scifi",
   "publisher": "orbit"
}'

curl -XPOST localhost:9200/authors/book/3?parent=1 -d '{
   "name": "Dead in the night",
   "genre": "thriller",
   "publisher": "penguin"
}'

Elasticsearch provides a has_child filter that does pretty much what is says on the tin: it selects parent documents with a least one child satisfying a certain query. This query then finds only Alastair Reynolds:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
curl -XPOST localhost:9200/authors/bare_author/_search -d '{
  "query": {
    "has_child": {
      "type": "book",
      "query" : {
        "filtered": {
          "query": { "match_all": {}},
          "filter" : {
            "and": [
              {"term": {"publisher": "penguin"}},
              {"term": {"genre": "scifi"}}
            ]
          }
        }
      }
    }
  }
}'

Solr 4.0 will apparently have the ability to do joins, although as far as I can tell this comes with some restrictions, in particular no joins if you are operated in a distributed environment. By limiting itself to parent/child type relationships elasticsearch makes life easier for itself: a child is always indexed in the same shard as its parent, so has_child doesn’t have to do awkward cross shard operations.

Building lists

You can also use this to model user specific lists of shared global items - e.g. if you wanted items that a user had rated. In this case your child documents would represent the fact that a specific user had rated a specific post - they are nothing more than a user_id, post_id and a rating: a join table in relational database lingo.

Using a parent/child relationships and has_child allows you to easily find all the posts favourited by a user while enabling users to search through their favourites based on post content, date or any other of a post’s attributes or any of the child item properties. Adding an item to the rated items list is cheap - it requires only indexing a very small rating item.

With these documents

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
curl -XPOST localhost:9200/posts

curl -XPOST localhost:9200/posts/rating/_mapping -d '{
  "rating":{
    "_parent": {"type": "post"}
  }
}'

curl -XPUT localhost:9200/posts/post/1 -d '{ "title": "bolivia rated 4" }'
curl -XPUT localhost:9200/posts/post/2 -d '{ "title": "bolivia rated 2" }'
curl -XPUT localhost:9200/posts/post/3 -d '{ "title": "another country rated 4" }'
curl -XPUT localhost:9200/posts/rating/1?parent=1 -d '{ "user_id": 1234, "rating": 4}'
curl -XPUT localhost:9200/posts/rating/2?parent=2 -d '{ "user_id": 1234, "rating": 2}'
curl -XPUT localhost:9200/posts/rating/3?parent=2 -d '{ "user_id": 4234, "rating": 5}'
curl -XPUT localhost:9200/posts/rating/4?parent=3 -d '{ "user_id": 1234, "rating": 5}'

This query

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
curl -XPOST localhost:9200/posts/post/_search -d '{
  "query": {
    "filtered": {
      "query": {
        "text": {"title": "bolivia"}
      },
      "filter":{
        "has_child": {
          "type": "rating",
          "query" : {
            "filtered": {
              "query": { "match_all": {}},
              "filter" : {
                "and": [
                  {"term": {"user_id": 1234}},
                  {"range": {"rating": {"gt" : 3}}}
                ]
              }
            }
          }
        }
      }
    }
  }
}'

finds only “bolivia rated 4” since that is the only post mentioning bolivia that has been rated over 3 by the user we are interested in. The top level query on title applies to the posts, wheres the query inside the has_child filter describes conditions that the children must match (in this case they must belong to a specific user and have at least a certain rating).

Ordering

What has_child doesn’t let you do is order based on attributes of the children or return attributes of the child. If you wanted to order a user’s rated posts based on when they were rated or by decreasing rating then you can search against posts/rating directly but you might want to apply some search critera to the posts too. For example you might want to only find rated posts on a certain topic (still ordering by the rating the user gave). With has_child, you’re out of luck. Nested documents don’t help either.

As of 0.19.10 you can use the has_parent filter. This works almost exactly the same as has child but allows you to specify a query against the parent items instead. This query returns the ratings by user 1234, on posts whose title matches “bolivia”, in decreasing score order

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
curl -XPOST localhost:9200/posts/rating/_search?pretty=true -d '{
  "query": {
    "filtered": {
      "query": {"match_all": {}},
      "filter": {
        "and": [
          {"term": {"user_id": 1234}},
          {
            "has_parent": {
              "type": "post",
              "query": {
                "term": {"title": "bolivia"}
              }
            }
          }
        ]
      }
    }
  },
  "sort": [
    { "rating" : {"order" : "desc"} }
  ]
}'

This returns the rating objects - you’d have to then fetch the corresponding posts with a separate query.

Faking it

If you’re stuck on an older version of elasticsearch, you can get most of the way there with top_children. As the documentation says top_children first queries the child documents and then aggregates them into parent documents. In our example, this means that elasticsearch will first find the rating documents that match our query. Then it will match each rating to its parent post, aggregating duplicate post where they exist.

The fiddly bit with top children is that elasticsearch doesn’t know ahead of time how many documents it will lose when the aggregation happens. In this particular case it’s easy because two distinct ratings by the same user always correspond to two distinct posts, so we don’t need to bother with factor and incremental_factor settings because the aggregation phase never does anything. Similarly, score mode doesn’t matter either. If you need to provide an accurate count of the total number of results you just need to set factor big enough that the first sweep elasticsearch does of the child documents finds all of them. If you know that the user has 500 rated items on their list and you’re asking for the first 10 items, then a factor of 50 should do the trick. Factor need only be an upper bound - you don’t need to know exactly how many items the user has on their list (which might be awkward to work out without a separate elastic search query if the user is searching a specific subset of their ratings).

What you get after all this is a list of parent documents (posts) sorted by the query score of the children documents (ratings). To achieve the original goal of sorting the posts based on attributes of the children documents we just need to ensure that this query score has the right value. For example, have top_children query wrap a custom_score query so that you have control on what the score for each child is.

With the same documents in the index, this query returns the posts that user 1234 has rated, ordered by their rating:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
curl -XPOST localhost:9200/posts/_search -d '{
  "query": {
    "top_children":{
      "type": "rating",
      "query": {
        "custom_score": {
          "script": "doc['\''rating'\''].value",
          "query": {
            "filtered": {
              "query": {"match_all": {}},
              "filter": {"term": {"user_id": 1234}}
            }
          }
        }
      }
    }
  }
}'

We’re running a top_children query, so the first thing we need to do is to say what is the type of the children we’re considering (rating). Then we provide the query that finds those children. This is a custom_score query, wrapping a filtered query. The filtered query ensures that we find only ratings given by the user we’re interested in, and then the script element makes the score of the rating document be rating itself, so that we get our posts sorted by rating. The funkiness with the backslashes is only because I’m trying to include a literal single quote in a shell friendly string delimited by single quotes - the actual json that we’re sending has just "script": "doc['rating'].value".

Ruby fun

Unfortunately the tire library doesn’t really support any of this fun stuff at the moment - there’s a bit of a moratorium on this sort of feature being added because at the moment every single little query type and option ends up being separate methods scattered all over tire, which the maintainer understandably doesn’t like. You can sort of hack it in though.

Tire doesn’t allow you to set a document’s parent id when indexing. This is straightforward to add and is only being held up by the aforementioned moratorium. My fork adds this ability. With that you end up with

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Post < ActiveRecord::Base
  mapping do
    indexes :content
    indexes :id, :type => :integer
  end
end

class Rating < ActiveRecord::Base
  belongs_to :post
  belongs_to :user

  index_name 'posts'

  mapping(:_parent => {:type => 'post'}) do
    indexes :id, :type => :integer
    indexes :user_id, :type => :integer
    indexes :rating, :type => :integer
  end

  def parent
    post_id
  end
end

The next bit of unhappiness is that tire’s auto index creation presumes one type per index but for a parent/child relationship to exist both types have to be in the same index. I’ve ended up doing something like this to create my indexes.

1
2
3
4
5
6
Tire::Index.new "posts" do
  create :mappings => Post.mapping_to_hash, :settings => Post.settings
  Tire::Configuration.client.put(
        "#{Tire::Configuration.url}/posts/#{Rating.document_type}/_mapping",
        Rating.mapping_to_hash.to_json)
end

which isn’t quite as beautiful but gets the job done.

Lastly you need to actually do the query. In the absence of top_children actually being part of tire’s api you can fudge it like so

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class Post < ActiveRecord::Base
  ...
  def self.top_children(options, &block)
    {:top_children => {:query => Tire::Search::Query.new(&block).to_hash}.merge(options)}
  end

  def self.rated_by_user(user_id)
    query = top_children(:type => 'rating', :score => 'max', :factor => 50) do
      custom_score(script: "doc['rating'].value") do
        filtered do
          query {all}
          filter :term, user_id: user_id
        end
      end
    end
    Post.search do |search|
      search.query do
        @value = query
      end
    end
  end
end

This little bit of unpleasantness builds up the query as a hash and then shoves it into tire while it’s looking the other way. Obviously you can structure it in such a way that it’s easy to add other conditions (whether on Post or on Rating) to the search. You can also build up the json manually and use Post.search :payload => my_json (there is a bug with the payload option that clashes with tire-contrib’s logger extension)