When you’re indexing data, the world is rarely as simple as each document existing in isolation. Sometimes, you’re better off denormalizing all data into the child documents. For example if you were modelling books, adding an author field to books could be a sensible choice (even if in the database that is your authoritative datasource the data is split into separate authors
and books
table). It’s simple and you can easily construct queries on both attributes of the book and the author’s name.
That’s not always practical - there might be too much data in the parent document to duplicate it in each child document. If you had your typical blog/comment app then you wouldn’t want to repeat the entire content of the blog post in each comment as this would vastly increase the amount of indexed data several. Yet without that you can’t easily write queries to find comments on posts matching certain criteria (other than by doing a 2 step of process of first finding matching posts and then fetching comments with a certain post_id
, which is often unwieldy or slow (or both)).
Another option is to place child documents inside the parent document, for example you can have documents of the form
1 2 3 4 5 6 7 8 9 10 11 12 |
|
One drawback here is that adding a child requires reindexing the entire document. Finding authors that have written books with a certain genre is easy but finding authors that have had a science fiction book published by penguin is harder.
If our index contains
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
then the most obvious query
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
finds both authors - you can’t express that your conditions on published
and genre
need to match against the same book.
Nested Documents
ElasticSearch provides two things that help with this. The first is the concept of a nested document/query. This allows you to say that you are looking for authors where at least one book satisfies both of your criteria.
First you need to setup a mapping that says that the books field is going to be nested:
1 2 3 4 5 6 7 8 9 |
|
If we then insert the same data as before into this new index then this query
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Here the nested
filter allows you to run a query against the nested documents (ie the books) and filter authors by those that have at least one nested document matching the query. The path
option tells us which part of the authors document this query applies to and then the query
option is a query to run against these nested documents. Unlike the previous query this requires that an individual book be found satisfying both requirements so only Alaistair Reynolds is returned
Parent & child
The other concept elasticsearch provides is that of a parent and child relationship between documents. The previous example can be reworked with authors as the parent documents and books as the child documents.
This time, index the authors separately from their books:
1 2 3 4 5 6 7 |
|
Then configure the mapping for the book type and say that its parent type is bare_author
. You need to do this before you create any books.
1 2 3 4 5 |
|
When we indexing books, you must then give the id of their parent (ie we supply the id of one of the previously created authors)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Elasticsearch provides a has_child
filter that does pretty much what is says on the tin: it selects parent documents with a least one child satisfying a certain query. This query then finds only Alastair Reynolds:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Solr 4.0 will apparently have the ability to do joins, although as far as I can tell this comes with some restrictions, in particular no joins if you are operated in a distributed environment. By limiting itself to parent/child type relationships elasticsearch makes life easier for itself: a child is always indexed in the same shard as its parent, so has_child
doesn’t have to do awkward cross shard operations.
Building lists
You can also use this to model user specific lists of shared global items - e.g. if you wanted items that a user had rated. In this case your child documents would represent the fact that a specific user had rated a specific post - they are nothing more than a user_id
, post_id
and a rating: a join table in relational database lingo.
Using a parent/child relationships and has_child
allows you to easily find all the posts favourited by a user while enabling users to search through their favourites based on post content, date or any other of a post’s attributes or any of the child item properties. Adding an item to the rated items list is cheap - it requires only indexing a very small rating
item.
With these documents
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
This query
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
finds only “bolivia rated 4” since that is the only post mentioning bolivia that has been rated over 3 by the user we are interested in. The top level query on title applies to the posts, wheres the query inside the has_child
filter describes conditions that the children must match (in this case they must belong to a specific user and have at least a certain rating).
Ordering
What has_child
doesn’t let you do is order based on attributes of the children or return attributes of the child. If you wanted to order a user’s rated posts based on when they were rated or by decreasing rating then you can search against posts/rating directly but you might want to apply some search critera to the posts too. For example you might want to only find rated posts on a certain topic (still ordering by the rating the user gave). With has_child
, you’re out of luck. Nested documents don’t help either.
As of 0.19.10 you can use the has_parent
filter. This works almost exactly the same as has child but allows you to specify a query against the parent items instead. This query returns the ratings by user 1234, on posts whose title matches “bolivia”, in decreasing score order
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
This returns the rating objects - you’d have to then fetch the corresponding posts with a separate query.
Faking it
If you’re stuck on an older version of elasticsearch, you can get most of the way there with top_children
. As the documentation says top_children
first queries the child documents and then aggregates them into parent documents. In our example, this means that elasticsearch will first find the rating documents that match our query. Then it will match each rating to its parent post, aggregating duplicate post where they exist.
The fiddly bit with top children is that elasticsearch doesn’t know ahead of time how many documents it will lose when the aggregation happens. In this particular case it’s easy because two distinct ratings by the same user always correspond to two distinct posts, so we don’t need to bother with factor
and incremental_factor
settings because the aggregation phase never does anything. Similarly, score mode doesn’t matter either. If you need to provide an accurate count of the total number of results you just need to set factor
big enough that the first sweep elasticsearch does of the child documents finds all of them. If you know that the user has 500 rated items on their list and you’re asking for the first 10 items, then a factor of 50 should do the trick. Factor need only be an upper bound - you don’t need to know exactly how many items the user has on their list (which might be awkward to work out without a separate elastic search query if the user is searching a specific subset of their ratings).
What you get after all this is a list of parent documents (posts) sorted by the query score of the children documents (ratings). To achieve the original goal of sorting the posts based on attributes of the children documents we just need to ensure that this query score has the right value. For example, have top_children
query wrap a custom_score
query so that you have control on what the score for each child is.
With the same documents in the index, this query returns the posts that user 1234 has rated, ordered by their rating:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
We’re running a top_children
query, so the first thing we need to do is to say what is the type of the children we’re considering (rating). Then we provide the query that finds those children. This is a custom_score
query, wrapping a filtered
query. The filtered
query ensures that we find only ratings given by the user we’re interested in, and then the script
element makes the score of the rating document be rating itself, so that we get our posts sorted by rating. The funkiness with the backslashes is only because I’m trying to include a literal single quote in a shell friendly string delimited by single quotes - the actual json that we’re sending has just "script": "doc['rating'].value"
.
Ruby fun
Unfortunately the tire library doesn’t really support any of this fun stuff at the moment - there’s a bit of a moratorium on this sort of feature being added because at the moment every single little query type and option ends up being separate methods scattered all over tire, which the maintainer understandably doesn’t like. You can sort of hack it in though.
Tire doesn’t allow you to set a document’s parent id when indexing. This is straightforward to add and is only being held up by the aforementioned moratorium. My fork adds this ability. With that you end up with
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
The next bit of unhappiness is that tire’s auto index creation presumes one type per index but for a parent/child relationship to exist both types have to be in the same index. I’ve ended up doing something like this to create my indexes.
1 2 3 4 5 6 |
|
which isn’t quite as beautiful but gets the job done.
Lastly you need to actually do the query. In the absence of top_children
actually being part of tire’s api you can fudge it like so
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
This little bit of unpleasantness builds up the query as a hash and then shoves it into tire while it’s looking the other way. Obviously you can structure it in such a way that it’s easy to add other conditions (whether on Post or on Rating) to the search. You can also build up the json manually and use Post.search :payload => my_json
(there is a bug with the payload option that clashes with tire-contrib’s logger extension)