Indexing proposal

Thu Sep 26 18:59:41 UTC 2013

On 09/26/2013 02:25 PM, Benji York wrote:
> We have a couple of outstanding bugs about indexing charm names better
> (1205477 and 1220909).  After looking into Elastic Search's various
> tokenizing options, the approach we should try is to index the
> charm/bundle "name" into two fields: one will be "non-analyzed" (i.e.,
> indexed in its entirety) the second will use an ngram tokenizer (min=2
> max=20) but will not use the ngram search, because the max ngram size is
> large enough to account for all search strings.  We will also use the
> "dis_max" query type in order to score the two fields correctly.

Thank you, Benji.  How expensive do you estimate this experiment to be
to implement?  Would it affect deployment?

For everyone else, I didn't initially understand why we needed the
non-analyzed index, but the second reference explains:

"The general approach is to index ngrams in a separate field and then
craft a query that searches on both fields but boosts matches on the non
ngram field. This way you match on partial words (ngrams) but favor
matches on whole tokens. This is generally where DisMax is useful
because the query plays an important role in fine tuning the relevance."

Gary

> 
> Reference:
>     http://www.elasticsearch.org/guide/reference/index-modules/analysis/ngram-tokenizer/
>     http://elasticsearch-users.115913.n3.nabble.com/Which-is-the-best-right-use-of-NGrams-td4030176.html
>     http://www.elasticsearch.org/guide/reference/query-dsl/dis-max-query/
>