Monday, February 18, 2013

SOLR: Unordered exact match - Restrict matching based on token count


If your use cases demands strict matching here is an example of how you can restrict matching based on token count. In the example below, we are narrowing the search to all the keywords of the query +/- one. You can certainly change the range parameter to span over +/- any count. Also you can tune the matching by adding list of filters in the field analyzer, add stop word filters, remove duplicates etc.

Setting Token Count field

First we will add the token count field in our Schema to hold the count of tokens for the field “title”.  
< field name="titleToken" type="int" indexed="true" stored="true" / >
< field name="title" type="text" indexed="true" stored="true" / >


Next we extend SearchComponent class to update the titleToken field with the count of tokens in the field “title” after the analyzer setting comes to affect, in the example case, the analyzer setting for fieldType=”text”.

Extend SearchComponent

Here we will extend the SearchComponent to read the field /fields on which we want to restrict the matching based on token count, title for example. Read the analyzer setup in inform() method to apply the settings for the title field in the Schema.xml.

public class QueryTokenComponent extends SearchComponent implements SolrCoreAware  {
    private String fieldname = “title”;
    private Analyzer analyzer;
   
    @Override
    public void init(NamedList args) {
        super.init(args);
    }
   
    public void inform(SolrCore core) {
        analyzer = core.getSchema().getAnalyzer();
    }

    @Override
    public void prepare(ResponseBuilder rb) throws IOException {

    }

Next we override the prepare() method in the above class to add the token range in the filter query and update the ModifiableSolrParams with the new filter query on the token range.

@Override
public void prepare(ResponseBuilder rb) throws IOException {
   SolrQueryRequest req = rb.req;
   SolrParams params = req.getParams();
       
   ModifiableSolrParams modparams = new ModifiableSolrParams(params);
   String queryString = modparams.get(CommonParams.Q);
   int tokenCnt = AnalyzerUtils.getTokens(analyzer, fieldName, queryString);
   modparams.add(CommonParams.FQ, "titleToken:[ " + (tokenCnt - 1) + " TO " + (tokenCnt +1) +"]");
   req.setParams(modparams);
}
And here’s how the getTokens method will look like-

public int getTokens( Analyzer analyzer, String field, String query) throws IOException {
       TokenStream tokenStream = analyzer.tokenStream(field, new StringReader(query));
       CharTermAttribute termAttribute = tokenStream.getAttribute(CharTermAttribute.class);
       String term = “”;
                 
       List tokens = new ArrayList();
       while (tokenStream.incrementToken())
       {
                  term = termAttribute.toString();
                tokens.add(term);
        }
                                         
        return tokens.size();

}

Debug

If you’d like to see the token count or the tokens that come to play, add the field in the schema.xml and update the values in UpdateRequestProcessor class extension.

class TokenCountProcessHandler extends UpdateRequestProcessor
{
     private Analyzer analyzer;
   
     public TokenCountProcessHandler ( SolrQueryRequest req,
                                                                  SolrQueryResponse rsp,
                                                                  UpdateRequestProcessor next)
    { 
         super( next );
         analyzer = req.getSchema().getAnalyzer();
     }
    @Override
    public void processAdd(final AddUpdateCommand cmd) throws IOException
    { 
        SolrInputDocument doc = cmd.getSolrInputDocument();
        Object v = doc.getFieldValue( "title" );
        if( v != null )
        {
             String title =  v.toString();
             doc.addField("wrd_cnt", getTokens(analyzer, "title", title).size());
         }
         cmd.solrDoc = doc;

        // pass it up the chain
        super.processAdd(cmd);
     }
}