Wednesday, March 13, 2013

Minimum Match per index field: SOLR Ranking and Relevance improvement


With SOLR minimum match parameter (mm), the constraint is applied on all the fields used in the matching (qf) collectively. So, if the query is of two keywords, and each keyword was found in different matching fields, the document is deemed matched and relevant.
For example,
qf = title, description, keyword
mm=2>75%
q= adopt a pet dog, where the matching keywords are “adopt”, “pet” and “dog”.

This could match a document with title – “Adopting Animals”, description talks about all the pet animals and the keyword has the list of animals including dog. This could equally match a document with title – “How to adopt a dog” with the page describing it. But the second document might be ranked lower than the first due to document size and keyword count in the document description even though it would be more relevant to the query here.

Also matching tokens in description field can dilute the ranking relevance but the document might get ranked higher because of tf-idf.  We can lower the matching criteria of description field over title; eg, qf = title^5, description^2, keyword; and address the issue to some extent.

Here we will talk about setting different minimum match criteria (mm) for each index field to further restrict the matching and not let matching keywords found in different index fields dilute the relevancy. This solution can help improve the document relevancy by 12% -20% (per simple result text similarity score generator).
Configure new SearchComponent params in the SolrConfig.xml to setup per index field mm value. This is only an example, the format of the field.mm depends on your implementation-
com.test.solr.qparser.MinimumFieldMatchQueryProcessor

title_mm=3<75%||description_mm=3<75%||keyword_mm=3<75%
or


Since the minimum match (mm) field is processed and set in QParser class, we will set the minimum match criteria per field and update the parameters in this class.
Here is the QueryProcessor interface to extend from-

public interface QueryProcessor
{
     void preprocess(QParser qPlugin);
     Query process(QParser qPlugin, Query prevQuery) throws ParseException;
}

The MinimumMatchFieldQueryProcessor implementation-


public class MinimumMatchFieldQueryProcessor implements QueryProcessor {
     private Map minMatchFieldsMap = null;
     private String mmOP;
     private String lang = null;

    @Override
    public void preprocess(QParser qPlugin) {
                  String fieldsToMatch = qPlugin.getParams().get("minmatch.mm");
                  mmOP = qPlugin.getParams().get("minmatch.op", "AND");
              minMatchFieldsMap = new HashMap();
String[] fields = fieldsToMatch.split("\\|\\|");
              for (String field : fields) {
                int indx = field.indexOf("=");
                if (indx != -1)
               {
                       minMatchFieldsMap.put(field.substring(0, indx).replaceAll("_mm", "”),
               field.substring(indx + 1));
                }
          }
     }

 @Override
  public Query process(QParser qPlugin, Query prevQuery) throws ParseException {

       String queryString = CommonUtils.extractPureQuery(qPlugin.getString());

       if (StringUtils.isBlank(queryString)) return prevQuery;

      BooleanQuery bq = new BooleanQuery(true);
      for (Map.Entry entry : minMatchFieldsMap.entrySet())
      {
                  String subQueryString = String.format("_query_:\"{!edismax qf=%s mm=%s}%s\"", entry.getKey(), entry.getValue(), queryString);

             Query minMatchQuery = qPlugin.subQuery(subQueryString, "lucene")
             .getQuery();

                  if ("and".equalsIgnoreCase(mmOP))
                  {
                                    bq.add(minMatchQuery, Occur.MUST);
                  }
                  else
              {
                                    bq.add(minMatchQuery, Occur.SHOULD);
                  }
       }
       return bq;
      }
}

In the next blog I will talk about how to add a customized QueryProcessor.

Monday, February 18, 2013

SOLR: Unordered exact match - Restrict matching based on token count


If your use cases demands strict matching here is an example of how you can restrict matching based on token count. In the example below, we are narrowing the search to all the keywords of the query +/- one. You can certainly change the range parameter to span over +/- any count. Also you can tune the matching by adding list of filters in the field analyzer, add stop word filters, remove duplicates etc.

Setting Token Count field

First we will add the token count field in our Schema to hold the count of tokens for the field “title”.  
< field name="titleToken" type="int" indexed="true" stored="true" / >
< field name="title" type="text" indexed="true" stored="true" / >


Next we extend SearchComponent class to update the titleToken field with the count of tokens in the field “title” after the analyzer setting comes to affect, in the example case, the analyzer setting for fieldType=”text”.

Extend SearchComponent

Here we will extend the SearchComponent to read the field /fields on which we want to restrict the matching based on token count, title for example. Read the analyzer setup in inform() method to apply the settings for the title field in the Schema.xml.

public class QueryTokenComponent extends SearchComponent implements SolrCoreAware  {
    private String fieldname = “title”;
    private Analyzer analyzer;
   
    @Override
    public void init(NamedList args) {
        super.init(args);
    }
   
    public void inform(SolrCore core) {
        analyzer = core.getSchema().getAnalyzer();
    }

    @Override
    public void prepare(ResponseBuilder rb) throws IOException {

    }

Next we override the prepare() method in the above class to add the token range in the filter query and update the ModifiableSolrParams with the new filter query on the token range.

@Override
public void prepare(ResponseBuilder rb) throws IOException {
   SolrQueryRequest req = rb.req;
   SolrParams params = req.getParams();
       
   ModifiableSolrParams modparams = new ModifiableSolrParams(params);
   String queryString = modparams.get(CommonParams.Q);
   int tokenCnt = AnalyzerUtils.getTokens(analyzer, fieldName, queryString);
   modparams.add(CommonParams.FQ, "titleToken:[ " + (tokenCnt - 1) + " TO " + (tokenCnt +1) +"]");
   req.setParams(modparams);
}
And here’s how the getTokens method will look like-

public int getTokens( Analyzer analyzer, String field, String query) throws IOException {
       TokenStream tokenStream = analyzer.tokenStream(field, new StringReader(query));
       CharTermAttribute termAttribute = tokenStream.getAttribute(CharTermAttribute.class);
       String term = “”;
                 
       List tokens = new ArrayList();
       while (tokenStream.incrementToken())
       {
                  term = termAttribute.toString();
                tokens.add(term);
        }
                                         
        return tokens.size();

}

Debug

If you’d like to see the token count or the tokens that come to play, add the field in the schema.xml and update the values in UpdateRequestProcessor class extension.

class TokenCountProcessHandler extends UpdateRequestProcessor
{
     private Analyzer analyzer;
   
     public TokenCountProcessHandler ( SolrQueryRequest req,
                                                                  SolrQueryResponse rsp,
                                                                  UpdateRequestProcessor next)
    { 
         super( next );
         analyzer = req.getSchema().getAnalyzer();
     }
    @Override
    public void processAdd(final AddUpdateCommand cmd) throws IOException
    { 
        SolrInputDocument doc = cmd.getSolrInputDocument();
        Object v = doc.getFieldValue( "title" );
        if( v != null )
        {
             String title =  v.toString();
             doc.addField("wrd_cnt", getTokens(analyzer, "title", title).size());
         }
         cmd.solrDoc = doc;

        // pass it up the chain
        super.processAdd(cmd);
     }
}