Thursday, February 16, 2012

SOLR: Improve relevancy by boosting exact and phrase match

Once we have the index ready for searching, the next implicit step is to improve the relevancy of the search index. SOLR of course provides ways to tune the search relevancy but one very obvious way to improve your relevancy almost always gets ignored. By boosting exact and phrase matching over the query matching we can achieve relevancy improvement by significant factor.

Exact Match Setup


To set a field(s) for exact matching, add another field in the Schema.xml and copy the content into it using copyField
<field name="title" type="text" indexed="true" stored="true" />
<field name="titleExact" type="textExact" indexed="true" stored="true" />
<copyField source="title" dest="titleExact"/>


You would notice that the data type for titleExact is set to "textExact" (defined below), although similar exact match effect can be achieved by setting the datatype to "string" but with adding our own datatype we can further fine tune by adding appropriate tokenizer and filters.
<fieldType name="textExact" class="solr.TextField" positionIncrementGap="100" >
   <analyzer type="index">
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="20"/>
   <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="20"/>
   <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>

Here I have used WhiteSpaceTokenizer without stopwords or stemming filters. I am using LimitTokenCounterFilter to limit the number of tokens and LowerCaseFilter to make the matching case-insensitive. We can further fine tune the textExact dataType to make the exact match a bit more lenient or strict per our use case.

Putting it All Together


Now to boost the exact match field and phrase matching, in the SolrConfig.xml -
<str name="qf">title titleExact^10</str>
<str name="pf">title^10 titleExact^100</str>

Now for both query and phrase matching we are boosting the exact matching field "titleExact" match higher than the non-exact matching field "title", also the same fields are boosted higher for phrase search (pf) compare to query or keyword search (qf). This would be a simple and first step to improving relevancy.

Saturday, February 11, 2012

Adding Ranking Support using SOLR SearchComponent

While working on adding ranking support based on region in one of our search index, the SOLR SearchComponent hook came in pretty handy and was quick. I have over simplified the use case below to define the steps for adding external ranking support in the SOLR search index but feel free to drop a comment /email if you need more info.

Define the SearchComponent

In the SolrConfig.xml, define the new SearchComponent as

<SearchComponent name="rank" class="com.solr.searchindex.component.RankingComponent">
   <lst name="rank">
      <lst name="DEFAULT">
         <str name="bf">hostRankdefault^10<str>
      <lst>
      <lst name="US">
         <str name="bf">hostRankus^5<str>
      </lst>
      <lst name="UK">
         <str name="bf">hostRankuk^5</str>
      </lst>
   </lst>
</searchComponent>

Here I have two regions – US and UK based on which I would like the documents to rank per the host ranking I have defined in external file. Boosting the document based on ranking in an external file gives us the flexibility to tune the rank anytime or even add more regions without regenerating the index, which is a huge gain if you have large index size.

Register the new SearchComponent in the array of components list. Note: Order of registering the components matters.

<arr name="components">
   <str>rank</str>
   <str>query</str>
   <str>highlight</str>
   <str>debug</str>
&l;/arr>

Now we need to define the 3 fields used for boosting in the SearchComponent in the Schema.xml.

Define ExternalFileFields


First we define the new ExternalFileField as a fieldType in Schema.xml with keyField referring to the field ‘site’ which stores the host /domain name.

<fieldType name="hostRankExt" keyField="site" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField" valType="pfloat"/>
<field name="site"type="String" indexed=”true” stored=”true”/>

Here I have defined the value type 'valType' for this field as float.

Now we define the boost fields which will refer to this ExternalFileField.
<field name="hostRankdefault" type="hostRankExt"/>
<field name="hostRankus" type="hostRankExt"/>
<field name="hostRankuk" type="hostRankExt"/>

Define External Files


The next step would be to add three host files with ranks to be referred by these three boost fields. The file name should be of the format external_<fieldname> and placed in the index directory to be picked by SOLR. Few things to note here would be, if the external file has already been loaded, and then updated, the changes will be visible only after the commit and it is suggested to have the external file sorted on the key.

external_hostRankDefault

external_hostRankus

external_hostRankuk

uk.yahoo.com=0.5
www.yahoo.com=0.5

uk.yahoo.com=0.5
www.yahoo.com=1.0

uk.yahoo.com=1.0
www.yahoo.com=0.5


Extend the SearchComponent to add the ranking support


Here is the quick code sample to add the ranking support based on the region passed in the query URL.
public class RankingComponent extends SearchComponent implements SolrCoreAware {
   private static final String RANK = "rank";
   private static final String RANK_US = "US";
   private static final String RANK_UK = "UK";
   private static final String RANK_DEFAULT = "DEFAULT";
   private Map initParamMap = new HashMap();

   @Override
   public void prepare(ResponseBuilder rb) throws IOException {

      SolrQueryRequest req = rb.req;
      SolrParams params = req.getParams();
      ModifiableSolrParams modparams = new ModifiableSolrParams(params);

      if (params.get(RANK).toUpperCase().equals(RANK_US))
      {
         updateParams(modparams, RANK_US);
      }
      else if (params.get(RANK).toUpperCase().equals(RANK_UK))
      {
         updateParams(modparams, RANK_UK);
      }
      else
      {
         updateParams(modparams, RANK_DEFAULT);
      }
      req.setParams(modparams);
   }

   @Override
   public void init(NamedList args) {
      super.init(args);
      NamedList rankList = (NamedList) args.get(RANK);
      for (int i = 0; i < rankList.size(); i++) {
         initParamMap.put(rankList.getName(i), (NamedList) rankList.getVal(i));
      }
   }

   private void updateParams(ModifiableSolrParams modparams, String rankid) {
      NamedList rankParams = initParamMap.get(rankid);
      int rankLength = rankParams.size();
      String name = “”;
      Object val = null;
      for (int i = 0; i < rankLength; i++) {
         name = rankParams.getName(i); //Reading parameter name our case ‘bf’
         val = rankParams.get(name); //Reading parameter value our case ‘bf’ value
         if (val != null) {
            modparams.set(name, val.toString());
         }
      }
   }
}

Now we have the ranking support in our search index. Try it out with changing the parameter value to see the difference it makes. Based on your score distribution you might have to update your boost factor or rank values.