Wednesday, March 13, 2013

Minimum Match per index field: SOLR Ranking and Relevance improvement


With SOLR minimum match parameter (mm), the constraint is applied on all the fields used in the matching (qf) collectively. So, if the query is of two keywords, and each keyword was found in different matching fields, the document is deemed matched and relevant.
For example,
qf = title, description, keyword
mm=2>75%
q= adopt a pet dog, where the matching keywords are “adopt”, “pet” and “dog”.

This could match a document with title – “Adopting Animals”, description talks about all the pet animals and the keyword has the list of animals including dog. This could equally match a document with title – “How to adopt a dog” with the page describing it. But the second document might be ranked lower than the first due to document size and keyword count in the document description even though it would be more relevant to the query here.

Also matching tokens in description field can dilute the ranking relevance but the document might get ranked higher because of tf-idf.  We can lower the matching criteria of description field over title; eg, qf = title^5, description^2, keyword; and address the issue to some extent.

Here we will talk about setting different minimum match criteria (mm) for each index field to further restrict the matching and not let matching keywords found in different index fields dilute the relevancy. This solution can help improve the document relevancy by 12% -20% (per simple result text similarity score generator).
Configure new SearchComponent params in the SolrConfig.xml to setup per index field mm value. This is only an example, the format of the field.mm depends on your implementation-
com.test.solr.qparser.MinimumFieldMatchQueryProcessor

title_mm=3<75%||description_mm=3<75%||keyword_mm=3<75%
or


Since the minimum match (mm) field is processed and set in QParser class, we will set the minimum match criteria per field and update the parameters in this class.
Here is the QueryProcessor interface to extend from-

public interface QueryProcessor
{
     void preprocess(QParser qPlugin);
     Query process(QParser qPlugin, Query prevQuery) throws ParseException;
}

The MinimumMatchFieldQueryProcessor implementation-


public class MinimumMatchFieldQueryProcessor implements QueryProcessor {
     private Map minMatchFieldsMap = null;
     private String mmOP;
     private String lang = null;

    @Override
    public void preprocess(QParser qPlugin) {
                  String fieldsToMatch = qPlugin.getParams().get("minmatch.mm");
                  mmOP = qPlugin.getParams().get("minmatch.op", "AND");
              minMatchFieldsMap = new HashMap();
String[] fields = fieldsToMatch.split("\\|\\|");
              for (String field : fields) {
                int indx = field.indexOf("=");
                if (indx != -1)
               {
                       minMatchFieldsMap.put(field.substring(0, indx).replaceAll("_mm", "”),
               field.substring(indx + 1));
                }
          }
     }

 @Override
  public Query process(QParser qPlugin, Query prevQuery) throws ParseException {

       String queryString = CommonUtils.extractPureQuery(qPlugin.getString());

       if (StringUtils.isBlank(queryString)) return prevQuery;

      BooleanQuery bq = new BooleanQuery(true);
      for (Map.Entry entry : minMatchFieldsMap.entrySet())
      {
                  String subQueryString = String.format("_query_:\"{!edismax qf=%s mm=%s}%s\"", entry.getKey(), entry.getValue(), queryString);

             Query minMatchQuery = qPlugin.subQuery(subQueryString, "lucene")
             .getQuery();

                  if ("and".equalsIgnoreCase(mmOP))
                  {
                                    bq.add(minMatchQuery, Occur.MUST);
                  }
                  else
              {
                                    bq.add(minMatchQuery, Occur.SHOULD);
                  }
       }
       return bq;
      }
}

In the next blog I will talk about how to add a customized QueryProcessor.

Monday, February 18, 2013

SOLR: Unordered exact match - Restrict matching based on token count


If your use cases demands strict matching here is an example of how you can restrict matching based on token count. In the example below, we are narrowing the search to all the keywords of the query +/- one. You can certainly change the range parameter to span over +/- any count. Also you can tune the matching by adding list of filters in the field analyzer, add stop word filters, remove duplicates etc.

Setting Token Count field

First we will add the token count field in our Schema to hold the count of tokens for the field “title”.  
< field name="titleToken" type="int" indexed="true" stored="true" / >
< field name="title" type="text" indexed="true" stored="true" / >


Next we extend SearchComponent class to update the titleToken field with the count of tokens in the field “title” after the analyzer setting comes to affect, in the example case, the analyzer setting for fieldType=”text”.

Extend SearchComponent

Here we will extend the SearchComponent to read the field /fields on which we want to restrict the matching based on token count, title for example. Read the analyzer setup in inform() method to apply the settings for the title field in the Schema.xml.

public class QueryTokenComponent extends SearchComponent implements SolrCoreAware  {
    private String fieldname = “title”;
    private Analyzer analyzer;
   
    @Override
    public void init(NamedList args) {
        super.init(args);
    }
   
    public void inform(SolrCore core) {
        analyzer = core.getSchema().getAnalyzer();
    }

    @Override
    public void prepare(ResponseBuilder rb) throws IOException {

    }

Next we override the prepare() method in the above class to add the token range in the filter query and update the ModifiableSolrParams with the new filter query on the token range.

@Override
public void prepare(ResponseBuilder rb) throws IOException {
   SolrQueryRequest req = rb.req;
   SolrParams params = req.getParams();
       
   ModifiableSolrParams modparams = new ModifiableSolrParams(params);
   String queryString = modparams.get(CommonParams.Q);
   int tokenCnt = AnalyzerUtils.getTokens(analyzer, fieldName, queryString);
   modparams.add(CommonParams.FQ, "titleToken:[ " + (tokenCnt - 1) + " TO " + (tokenCnt +1) +"]");
   req.setParams(modparams);
}
And here’s how the getTokens method will look like-

public int getTokens( Analyzer analyzer, String field, String query) throws IOException {
       TokenStream tokenStream = analyzer.tokenStream(field, new StringReader(query));
       CharTermAttribute termAttribute = tokenStream.getAttribute(CharTermAttribute.class);
       String term = “”;
                 
       List tokens = new ArrayList();
       while (tokenStream.incrementToken())
       {
                  term = termAttribute.toString();
                tokens.add(term);
        }
                                         
        return tokens.size();

}

Debug

If you’d like to see the token count or the tokens that come to play, add the field in the schema.xml and update the values in UpdateRequestProcessor class extension.

class TokenCountProcessHandler extends UpdateRequestProcessor
{
     private Analyzer analyzer;
   
     public TokenCountProcessHandler ( SolrQueryRequest req,
                                                                  SolrQueryResponse rsp,
                                                                  UpdateRequestProcessor next)
    { 
         super( next );
         analyzer = req.getSchema().getAnalyzer();
     }
    @Override
    public void processAdd(final AddUpdateCommand cmd) throws IOException
    { 
        SolrInputDocument doc = cmd.getSolrInputDocument();
        Object v = doc.getFieldValue( "title" );
        if( v != null )
        {
             String title =  v.toString();
             doc.addField("wrd_cnt", getTokens(analyzer, "title", title).size());
         }
         cmd.solrDoc = doc;

        // pass it up the chain
        super.processAdd(cmd);
     }
}


Thursday, February 16, 2012

SOLR: Improve relevancy by boosting exact and phrase match

Once we have the index ready for searching, the next implicit step is to improve the relevancy of the search index. SOLR of course provides ways to tune the search relevancy but one very obvious way to improve your relevancy almost always gets ignored. By boosting exact and phrase matching over the query matching we can achieve relevancy improvement by significant factor.

Exact Match Setup


To set a field(s) for exact matching, add another field in the Schema.xml and copy the content into it using copyField
<field name="title" type="text" indexed="true" stored="true" />
<field name="titleExact" type="textExact" indexed="true" stored="true" />
<copyField source="title" dest="titleExact"/>


You would notice that the data type for titleExact is set to "textExact" (defined below), although similar exact match effect can be achieved by setting the datatype to "string" but with adding our own datatype we can further fine tune by adding appropriate tokenizer and filters.
<fieldType name="textExact" class="solr.TextField" positionIncrementGap="100" >
   <analyzer type="index">
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="20"/>
   <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="20"/>
   <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>

Here I have used WhiteSpaceTokenizer without stopwords or stemming filters. I am using LimitTokenCounterFilter to limit the number of tokens and LowerCaseFilter to make the matching case-insensitive. We can further fine tune the textExact dataType to make the exact match a bit more lenient or strict per our use case.

Putting it All Together


Now to boost the exact match field and phrase matching, in the SolrConfig.xml -
<str name="qf">title titleExact^10</str>
<str name="pf">title^10 titleExact^100</str>

Now for both query and phrase matching we are boosting the exact matching field "titleExact" match higher than the non-exact matching field "title", also the same fields are boosted higher for phrase search (pf) compare to query or keyword search (qf). This would be a simple and first step to improving relevancy.

Saturday, February 11, 2012

Adding Ranking Support using SOLR SearchComponent

While working on adding ranking support based on region in one of our search index, the SOLR SearchComponent hook came in pretty handy and was quick. I have over simplified the use case below to define the steps for adding external ranking support in the SOLR search index but feel free to drop a comment /email if you need more info.

Define the SearchComponent

In the SolrConfig.xml, define the new SearchComponent as

<SearchComponent name="rank" class="com.solr.searchindex.component.RankingComponent">
   <lst name="rank">
      <lst name="DEFAULT">
         <str name="bf">hostRankdefault^10<str>
      <lst>
      <lst name="US">
         <str name="bf">hostRankus^5<str>
      </lst>
      <lst name="UK">
         <str name="bf">hostRankuk^5</str>
      </lst>
   </lst>
</searchComponent>

Here I have two regions – US and UK based on which I would like the documents to rank per the host ranking I have defined in external file. Boosting the document based on ranking in an external file gives us the flexibility to tune the rank anytime or even add more regions without regenerating the index, which is a huge gain if you have large index size.

Register the new SearchComponent in the array of components list. Note: Order of registering the components matters.

<arr name="components">
   <str>rank</str>
   <str>query</str>
   <str>highlight</str>
   <str>debug</str>
&l;/arr>

Now we need to define the 3 fields used for boosting in the SearchComponent in the Schema.xml.

Define ExternalFileFields


First we define the new ExternalFileField as a fieldType in Schema.xml with keyField referring to the field ‘site’ which stores the host /domain name.

<fieldType name="hostRankExt" keyField="site" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField" valType="pfloat"/>
<field name="site"type="String" indexed=”true” stored=”true”/>

Here I have defined the value type 'valType' for this field as float.

Now we define the boost fields which will refer to this ExternalFileField.
<field name="hostRankdefault" type="hostRankExt"/>
<field name="hostRankus" type="hostRankExt"/>
<field name="hostRankuk" type="hostRankExt"/>

Define External Files


The next step would be to add three host files with ranks to be referred by these three boost fields. The file name should be of the format external_<fieldname> and placed in the index directory to be picked by SOLR. Few things to note here would be, if the external file has already been loaded, and then updated, the changes will be visible only after the commit and it is suggested to have the external file sorted on the key.

external_hostRankDefault

external_hostRankus

external_hostRankuk

uk.yahoo.com=0.5
www.yahoo.com=0.5

uk.yahoo.com=0.5
www.yahoo.com=1.0

uk.yahoo.com=1.0
www.yahoo.com=0.5


Extend the SearchComponent to add the ranking support


Here is the quick code sample to add the ranking support based on the region passed in the query URL.
public class RankingComponent extends SearchComponent implements SolrCoreAware {
   private static final String RANK = "rank";
   private static final String RANK_US = "US";
   private static final String RANK_UK = "UK";
   private static final String RANK_DEFAULT = "DEFAULT";
   private Map initParamMap = new HashMap();

   @Override
   public void prepare(ResponseBuilder rb) throws IOException {

      SolrQueryRequest req = rb.req;
      SolrParams params = req.getParams();
      ModifiableSolrParams modparams = new ModifiableSolrParams(params);

      if (params.get(RANK).toUpperCase().equals(RANK_US))
      {
         updateParams(modparams, RANK_US);
      }
      else if (params.get(RANK).toUpperCase().equals(RANK_UK))
      {
         updateParams(modparams, RANK_UK);
      }
      else
      {
         updateParams(modparams, RANK_DEFAULT);
      }
      req.setParams(modparams);
   }

   @Override
   public void init(NamedList args) {
      super.init(args);
      NamedList rankList = (NamedList) args.get(RANK);
      for (int i = 0; i < rankList.size(); i++) {
         initParamMap.put(rankList.getName(i), (NamedList) rankList.getVal(i));
      }
   }

   private void updateParams(ModifiableSolrParams modparams, String rankid) {
      NamedList rankParams = initParamMap.get(rankid);
      int rankLength = rankParams.size();
      String name = “”;
      Object val = null;
      for (int i = 0; i < rankLength; i++) {
         name = rankParams.getName(i); //Reading parameter name our case ‘bf’
         val = rankParams.get(name); //Reading parameter value our case ‘bf’ value
         if (val != null) {
            modparams.set(name, val.toString());
         }
      }
   }
}

Now we have the ranking support in our search index. Try it out with changing the parameter value to see the difference it makes. Based on your score distribution you might have to update your boost factor or rank values.

Thursday, May 26, 2011

JMeter setup for QPS evaluation

Jmeter is a feature rich tool to load test and analyze your system. Here I plan to share the steps of setting up a simple Jmeter test to evaluate the throughput and determine the QPS (query per second) the system can support.

The thumb rule to determine the QPS is to keep increasing the request per second to the system and find a saturation point where your throughput dramatically drops with the increase in response time. With any load testing you will see that the response time stays constant (or minor fluctuation) and throughput increases linearly with the increase in the load or requests to the server. As you would reach the saturation point of the server, throughput will stop increasing and would rather have a sharp dip and the response time shoots up. The throughtput your server handled before reaching this saturation point is the MAX throughtput your server can handle, in other words the QPS your server can support.

To setup this test we will start with downloading and setting up your Jmeter.

Once you have JMeter setup, you can start it in your preferred way; I start it in the UI mode. To create a simple test for our purpose, here are the steps I followed.
Setup test plan

Step 1: Add a Thread Group to your Jmeter test plan.

Right Click on Test Plan and Add -> Threads (Users) -> Thread Group

Step2: Next we will add a HTTP Request Sampler

Right click on Thread Group and Add -> Sampler -> HTTP Request


Notice the path has a q parameter with the value substitution, as we will fill in unique values from a file to pass unique URL requests to the server.

Step 3: Now to see the results in Graph, we need to add Graph Results Listener

Right Click on HTTP Request and Add -> Listener -> Graph Results


Pass Unique Values

Step 4: To send different parameter values with each user request sent to the server, we can add "CSV Data Set Config" Config Element.

Right Click on HTTP Request and Add -> Config Element -> CSV Data Set Config

The MD5.csv file has unique id one per line. If you have multiple parameters you can add the parameters comma separated in Variable Names.
Monitor results /response

Step 5: To see the summarized result

Right Click on HTTP Request and Add -> Listener -> Summary Report

Now we can start generating the load for the server by going to the Thread Group section from the left panel. To simply show the effect of drop in throughput with the increase in load, in the example here I am increasing the users (or number of threads).

From the Thread Group tab, I will increase the concurrent users starting with 10 and setting the Ramp up period to zero.

Here is my summarized report of the experiment-

# of UsersAvg. resp. timeThroughputKB/sec
(concurrent)      (qps) 
1019333.44481605101.2881
5028042.08754209153.5357
7522578.45188285234.3444
100209105.9322034306.0768
15031483.01051467246.4645
200255198.764146317.7128
250266241.070028521.4105
275323251.8878357773.7123
30061988.13160987260.3781

This states that the QPS of my server is around 250-275. As we increase the load to 300, we can see a spike in the average response time and a dip in the throughput, stating that the server has reached it's saturation level. This summarized report also gives you the average response time your server can support. This information is very crucial in designing a system.

This experiment can be varied in different ways to introduce other factors that could be related to the use case more applicable to the system you are testing viz.; add load in steps or delays, create a set of user behavior and run it in loops etc.


 

Monday, March 29, 2010

Unix - find and remove files

Since rm command does not support searching of files we need to use find and rm command in combination. While working with searching and removing files today, I thought, I will share with you some of these combinations to make your search easy.

find

Search files matching a certain pattern viz, '&'

find . -name "*&*"

Here we are looking in the current directory for all the files which have '&' in its name and in case you need to exclude few set of files with a particular pattern viz, '_'

find . -name "*&*" -and –not –name "_"

If the directory we are searching in also has subdirectories which you would like to exclude from the search, then use type switch,

find . –type f -name "*&*" -and –not –name "_"

find and rm

Now that we have found the list of files we would like to remove from the directory, we need to pass this list to the 'rm' (remove) command. Since rm command takes one file at a time, there are two ways we can do this.

find . -name "FILE_SEARCH_PATTERN"-exec rm -i {} \;

If you don't want the confirmation before removing each file, replace the rm switch –i with -f

Or

find . -name "FILE_SEARCH_PATTERN" | xargs rm

Where 'xargs' creates an argument list for a UNIX command using standard input and executes it. In UNIX shells, there is a restriction on the number of arguments allowed on a command line. 'xargs' helps here with bundling the arguments into smaller groups and execute rm command for each group separately.