Monday, February 18, 2013

SOLR: Unordered exact match - Restrict matching based on token count


If your use cases demands strict matching here is an example of how you can restrict matching based on token count. In the example below, we are narrowing the search to all the keywords of the query +/- one. You can certainly change the range parameter to span over +/- any count. Also you can tune the matching by adding list of filters in the field analyzer, add stop word filters, remove duplicates etc.

Setting Token Count field

First we will add the token count field in our Schema to hold the count of tokens for the field “title”.  
< field name="titleToken" type="int" indexed="true" stored="true" / >
< field name="title" type="text" indexed="true" stored="true" / >


Next we extend SearchComponent class to update the titleToken field with the count of tokens in the field “title” after the analyzer setting comes to affect, in the example case, the analyzer setting for fieldType=”text”.

Extend SearchComponent

Here we will extend the SearchComponent to read the field /fields on which we want to restrict the matching based on token count, title for example. Read the analyzer setup in inform() method to apply the settings for the title field in the Schema.xml.

public class QueryTokenComponent extends SearchComponent implements SolrCoreAware  {
    private String fieldname = “title”;
    private Analyzer analyzer;
   
    @Override
    public void init(NamedList args) {
        super.init(args);
    }
   
    public void inform(SolrCore core) {
        analyzer = core.getSchema().getAnalyzer();
    }

    @Override
    public void prepare(ResponseBuilder rb) throws IOException {

    }

Next we override the prepare() method in the above class to add the token range in the filter query and update the ModifiableSolrParams with the new filter query on the token range.

@Override
public void prepare(ResponseBuilder rb) throws IOException {
   SolrQueryRequest req = rb.req;
   SolrParams params = req.getParams();
       
   ModifiableSolrParams modparams = new ModifiableSolrParams(params);
   String queryString = modparams.get(CommonParams.Q);
   int tokenCnt = AnalyzerUtils.getTokens(analyzer, fieldName, queryString);
   modparams.add(CommonParams.FQ, "titleToken:[ " + (tokenCnt - 1) + " TO " + (tokenCnt +1) +"]");
   req.setParams(modparams);
}
And here’s how the getTokens method will look like-

public int getTokens( Analyzer analyzer, String field, String query) throws IOException {
       TokenStream tokenStream = analyzer.tokenStream(field, new StringReader(query));
       CharTermAttribute termAttribute = tokenStream.getAttribute(CharTermAttribute.class);
       String term = “”;
                 
       List tokens = new ArrayList();
       while (tokenStream.incrementToken())
       {
                  term = termAttribute.toString();
                tokens.add(term);
        }
                                         
        return tokens.size();

}

Debug

If you’d like to see the token count or the tokens that come to play, add the field in the schema.xml and update the values in UpdateRequestProcessor class extension.

class TokenCountProcessHandler extends UpdateRequestProcessor
{
     private Analyzer analyzer;
   
     public TokenCountProcessHandler ( SolrQueryRequest req,
                                                                  SolrQueryResponse rsp,
                                                                  UpdateRequestProcessor next)
    { 
         super( next );
         analyzer = req.getSchema().getAnalyzer();
     }
    @Override
    public void processAdd(final AddUpdateCommand cmd) throws IOException
    { 
        SolrInputDocument doc = cmd.getSolrInputDocument();
        Object v = doc.getFieldValue( "title" );
        if( v != null )
        {
             String title =  v.toString();
             doc.addField("wrd_cnt", getTokens(analyzer, "title", title).size());
         }
         cmd.solrDoc = doc;

        // pass it up the chain
        super.processAdd(cmd);
     }
}


Thursday, February 16, 2012

SOLR: Improve relevancy by boosting exact and phrase match

Once we have the index ready for searching, the next implicit step is to improve the relevancy of the search index. SOLR of course provides ways to tune the search relevancy but one very obvious way to improve your relevancy almost always gets ignored. By boosting exact and phrase matching over the query matching we can achieve relevancy improvement by significant factor.

Exact Match Setup


To set a field(s) for exact matching, add another field in the Schema.xml and copy the content into it using copyField
<field name="title" type="text" indexed="true" stored="true" />
<field name="titleExact" type="textExact" indexed="true" stored="true" />
<copyField source="title" dest="titleExact"/>


You would notice that the data type for titleExact is set to "textExact" (defined below), although similar exact match effect can be achieved by setting the datatype to "string" but with adding our own datatype we can further fine tune by adding appropriate tokenizer and filters.
<fieldType name="textExact" class="solr.TextField" positionIncrementGap="100" >
   <analyzer type="index">
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="20"/>
   <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="20"/>
   <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>

Here I have used WhiteSpaceTokenizer without stopwords or stemming filters. I am using LimitTokenCounterFilter to limit the number of tokens and LowerCaseFilter to make the matching case-insensitive. We can further fine tune the textExact dataType to make the exact match a bit more lenient or strict per our use case.

Putting it All Together


Now to boost the exact match field and phrase matching, in the SolrConfig.xml -
<str name="qf">title titleExact^10</str>
<str name="pf">title^10 titleExact^100</str>

Now for both query and phrase matching we are boosting the exact matching field "titleExact" match higher than the non-exact matching field "title", also the same fields are boosted higher for phrase search (pf) compare to query or keyword search (qf). This would be a simple and first step to improving relevancy.

Saturday, February 11, 2012

Adding Ranking Support using SOLR SearchComponent

While working on adding ranking support based on region in one of our search index, the SOLR SearchComponent hook came in pretty handy and was quick. I have over simplified the use case below to define the steps for adding external ranking support in the SOLR search index but feel free to drop a comment /email if you need more info.

Define the SearchComponent

In the SolrConfig.xml, define the new SearchComponent as

<SearchComponent name="rank" class="com.solr.searchindex.component.RankingComponent">
   <lst name="rank">
      <lst name="DEFAULT">
         <str name="bf">hostRankdefault^10<str>
      <lst>
      <lst name="US">
         <str name="bf">hostRankus^5<str>
      </lst>
      <lst name="UK">
         <str name="bf">hostRankuk^5</str>
      </lst>
   </lst>
</searchComponent>

Here I have two regions – US and UK based on which I would like the documents to rank per the host ranking I have defined in external file. Boosting the document based on ranking in an external file gives us the flexibility to tune the rank anytime or even add more regions without regenerating the index, which is a huge gain if you have large index size.

Register the new SearchComponent in the array of components list. Note: Order of registering the components matters.

<arr name="components">
   <str>rank</str>
   <str>query</str>
   <str>highlight</str>
   <str>debug</str>
&l;/arr>

Now we need to define the 3 fields used for boosting in the SearchComponent in the Schema.xml.

Define ExternalFileFields


First we define the new ExternalFileField as a fieldType in Schema.xml with keyField referring to the field ‘site’ which stores the host /domain name.

<fieldType name="hostRankExt" keyField="site" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField" valType="pfloat"/>
<field name="site"type="String" indexed=”true” stored=”true”/>

Here I have defined the value type 'valType' for this field as float.

Now we define the boost fields which will refer to this ExternalFileField.
<field name="hostRankdefault" type="hostRankExt"/>
<field name="hostRankus" type="hostRankExt"/>
<field name="hostRankuk" type="hostRankExt"/>

Define External Files


The next step would be to add three host files with ranks to be referred by these three boost fields. The file name should be of the format external_<fieldname> and placed in the index directory to be picked by SOLR. Few things to note here would be, if the external file has already been loaded, and then updated, the changes will be visible only after the commit and it is suggested to have the external file sorted on the key.

external_hostRankDefault

external_hostRankus

external_hostRankuk

uk.yahoo.com=0.5
www.yahoo.com=0.5

uk.yahoo.com=0.5
www.yahoo.com=1.0

uk.yahoo.com=1.0
www.yahoo.com=0.5


Extend the SearchComponent to add the ranking support


Here is the quick code sample to add the ranking support based on the region passed in the query URL.
public class RankingComponent extends SearchComponent implements SolrCoreAware {
   private static final String RANK = "rank";
   private static final String RANK_US = "US";
   private static final String RANK_UK = "UK";
   private static final String RANK_DEFAULT = "DEFAULT";
   private Map initParamMap = new HashMap();

   @Override
   public void prepare(ResponseBuilder rb) throws IOException {

      SolrQueryRequest req = rb.req;
      SolrParams params = req.getParams();
      ModifiableSolrParams modparams = new ModifiableSolrParams(params);

      if (params.get(RANK).toUpperCase().equals(RANK_US))
      {
         updateParams(modparams, RANK_US);
      }
      else if (params.get(RANK).toUpperCase().equals(RANK_UK))
      {
         updateParams(modparams, RANK_UK);
      }
      else
      {
         updateParams(modparams, RANK_DEFAULT);
      }
      req.setParams(modparams);
   }

   @Override
   public void init(NamedList args) {
      super.init(args);
      NamedList rankList = (NamedList) args.get(RANK);
      for (int i = 0; i < rankList.size(); i++) {
         initParamMap.put(rankList.getName(i), (NamedList) rankList.getVal(i));
      }
   }

   private void updateParams(ModifiableSolrParams modparams, String rankid) {
      NamedList rankParams = initParamMap.get(rankid);
      int rankLength = rankParams.size();
      String name = “”;
      Object val = null;
      for (int i = 0; i < rankLength; i++) {
         name = rankParams.getName(i); //Reading parameter name our case ‘bf’
         val = rankParams.get(name); //Reading parameter value our case ‘bf’ value
         if (val != null) {
            modparams.set(name, val.toString());
         }
      }
   }
}

Now we have the ranking support in our search index. Try it out with changing the parameter value to see the difference it makes. Based on your score distribution you might have to update your boost factor or rank values.

Thursday, May 26, 2011

JMeter setup for QPS evaluation

Jmeter is a feature rich tool to load test and analyze your system. Here I plan to share the steps of setting up a simple Jmeter test to evaluate the throughput and determine the QPS (query per second) the system can support.

The thumb rule to determine the QPS is to keep increasing the request per second to the system and find a saturation point where your throughput dramatically drops with the increase in response time. With any load testing you will see that the response time stays constant (or minor fluctuation) and throughput increases linearly with the increase in the load or requests to the server. As you would reach the saturation point of the server, throughput will stop increasing and would rather have a sharp dip and the response time shoots up. The throughtput your server handled before reaching this saturation point is the MAX throughtput your server can handle, in other words the QPS your server can support.

To setup this test we will start with downloading and setting up your Jmeter.

Once you have JMeter setup, you can start it in your preferred way; I start it in the UI mode. To create a simple test for our purpose, here are the steps I followed.
Setup test plan

Step 1: Add a Thread Group to your Jmeter test plan.

Right Click on Test Plan and Add -> Threads (Users) -> Thread Group

Step2: Next we will add a HTTP Request Sampler

Right click on Thread Group and Add -> Sampler -> HTTP Request


Notice the path has a q parameter with the value substitution, as we will fill in unique values from a file to pass unique URL requests to the server.

Step 3: Now to see the results in Graph, we need to add Graph Results Listener

Right Click on HTTP Request and Add -> Listener -> Graph Results


Pass Unique Values

Step 4: To send different parameter values with each user request sent to the server, we can add "CSV Data Set Config" Config Element.

Right Click on HTTP Request and Add -> Config Element -> CSV Data Set Config

The MD5.csv file has unique id one per line. If you have multiple parameters you can add the parameters comma separated in Variable Names.
Monitor results /response

Step 5: To see the summarized result

Right Click on HTTP Request and Add -> Listener -> Summary Report

Now we can start generating the load for the server by going to the Thread Group section from the left panel. To simply show the effect of drop in throughput with the increase in load, in the example here I am increasing the users (or number of threads).

From the Thread Group tab, I will increase the concurrent users starting with 10 and setting the Ramp up period to zero.

Here is my summarized report of the experiment-

# of UsersAvg. resp. timeThroughputKB/sec
(concurrent)      (qps) 
1019333.44481605101.2881
5028042.08754209153.5357
7522578.45188285234.3444
100209105.9322034306.0768
15031483.01051467246.4645
200255198.764146317.7128
250266241.070028521.4105
275323251.8878357773.7123
30061988.13160987260.3781

This states that the QPS of my server is around 250-275. As we increase the load to 300, we can see a spike in the average response time and a dip in the throughput, stating that the server has reached it's saturation level. This summarized report also gives you the average response time your server can support. This information is very crucial in designing a system.

This experiment can be varied in different ways to introduce other factors that could be related to the use case more applicable to the system you are testing viz.; add load in steps or delays, create a set of user behavior and run it in loops etc.


 

Monday, March 29, 2010

Unix - find and remove files

Since rm command does not support searching of files we need to use find and rm command in combination. While working with searching and removing files today, I thought, I will share with you some of these combinations to make your search easy.

find

Search files matching a certain pattern viz, '&'

find . -name "*&*"

Here we are looking in the current directory for all the files which have '&' in its name and in case you need to exclude few set of files with a particular pattern viz, '_'

find . -name "*&*" -and –not –name "_"

If the directory we are searching in also has subdirectories which you would like to exclude from the search, then use type switch,

find . –type f -name "*&*" -and –not –name "_"

find and rm

Now that we have found the list of files we would like to remove from the directory, we need to pass this list to the 'rm' (remove) command. Since rm command takes one file at a time, there are two ways we can do this.

find . -name "FILE_SEARCH_PATTERN"-exec rm -i {} \;

If you don't want the confirmation before removing each file, replace the rm switch –i with -f

Or

find . -name "FILE_SEARCH_PATTERN" | xargs rm

Where 'xargs' creates an argument list for a UNIX command using standard input and executes it. In UNIX shells, there is a restriction on the number of arguments allowed on a command line. 'xargs' helps here with bundling the arguments into smaller groups and execute rm command for each group separately.

Wednesday, March 10, 2010

Append rows to the SELECT query result

Something I stumped over recently when I was working with a list of categories. Simply put, I had to sort the category on the number of questions each have and append category 'Other' at the end.

UNION ALL can help resolve this issue. So All we need to do is-

SELECT category
FROM t_question
WHERE category <> 'other'
GROUP BY category

UNION all

SELECT DISTINCT category
FROM t_question
WHERE category = 'other'



which will give me the list of categories from the table and add category 'other' at the end of the list. But if you need to sort the list (ORDER BY) as I needed to and if you are working on MS SQL Server 2005 then ORDER BY clause does not work with UNION clause.

SO I updated the above query to-

SELECT c.category
FROM (SELECT category, count(*) num
FROM t_question
              WHERE category <> 'other'
              GROUP BY category

              UNION all

              SELECT DISTINCT category, 1
              FROM t_question
              WHERE category = 'other') c 
ORDER BY num DESC



In case you bump into the same issue, now you know how to tweak your query.

Wednesday, April 1, 2009

Perl script to send alert notification email

We usually have a bunch of automated tasks running at different schedules and it is not feasible to keep monitoring them in person. It is almost always need to have some kind of email alert notification at the minimum in place to get notifications when something suspicious happens in the scheduled tasks running. The Perl script below addresses to that minimum requirement. One can always improvise and add further features to it or have it incorporated can called from your java (or any other) programs etc.

The script has one subroutine msgSender.

eval{
}
or do {
};

In the eval block we will trap any exception thrown in the msgSender subroutine and in or do block we will handle that exception by logging it.
First we need to open the file to read the list of send email addresses from the file.

open (SENDTOLISTFILE, $sendToList ) or die("Could not open sender list file.");
while (my $line = <SENDTOLISTFILE>) {
    chomp($line);
    push(@sendTo,$line);
}
close(SENDTOFILE);

chomp($line) actually chops off \n from the line we read and we are copying the line read in an array defined by sendTo. At the end we close the file.

Next we read the subject of the email and content to be sent as arguments. If you only have one task running, you can have your subject and message hardcoded.

my $messageSubject = $_[0];
my $messageContent = $_[1];

Now comes the actual part of setting up the SMTP server for send the email notification.

# Setup SMTP mail server to send the alert email
use Net::SMTP;
$smtp = Net::SMTP->new('SMTP_SERVER_INFO'); # connect to an SMTP server
$smtp->mail('alertNotify@YOUR_DOMAINNAME.com'); # use the sender's address here
for($count = @sendTo; $count>0 ; $count--)
{         
    $sendToVal = $sendTo[$count-1];
    $smtp->to($sendToVal);            # recipient's address
}
$smtp->data(); # Start writing the mail
        
# Send the header.    
$smtp->datasend("To: alertNotify\@YOUR_DOMAINNAME.com\n");
$smtp->datasend("From: alertNofity\@YOUR_DOMAINNAME.com\n");
$smtp->datasend("Subject: $messageSubject\n");
$smtp->datasend("\n");
        
# Send the body.
$smtp->datasend("This is a Broadcast Message. Please DO NOT reply to this email. It is not monitored\n");
$smtp->datasend("------------------------------------------------------------------------------------------\n");
$smtp->datasend("$messageContent\n");
$smtp->dataend(); # Finish sending the mail
$smtp->quit; # Close the SMTP connection

Update SMTP_SERVER_INFO with your smtp server info and update the sender addresses with your sender email address.

Once we have the SMTP server info set, we need to setup the header and body info. At the end we end the email with the quit statement.

At the end we log errors if any.

or do{
my $errorLogging = "C:/logs/alertNotify.err";
open(LOG,">$errorLogging") || die("Cannot Open File");
print LOG "alertNotify:$@";
close(LOG);
exit(-1);
};

Or do block is reached only if the eval block throws an exception which is captured in $@ variable.

Instead of or do block we can also use the eval block in combination with if condition as below to capture and log errors.

eval{
...
};
if($@){
my $errorLogging = "C:/logs/alertNotify.err";
open(LOG,">$errorLogging") || die("Cannot Open File");
print LOG "alertNotify:$@";
close(LOG);
exit(-1);
}

Feel free to post comments and questions here or at bhawnablog@gmail.com

Putting it all together…the complete Perl script is below

#!/usr/bin/perl

msgSender($ARGV[0],$ARGV[1]);
 

sub msgSender{
eval{    
        #Read the email sendTo list from the txt file
        my $sendToList = "C:/alerts/sendToList.txt";
        
        #Open Pipe to the sendTo list file
        open (SENDTOLISTFILE, $sendToList ) or die("Could not open sender list file C:/alerts/sendToList.txt");
         while (my $line = <SENDTOLISTFILE>) {
            chomp($line);
             push(@sendTo,$line);
        }
        close(SENDTOFILE);
        
        #Read arguments for subject and content
        my $messageSubject = $_[0];
        my $messageContent = $_[1];
        
        # Setup SMTP mail server to send the alert email
        use Net::SMTP;
        
        $smtp = Net::SMTP->new('SMTP_SERVER_HOST'); # connect to an SMTP server
        $smtp->mail('alertNotify@YOUR_DOMAINNAME.com'); # sender's address here
        for($count = @sendTo; $count>0 ; $count--)
        {         
            $sendToVal = $sendTo[$count-1];
            $smtp->to($sendToVal);            # recipient's address
        }
        $smtp->data(); # Start writing the mail
        
        # Send the header.    
        $smtp->datasend("To: alertAlias\@YOUR_DOMAINNAME.com\n");
        $smtp->datasend("From: alertNofity\@YOUR_DOMAINNAME.com\n");
        $smtp->datasend("Subject: $messageSubject\n");
        $smtp->datasend("\n");
        
        # Send the body.
        $smtp->datasend("This is a Broadcast Message. Please DO NOT reply to this email. It is not monitored\n");
        $smtp->datasend("------------------------------------------------------------------------------------------\n");
        $smtp->datasend("$messageContent\n");
        $smtp->dataend(); # Finish sending the mail
        $smtp->quit; # Close the SMTP connection        
    }
    or do{
            my $errorLogging = "C:/logs/alertNotify.err";
            open(LOG,">$errorLogging") || die("Cannot Open File");
            print LOG "alertNotify:$@";
            close(LOG);
            exit(-1);
    };        
    exit(0);    
}

Tuesday, March 31, 2009

SQL Server Locking Mechanism Quick Facts

  1. The size of memory made available to the SQL server defines the lock granularity that the server will pick while processing a transaction.
  2. The lowest granularity level is row.
  3. SQL server gets shared locks on data being queried which means all queries can see data, but queries will block writes and writes will block queries, unlike Oracle which uses snapshots for executing queries, so queries will not block writes and writes does not block queries (although writes blocks other writes).
  4. For updating a single row, SQL server acquires a single lock but if you are updating a huge set of rows, viz. 1000 rows, SQL server might decide on acquiring a page, extent or whole table lock depending on how the data is stored physically. One can control this by specifying ROWLOCK HINT in the update statement. Although tuning the query using HINTs should be done only under expert supervision or by experts.
  5. SQL server acquires/ chooses Bulk Update lock for Bulk copy operations which improves performance at the cost of concurrency.
  6. TRANSACTION ISOLATION LEVEL defined can affect the SQL server's choice of deciding on one level of lock over the other.
  7. SERIALIZABLE is the most restrictive of all the transaction isolation levels (READ COMMITED, READ UNCOMMITED, REPEATABLE READ, SERIALIZABLE). It ensures that each transaction is completely isolated from others.
  8. By default, SQL Server transactions do not time out, unless LOCK_TIMEOUT is specified.
  9. SQL Server has deadlock detection and resolution mechanism which picks one of the transaction thread involved in deadlock to roll back. One can control which transaction gets rolled back using SET DEADLOCK_PRIORITY (LOW, NORMAL, HIGH or integer range from -10 to 10, default is NORMAL) statement. The transaction session with lower priority is picked to roll back in deadlock situations. For transaction sessions with same deadlock priority level, the one which is least expensive to roll back is picked and if nothing can be decided for the pick, the transaction to roll back is picked randomly.

Tuesday, March 24, 2009

Useful /Handy SQL Queries: MS SQL server

Q. Find duplicates in a table

SELECT zip ,
COUNT (zip) AS NumOccurrences
FROM zipcode GROUP BY zip
HAVING
(COUNT(zip)> 1 )

Q. Select a row or column value at random

SELECT TOP 1 city
FROM cityAddress ORDER BY NEWID()

Q. List items in one table that are not in the other

(LEFT JOIN)
SELECT customers.*
FROM customers LEFTJOIN orders ON customers.customer_id = orders.customer_id
WHERE orders.customer_id IS NULL

Alternatively,

SELECT customers.*
FROM customers
WHERE customers.customer_id NOT IN(SELECT customer_id FROM orders)

Note: In clauses are slower in execution

Q. List items in one table that are also in another table

(INNER JOIN)

SELECT
DISTINCT customers.*
FROM customers INNER JOIN orders
ON customers.customer_id = orders.customer_id

Alternatively,

SELECT customers.*
FROM customers
WHERE customers.customer_id IN(SELECT customer_id FROM orders)

Note: In clauses are slower in execution

Q. Get Total count of distinct column value

Notice the DISTINCT keyword placement

SELECT COUNT(DISTINCT customer_state) AS total
FROM customers

Q. Copy data from one table into another

INSERT INTO customers(customer_id, customer_name)
SELECT customer_id, customer_name
FROM partnerCustList

Q. Bulk Insert data from one table into another.

The new table will have same structure as the one where the data is copied from with Bulk Insert

SELECT *
INTO customers
FROM partnerCustList

Sunday, March 22, 2009

Java Buffer

A buffer is an object, used to write some primitive type data into or read from. A buffer provides structured access to the data while keeping track for the reading and writing processes. Buffers allow I/O operations on blocks of data instead of working with them byte by byte (stream-oriented) which speeds up the I/O operations.

To understand buffers in depth we need to take a tour to the buffer internals.

Buffer Internals

State Variables

Buffer state variables help in keeping the "internal accounting" for them. With each read/ write operation, buffer's state variable is updated to help buffers manage its resources and help us perform I/O operations in blocks. Buffers has 3 state variables to track its state and the data it holds-

Position – keeps track of how much data was written or read from the buffer i.e, where should the next set of data block we added to the buffer or read from.

Limit – keeps track of how much data is left in the buffer to read from or how much space is left in the buffer to write data into

Capacity – specifies the max amount of data that the buffer can hold.

This brings us to the equation,

position ≤ limit ≤ capacity where none of the state variables can be negative.

Now let us try to visualize these variables. Assuming the capacity of our buffer is 16 bytes shown by dashes below,

State: Empty

position =0                                                                                                                                                       limit, capacity = 16
down arrow____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____down arrow

State: First write of 8 bytes

                                                                                    position = 8                                                                             limit, capacity = 16
__1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ down arrow____ ____ ____ ____ ____ ____ ____ ____down arrow

State: Second write of 4 bytes

                                                                                                                              position = 12        limit, capacity = 16
__1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ down arrow ____ ____ ____ ____down arrow

Now let us flip the buffer to read the data from, flip(), this sets the limit to the current position and resets position to 0.

State: flip()

position = 0                                                                                                           limit = 12                            capacity = 16
down arrow__1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ down arrow____ ____ ____ ____down arrow

The buffer is now ready to be read the data from,

State: Read 8 bytes

                                                                           position = 8                            limit = 12                                  capacity = 16
__1_ __1_ __1_ __1_ __1_ __1_ __1_ __1_ down arrow__1_ __1_ __1_ __1_ down arrow ____ ____ ____ ____down arrow

The next read statement can read maximum 4 more bytes from our buffer due to the limit set to 12.

State: Read 4 bytes

                                                                                                              position, limit = 12                       capacity = 16
__1_ __1_ __1_ __1_ __1_ __1_ __1_ __1___1_ __1_ __1_ __1_ down arrow___ ___ ___ ___down arrow

And finally we clear up our buffer before using it further, clear(), this sets the position to 0 and the limit equal to the buffer capacity.

State: clear()

position =0                                                                                                                             limit, capacity = 16
down arrow____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____down arrow

Accessor Methods

Get (ByteBuffer)
  1. byte get(); - returns single byte.
  2. ByteBuffer get( byte dest[] ); - reads a group of bytes into the array dest
  3. ByteBuffer get( byte dest[], int offset, int length ); - reads a group of bytes into the array dest
  4. byte get( int index ); - returns a byte of data from the position specified by index

The methods from 1-3 respect the buffer state variables whereas, method 4 does not. So the 4th method ignores the position and limit state variable of the buffer and does not change their values either. Method 4 is referred as an absolute method while other methods are relative. Methods 2 and 3 just return this object on which they were called which allows chaining of the methods when needed.

buffer.get(data).flip();

Put (ByteBuffer)

1. ByteBuffer put( byte b ); - puts one byte in the buffer

2. ByteBuffer put( byte src[] ); - puts an array of bytes in the buffer

3. ByteBuffer put( byte src[], int offset, int length ); - puts an array of bytes in the buffer

4. ByteBuffer put( ByteBuffer src ); - copies data from source buffer into this buffer

5. ByteBuffer put( int index, byte b ); - puts data byte into the position specified by index

Here the method 5 is absolute and all others are relative.

The methods discussed above are all related to ByteBuffer class. Other buffer types have equivalent get() and put() methods dealing with the corresponding primitive type they handle.

ByteBuffer class also has methods to get or put data of specific primitive type both in absolute and relative form.

ByteBuffer Quick Facts

  1. Buffer allocation automatically empties the ByteBuffer and resets the state variables.
  2. duplicate and slice methods perform shallow copy of the original ByteBuffer. So anything you do on the returned buffer will affect the original.

Other handy methods

Creating buffers: allocate() and wrap()

Buffers can be created by allocating space for it using method allocate() or by wrapping existing array into a buffer using method wrap().

ByteBuffer buffer = ByteBuffer.allocate(1024);

Allocates 1024 bytes of space for the object buffer.

You can also wrap an array of primitive type into a corresponding buffer.

Byte arr[] = new byte[1024];

Bytebuffer buffer = ByteBuffer.wrap(arr);

Both buffer and arr share the same memory space now.
Direct vs. in-direct ByteBuffer Allocations

Direct ByteBuffer space is allocated in the native OS memory, although java does not guarantee the success. Allocation of direct ByteBuffer in memory is costly but it provides faster I/O.

ByteBuffer byte_buff = ByteBuffer.allocateDirect (2000);

There is no allocateDirect method for other primitive buffer types but we can use ByteBuffer view buffers to read the data in other primitive type while still making use of ByteBuffer's allocateDirect underneath.

ByteBuffer byte_buff = ByteBuffer.allocateDirect (2000);
CharBuffer cbuf = buffer.asCharBuffer();

Slicing buffers: slice()

Creates a sub-buffer out of the original buffer it is called upon and both share the same memory space. Slicing a buffer creates a shallow copy.

ByteBuffer origBuffer = ByteBuffer.allocate(16);

origBuffer.position(4);

origBuffer.limit(12);

ByteBuffer slicedBuffer = origBuffer.slice();

Now if we add 4 to each value in the buffer the above buffer can be represented as

position = 0     position(slicedBuffer) = 4                                  limit(slicedBuffer) = 12                      capacity = 16
down arrow__1_ __1_ __1_ __1_ down arrow __5_ __5_ __5_ __5_ __5_ __5_ __5_ __5_ down arrow ____ ____ ____ ____down arrow

This feature allows data abstraction by helping you write functions to work with whole or a slice of buffer data.

Marking the buffer position: mark()

Marks the current position in the buffer such that any subsequent buffer reset() will bring the buffer position to the current mark position instead of setting it to 0.

Rewind Buffer: rewind()

Sets the buffer position to 0 and discards any mark settings

Creating read-only buffers: asReadOnlyBuffer()

ByteBuffer buffer = ByteBuffer.allocate(1024);
ByteBuffer readoonlyBuffer = buffer.asReadOnlyBuffer();

Buffer in Action

Copying data from input stream into buffer and writing the data from the buffer into output stream.


import java.io.*;
import java.nio.*;
import java.nio.channels.*;
public class BufferCopy
{
  public static void main(String[] args) throws IOException
  {
    FileInputStream inFile = new FileInputStream(args[0]);
    FileOutputStream outFile = new FileOutputStream(args[1]);
    FileChannel inChannel = inFile.getChannel();
    FileChannel outChannel = outFile.getChannel();

    ByteBuffer buffer = ByteBuffer.allocate(1024*1024);

    for (; inChannel.read(buffer) != -1; buffer.clear())
    {
      buffer.flip();
      while (buffer.hasRemaining())
       outChannel.write(buffer);
    }
    inChannel.close();
    outChannel.close();
  }
}

Converting ByteBuffer to CharBuffer

char[] data = "ByteToCharBuffer".toCharArray();
ByteBuffer bb = ByteBuffer.allocate(data.length * 2);
CharBuffer cb = bb.asCharBuffer();
cb.put(data);
while ((c = cb.getChar()) != 0)
System.out.print(c + " ");


Wrap a char array into a charBuffer

CharBuffer buffer = CharBuffer.allocate(8);
char[] myBuffer = new char[100];
CharBuffer cb = CharBuffer.wrap(myBuffer);


Converting between string and bytes

// Create the encoder and decoder
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
try
{
// Convert string to bytes (ISO-LATIN-1) in ByteBuffer
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("string"));

// Convert bytes from ByteBuffer into CharBuffer and then to a string.
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
}
catch (CharacterCodingException e) {
}

String and byte conversion using the direct allocation for ByteBuffer

// Create a direct ByteBuffer for channeling the data
ByteBuffer bytebuf = ByteBuffer.allocateDirect(1024);
// Create a non-direct character ByteBuffer
CharBuffer charbuf = CharBuffer.allocate(1024);
// Convert characters in charbuf to bytebuf
encoder.encode(charbuf, bytebuf, false);
// flip bytebuf before reading from it
bytebuf.flip();
// Convert bytes in bytebuf to charbuf
decoder.decode(bytebuf, charbuf, false);
// flip charbuf before reading from it
charbuf.flip();