Tingling with Bobo-browse MultiValueFacetHandler Limit

6 09 2010

In one of my recent project implemented with Bobo-Browse for faceting, I ran into an issue with MultiValueFacetHandler’s 1024 values per field per record limitation.  I have some odd cases in my data set where a publication can have more 2000 authors.  This limit would stop at 1024 authors and left the rest of the authors uncredited in my search results.  It wasn’t a great loss as there are perhaps only a handful of publications with this many authors.  However, it was not great for the business so I had to create a solution.  After some searches around and tingling with the source a little, I found that by removing the hard 1024 limit was a viable solution in this particular project.  My benchmark number didn’t change after the hack and I was able to get all authors in the facet.  I can understand the hard limit was to prevent overloading the facet with too many values that could kill performance or worst, cause out of memory issue.   In fact, 1024 is a pretty high limit for a single field.  In my experience, I hardly found any multi-value field can reach anywhere near that number on a single record.  However, I just encountered such oddity.  Luckily, the number of publications with more than 1024 authors is negligible compare to the million of publications in my index.  So, this little hack didn’t produce any adverse effect for me.

Couple ways to do this hack.  One is to modify BigNestedIntArray.MAX_ITEMS directly with your desire max value.  Another way is to modify the following 3 files.

  • MultiValueFacetDataCache.java
  • MultiValueFacetHandler.java
  • BigNestedIntArray.java

Look for the following code:

_maxItems = Math.min(maxItems, BigNestedIntArray.MAX_ITEMS);

and replace it with:

_maxItems = maxItems;

And of course, call setMaxItems on your MultiValueFacetHandler to set the desired max value.

Just want to iterate that this hack has only been tested in one particular project.  It may create instability in your project if you have a lot of records with 1024+ facet values.  I was told that the array’s key is 11 bit.  So, there is a max value at 2048.  Having more than 2048 value will likely cause an exception.  Memory consumption can also become a real issue.  Consider yourself warned.




Apache DocumentRoot does not exist

29 05 2010

I got to write this down this time although this is not related to Solr/Lucene.  This has come back and bite me many times.  The error is due to incorrect SELinux context on the DocumentRoot directory.  Here is what you need to do to correct it:

chcon -R user_u:object_r:httpd_sys_content_t 

You may want to check on /var/www first to see if the context is correct by issuing this command:

ls -al --context /var/www

Okay, this is it.  This should fix this annoying issue.




High Performance Faceting with Bobo-Browse

24 05 2010

I got the chance to do a barebone Lucene implementation for a client with 40 million records.  They liked to introduce faceting on the author field.  I was tempted to just go ahead with Solr.  However, it’d be counterproductive to the project because they don’t need the full package provided by Solr.  My client only wants to build the facets on top of their index with minimal changes.  Bobo became the obvious choice in this matter.  To say the least, Bobo is amazingly simple to use and yet it provides decent performance.

The biggest roadblock we faced with this implementation is the memory footprint.  When the author index was loaded using Bobo, it allocated 12G of memory.  Initially, we set our young generation size way too small, the GC algorithm we selected, CMS (Concurrent Mark Sweep), had to constantly do full sweep after every 2-3 searches.  The full sweep would halt the entire service for about a minute before returning.  It was unacceptable as it pretty much killed search altogether.  It appeared that Bobo allocates quite a bit of temporary memory to calculate the facets.  Perhaps it was the nature of our data with a lot of intersection between authors that caused the excessive memory usage.  We slowly increased the young generation size from 2G (yes I know, it’s very small) to around 8G to get a stable system with virtually zero full sweep.

Here is our current JVM config:

java -Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=9091 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Djava.rmi.server.hostname=se01.us.researchgate.net \
-verbosegc -XX:+PrintGCDetails \
-XX:+UseConcMarkSweepGC \
-XX:+CMSIncrementalMode \
-XX:+CMSIncrementalPacing \
-XX:+UseParNewGC \
-XX:+CMSParallelRemarkEnabled \
-XX:+DisableExplicitGC \
-XX:MaxGCPauseMillis=2000 \
-XX:SoftRefLRUPolicyMSPerMB=1 \
-XX:CMSIncrementalDutyCycleMin=10 \
-XX:CMSIncrementalDutyCycle=50 \
-XX:ParallelGCThreads=8 \
-XX:GCTimeRatio=10 \
-Xmn8g \
-Xms22g \
-Xmx22g \
java -verbosegc -XX:+PrintGCDetails \
     -XX:+UseConcMarkSweepGC \
     -XX:+CMSIncrementalMode \
     -XX:+CMSIncrementalPacing \
     -XX:+UseParNewGC \
     -XX:+CMSParallelRemarkEnabled \
     -XX:+DisableExplicitGC \
     -XX:MaxGCPauseMillis=2000 \
     -XX:SoftRefLRUPolicyMSPerMB=1 \
     -XX:CMSIncrementalDutyCycleMin=10 \
     -XX:CMSIncrementalDutyCycle=50 \
     -XX:ParallelGCThreads=8 \
     -XX:GCTimeRatio=10 \
     -Xmn8g \
     -Xms22g \
     -Xmx22g

This configuration works for us.  If you run into similar JVM garbage collection issue, I hope this set of configuration will help you too.

Join the forum discussion on this post - (1) Posts