High Performance Faceting with Bobo-Browse

24 05 2010

I got the chance to do a barebone Lucene implementation for a client with 40 million records.  They liked to introduce faceting on the author field.  I was tempted to just go ahead with Solr.  However, it’d be counterproductive to the project because they don’t need the full package provided by Solr.  My client only wants to build the facets on top of their index with minimal changes.  Bobo became the obvious choice in this matter.  To say the least, Bobo is amazingly simple to use and yet it provides decent performance.

The biggest roadblock we faced with this implementation is the memory footprint.  When the author index was loaded using Bobo, it allocated 12G of memory.  Initially, we set our young generation size way too small, the GC algorithm we selected, CMS (Concurrent Mark Sweep), had to constantly do full sweep after every 2-3 searches.  The full sweep would halt the entire service for about a minute before returning.  It was unacceptable as it pretty much killed search altogether.  It appeared that Bobo allocates quite a bit of temporary memory to calculate the facets.  Perhaps it was the nature of our data with a lot of intersection between authors that caused the excessive memory usage.  We slowly increased the young generation size from 2G (yes I know, it’s very small) to around 8G to get a stable system with virtually zero full sweep.

Here is our current JVM config:

java -Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=9091 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Djava.rmi.server.hostname=se01.us.researchgate.net \
-verbosegc -XX:+PrintGCDetails \
-XX:+UseConcMarkSweepGC \
-XX:+CMSIncrementalMode \
-XX:+CMSIncrementalPacing \
-XX:+UseParNewGC \
-XX:+CMSParallelRemarkEnabled \
-XX:+DisableExplicitGC \
-XX:MaxGCPauseMillis=2000 \
-XX:SoftRefLRUPolicyMSPerMB=1 \
-XX:CMSIncrementalDutyCycleMin=10 \
-XX:CMSIncrementalDutyCycle=50 \
-XX:ParallelGCThreads=8 \
-XX:GCTimeRatio=10 \
-Xmn8g \
-Xms22g \
-Xmx22g \
java -verbosegc -XX:+PrintGCDetails \
     -XX:+UseConcMarkSweepGC \
     -XX:+CMSIncrementalMode \
     -XX:+CMSIncrementalPacing \
     -XX:+UseParNewGC \
     -XX:+CMSParallelRemarkEnabled \
     -XX:+DisableExplicitGC \
     -XX:MaxGCPauseMillis=2000 \
     -XX:SoftRefLRUPolicyMSPerMB=1 \
     -XX:CMSIncrementalDutyCycleMin=10 \
     -XX:CMSIncrementalDutyCycle=50 \
     -XX:ParallelGCThreads=8 \
     -XX:GCTimeRatio=10 \
     -Xmn8g \
     -Xms22g \
     -Xmx22g

This configuration works for us.  If you run into similar JVM garbage collection issue, I hope this set of configuration will help you too.

Join the forum discussion on this post - (1) Posts

Actions

Information



Leave a comment

You can use these tags : <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

ERROR: si-captcha.php plugin says GD image support not detected in PHP!

Contact your web host and ask them why GD image support is not enabled for PHP.

ERROR: si-captcha.php plugin says imagepng function not detected in PHP!

Contact your web host and ask them why imagepng function is not enabled for PHP.