Effective Use of Solr Index Distribution Scripts

10 02 2009

Operation or automation tasks sometimes is an after-thought at the end of development. For Solr development, it’s actually not that bad to think about automation at the very end. Solr provides a set of very useful scripts to make automation easy. You can consider yourself lucky if you are short on time to build automation. I will first talk about basic architecture with Solr and then I will dive into leveraging Solr’s distribtion and operation scripts.

The most basic form of architecture for a Solr-based application only require a single application server. Assuming you develop in Java, you can have both Solr and your webapp served by the same application server. A more common and effective architecture would involve an dedicated indexing server (or indexer) and one or more slave index servers. The idea is to separate all index building work from normal queries. Conceptually, this is similar to database clustering where you have a read/write server as master and read-only servers as slaves.

The following set up involves Tomcat, Apache and Linux assuming Solr’s home is under /solr on every Solr servers.

Note: you may be able to replicate similar configuration on a Windows environment running Cygwin. I haven’t tried it on Windows yet so YMMV.

  • Scripts configuration
    • Environment can be configured in solr/conf/scripts.conf. Here is a sample indexer configuration:
    • user=solr
      solr_hostname=indexer
      solr_port=8080
      rsyncd_port=18080
      data_dir=data
      webapp_name=solr
      master_host=indexer
      master_data_dir=/solr/data
      master_status_dir=/solr/logs
    • Sample slave server configuration:
    • user=solr
      solr_hostname=slave1
      solr_port=8080
      rsyncd_port=18080
      data_dir=data
      webapp_name=solr
      master_host=indexer
      master_data_dir=/solr/data
      master_status_dir=/solr/logs
  • SSH set up
    • Solr uses SSH and Rsync in its index distrubtion scripts so we need to make sure SSH keys are configured and public keys are exchanged between indexer and slave index servers. If you haven’t configured SSH key yet, use the ssh-keygen command to generate public/private key pair on every Solr servers.
    • $ ssh-keygen
      Generating public/private rsa key pair.
      Enter file in which to save the key (/home/solr/.ssh/id_rsa):
      Enter passphrase (empty for no passphrase):
      Enter same passphrase again:
      Your identification has been saved in /home/solr/.ssh/id_rsa.
      Your public key has been saved in /home/solr/.ssh/id_rsa.pub.
      The key fingerprint is:
      0c:27:27:f5:81:36:87:82:0f:4f:39:b5:aa:fd:e4:2f solr@solr
    • Exchange public key between indexer and slave index servers
    • $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      $ chmod 644 ~/.ssh/authorized_keys
      $ ssh solr@indexer "cat .ssh/id_rsa.pub" >> ~/.ssh/authorized_keys
  • Rsyncd set up
    • Solr uses rsync for index distribution so you need to make sure rsync is functional in your operating system. Start Rsyncd the first time with following commands:
    • $ /solr/bin/rsyncd-enable
      $ /solr/bin/rsyncd-start
  • Configure Solr to automatically generate a snapshot after optimize. Update solr/conf/solrconfig.xml with following:
  • 
          /solr/bin/snapshooter
          /solr/bin/
          true
    
    
  • Enable snapshot pulling on slave servers:
  • $ /usr/bin/snappuller-enable
  • Set up snapshot pulling on slave servers at 3am in cron:
  • 0 3 * * * /solr/bin/snappuller; /solr/bin/snapinstaller; /solr/bin/snapcleaner -N 3
  • OPTIONAL: set up Apache load balancing of your slave index servers (running Tomcat), update /etc/conf/httpd.conf with following:
  • LoadModule proxy_module modules/mod_proxy.so
    LoadModule proxy_balancer_module modules/mod_proxy_balancer.so
    ....
    
        ProxyRequests Off
        ProxyPreserveHost On
        ProxyPass / balancer://tomcats/ stickysession=JSESSIONID lbmethod=byrequests
        ProxyPassReverse /  balancer://tomcats/
        
            BalancerMember ajp://slave1:8080 route=jvm1 loadfactor=20
            BalancerMember ajp://slave2:8080 route=jvm2 loadfactor=20
        
    

All indexing work should be done on your indexer. When you issue the optimize command, Solr will automatically generate a snapshot. Snapshot should be generate well ahead of the scheduled snapshot pulling time (3am in this case). Apache load balancing is optional if you only have one slave server or you have other load balancing solution.

Reference links:

http://wiki.apache.org/solr/CollectionDistribution

http://wiki.apache.org/solr/SolrOperationsTools

Join the forum discussion on this post - (12) Posts



Setting Solr Home (solr/home) in JNDI on Tomcat 5.5

28 01 2009

Here is another odd issue I encountered which I haven’t found a good solution for at the moment.  Here I only provide a not so elegant solution to work around the issue.  Let me know if you have a better solution so I can post it here.  

According to Solr’s documentation on Tomcat configuration, you can use JNDI to initialize the solr/home variable. 


   

When I tried the same configuration on Centos, Solr would not start.  In the error log, I found:

INFO: No /solr/home in JNDI

This is basically saying that Solr couldn’t find solr/home through JNDI.  Although solr/home is initialized under the Environment tag but for some odd reason, Solr does not see it in JNDI.  Perhaps my Tomcat install is whacked or Solr may be using some prefix or suffix to find this JNDI entry.  I haven’t had time to look into Solr’s source code to see if it really looks for solr/home through JNDI.  That’s why I said I don’t have a elegant solution to this issue.  My work around is pass a system property to the JVM via the “-D” parameter.  Modify catalina.sh (or catalina.bat if run on Windows) and add following to JAVA_OPTS.

JAVA_OPTS="... -Dsolr.solr.home=/my/solr/home"
Join the forum discussion on this post - (2) Posts



Update with Invalid Control Character Crashes Solr

28 01 2009

This is a common issue when you start indexing large amount of text content, especially for content originated from web crawling.  You may not encounter this issue if the content is already filtered by a data cleansing library of some sort.  However, if you build your own XML for Solr, it’s likely that you will encounter this illegal character issue sooner or later.  The cause is usually due to control character (non-UTF8) embeded within text content that’s being indexed.  It’s usually catastrophic when it happens because Solr (or the underlying technology Lucene) does not handle the illegal character exception.  The Solr server would crash.   A solution to this problem is either remove or replace the invalid characters.  You can use my sample code (in Java) below to strip these control characters.

Here is what you would have seen if you encounter the illegal character issue:

Jan 28, 2009 6:22:06 PM org.apache.solr.common.SolrException log
SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 25))
 at [row,col {unknown-source}]: [718,22]
        at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
        at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
        at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
        at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
        at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
        at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
        at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327)
        at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
        at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)
        at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
        at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
        at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
        at java.lang.Thread.run(Thread.java:619)

Sample code:

public static String stripNonUTF8Chars(String string) {
    if (string == null) return null;
    StringBuffer sb = new StringBuffer();
    char[] charArray = string.toCharArray();
    for (int i = 0; i < charArray.length; i++) {
        if (charArray[i] != '\u0000' &&
            charArray[i] != '\u0001' &&
            charArray[i] != '\u0002' &&
            charArray[i] != '\u0003' &&
            charArray[i] != '\u0004' &&
            charArray[i] != '\u0005' &&
            charArray[i] != '\u0006' &&
            charArray[i] != '\u0007' &&
            charArray[i] != '\u0008' &&
            charArray[i] != '\u000B' &&
            charArray[i] != '\u000C' &&
            charArray[i] != '\u000E' &&
            charArray[i] != '\u000F' &&
            charArray[i] != '\u0010' &&
            charArray[i] != '\u0011' &&
            charArray[i] != '\u0012' &&
            charArray[i] != '\u0013' &&
            charArray[i] != '\u0014' &&
            charArray[i] != '\u0015' &&
            charArray[i] != '\u0016' &&
            charArray[i] != '\u0017' &&
            charArray[i] != '\u0018' &&
            charArray[i] != '\u0019' &&
            charArray[i] != '\u001A' &&
            charArray[i] != '\u001B' &&
            charArray[i] != '\u001C' &&
            charArray[i] != '\u001D' &&
            charArray[i] != '\u001E' &&
            charArray[i] != '\u001F' &&
            charArray[i] != '\uFFFE' &&
            charArray[i] != '\uFFFF') {
            sb.append(charArray[i]);
        }
    }
    return sb.toString();
}
Join the forum discussion on this post - (1) Posts