Effective Use of Solr Index Distribution Scripts

10 02 2009

Operation or automation tasks sometimes is an after-thought at the end of development. For Solr development, it’s actually not that bad to think about automation at the very end. Solr provides a set of very useful scripts to make automation easy. You can consider yourself lucky if you are short on time to build automation. I will first talk about basic architecture with Solr and then I will dive into leveraging Solr’s distribtion and operation scripts.

The most basic form of architecture for a Solr-based application only require a single application server. Assuming you develop in Java, you can have both Solr and your webapp served by the same application server. A more common and effective architecture would involve an dedicated indexing server (or indexer) and one or more slave index servers. The idea is to separate all index building work from normal queries. Conceptually, this is similar to database clustering where you have a read/write server as master and read-only servers as slaves.

The following set up involves Tomcat, Apache and Linux assuming Solr’s home is under /solr on every Solr servers.

Note: you may be able to replicate similar configuration on a Windows environment running Cygwin. I haven’t tried it on Windows yet so YMMV.

  • Scripts configuration
    • Environment can be configured in solr/conf/scripts.conf. Here is a sample indexer configuration:
    • user=solr
      solr_hostname=indexer
      solr_port=8080
      rsyncd_port=18080
      data_dir=data
      webapp_name=solr
      master_host=indexer
      master_data_dir=/solr/data
      master_status_dir=/solr/logs
    • Sample slave server configuration:
    • user=solr
      solr_hostname=slave1
      solr_port=8080
      rsyncd_port=18080
      data_dir=data
      webapp_name=solr
      master_host=indexer
      master_data_dir=/solr/data
      master_status_dir=/solr/logs
  • SSH set up
    • Solr uses SSH and Rsync in its index distrubtion scripts so we need to make sure SSH keys are configured and public keys are exchanged between indexer and slave index servers. If you haven’t configured SSH key yet, use the ssh-keygen command to generate public/private key pair on every Solr servers.
    • $ ssh-keygen
      Generating public/private rsa key pair.
      Enter file in which to save the key (/home/solr/.ssh/id_rsa):
      Enter passphrase (empty for no passphrase):
      Enter same passphrase again:
      Your identification has been saved in /home/solr/.ssh/id_rsa.
      Your public key has been saved in /home/solr/.ssh/id_rsa.pub.
      The key fingerprint is:
      0c:27:27:f5:81:36:87:82:0f:4f:39:b5:aa:fd:e4:2f solr@solr
    • Exchange public key between indexer and slave index servers
    • $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      $ chmod 644 ~/.ssh/authorized_keys
      $ ssh solr@indexer "cat .ssh/id_rsa.pub" >> ~/.ssh/authorized_keys
  • Rsyncd set up
    • Solr uses rsync for index distribution so you need to make sure rsync is functional in your operating system. Start Rsyncd the first time with following commands:
    • $ /solr/bin/rsyncd-enable
      $ /solr/bin/rsyncd-start
  • Configure Solr to automatically generate a snapshot after optimize. Update solr/conf/solrconfig.xml with following:
  • 
          /solr/bin/snapshooter
          /solr/bin/
          true
    
    
  • Enable snapshot pulling on slave servers:
  • $ /usr/bin/snappuller-enable
  • Set up snapshot pulling on slave servers at 3am in cron:
  • 0 3 * * * /solr/bin/snappuller; /solr/bin/snapinstaller; /solr/bin/snapcleaner -N 3
  • OPTIONAL: set up Apache load balancing of your slave index servers (running Tomcat), update /etc/conf/httpd.conf with following:
  • LoadModule proxy_module modules/mod_proxy.so
    LoadModule proxy_balancer_module modules/mod_proxy_balancer.so
    ....
    
        ProxyRequests Off
        ProxyPreserveHost On
        ProxyPass / balancer://tomcats/ stickysession=JSESSIONID lbmethod=byrequests
        ProxyPassReverse /  balancer://tomcats/
        
            BalancerMember ajp://slave1:8080 route=jvm1 loadfactor=20
            BalancerMember ajp://slave2:8080 route=jvm2 loadfactor=20
        
    

All indexing work should be done on your indexer. When you issue the optimize command, Solr will automatically generate a snapshot. Snapshot should be generate well ahead of the scheduled snapshot pulling time (3am in this case). Apache load balancing is optional if you only have one slave server or you have other load balancing solution.

Reference links:

http://wiki.apache.org/solr/CollectionDistribution

http://wiki.apache.org/solr/SolrOperationsTools

Join the forum discussion on this post - (12) Posts