28
01
2009
Here is another odd issue I encountered which I haven’t found a good solution for at the moment. Here I only provide a not so elegant solution to work around the issue. Let me know if you have a better solution so I can post it here.
According to Solr’s documentation on Tomcat configuration, you can use JNDI to initialize the solr/home variable.
When I tried the same configuration on Centos, Solr would not start. In the error log, I found:
INFO: No /solr/home in JNDI
This is basically saying that Solr couldn’t find solr/home through JNDI. Although solr/home is initialized under the Environment tag but for some odd reason, Solr does not see it in JNDI. Perhaps my Tomcat install is whacked or Solr may be using some prefix or suffix to find this JNDI entry. I haven’t had time to look into Solr’s source code to see if it really looks for solr/home through JNDI. That’s why I said I don’t have a elegant solution to this issue. My work around is pass a system property to the JVM via the “-D” parameter. Modify catalina.sh (or catalina.bat if run on Windows) and add following to JAVA_OPTS.
JAVA_OPTS="... -Dsolr.solr.home=/my/solr/home"
Join the forum discussion on this post - (2) Posts
Comments : No Comments »
Categories : Tips
28
01
2009
This is a common issue when you start indexing large amount of text content, especially for content originated from web crawling. You may not encounter this issue if the content is already filtered by a data cleansing library of some sort. However, if you build your own XML for Solr, it’s likely that you will encounter this illegal character issue sooner or later. The cause is usually due to control character (non-UTF8) embeded within text content that’s being indexed. It’s usually catastrophic when it happens because Solr (or the underlying technology Lucene) does not handle the illegal character exception. The Solr server would crash. A solution to this problem is either remove or replace the invalid characters. You can use my sample code (in Java) below to strip these control characters.
Here is what you would have seen if you encounter the illegal character issue:
Jan 28, 2009 6:22:06 PM org.apache.solr.common.SolrException log
SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 25))
at [row,col {unknown-source}]: [718,22]
at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327)
at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)
at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Thread.java:619)
Sample code:
public static String stripNonUTF8Chars(String string) {
if (string == null) return null;
StringBuffer sb = new StringBuffer();
char[] charArray = string.toCharArray();
for (int i = 0; i < charArray.length; i++) {
if (charArray[i] != '\u0000' &&
charArray[i] != '\u0001' &&
charArray[i] != '\u0002' &&
charArray[i] != '\u0003' &&
charArray[i] != '\u0004' &&
charArray[i] != '\u0005' &&
charArray[i] != '\u0006' &&
charArray[i] != '\u0007' &&
charArray[i] != '\u0008' &&
charArray[i] != '\u000B' &&
charArray[i] != '\u000C' &&
charArray[i] != '\u000E' &&
charArray[i] != '\u000F' &&
charArray[i] != '\u0010' &&
charArray[i] != '\u0011' &&
charArray[i] != '\u0012' &&
charArray[i] != '\u0013' &&
charArray[i] != '\u0014' &&
charArray[i] != '\u0015' &&
charArray[i] != '\u0016' &&
charArray[i] != '\u0017' &&
charArray[i] != '\u0018' &&
charArray[i] != '\u0019' &&
charArray[i] != '\u001A' &&
charArray[i] != '\u001B' &&
charArray[i] != '\u001C' &&
charArray[i] != '\u001D' &&
charArray[i] != '\u001E' &&
charArray[i] != '\u001F' &&
charArray[i] != '\uFFFE' &&
charArray[i] != '\uFFFF') {
sb.append(charArray[i]);
}
}
return sb.toString();
}
Join the forum discussion on this post - (1) Posts
Comments : No Comments »
Categories : Tips
19
01
2009
Many of you may have seen this exception when trying out Solr’s range search query based on Solr’s documentation.
HTTP Status 400 - org.apache.lucene.queryParser.ParseException:
Cannot parse 'rank:[1 to 10]': Encountered "10" at line 1, column 11.
Was expecting: "]" ...
This exception is saying that Lucene’s query parser does not recognize a third value “10” in the range filter query. The parser only expects two values. It’s kind of confusing because Solr’s documentation gives an example of “field:[1 to 100]”. Interestingly, the documentation is consistent between Solr and Lucene’s documentation regarding the range search query format:
http://wiki.apache.org/solr/SolrQuerySyntax
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Range Searches
As you can see, they both describe range search query’s syntax as :[ to ]. However, this syntax is incorrect (at least for Solr 1.3). The correct syntax should be :[ ]. Notice that the keyword “to” between from and to is not needed. For example:
- Return results with value in field “rank” between 1 and 100
- Return results with value in field “rank” less than or equal to 10
- Return results with value in field “rank” greater than or equal to 10
Solr’s documentation was updated on 2008-10-20 as I write this post. Hopefully, the documentation will be corrected soon. There is also a possibility that the “to” keyword may be supported in future versions of Solr and Lucene. For now, omit the “to”.
Join the forum discussion on this post - (2) Posts
Comments : 3 Comments »
Categories : Bugs