Update with Invalid Control Character Crashes Solr

28 01 2009

This is a common issue when you start indexing large amount of text content, especially for content originated from web crawling.  You may not encounter this issue if the content is already filtered by a data cleansing library of some sort.  However, if you build your own XML for Solr, it’s likely that you will encounter this illegal character issue sooner or later.  The cause is usually due to control character (non-UTF8) embeded within text content that’s being indexed.  It’s usually catastrophic when it happens because Solr (or the underlying technology Lucene) does not handle the illegal character exception.  The Solr server would crash.   A solution to this problem is either remove or replace the invalid characters.  You can use my sample code (in Java) below to strip these control characters.

Here is what you would have seen if you encounter the illegal character issue:

Jan 28, 2009 6:22:06 PM org.apache.solr.common.SolrException log
SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 25))
 at [row,col {unknown-source}]: [718,22]
        at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
        at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
        at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
        at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
        at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
        at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
        at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327)
        at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
        at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)
        at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
        at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
        at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
        at java.lang.Thread.run(Thread.java:619)

Sample code:

public static String stripNonUTF8Chars(String string) {
    if (string == null) return null;
    StringBuffer sb = new StringBuffer();
    char[] charArray = string.toCharArray();
    for (int i = 0; i < charArray.length; i++) {
        if (charArray[i] != '\u0000' &&
            charArray[i] != '\u0001' &&
            charArray[i] != '\u0002' &&
            charArray[i] != '\u0003' &&
            charArray[i] != '\u0004' &&
            charArray[i] != '\u0005' &&
            charArray[i] != '\u0006' &&
            charArray[i] != '\u0007' &&
            charArray[i] != '\u0008' &&
            charArray[i] != '\u000B' &&
            charArray[i] != '\u000C' &&
            charArray[i] != '\u000E' &&
            charArray[i] != '\u000F' &&
            charArray[i] != '\u0010' &&
            charArray[i] != '\u0011' &&
            charArray[i] != '\u0012' &&
            charArray[i] != '\u0013' &&
            charArray[i] != '\u0014' &&
            charArray[i] != '\u0015' &&
            charArray[i] != '\u0016' &&
            charArray[i] != '\u0017' &&
            charArray[i] != '\u0018' &&
            charArray[i] != '\u0019' &&
            charArray[i] != '\u001A' &&
            charArray[i] != '\u001B' &&
            charArray[i] != '\u001C' &&
            charArray[i] != '\u001D' &&
            charArray[i] != '\u001E' &&
            charArray[i] != '\u001F' &&
            charArray[i] != '\uFFFE' &&
            charArray[i] != '\uFFFF') {
            sb.append(charArray[i]);
        }
    }
    return sb.toString();
}


Join the forum discussion on this post - (1) Posts

Actions

Information



Leave a comment

You can use these tags : <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

ERROR: si-captcha.php plugin says GD image support not detected in PHP!

Contact your web host and ask them why GD image support is not enabled for PHP.

ERROR: si-captcha.php plugin says imagepng function not detected in PHP!

Contact your web host and ask them why imagepng function is not enabled for PHP.