DataImportHandler Runs Out of Memory on Large Table

16 05 2009

One DataImportHandler(DIH) configuration people may overlook is the batchSize attribute.  If you start your JVM with enough memory to store the entire table, you won’t even need to set batchSize at all.  batchSize basically tells DIH to call setFetchSize through JDBC to bring back certain number of records at once.  If you use MySQL, you may still run out of memory even when you set the batchSize attribute.  That’s due to a limitation in MySQL’s drive where the setting is ignored.  The workaround is setting batchSize to “-1”.  This will pass Integer.MIN_VALUE to MySQL as fetch size and prevent the driver from running out of memory.

Join the forum discussion on this post - (1) Posts



Denormalized Data Structure

7 05 2009

The best data structure for effective use of faceted search in Solr is a flat (denormalized) structure. This may be contrary to a lot of application design principles, but this is an important concept to understand in order to use Solr effectively. To decide on how you store data in Solr, you have to know the kind of record(s) you want to return ultimately. Do you have multiple record types you like to return? Are there any commonalities between the record types? The main idea is knowing what a base record is and how child records associate with the base record.

Let’s look at an books inventory example:

Author
---------------------------
auth_id | author_name
--------+------------------
      1 | J.K. Rowling
      2 | Michael Crichton
      3 | J. R. R. Tolkien

Category
------------------
cat_id | category
-------+----------
     1 | Fantasy
     2 | Sci-Fi

Book
--------------------------------------------------------------------------------
id | auth_id | cat_id | release_date | title
---+---------+--------+--------------+------------------------------------------
 1 |       1 |      1 |   1999-09-01 | H. P. and the Sorcerer’s Stone
 2 |       1 |      1 |   2004-08-01 | H. P. and the Order of the Phoenix
 3 |       2 |      2 |   1991-03-01 | Jurassic Park
 4 |       2 |      2 |   2000-10-01 | Timeline
 5 |       3 |      1 |   1990-04-01 | LOTR, the Fellowship of the Ring
 6 |       3 |      1 |   1990-04-01 | LOTR, the Two Towers
 7 |       3 |      1 |   1990-04-01 | LOTR, Return of the King

It should be obvious to you that the base record in the above example is Book.  We will denormalize the above tables into a flat structure.  The final table to be fed into Solr will look like this:

-------------------------------------------------------------------------------------
id | author_name      | category | release_date | title
---+------------------+----------+--------------+------------------------------------
 1 | J.K. Rowling     | Fantasy  |   1999-09-01 | H. P. and the Sorcerer’s Stone
 2 | J.K. Rowling     | Fantasy  |   2004-08-01 | H. P. and the Order of the Phoenix
 3 | Michael Crichton | Sci-Fi   |   1991-03-01 | Jurassic Park
 4 | Michael Crichton | Sci-Fi   |   2000-10-01 | Timeline
 5 | J. R. R. Tolkien | Fantasy  |   1990-04-01 | LOTR, the Fellowship of the Ring
 6 | J. R. R. Tolkien | Fantasy  |   1990-04-01 | LOTR, the Two Towers
 7 | J. R. R. Tolkien | Fantasy  |   1990-04-01 | LOTR, Return of the King

Notice that the flatten records look more verbose now.  The redundant data is intended for Solr to build facets from.  Let’s create facets on author_name and category.  First, enable indexing on author_name and category.  Second, enable facet search in the requestHandler.  Here is a sample configuration:


author_name
category

true

100

1

After you rebuilt your index and restarted Solr, you should start seeing facets in your search results.  This is a very simple example of flattening records for faceted search.  As you add more record types to the index, things can get a little messy.  Assuming we have done so well with book search that we want to add music search as well.  Let’s flatten music records using technique we just discussed.  Here is the flatten version of music records:

-------------------------------------------------------------------
id | artist_name      | genre         | release_date | album_title
---+------------------+---------------+--------------+-------------
 1 | The Beatles      | Classic Rock  |   1969-09-26 | Abbey Road
 2 | The Doors        | Classic Rock  |   1970-08-01 | The Doors
 3 | Madonna          | Pop           |   1998-03-03 | Ray of Light
 4 | Prince           | Sci-Fi        |   1990-10-17 | Purple Rain

Notice that this table look awfully similar to the flattened version of the book records.  We can in fact reuse fields to map this new music data so a common search query can search on both book and music records.  However, we cannot just simply mix the two tables together.  That’s because doing so will lose the data type (book, music) information once the tables is merged.  We will have to introduce a new field for record type identification.  We also need to create a globally unique identifier for these records.  Here is what the final merged table look like:

--------------------------------------------------------------------------------------------------------
guid | id | type  | author_name      | category      | release_date | title
-----+----+-------+------------------+---------------+--------------+-----------------------------------
B_1  |  1 | Book  | J.K. Rowling     | Fantasy       |   1999-09-01 | H. P. and the Sorcerer’s Stone
B_2  |  2 | Book  | J.K. Rowling     | Fantasy       |   2004-08-01 | H. P. and the Order of the Phoenix
B_3  |  3 | Book  | Michael Crichton | Sci-Fi        |   1991-03-01 | Jurassic Park
B_4  |  4 | Book  | Michael Crichton | Sci-Fi        |   2000-10-01 | Timeline
B_5  |  5 | Book  | J. R. R. Tolkien | Fantasy       |   1990-04-01 | LOTR, the Fellowship of the Ring
B_6  |  6 | Book  | J. R. R. Tolkien | Fantasy       |   1990-04-01 | LOTR, the Two Towers
B_7  |  7 | Book  | J. R. R. Tolkien | Fantasy       |   1990-04-01 | LOTR, Return of the King
M_1  |  1 | Music | The Beatles      | Classic Rock  |   1969-09-26 | Abbey Road
M_2  |  2 | Music | The Doors        | Classic Rock  |   1970-08-01 | The Doors
M_3  |  3 | Music | Madonna          | Pop           |   1998-03-03 | Ray of Light
M_4  |  4 | Music | Prince           | Sci-Fi        |   1990-10-17 | Purple Rain

As you can see, when we reference more tables, the bigger the flattened table will become.  The table itself also become more human readable because you no longer have to refer to other tables to retrieve the textual representation of a key.  This is exactly what Lucene (underlying search technology of Solr) needs.  Lucene doesn’t care about normalized data structure.  Lucene only care about building indexes on top of flat records and you search against those indexes.  It’s as simple as that.  Lucene is not a database.

There are limitation to this denormalization approach though as we deal with more complex data structure such as multivalued fields.  Let’s say we want to add some sellers and their location information to our search engine where a seller can sell many books/music and the same book/music can be sold by many sellers (This is essentially a many-to-many relationship).  We also want to enable faceting on seller and seller location so user can filter by sellers and their location to locate books/music availability at certain location.  The records would look something like this when flatten.

-------------------------------------------------------------------------------------------------
guid | id | type | author_name  | category | seller          | seller_location    | title
-----+----+------+--------------+----------+-----------------+--------------------+--------------
B_1  |  1 | Book | J.K. Rowling | Fantasy  | Smith, Johnson  | Boston, New York   | ..Sorcerer..
B_2  |  2 | Book | J.K. Rowling | Fantasy  | Smith, Williams | Boston, Providence | ..Phoenix
B_3  |  3 | Book | M. Crichton  | Sci-Fi   | Smith, Wilson   | Boston, Austin     | Jurassic..
B_4  |  4 | Book | M. Crichton  | Sci-Fi   | Smith, Johnson  | Boston, New York   | Timeline

Notice that each book has 2 sellers and 2 seller locations.  Look more carefully and you will see that the relationship between seller and seller location is not there anymore.  We can use their position as a hint to map between seller and seller location but position is not very reliable so this piece of information is considered lost.  Besides, the facet values will be incorrect when user start filtering by seller or seller location.  Here is an example to illustrate the point:

  • Apply seller -> Smith as filter
  • seller_location facet values would return Boston, New York, Providence, Austin

From Solr/Lucene’s point of view, this is a correct result because it’s basically returning possible seller location values based on records found by the filter.  From user’s point of view, the results is totaly wrong, because seller Smith does not have locations in New York, Providence or Austin.

So, how do we fix this?  Remember I mentioned earlier that Lucene is not a database.  It really isn’t, so a join query is out of the question.  Normally, when we deal with such query, we would do a inner join between seller and book tables and filter on the seller field.  The results will be a list of books with the associating seller location.  In this case, seller_location should return Boston on every records.  It is apparent that the only way to perform this query is by keeping the book and seller tables separate.  To make this query possible in Solr/Lucene, we need to workaround Lucene’s limitation.  We will need additional record type and we will need to run multiple queries.  First, we need to keep the seller table separate (as separate records) because it let us maintain relationship between seller and seller_location.  Second, we need additional column in seller records to map to book/music records to make filtering book/music by seller possible.  The seller records would look like this:

---------------------------------------------------------------------
guid | id | type   | seller   | seller_location | product_guid
-----+----+--------+----------+-----------------+--------------------
S_1  |  1 | Seller | Smith    | Boston          | B_1, B_2, B_3, B_4
S_2  |  2 | Seller | Williams | Providence      | B_2
S_3  |  3 | Seller | Wilson   | Austin          | B_3
S_4  |  4 | Seller | Johnson  | New York        | B_4

For this query to work, we will need to query Solr twice.  The first query will filter on seller -> Smith, type -> Seller.  The facet values on seller_location should return Boston only.  Keep this facet values so you can display it as part of the results.  Then, we need to find all the facet values in product_guid (assuming facet is already enabled on this field) to compose the second query to query against Book and Music records.  The second query will return all associating Book and Music records of seller Smith.  Now you have the right Book and Music records and correct facet value on seller_location.

Solr/Lucene may not be the best solution when dealing with many-to-many relationships.  A database would do a much better job at joining tables.  If it becomes unavoidable and you can only use Solr/Lucene, give this workaround a try and let me know how things go.  If you have a better solution than this, let me know as well.  I like to hear how people solve similar challenge when using Solr/Lucene.

Join the forum discussion on this post - (1) Posts



Getting Client IP in Apache/Tomcat

19 03 2009

This is an issue that just keep coming back to me but I always tend to forget how it’s been resolved before.  I would have to research into the issue every single time.  The problem is that request.getRemoteAddr() doesn’t always return the client IP in Tomcat (or any app servers) when it’s sitting behind a proxy.  In my case, Tomcat is behind Apache’s balancer.  When you call request.getRemoteAddr(), it would return the Apache server’s IP instead of the client IP.  

To get the actual client IP, you need to pull it from the header.  Apache sets a header variable called “x-forwarded-for” with the client IP.  You can simply grab it from the header:

String ip = request.getHeader("x-forwarded-for");

This is just one of the thing I found annoying. I can understand why request.getRemoteAddr() doesn’t always return client IP. I’m hoping eventually Tomcat will have a reliable method to return a proper client IP.

Join the forum discussion on this post - (1) Posts