Sunday, February 13, 2011

Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

I'd love to know how Stack Overflow's tagging and search is architected, because it seems to work pretty well.

What is a good database/search model if I want to do all of the following:

  1. Storing Tags on various entities, (how normalized? i.e. Entity, Tag, and Entity_Tag tables?)
  2. Searching for items with particular tags
  3. Building a tag cloud of all tags that apply to a particular search result set
  4. How to show a tag list for each item in a search result?

Perhaps it makes sense to store the tags in a normalized form, but also as a space-delimited string for the purposes of #2, #4, and perhaps #3. Thoughts?

I have heard it said that Stack Overflow uses Lucene for search. Is that true? I've heard a couple of podcasts discussing SQL optimization, but nothing about Lucene. If they do use Lucene, I'm wondering how much of the search result comes from Lucene, and whether the "drill-down" tag cloud comes from Lucene.

  • I don't know if they qualify as optimal, but both DotNetKicks and Kigg are open source digg clone implementations. You can look at how they're doing tags and search.

    My best guesses without a lot of deliberation :)

    1. I never like the idea of serializing multiple values into a single field, so delimited strings stored in one field don't appeal to me... might work for adjacency paths with trees, but those are always ordered and tags need not be. This seems like it would tax the LIKE operator work you might do to find them.

    So my initial take is probably Entity -> EntityTag <- Tag.

    1. This approach makes finding items via Tag pretty easy, join back through EntityTag, call it a day.

    2. You need a secondary operation here to select the distinct tags for the result set. So a.) pull the result set, b.) normalize the tag space. I think you do this no matter what the answer is to #1 -- even stuffing tags into one field will still yield duplicate tags (and you have to deserialize them to perform this op--so more work, another argument for a fully relational approach).

    3. Still easy. Here's one area where the serialized approach works better. No need to join for child tags, it's right there in the Entity. That said, pulling out 0..n tags via the two table join doesn't seem too challenging to me. If you're talking perf considerations, build it normalized first then optimize via cache or denorm.

    The other option is "do both". This feels like a premature optimization, but you could do the full normalized approach to support any tag-centric operations and serialize upon persist to have a denormalized version right there in the Entity. A bit more work, some potential to fall out of synch if not fully covered, but best of both worlds if there's real limitations to the fully normalized way in your use cases.

    Lucene is interesting as well, you can declare specific metadata in the indices IIRC, so you could potentially leverage tag search this way as well. My suspicion is, if you go too far down this road, then you end up having some disconnects between what you store in the database and the index at some point. I can speak favorably about Lucene, it's very capable and easy to use--I believe .Text used it for it's search capabilities and it supported all of weblogs.asp.net prior to it switching over to Community Server. I'd stick to it for full-text search if MSSQL isn't in the picture/sufficient, solve the tag issues in the database imo.

    From Grant
  • Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.

    So here I go again...

    Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.

    Regarding the OS projects recommended by @Grant:

    • *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
    • Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
    • Both projects have a 3-table approach to tagging as everyone generally seems to recommend

    I also found some other questions on SO that I'd missed before:

    What I'm currently doing for each of the items I mentioned:

    1. In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
      • Build site-wide tag clouds
      • browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
    2. For search I use Lucene + NHibernate.Search
      • Tags are concat'd into a TagString that is indexed by Lucene
        • So I have the full power of the Lucene query engine (AND / OR / NOT queries)
        • I can search for text and filter by tags at the same time
        • The Lucene analyzer merges words for better tag searches (i.e. a tag search for "test" will also find stuff tagged "testing")
      • Lucene returns a potentially enormous result set, which I paginate to 20 results
      • Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
      • So it's entirely possible that a search results in 0 hits to the DB
    3. Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
    4. Haven't done this yet either, but I will probably store the TagString in the DB so that I can show an Entity's Tag list without having to make 2 more joins.

    This means that whenever an Entity's tags are modified, I have to:

    • Insert any new Tags that do not already exist
    • Insert/Delete from the EntityTag table
    • Update Entity.TagString
    • Update the Lucene index for the Entity

    Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.

    Time will tell...

    Shawn Simon : cant upmod this post enough
    Funka : the first link in this answer ("SQL server 2005 full text search") no longer seems to work?
    Richard Collette : The updated link would probably be: http://meta.stackoverflow.com/questions/19548/what-search-engine-stackoverflow-is-using
  • Good recap, Winston, thx for following up with your approach.

    From Grant

0 comments:

Post a Comment