Tuesday, March 15, 2011

Another ID generation queston

I'm trying to generate unique IDs for use in a Google App Engine application and would like feedback on the feasibility of the approach I'm thinking of using (questions at the end). I've read quite a few questions on this topic, but I don't remember coming across this particular approach.

I'd like random-looking IDs, e.g., MD5 hashes, but I also want them to be small. Four to six characters, along the lines of tinyurl, would be ideal. The IDs will be for user-generated content, in the context of my application, things like test questions that people will be writing. It's not necessary that the IDs be random-looking (it's fine if they look like serial IDs), but the approach I'm thinking of using lends itself to this, so it's not really an issue.

People familiar with Google App Engine will know that writes to the data store are particularly expensive and can result in timeouts if there are too many of them to the same entity group. Sharded counters are an approach that is often used to avoid write contention on a single global counter and the failed transactions that go with it.

Along with getting short IDs and avoiding write contention, I'm trying to avoid the birthday paradox. I would like to prepare for the possibility of there being millions of IDs, even if this is going overboard a little.

I was thinking of using a sharded counter along the following lines:

  • The counter is sharded on users, so that there is a shard for each user. Each counter object has its own count that is specific to a given user, which is incremented when a new item is created by that user. The count is incremented regardless of whether an item is successfully created.
  • The basis of an ID is an MD5 hash of the following string: "<user-email-address>|<latest-counter-value>".
  • The resulting MD5 hash is then truncated, initially to four characters.
  • A single global "length" value is maintained. Whenever the previous steps result in a duplicate key (one imagines this will happen fairly quickly at first), the value of length will be increased by one. MD5 hashes for new IDs will now be truncated at "length" characters, rather than four characters.
  • I don't want to expose the user email address, which suggests that a hash of some kind would be a good way to go.

My questions are: Am I right in thinking that this will largely avoid write contention as a result of duplicate keys and that write contention on the length field would probably not be an issue, especially at longer lengths? Can anyone describe the mathematics involved here? Would the length quickly increase to near the length of an MD5 hash, calling into question the value of the whole approach? Would it be better just to go with the full (longer) MD5 hash in order to keep things easier to maintain? Is there anything that I'm overlooking?

From stackoverflow
  • How about something like this:

    1. If you want 4 character keys using a-zA-Z0-9, then you'd have: 62^4 = > 14 million possible values.

    2. Break this up into N partitions: 0000 ... 1000, 1001 ... 2000 , ... , ZZAA ... ZZZZ

      Each partition is represented by an entity with: start id end id current id

    Now, to generate an ID:

    1. randomly pick a partition. Use the start key for each partition as a key. Start Transaction:
    2. get_or_create() that partition entity.
    3. if current id == end id, go to step 1
    4. save current id
    5. increment current id by one End Transaction

    use the current id that was saved as your id.

    If you pick N as 140 you each partition would have 100,000 values. This would allow quite a few concurrent inserts while limiting the number of retries due to picking an "empty" partition.

    You might need to think about this more. Especially, how will you handle the case when you need to move to 5 or 6 digit keys?

    -Dave

    Eric W. : Thanks for the interesting approach. I'll give it some thought and try to better understand it. One question I have is how much it is how much it would result in collisions (or retries) as the number of keys grows large. I'm trying to keep collisions to as close to zero as possible.
    Dave : You will only run into collisions when the partitions fill up.
    Dave : There are other optimizations you can do with this: 1. Memcache a list of "filled partitions" 2. If you are going to get a bunch of ids at once, then you can grab a block of n ids from a partition and then increment its counter by that value.
    Eric W. : What happens when two server instances request a new id from the same partition at the same time? Would they get the same id?
    Dave : No, because you'll access the partition in a transaction. So if two try at the same time one of them will fail and retry.
    mcherm : As I read the question I was about to suggest this approach, but you beat me to it. I recommend this solution. Among other things, it should result in shorter IDs than the original poster's suggestion.
  • Just to add some hard numbers to the question above, I've implemented a small program to generate ids along the lines mentioned in the question, and these were the results of one of the trial runs:

    Length     Count      MD5             Base 62
    4          400        3d0e            446
    5          925        4fc04           1Myi
    6          2434       0e9368          40V6
    7          29155      e406d96         GBFiA
    8          130615     7ba787c8        2GOiCm
    9          403040     75525d163       YNKjL9
    10         1302992    e1b3581f52      H47JAIs
    Total:     1869561
    

    Each time the length of the hash increased, this was due to a collision, so there were six collisions in this instance. The count is the number of keys of a given length that were generated before a collision. The MD5 column shows the last truncated key that was successfully added before there was a duplicate key error. The column on the far right shows the key in base 62 (if I've done the conversion correctly).

    It looks like more and more keys are being generated with each additional character, which is what you would imagine. I'm hopeful that this approach will scale for user-generated content.

    For anyone who's interested, here's how I got the base 62 representation for the last column:

    def base_62_encode(input):
        "Inspired by code at http://en.wikipedia.org/wiki/Base_36."
        CLIST="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
        rv = ""
        while input != 0:
            rv = CLIST[input % 62] + str(rv)
            input /= 62
        return rv
    
    base62_id = base_62_encode(long(truncated_hash, 16))
    


    (Added later on:)

    Here is a class that takes care of several things related to the generation of these IDs in the context of Google App Engine. By default, keys that are generated by this class are not purely base 62, since the first character of a GAE key name must be alphabetic. That requirement has been addressed here by using base 52 for the first character.

    The primary base can be changed to something other than 62 by altering the "clist" value that has been passed into (or omitted from) the constructor. One might want to remove characters that are easy to get mixed up, for example, "1", "l", "i", etc.

    Usage:

    keygen = LengthBackoffIdGenerator(SomeClass, initial_length=5)
    small_id, modified = keygen.small_id(seed_value_from_sharded_counter)
    

    Here is the class:

    class LengthBackoffIdGenerator(object):
        """Class to generate ids along the lines of tinyurl.
    
        By default, ids begin with a alphabetic character.  Each id that is created is checked against previous ones
        to ensure uniqueness.  When a duplicate is generated, the length of the seed value used is increased by one
        character.
    
        Ids become gradually longer over time while remaining relatively short, and there are very few retries in 
        relation to the number of keys generated.
        """
        ids = set()
    
        def __init__(self, cls, clist='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', alpha_first=False,
                initial_length=5, check_db=False):
            self.clist = clist
            self.alpha_first = alpha_first
            if self.alpha_first:
                import re
                alpha_list = re.sub(r'\d', '', clist)
                if len(alpha_list) < 1:
                    raise Exception, 'At least one non-numeric character is required.'
                self.alpha_list = alpha_list
            self.cls = cls
            self.length = initial_length
            self.check_db = check_db
    
        def divide_long_and_convert(self, radix, clist, input, rv):
            "Inspired by code at http://en.wikipedia.org/wiki/Base_36."
            rv += clist[input % radix]
            input /= radix
            return (input, rv)
    
        def convert_long(self, input):
            rv = ""
            if self.alpha_first:
                input, rv = self.divide_long_and_convert(len(self.alpha_list), self.alpha_list, input, rv)
            while input != 0:
                input, rv = self.divide_long_and_convert(len(self.clist),      self.clist,      input, rv)
            return rv
    
        def hash_truncate_and_binify(self, input, length):
            """Compute the MD5 hash of an input string, truncate it and convert it to a long.
    
            input  - A seed value.
            length - The length to truncate the MD5 hash to.
            """
            from assessment.core.functions import md5_hash
            input = md5_hash(input)[0:length]
            return long(input, 16)
    
        def _small_id(self, input):
            """Create an ID that looks similar to a tinyurl ID.
    
            The first letter of the ID will be non-numeric by default.
            """
            return self.convert_long(self.hash_truncate_and_binify(input, self.length))
    
        def small_id(self, seed):
            key_name = self._small_id(seed)
            increased_length = False
            if self.check_db and not self.cls.get_by_key_name(key_name) is None:
                self.ids.add(key_name)
            if key_name in self.ids:
                self.length += 1
                increased_length = True
                key_name = self._small_id(seed)
            self.ids.add(key_name)
            return (key_name, increased_length)
    
  • This algorithm is smart and would indeed minimize write operations. So the length value will converge to a balance between shortest length and absence of collisions.

    You can relax the constrain of absence of collision by using strategies used in hash tables. Try some other unique IDs before falling back to incrementing the length.

    I would thus suggest you add a test counter to the end of the hashed string initialized to 0. If the generated ID is already used, increment the counter and retry until a maximum counter value is reached after what you increase the length.

    You will end up with a more efficient use of your ID address space and much less frequent length increments.

    Regarding your question about the length limit of MD5 I think the choice of MD5 is an overkill. You don't really need a cryptographic (pseudo)secure hash. What you need is a random bit generator for which you can use crc32 (or adler which is faster). Especially if the code is to be run on a mobile phone. To implement a random bit generator with crc32, prepend a 32bits value to the string to hash and initialize it with a constant value of your choice (the seed). Then compute the crc32 on this string. If you need more bytes, write the obtained crc32 value in the 32bits in front of the string and recompute the crc32. Iterate until you have enough bits. You may replace the crc32 with the algorithm of your choice. This yields a cheap and fast random bit generator where the initial constant is the seed.

    With this algorithm you minimize the number of random bits generated and also have no upper limit in length.

    Regarding encoding, you didn't specify the constrains. Can you use upper and lower case letters with the digits ? Your example uses an alphabet of 36 different ASCII values. Once you have the pseudo random number generator described above that can generate as many bytes as desired, simply define length to be the number of letters of your ID and pick the next letter with a modulo of the next random byte. You'll know then exactly how many bytes to generate in one go and generating the ID is trivial.

0 comments:

Post a Comment