Saturday, February 5, 2011

Determining distribution so I can generate test data

I've got about 100M value/count pairs in a text file on my Linux machine. I'd like to figure out what sort of formula I would use to generate more pairs that follow the same distribution. From a casual inspection, it looks power law-ish, but I need to be a bit more rigorous than that. Can R do this easily? If so, how? Is there something else that works better?

Thanks!

  • While a bit costly, you can mimic your sample's distribution exactly (without needing any hypothesis on underlying population distribution) as follows.

    You need a file structure that's rapidly searchable for "highest entry with key <= X" -- Sleepycat's Berkeley database has a btree structure for that, for example; SQLite is even easier though maybe not quite as fast (but with an index on the key it should be OK).

    Put your data in the form of pairs where the key is the cumulative count up to that point (sorted by increasing value). Call K the highest key.

    To generate a random pair that follows exactly the same distribution as the sample, generate a random integer X between 0 and K and look it up in that file structure with the mentioned "highest that's <=" and use the corresponding value.

    Not sure how to do all this in R -- in your shoes I'd try a Python/R bridge, do the logic and control in Python and only the statistics in R itself, but, that's a personal choice!

    Jaime : +1 As simple as beautiful: why constrain yourself to ideal representations, when a computer allows you to have reality itself?
  • To see whether you have a real power law distribution, make a log-log plot of frequencies and see whether they line up roughly on a straight line. If you do have a straight line, you might want to read this article on the Pareto distribution for more on how to describe your data.

  • Hi twk -

    I'm assuming that you're interested in understanding the distribution over your categorical values.

    The best way to generate "new" data is to sample from your existing data using R's sample() function. This will give you values which follow the probability distribution indicated by your existing counts.

    To give a trivial example, let's assume you had a file of voter data for a small town, where the values are voters' political affiliations, and counts are number of voters:

    affils <- as.factor(c('democrat','republican','independent'))
    counts <- c(552,431,27)
    ## Simulate 20 new voters, sampling from affiliation distribution
    new.voters <- sample(affils,20, replace=TRUE,prob=counts)
    new.counts <- table(new.voters)
    

    In practice, you will probably bring in your 100m rows of values and counts using R's read.csv() function. Assuming you've got a header line labeled "values\t counts", that code might look something like this:

    dat <- read.csv('values-counts.txt',sep="\t",colClasses=c('factor','numeric'))
    new.dat <- sample(dat$values,100,replace=TRUE,prob=dat$counts)
    

    One caveat: as you may know, R keeps all of its objects in memory, so be sure you've got enough freed up for 100m rows of data (storing character strings as factors will help reduce the footprint).

    From dataspora

0 comments:

Post a Comment