Saturday, December 27, 2008

The Grown-Up Purity Test!

During the company ski trip last year, we somehow found ourselves taking purity tests, something we hadn't done since college. Much fun was had (at least until Akshay turned off the Internet in a fit of rage, and we had to score the tests using the Firebug console). Many, many drinks later, we somehow decided that what the world really needed was a hip web-2.0ish purity test, with keyboard shortcuts and big fonts and statistics comparing you to other test takers and fewer "have you ever held hands with a MOS?" type questions. So I registered for The Grown-Up Purity Test (pronounced tee-gupt) and figured it couldn't take more than a weekend of hacking to do.

Since it's now a year later, it's clear I severely underestimated what was involved. Writing the questions was surprisingly hard, and took a lot of feedback and help from various folks, especially David. I'm still not 100% happy with them, but this project has dragged out much longer than anything this frivolous and puerile ever should.

<technical details>
Also tricky was figuring out how to work around the limitations of the App Engine data store. In a relational database, it would be pretty easy to go from normalized data to selecting the mean score for some demographic group. But App Engine doesn't offer aggregation functions, doesn't do joins, and doesn't fetch more than 1,000 rows. I ended up doing several writes on each answered question and each finished test. If a 27-year-old male answered yes to Question 1, I write the following:

Question 1, Yes: 1, Total: 1
Question 1, Gender: male, Orientation: straight, Yes: 1, Total: 1
Question 1, Gender: male, Orientation: straight, Age: 27 Yes: 1, Total: 1

Initially, I thought I would have to store all combinations of gender, orientation, and age as separate aggregates to future-proof myself for any graphs I might want to make. This was really slow, especially with atomic writes. Then I realized that for attributes with few possible values, like gender and orientation, I could fetch, say, both the male and female values from the datastore and combine them in the application code to create a gender-neutral statistic as needed. For overall scores, I updated both a global mean as well a set of score buckets in order to make histograms easier to generate. There's more discussion of these sorts of solutions here and here. You can check out the TGUPT code at
</technical details>

Anyway, go take the test and let me know what you think.