Posted on 15 Apr 2013 in Python PostgreSQL MongoDB
I recently came across a few fun coding challenges. I won't name the site or the challenge so as not to spoil the fun for others. This post's purpose is to document one such challenge and some things I came across. If you have style or performance tips, please let me know.
Update 2: 4/17/2013 I was doing the sub-query method the stupid way, and adjusted it to be more like the temp table method, and now it is fastest.
The gist of the challenge is, given a dataset (as a tsv file compressed with bz2) of 100,000 networks across the globe, such as:
A hypothetic customer is only interested in a certain part of this data:
After recently taking 10gen's Mongo101 course, my first thought went straight to the MongoDB aggregation framework:
$match, $group, $count, $unwind, $match, $skip, $limit. I also thought of Underscore which I had recently used on a project. Underscore has some nice tools (countBy, groupBy) and works great in the browser and on node. What really got me thinking though was two tools I have recently been learning: Python and PostgreSQL.
The first thing I wanted to do was to group the networks by the three identifying location fields (country, region, city). When using
GROUP BY though, the rows are flattened. We need something similar to MongoDB's
$unwind. PostgreSQL's window functions (Update: sub-query and JOIN to the rescue) to the rescue! A window function basically allows you to run aggregate functions without squashing the original rows.
Below is the resulting window function query. Starting on line 3 is the initial sub-query/window function to get the counts by region. On line one we make sure the networks are distinct. Following the sub-query, we add our filters.
A gotcha on this was that we need to sort the results using C locale. For me on Ubuntu 12.10 and PostgreSQL 9.1, all new databases default to the system Collate and Ctype, in this case
en_US.UTF-8. The quickest way to remedy this was to create a new database using
createdb -T template0 --lc-collate=C --lc-ctype=C -E UNICODE no_belgium and re-import the data. I came across other suggestions on this, but none worked until I tried the re-import.
I was also curious how this query may perform without using a window function. The first thought I had was to use a sub-query to count the networks by locations, then join on the un-grouped networks to preserve the individual network rows.
Update: much faster now! This query was originally taking ~1100ms because I was doing the location match as a pseudo-field
(country, region, city) instead of
DISTINCT on country, region, city The later allows it to hit the index. These are both pretty slow, but considering all 100000 rows are being scanned multiple times and this is an ad hoc aggregation query, maybe it is not so bad.
After initially making this post, John told me that using temporary tables might be a good approach. After I worked it out as below, the two queries return in ~150ms combined. Thanks for the tip John!
Python has everything we need for this in the standard library (code below). We first need to decompress and read the file. There are a couple of ways to do this, but
with is my preferred method because you don't need to issue
close(). Going straight to
for line in was how I did it at first, but this left stray
\n's hanging around and was also ~50ms slower then grabbing the entire file at once. For larger files and different applications this may not work.
Now that the data is in a neat format, we just need to group it, filter it, and sort it. Being relatively new to Python, one of my favorite things is being able to use list comprehensions. List comprehensions came about in Python 2 and are used in mathematics and other languages.
filter_data is one such example and is a nice compact way to represent oft-used for/if blocks. After this we just need to convert the list to a set to make sure we have all unique networks, then return the 10000th item.
Interestingly, the PostgreSQL queries are faster than this Python example. Most of the time may be spent reading the file, while PostgreSQL can operate all in memory. I am sure more experienced Python/Postgres developers could make both much faster, but it is worth noting anyway. I may revisit this and re-factor them for fun as I learn more.
I also decided to write a MongoDB aggregation pipeline to see how it would compare. It came out at 1083ms, which is not very good. I may revisit this one at some point...