No Belgium: Code challenge on MongoDB, PostgreSQL, and Python

Posted on 15 Apr 2013 in Python PostgreSQL MongoDB

I recently came across a few fun coding challenges. I won't name the site or the challenge so as not to spoil the fun for others. This post's purpose is to document one such challenge and some things I came across. If you have style or performance tips, please let me know.

Update: 4/17/2013 John suggested using temporary tables to address this type of problem, so I tried this method, and it is indeed faster.

Update 2: 4/17/2013 I was doing the sub-query method the stupid way, and adjusted it to be more like the temp table method, and now it is fastest.

The gist of the challenge is, given a dataset (as a tsv file compressed with bz2) of 100,000 networks across the globe, such as:

     network      |   source   | country |      region      |    city    | score  
------------------+------------+---------+------------------+------------+--------
 88.255.200.8/29  | CONJECTURE | GB      | Northamptonshire | Daventry   |  419.4
 71.206.20.0/24   | CONJECTURE | US      | CA               | San Jose   | 851.04
 218.2.196.0/23   | HEARSAY    | IT      |                  |            | 706.06
 222.68.108.76/30 | TRUTH      | US      | DC               | Washington | 190.13
 62.118.8.56/30   | CONJECTURE | CA      | SK               | Regina     |  625.9

A hypothetic customer is only interested in a certain part of this data:

Exclude networks in Belgium, New Zealand, and Hawaii
A location is a unique tuple defined by (country, region, city)
Count the number of networks in each group, and only include the networks that are in locations with 11-99 networks
Un-group the networks
Sort them lexicographically using C Locale
Return the 10,000th distinct network, 1-indexed i.e. [9999] 0-indexed

Getting Started

After recently taking 10gen's Mongo101 course, my first thought went straight to the MongoDB aggregation framework: $match, $group, $count, $unwind, $match, $skip, $limit. I also thought of Underscore which I had recently used on a project. Underscore has some nice tools (countBy, groupBy) and works great in the browser and on node. What really got me thinking though was two tools I have recently been learning: Python and PostgreSQL.

PostgreSQL and the Window Function

The first thing I wanted to do was to group the networks by the three identifying location fields (country, region, city). When using count() with GROUP BY though, the rows are flattened. We need something similar to MongoDB's $unwind. PostgreSQL's window functions (Update: sub-query and JOIN to the rescue) to the rescue! A window function basically allows you to run aggregate functions without squashing the original rows.

Below is the resulting window function query. Starting on line 3 is the initial sub-query/window function to get the counts by region. On line one we make sure the networks are distinct. Following the sub-query, we add our filters.

Gotcha: PostgreSQL and C locale

A gotcha on this was that we need to sort the results using C locale. For me on Ubuntu 12.10 and PostgreSQL 9.1, all new databases default to the system Collate and Ctype, in this case en_US.UTF-8. The quickest way to remedy this was to create a new database using createdb -T template0 --lc-collate=C --lc-ctype=C -E UNICODE no_belgium and re-import the data. I came across other suggestions on this, but none worked until I tried the re-import.

PostgreSQL using window function

SELECT DISTINCT ON (network) network
FROM
    (SELECT *,
            count(*) OVER (Partition BY (country, region, city)) AS netsreg
     FROM networks) AS sq
WHERE netsreg > 10
    AND netsreg < 100
    AND country != 'BE'
    AND country != 'NZ'
    AND region != 'HI'
ORDER BY network
OFFSET 9999 LIMIT 1;

-- 63.127.181.128/27
-- (1 row)
--
-- Time: 626.143 ms

PostgreSQL JOIN on sub-query

I was also curious how this query may perform without using a window function. The first thought I had was to use a sub-query to count the networks by locations, then join on the un-grouped networks to preserve the individual network rows.

SELECT DISTINCT ON (networks.network) networks.network
FROM networks
LEFT OUTER JOIN
  (SELECT DISTINCT country,
                   region,
                   city,
                   count(*)
   FROM networks
   GROUP BY country,
            region,
            city) AS nets ON (networks.country = nets.country
                              AND networks.region = nets.region
                              AND networks.city = nets.city)
WHERE nets.count > 10
  AND nets.count < 100
  AND networks.country != 'BE'
  AND networks.country != 'NZ'
  AND networks.region != 'HI'
ORDER BY network
OFFSET 9999 LIMIT 1;

-------------------
-- 63.127.181.128/27
-- (1 row)

-- Time: 124.223 ms

Update: much faster now! This query was originally taking ~1100ms because I was doing the location match as a pseudo-field (country, region, city) instead of DISTINCT on country, region, city The later allows it to hit the index. These are both pretty slow, but considering all 100000 rows are being scanned multiple times and this is an ad hoc aggregation query, maybe it is not so bad.

Update: PostgreSQL with temporary table

After initially making this post, John told me that using temporary tables might be a good approach. After I worked it out as below, the two queries return in ~150ms combined. Thanks for the tip John!

CREATE temp TABLE nets AS
SELECT DISTINCT country,
                region,
                city,
                count(*)
FROM networks
GROUP BY country,
         region,
         city;

-- SELECT 9740
-- Time: 94.347 ms

SELECT DISTINCT ON (networks.network) networks.network,
                                      nets.count
FROM networks
LEFT OUTER JOIN nets ON (networks.country = nets.country
                         AND networks.region = nets.region
                         AND networks.city = nets.city)
WHERE nets.count > 10
    AND nets.count < 100
    AND networks.country != 'BE'
    AND networks.country != 'NZ'
    AND networks.region != 'HI'
ORDER BY networks.network
OFFSET 9999 LIMIT 1;

--      network      | count 
-------------------+-------
-- 63.127.181.128/27 |    85
-- (1 row)

-- Time: 57.389 ms

Doing it in Python

Python has everything we need for this in the standard library (code below). We first need to decompress and read the file. There are a couple of ways to do this, but with is my preferred method because you don't need to issue close(). Going straight to for line in was how I did it at first, but this left stray \n's hanging around and was also ~50ms slower then grabbing the entire file at once. For larger files and different applications this may not work.

Now that the data is in a neat format, we just need to group it, filter it, and sort it. Being relatively new to Python, one of my favorite things is being able to use list comprehensions. List comprehensions came about in Python 2 and are used in mathematics and other languages.

filter_data is one such example and is a nice compact way to represent oft-used for/if blocks. After this we just need to convert the list to a set to make sure we have all unique networks, then return the 10000th item.

Interestingly, the PostgreSQL queries are faster than this Python example. Most of the time may be spent reading the file, while PostgreSQL can operate all in memory. I am sure more experienced Python/Postgres developers could make both much faster, but it is worth noting anyway. I may revisit this and re-factor them for fun as I learn more.

import bz2
from itertools import groupby


def read_data(file_name):
    """ Parse the bz2 file to get the networks into a list of lists. """
    data = []
    with bz2.BZ2File(file_name, 'rb') as input_file:
        lines = input_file.read().splitlines()
        for line in lines:
            line = line.split('\t')
            if not line[2] in ["BE", "NZ"]:
                if not line[3] == "HI":
                    data.append(line)
    return data


def group_by(data):
    """ Group the networks by unique location (country, region, city). """
    groups = []
    keyfunc = lambda y: (y[2], y[3], y[4])
    data = sorted(data, key=keyfunc)
    for k, g in groupby(data, keyfunc):
        groups.append(list(g))
    return groups


def filter_data(a):
    """ Return list of CIDRs of networks from locations with
    11-99 networks. """
    return [j[0] for i in a if len(i) > 10 and len(i) < 100 for j in i]


# print the 10000th item in the sorted, unique list
nets = set(filter_data(group_by(read_data('no_belgium.txt.bz2'))))
print(sorted(list(nets))[9999])

$ time python no_belgium.py
63.127.181.128/27
python no_belgium.py  0.76s user 0.02s system 96% cpu 0.808 total

MongoDB Aggregation Example

I also decided to write a MongoDB aggregation pipeline to see how it would compare. It came out at 1083ms, which is not very good. I may revisit this one at some point...

use nobelgium
db.networks.aggregate([
    // using match instead of $nor:[] to utilize index, save 100ms+
    {$match:
        {
            country: {$ne: "NZ"}
        }
    },
    {$match:
        {
            country: {$ne: "BE"}
        }
    },
    {$match:
        {
            region: {$ne: "HI"}
        }
    },
    {$group:
        {
            _id: {country: "$country", region: "$region", city: "$city"},
            netz: {$push: "$network" },
            count: {$sum: 1}
        }
    },
    {$project:
        {
            _id: 0,
            count: 1,
            netz: 1
        }
    },
    {$unwind: "$netz"},
    {$match:
        {
            count: {$gt: 10, $lt: 100}
        }
    },
    {$group:
        {
            _id: "$netz",
            uniq: {$sum: 1}
        }
    },
    {$sort:
        {
            _id: 1
        }
    },
    {
        $skip: 9999
    },
    {
        $limit: 1
    }
]);

{
    "result" : [
        {
            "_id" : "63.127.181.128/27",
            "uniq" : 1
        }
    ],
    "ok" : 1
}