Jun 19

Functional for the win

As part of my new job, as a research student, I have to process a lot of data. On the order of several hundreds of thousands of records. So, I turned to my favourite language, Python. Its what caused my earlier issues with memory management.

A few more lessons I’ve learned:

  • sometimes theres a faster way to do stuff.
  • make sure you aren’t putting repetitive data into a database
  • verify the data format before you download 2-3 GB of it

Okay, so, faster way to do stuff. Python has both a performance advantage and disadvantage. The advantage is that it only takes one or two hours to code up something to process data. The disadvantage is that in an effort to be clear and simple, one can end up coding a task that takes 20-100x longer than it should. Like I did.

I had a lot of nested loops, python speed loops, involving lots of duplicate handling, inserting to a database, etc. It took five hours to go through only 14 sets of queries to clear out. I had a lot of duplicate rows that needed to be erased. For a query looking for rows with an id of 15, there are 25,607 records. Removing the duplicates yields… 807. Brilliant.

Today, I decided to take most of the loops, and replace them with their functional counterparts, map, lambda, and Google’s functional tools in python, goopy. In only two hours, it has gone through now, 160 ids. In less than half the time. Pretty amazing, eh?

Here’s a sample:

for row in all:
     curs=curs.execute('insert into elements2 values(?, ?, ?, ?, ?)', (row[0], row[1], str(row[2]), str(row[3]), ''))

replaced with:

map(lambda row: <code>conn.execute('insert into elements2 values(?, ?, ?, ?, ?)', (row[0], row[1], str(row[2]), str(row[3]), ''))</code>, all)

A little bit more difficult to read, yes, but, the speed improvements are well over 50 times faster…

All this work to clear out duplicate entries, and how intensive it can be to fix, is a perfect example of why the “Look Before You leap” programming paradigm works better than, “Ask for Forgiveness Not Permission”.

Now, once the processing is done, I’ll be able to go through all the data… and add a piece of information I forgot to process in the first place to each entry. Yay me!

Comments are off for this post

Comments are closed.