Troubleshooting Performance in Complex Python Applications

At one of my jobs I helped build an application that processing 10s of thousands of reports on a nightly basis. Whenever you start running any code at scale you invariably end up running into performance bottlenecks. Below is an example of how I triaged improving performance in a complicated python application.

Often finding the place with the bottleneck is the hardest part. I started out using cProfile.

Our application is a django application, written in python that takes pdf’s and warehouses data into postgresql. To triage this I took a report, and ran it through our report processing pipeline with cProfile enabled. The command I ran was similar to this:

python -m cProfile -o myscript.cprof ./process_report.py

One thing I immediately noticed was that all the profiling had an incredibly huge impact on the performance of the application. This task that normally run in a minute or so took over 24 hours to run–to the point where I canceled the job before it was complete. Fortunately the profiler had collected enough information for me to start diagnosing the issue.

I grabbed the .cprof file created above and loaded it into Intelij for analysis (Tools->Open CProfile Snapshot), which gave a visualization of the code below:

This gives a sense of how complex our application is–each of those colored blocks is a distinct function call. We ran close to 50 distinct functions to extract, transform and load this data, with about half of them being our code, and the other half being part of the django framework.

The most notable thing about the timing information is that sql is taking ~99% of our processing time. This should come as no surprise to someone familiar with databases. Queries have a number of qualities that make them very expensive, including network connections, as well as the transaction time to commit data.

While sql being a substantial performance issue should come as no surprise, I noticed something odd in a key part of the application highlighted here:

In our application we have a function populate that is responsible for writing out a single record. I noticed that for every populate function call we were calling save to write our data 3 times! No good! Taking this information as a hint we were able to track down some unintended triplicate writes in our application, speeding up this process by ~30%.

Another aspect worth pointing out is that in our application we have some queries that are responsible for reading and validating the integrity of our data. This is visually represented by the get_gl_account function in the above screenshot, though there are many other functions like these in our application depending on the data. It was often conjectured that these queries were happening at such a high frequency that they were negatively impacting performance. Taking a look at that function we can see that despite running ~500 times, it was taking an infinitesimally small (0%) of the time in this batch process. Using the power of cProfile, that conjecture was put to reset.

In summary while a little knowledge of the application was necessary to tune the code, cProfile is an incredibly powerful tool for optimizing python code in the right hands.

September 29, 2017 • Tags: , • Posted in: Technology • No Comments

How to Use the Python Memory Profiler

Recently one of my coworkers was having an issue where some of our code running over at heroku was consuming a massive amount of memory. One of the tools I was looking at to help troubleshoot this was the python memory profiler. While I was loading a newer copy of our data set my co-worker identified the root cause of the issue, so we didn’t end up using this tool.  That said, I found the information in provided while profiling code incredibly interesting, and potentially useful in the future.  Read the rest of this post »

October 7, 2016 • Tags: , • Posted in: Technology • No Comments

How to Bulk Delete Your Twitter Followers

This post starts out with a little bit of an embarrassing story, followed with a script you can use to delete everyone you are following on twitter.  A couple years ago, I signed up for twitter because hey, someone once squatted my liyanage@hotmail.com email address and I wanted to reserve my twitter name.  I had no interest in actually doing twitter.  I just wanted the virtual real estate (and why isn’t that virtual estate?).  Fast forward a couple years, and I check out my twitter account, and low and behold I am following close to 2,000 people.  Someone had hacked my account, and was selling my following off to the masses.  Unfortunately, there is no bulk delete option on twitter, and it would take an incredibly long time to delete that many people. Read the rest of this post »

August 27, 2014 • Tags: , , • Posted in: Technology • No Comments