How to profile your python code to improve performance
When your python scripts take too much time to execute, it’s crucial to profile which part of your code is responsible. It becomes increasingly tricky as your code grows both in size and complexity.
Let’s take an example:
The function run
loads the Titanic csv file with pandas and does some computation on the loaded data. I want to know which part takes more time: the loading or the computation.
I tried the standard python profiler but it turned out that it does not help me much to address the performance issue. Instead, I wrote a python decorator to time my code by allowing me to profile exactly what I need.
Why the standard python profiler is not what I need?
Let’s profile our piece of python code with cProfile. To do so, load the cProfile module when you run your script:
python -m cProfile load_and_normalize.py
Alternatively, you can profile the “run” function directly in the python code:
import cProfile
cProfile.run('run()')
In the end, you’ll get an output that looks like this:
Why it doesn’t help me:
- it gives way too much information: it tracks the time of all called functions, including the one called deep in pandas. Hence I got more ~300 function calls. In my example, I am only interested in the execution time of my own functions i.e.,
run
,load
,normalize_ticket_price
… - “tottime” is not relevant: this is the actual time spent in the function excluding the time spent in subcalls. As all calls are tracked, it is almost always 0.000.
- exploring the profiler results is hard: time results are either print to the console or written in a file. You may use pstats or tools such as cProfilev(basic display with sorting facilities), kcachegring or gprof2dot (show call graph) to explore results, but it is heavy.
- exporting the result is not easy: in my case, I’d like to store the profiler results in a comprehensive way in an external database.
A simple python decorator for tracking time
Instead, I’ll use this simple python decorator for profiling:
It computes the execution time of the decorated function fn
and logs the result right after fn
finished to execute:
- the
@wrap
(line 6) decorator sets the meta-attributes (name, docstring, …) of the decorated function from the original function ones. - the
try … finally
ensures the execution time is always logged, even if the decorated function raised an exception.
To use it, simply decorate the functions that you decided to profile:
Executing the function run
will print the following:
[SimpleTimeTracker] load 0.006
[SimpleTimeTracker] compute_max_ticket_price_per_class 0.003
[SimpleTimeTracker] normalize_ticket_price_per_class 0.004
[SimpleTimeTracker] run 0.010
- it is simple: the decorator is 24 lines long and has no dependencies.
- it is surgical: I only got timings of the function I decided to profile. This allows you to quickly understand which part of your script is slower.
- it is easy to export the results: the logging function
_log
passed to the decorator handles the profiler results. In the example above, it simply prints timings to the console, but I could have store the profiler results to a database.
A better python decorator that profiles partial times
The partial time of a function foo
consists of its total execution time i.e., the time delta between entering and exiting foo
, minus the execution times of the functions that are called inside the body of foo
.
In the example above, the partial execution time of the function run
is zero: its total execution time (10ms) minus execution times of normalize_ticket_price
(4ms) and load
(6ms). It means that if I want to improve the performance of the function run
, I may focus on improving load
and normalize_ticket_price
, as literally no time is spent in the body of the function run
(which is obvious when looking at the code).
To profile partial execution times, I improved a little bit the decorator:
Note that we do need a global variable _total_time_call_stack
to compute partial execution times.
Here is the profiling results of when running the function run
.
[BetterTimeTracker] load 0.006 0.006
[BetterTimeTracker] compute_max_ticket_price_per_class 0.003 0.003
[BetterTimeTracker] normalize_ticket_price_per_class 0.004 0.001
[BetterTimeTracker] run 0.010 0.000
Full code, documentation and examples are available here.
[Edit]
Following this comment on Reddit, I tried pyflame + flamegraph tools for profiling. It’s amazing: it shows which part of the python code takes more time in a concise way and without modifying the code (e.g., adding decorators).
To do so, you need to install pyflame and flamegraph on your computer, and to run the following command:
pyflame -t python your_code.py | flamegraph > profile.svg
You’ll get an SVG file that looks like this:
You can download the SVG file here. In a nutshell:
- Each bar represents a call to a function;
- Colors are randomly picked among warm colors :)
- The Y-axis represent call stacks;
- The X-axis is not the time: pyflame checks every milisecond the current call stack and the bar width represents the overall number of ticks it was in the function.
In this example, you can see most of the time (91.17%) is spent loading pandas. To get more detail about the run
function, do ctrl+f to search the SVG and click on the matching element (the tiny purple bar in the right part), you should get something like:
In the end, both pyflame and the profiling decorators allow you retrieve the execution time ratio spent in load
and normalize_ticket_price_per_class
functions which is about 60/40.
Thanks to Reda Boumahdi, Adil Baaj, Tristan Roussel, Adrien Lina, Louis Nicolle, and Flavian Hautbois.
If you are looking for Machine Learning expert's, don't hesitate to contact us !