Thursday, November 3, 2016

PL/pgSQL Profiler 3 (plprofiler) is released

Finding performance problems in deeply nested PL/pgSQL business logic is difficult.

One of the reasons is that PL/pgSQL hides anything happening in the PL code. None of the SQL queries inside of all the nested function and trigger calls ever shows up in the PostgreSQL logs or pg_stat_activity. They can be tracked in pg_stat_statements, but then comes the hunt for where in the source code they are. With this "Cone of Silence" utilities like pgbadger cannot point to the actual problem. All pgbadger tells you is that "SELECT slowfunc(123)" took two hours.

To help DBAs and developers tackle this problem, I started working on the plprofiler code about 6 months ago. The result of this work, which was sponsored by OpenSCG, is plprofiler version 3. 

The feature highlights of version 3 are:

  • Can be used by unprivileged users to profile PL code executing in their session.
  • Can be used by a Postgres superuser to profile an entire, running application and collect the profiling data in shared memory.
  • A command line utility that wraps around a Python Class with functionality for:
    • Profiling of SQL statements that invoke functions or triggers. The SQL can be given on the command line or in a separate file.
    • Monitoring a running application for an arbitrary period of time.
    • Monitoring a specific running backend (by PID) for an arbitrary period of time.
    • Saving monitoring data from any of the above into permanent tables. These data sets are fully resolved and do no longer rely on the system catalog to identify functions/triggers or get their function source code.
    • Exporting/importing of saved data sets.
    • Generating a self contained HTML report from local, shared or saved data. The HTML report contains a FlameGraph based on the wall clock time spent in PL/pgSQL functions and triggers.
The best thing here is the actual end result of profiling, the HTML report. The FlameGraph in it makes it very easy to identify the functions/triggers that use excessive amounts of time, however deeply they may be nested. In addition the report contains full source code (as pop-ups) of a user defined number of "Top" functions. That source code is presented together with the per source line execution statistics of number of executions, cumulative time spent and maximum execution time of that particular PL statement.

Sample Report of a pgbench implemented with PL/pgSQL functions

The first real-world case, we used the new plprofiler to solve, was an application consisting of approx. 600 PL/pgSQL functions with over 100,000 lines of PL code sitting on 1,800 tables with 13,000 indexes. Quite some schema I would say and all the business logic implemented in PL/pgSQL. The problem we found was a SELECT statement in a function, 9 call levels deep. The SELECT's WHERE clause did not specify the first column of the only available index, but due to the row width of the table the PostgreSQL optimizer chose a bitmap scan using a full scan of the entire index. Creating one additional index on that table resulted in a net gain of factor 2.6 in overall application performance!

This was a really tough case and our remote DBAs had already spent days on the problem without really getting anywhere. What makes this so difficult is the combination of PL/pgSQL hiding all queries and that the query in question was still producing an index scan (of sorts), so the table wouldn't show up as suspicious in pg_stat_user_tables or the like either. Finding the problem with plprofiler version 3 took about an hour, and that without any prior knowledge of the customer's application code.

Granted, it still takes an experienced DBA to spot the particular problem that was the root cause in this case.

Anyhow, if you are interested and want to play around with plprofiler version 3, BigSQL has binary packages as part of the BigSQL distribution that work on Linux, Mac OSX and Windows.

Saturday, July 16, 2016

My PL/pgSQL code is stuck, but where?

In the short time that I am enhancing the PL profiler (https://bitbucket.org/openscg/plprofiler) I have been asked multiple times if it can be abused as a debugging aid. Not directly. The conversation goes something like this:

Q: When my PL code is stuck somewhere, can I turn on profiling and see where?
A: No.
Q: Will this feature be added?
A: No.

Of course would that be a useful feature. I don't argue that. And it seems to be that this is precisely how Oracle users find out where their PL/SQL code gets stuck or moves like molasses. However, the reason why I am against adding this is because a profiler, or even parts of it, should not be enabled 24x7 on a production database server for the purpose of eventually using it to extract some debugging information some day. There are other ways to get that information and I will show you one of them.

If a program is stuck somewhere, one uses a debugger to get a backtrace. This works with C programs, like the database server, provided that symbol information is available. In that backtrace (and some variables) we also find the PL/pgSQL backtrace. Having symbol information available is also useful in case of a program crash, to find out why it crashed by loading the core dump into the debugger. 

Every line of PL code, that is executed, goes through the PL/pgSQL executor function exec_stmt(). At that place we find the current PL code's line in a variable. Likewise every function execution goes through either plpgsql_exec_function() or plpgsql_exec_trigger(). In those stack frames we find the OID of the function as well as its signature (name and call argument types).

Doing this eventually several times for a deeply nested PL/pgSQL program is tedious and no fun. So here is a little script called plbacktrace (https://github.com/wieck/plbacktrace.git) that does all of that. It is invoked with the PID of the PostgreSQL backend and will output information like



[postgres@db1 tmp]$ ./plbacktrace.py 13239
fn_oid=105672 lineno=5 func="life_get_cell_pl(life_board,integer,integer)"
fn_oid=105673 lineno=12 func="life_num_neighbors_pl(life_board,integer,integer)"
fn_oid=105675 lineno=11 func="life_tick_pl(life_board)"
fn_oid=105674 lineno=22 func="life_pl(integer,integer,boolean)"
 I will explain another time why I have a database server playing Conway's Game of Life written in PL/pgSQL. 


Regards, Jan