One of the reasons is that PL/pgSQL hides anything happening in the PL code. None of the SQL queries inside of all the nested function and trigger calls ever shows up in the PostgreSQL logs or pg_stat_activity. They can be tracked in pg_stat_statements, but then comes the hunt for where in the source code they are. With this "Cone of Silence" utilities like pgbadger cannot point to the actual problem. All pgbadger tells you is that "SELECT slowfunc(123)" took two hours.
To help DBAs and developers tackle this problem, I started working on the plprofiler code about 6 months ago. The result of this work, which was sponsored by OpenSCG, is plprofiler version 3.
The feature highlights of version 3 are:
- Can be used by unprivileged users to profile PL code executing in their session.
- Can be used by a Postgres superuser to profile an entire, running application and collect the profiling data in shared memory.
- A command line utility that wraps around a Python Class with functionality for:
- Profiling of SQL statements that invoke functions or triggers. The SQL can be given on the command line or in a separate file.
- Monitoring a running application for an arbitrary period of time.
- Monitoring a specific running backend (by PID) for an arbitrary period of time.
- Saving monitoring data from any of the above into permanent tables. These data sets are fully resolved and do no longer rely on the system catalog to identify functions/triggers or get their function source code.
- Exporting/importing of saved data sets.
- Generating a self contained HTML report from local, shared or saved data. The HTML report contains a FlameGraph based on the wall clock time spent in PL/pgSQL functions and triggers.
The best thing here is the actual end result of profiling, the HTML report. The FlameGraph in it makes it very easy to identify the functions/triggers that use excessive amounts of time, however deeply they may be nested. In addition the report contains full source code (as pop-ups) of a user defined number of "Top" functions. That source code is presented together with the per source line execution statistics of number of executions, cumulative time spent and maximum execution time of that particular PL statement.
|Sample Report of a pgbench implemented with PL/pgSQL functions|
The first real-world case, we used the new plprofiler to solve, was an application consisting of approx. 600 PL/pgSQL functions with over 100,000 lines of PL code sitting on 1,800 tables with 13,000 indexes. Quite some schema I would say and all the business logic implemented in PL/pgSQL. The problem we found was a SELECT statement in a function, 9 call levels deep. The SELECT's WHERE clause did not specify the first column of the only available index, but due to the row width of the table the PostgreSQL optimizer chose a bitmap scan using a full scan of the entire index. Creating one additional index on that table resulted in a net gain of factor 2.6 in overall application performance!
This was a really tough case and our remote DBAs had already spent days on the problem without really getting anywhere. What makes this so difficult is the combination of PL/pgSQL hiding all queries and that the query in question was still producing an index scan (of sorts), so the table wouldn't show up as suspicious in pg_stat_user_tables or the like either. Finding the problem with plprofiler version 3 took about an hour, and that without any prior knowledge of the customer's application code.
Granted, it still takes an experienced DBA to spot the particular problem that was the root cause in this case.
Anyhow, if you are interested and want to play around with plprofiler version 3, BigSQL has binary packages as part of the BigSQL distribution that work on Linux, Mac OSX and Windows.
This really is excellent, thanks.ReplyDelete
I got a little confused reading the docs, references to "save_interval" left me sitting around wondering why no stats were being transferred into the shared hash tables, when infact the parameter should have been "collect_interval".
Thanks for the feedback. Will check the docs when I get home.ReplyDelete