Our non-dba performance engineers and testers spent hours pulling together data to answer the question ‘How did the Oracle database perform during that test?’ They used multiple xymonprd screens to get host performance stats, multiple Oracle AWR reports to get database performance stats, and sometimes Oracle’s OEM tool as well. Much time was spent going from one source and screen to another, wading through more detail than they needed trying to get a coherent picture of performance covering both host servers and database. We wanted to speed up the process significantly and improve the quality of the analysis. We decided to create our own set of fundamental performance statistics covering both the host and database performance and place it in one basic summary page with drill down for finer details. Since Splunk provides easy searching and presentation functionality and we already had it licensed in-house, we decided to create it in Splunk
How it works
The general approach of the solution is that 4 stored procedures run in the Oracle database. The Stored procedures read Oracle performance views and write the data out to operating system text files. Splunk consumes the text performance data and the DBALite app presents the Splunk data via simple xml views. The Host performance stats are obtained from the ‘nix add-on for Unix and Linux. With this tool I can confidently look at the app’s Performance Summary page to find out how the DB is performing. It is easy to compare current performance to a period in the past, as dates are entered for both a baseline and “current” time window. So a recently completed test can be compared side_by_side with a baseline test. Or a period of bad performance can be compared side by side with a period of good performance to see what changed. Often, I go no further than the Summary page, BUT for example: • if I see an elevated host cpu utilization I can click through to the ‘Investigate CPU’ page and check whether it was User, IOWAIT or System utilization underlying it. I can even see if something in the test elevated the context switching • If I see elevated memory utilization, I quickly determine if it involved a high use of swap space or database PGA process memory utilization was the cause • I can see whether deadlocks were an issue or if there was significant blocking lock activity • I can see if io durations were longer than expected and whether they were sequential reads, scattered reads or direct reads • In the performance graphs and tables on the Summary page, if I see a sql with questionable performance (high disk reads or high cpu for example) I can click and the runstats for that sql for both baseline and “current” period will display in a new window
Challenges I ran into
The biggest challenge was that I was completely new to Splunk and so had to learn absolutely everything. I early on received the advice to stick to simple xml, which I struggled to do, but later decided we’d convert to using the web framework (a future enhancement) to get richer scripting and interactivity.
Accomplishments that I'm proud of
Decreased the usual time to gather and report database performance post-test from 3 or 4 hours to half hour or less. Gave non-dba’s a tool that makes the Oracle database seem less intimidating and its performance more graspable. Gave prod support a tool that allows them to be more proactive with respect to database issues as well as more empowered to diagnose and help resolve database issues.
What I learned
I learned much about how to write an app for Splunk. There is much I would do differently now that I have experienced creating an app. We have enhancements and improvements in mind.
What's next for Oracle DBLite
We will be adding pages for performance analytics – with the info we are gathering now we should be able to use queuing theory to help us answer the question “how much can we increase the workload before we expect a significant performance degradation?” “And how many cpu’s do we need to add to meet expected growth in workload?”