Short-circuiting the Java stack trace search

PCF Application Log Analytics
By Kyle Dunn

Many developers agree Java stack traces are the source of headaches and needless screen scrolling. Occasionally the verbosity is warranted and essential for debugging, although, more often, the overwhelming detail is just that, overwhelming. In the spirit of better developer productivity and shorter debugging cycles, this post will demonstrate an increasingly relevant reference architecture for cognitive capabilities in Pivotal Cloud Foundry (PCF) using two of Pivotal’s flagship data products: GemFire, an in-memory data grid, and Greenplum, a scale-out data warehouse.

Before jumping directly into the bits and bytes, it’s worth some justification of “why use those tools?”, especially given the vast landscape of mature log indexing and searching product. As far as open source offerings, the Elastisearch, Logstash, and Kibana (ELK) stack is quite powerful and effective for reactive event-based investigations; the operative term being reactive. PCF is a platform built with developer productivity at the forefront and proactively offers relevant application events in a single place, like PCF Metrics, is preferred to adding steps and dashboards into a developer’s workflow. Below is a screenshot of PCF Metrics, a powerful tool for visibility into application performance and debugging.

The Logs section at the bottom makes event searching very convenient for a developer. Taking this one step further can proactively surface the relevant pieces of recent stack trace messages, for example, those matching a “Caused by: ” filter. GemFire is well-suited for implementing such an enhancement, especially by coupling the new Lucene text indexing capabilities. As a complement, the analytical capabilities of Greenplum’s GPText is a good fit for higher-order, longitudinal insights like “Which Java classes appear in the most stack traces?”. A wide range of possibilities is opened by combining GemFire’s in-memory capabilities with Greenplum’s storage-efficient archival and performant searching.

Below is a very abridged stack trace a developer might see if the backing database became unavailable:
org.springframework.jdbc.support.MetaDataAccessException: Could not get Connection for extracting meta data; nested exception is org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLNonTransientConnectionException: [Pivotal][Greenplum JDBC Driver]Error establishing socket to host and port: 172.16.139.132:5432. Reason: Connection refused
at org.springframework.jdbc.support.JdbcUtils.
extractDatabaseMetaData(JdbcUtils.java:339)
...
at io.pivotal.Application.main(Application.java:10) [classes/:na]
Caused by: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLNonTransientConnectionException: [Pivotal][Greenplum JDBC Driver]Error establishing socket to host and port: 172.16.139.132:5432. Reason: Connection refused
at org.springframework.jdbc.datasource.DataSourceUtils.
getConnection(DataSourceUtils.java:80) ~[spring-jdbc-4.3.8.RELEASE.jar:4.3.8.RELEASE]
at org.springframework.jdbc.support.JdbcUtils.
extractDatabaseMetaData(JdbcUtils.java:326) ~[spring-jdbc-4.3.8.RELEASE.jar:4.3.8.RELEASE]
... 61 common frames omitted
Caused by: java.sql.SQLNonTransientConnectionException: [Pivotal][Greenplum JDBC Driver]Error establishing socket to host and port: 172.16.139.132:5432. Reason: Connection refused

At first glance, there doesn’t appear to be much useful insight buried in the noise, but upon closer inspection a few gems avail themselves to subsequent analysis. For example, the term Caused by: often prefixes a specific exception message, in this case: Could not get JDBC Connection; and Error establishing socket to host and port: 172.16.139.132:5432. Reason: Connection refused. Messages with this format are great candidates for proactive exposure to a developer, as discussed above. A straightforward implementation realizes this capability using a continuously applied filter to live data via an asynchronous event listener pattern.

Going into a related, but distinct, the patch of weeds: stack trace text contained in parentheses surfaces the underlying Java class and source line number, e.g. (DataSourceUtils.java:80). This data is sufficient for the longitudinal insight example previously mentioned: “Which Java classes appear the most stack traces?”. In this case we’ll show the utility of having application logs stored in an analytical data warehouse like Greenplum.

Stand back, I know regular expressions and (some) SQL:

SELECT
regexp_matches(message, E'\([A-z]+.java\):\\d+') AS javaclass,
COUNT(*)
FROM applogs
WHERE message ~ 'at'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;

javaclass | count
———————————-+——-
{PeerEurekaNode.java} | 40392
{AbstractPlainSocketImpl.java} | 33346
{WebResource.java} | 32314
{ThreadPoolExecutor.java} | 27406
{DefaultRequestDirector.java} | 21747
(10 rows)

Time: 8556.875 ms

Not bad, results in under 10 seconds from 1 million records using only 2 Greenplum data segments on a virtual machine. As a point of reference, production Greenplum deployments start at 32 segments spread over 4 servers.

Taking this approach one step further, and marrying the otherwise uninteresting stack traces for a given application, a visualization like the following is possible – generated programmatically with SQL and Java. In addition to a visual troubleshooting diagram, the underlying node, and edge graph data structure enables trending of problematic classes and dependencies. Imagine having the ability to answer questions like “Which classes are the cause of most application errors?” At a higher level, a VP of Engineering would be empowered to do “rollups” of code stability, the same way you might do in finance or marketing context. The most effective management is often rooted in granular measurements.

While these are just a few examples of useful tidbits, the combination of GemFire’s in-memory capabilities and Greenplum’s analytical performance serves as an unparalleled stack for moving the needle on developer-centric, cognitive capabilities using PCF’s abundant application logs.