Don’t Fear the Statistics – Using OBI for Statistical Analysis Part 1
Recently, Ranzal has been working with a client in the healthcare space implementing Oracle Business Intelligence (OBI), and a requirement surfaced to translate a scorecard report into an OBI dashboard. One of the data elements was simply captioned “Trend” and colored red, yellow, and green. It was discovered that this Trend was the slope of a linear regression plot (more on what that means in a moment) and the color was based on an arbitrarily chosen number. This immediately raised some concerns from the Ranzal team who then made some suggestions for more pertinent statistical analysis.
To set the stage, this healthcare client’s summarized (and greatly simplified) income statement divides Revenue into Inpatient and Outpatient and Expenses into Total Labor and Non Labor. Revenue and expenses are the primary focus of much of the analytics at an aggregate level. A single (seemingly arbitrarily chosen) number was used to determine the colored flags for each of these measures. This was despite Inpatient Revenue and Non Labor Expenses comprising the majority of the revenue and expense amounts (respectively). If we were to plot out these categories for the first five months of a fiscal year, we see the following (all data have been altered to preserve client confidentiality without overly affecting the overall analytic output):
Figure 1 Revenue and Expense Trend Plot
The trouble with plotting a trend of numbers is that it is sometimes difficult to understand, at a glance, how the organization is performing. In the plots above, clear downward and upward trends can be seen for Inpatient Revenue and Total Labor Expense (respectively). However, upon closer examination of Outpatient Revenue and Non Labor Expense, there are two upward trending months and two downward trending months. The overall trend is difficult to discern.
With the introduction of Oracle Business Intelligence Enterprise Edition (OBIEE)12c, a Trendline function was introduced that allows the creation of a linear regression trendline. Once this is applied, the above trend plots can be augmented to get a clearer picture of performance:
Figure 2 Revenue and Expense Linear Regression
This trendline uses a simple linear regression formula that is comprised as the slope (commonly represented by the letter m) and Intercept (commonly represented by the letter b) in the following formula:
y = mx + b
In our trend plots, the letter y represents the revenue and expense categories and x represents the fiscal periods.
The intercept is where the trendline crosses the y-axis when x is equal to zero. For most statistical analyses, the intercept is unimportant. The slope can be thought of the average change over the two parameters. Using OBI, the slope of each revenue and expense category can be calculated and the dashboard updated:
Figure 3 Linear Regression Slope
In the example above, the slope of the Inpatient Revenue can be thought as decreasing an average of $291,000 a month.
One issue with using the slope is that it is subjective. As was mentioned, our healthcare client had chosen a single arbitrary slope for each of the revenue and expense categories. The slopes in the example above range from 29 thousand to -291 thousand. Complicating matters, the client wanted the ability to run these Analysis for individual hospitals which can dramatically affect the slope. For instance, a hospital operating in Kansas City will probably not have the same revenue growth (or shrinkage) as a hospital operating in New York City. To use the slope as a quantifiable objective properly, a target slope would have to be determined for the enterprise and at each granular level expected to be benchmarked (hospital, department, etc.). This creates some obvious maintenance issues.
A more objective approach is to use the correlation coefficient, a number on a range from negative one to positive one. A correlation ranking of one indicates a positive correlation while a ranking of negative one indicates a negative correlation. For instance, for most companies, the number of units sold is often has a high degree of positive correlation to revenue. This would correspond to a correlation coefficient of close to one. For many companies working in the commodities market, the more competitor’s revenue increases, the lower the possible market share. This would be a negative correlation and result in a correlation coefficient calculation of negative one. A correlation coefficient of zero indicates a lack of any correlation. For instance, the number of broken arms set in a New York hospital is probably uncorrelated to the number of bowls of soup served by Panera Bread in Kansas City.
It is worth noting that correlation does not mean causation. For example, consider the number of pirate attacks and users of Microsoft Internet Explorer (IE) users:
Figure 4 IE Usage and Pirate Attacks
The number of pirate attacks and IE users have both been in decline since 2009. As can be seen by the scatter graph on the right, the more pirate attacks, the greater the use of IE. Regardless, naval security experts are probably not asking for adoption rate reports from Microsoft.
Returning to the client’s use case, adding the correlation coefficient to the dashboard provides a greater understanding of how the company is objectively performing:
Figure 5 Month and Revenue / Expense Category Figure Correlation
Inpatient Revenue has a correlation of -0.69, which is moderately significant for a metric most businesses want to increase. Conversely, the Outpatient Revenue has a slightly negative correlation of -0.36. While this should be a cause for concern, a “wait and see” approach (or deeper dive into Outpatient Revenue Categories) might be more prudent. Because the range of the correlation coefficient is negative one to one, filtering this analysis down to a more granular level, such as a hospital or department, will return an objective number that can be subjected to independent interpretation.
There are cases in which the subjectivity of the slope is particularly useful. In the case of our client, a full year budget was prepared at the beginning of the fiscal year and periodically updated as the year progressed. The slope of this budget could be used to generate the average dollar change desired per month. The advantage of this is that it reduces the possible volatility of a particular month into a single number that can be compared to the benchmark. As a final addition to the dashboard, a full year budget slope was added:
Figure 6 Full Year Budget Slope
With the exception of Non Labor Expenses, this organization is missing the mark on all of their budgetary goals, and the trend indicated by the actual slope and correlation coefficient means this situation is likely to get worse.
A word of warning about statistics in general and the use of slope and correlation coefficient in particular: micro and macro trends can should be considered and extreme outliers can mask actual trends.
For an example of micro and macro trends, consider JCPenney, a retailor that has been struggling since 2010. The following visualization (created using Oracle Data Visualization Desktop) charts the quarterly revenue from 2004 Q3 to 2016 Q4 along with the trendline for the entire period. The bars represent the correlation coefficient to that particular quarter (i.e. the first bar is the correlation between 2004 Q3 and 2004 Q4 while the second bar is the correlation between 2004 Q3, 2004 Q4, and 2005 Q1, etc.):
Figure 7 JCPenney Revenue Trend and Correlation
Notice that the first correlation bar is equal to one. When there are only two data points, the correlation coefficient will be equal to one, negative one, or zero. The next data point and correlation for 2005 Q1 (JCPenney recognizes holiday revenue in Q1 of each year) continues the high correlation streak, however, the following quarter drops the correlation down to 0.35. The correlation fluctuates quarterly until about 2012 Q2 when the definite downward trend is established.
A savvy analyst will break JCPenney’s performance during this time range into three distinct trends. Upward trending from 2004 to 2008 Q1, diminished upward trend from 2008 Q2 to 2012 Q1, and then a flat, but greatly reduced revenue from there:
Figure 8 JCPenney Distinct Trends
As an example of how an extreme outlier can affect statistical analysis, consider GTx Incorporated, a pharmaceutical drug developer. In December 2010, GTx recognized $49.9 million dollars in revenue from a partnership with Merck& Co., Inc., which spiked GTx’s revenue (previously averaging $2 million a quarter) to $56.7 million dollars:
Figure 9 GTx Incorporated Revenue Trend
In the visualization above, the orange projected trendline was calculated using revenue from 2004 Q1 through 2009 Q4. The purple trendline is the projected calculated using 2010 Q1, which includes the huge revenue spike. Obviously, the orange trendline is the more accurate due exclusion of the extreme data point.
Statistical analytics is part science/technology and part art. As with any data and visualizations, a certain degree of intelligent interpretation is needed to determine what it all really means. Functional users should be focused on what the various statistical interpretations mean and not be distracted on the complexity of the underlying mathematical functions. Trend visualizations can aid users in understanding how to interpret these statistical calculations. Many organizations miss opportunities because of individuals unwilling to embrace statistical methods due to the lack of solid education and guidance about what these numbers really mean. Training, change management, and the creation of rich visualizations can help enterprises harness the capabilities of statistical analysis and extend the role of their business intelligence systems.
Jason L. Hodson is a Principal Architect with Edgewater Ranzal. He focuses on the Oracle Business Intelligence platform, with particular emphasis on the federation of EPM and relational data source, Business Intelligence Cloud Service (BICS), as well as data governance with Hyperion DRM. He has experience with clients in the insurance, public utilities, manufacturing distribution, and healthcare industries. A former U.S. Marine, Jason has an undergraduate degree in mathematics/physics from Ball State University, an MBA and MS-Information Systems from the University of Cincinnati, and a MS-Information and Knowledge Strategy from Columbia University. He currently resides in Denver, CO and enjoys hiking, snowshoeing, and the local craft beer industry.