Wednesday, January 20, 2016

State of the Union Speeches and Data

I've done a couple posts on the SOTU speeches.  In the past these dealt with word count, approval, and the vague notion that the applause the president receives has a relationship with his approval rating at that time (which had a lower correlation this year in fact).

Wired had a good article highlighting the sentiment in the current and previous State of the Union (SOTU) speeches.  They went through the speech for several of the past years, highlighted the events that occurred each year, and gave the corresponding frequency or usage of terms in the speech that communicated the impact of those events.  This blog post is not duplicating the article.  I did see the graph though and wanted to see if I got a similar sentiment score for the speeches.  I used the 'syuzhet' library in R to conduct the analysis (big thanks to Matthew Jockers for the package).

The graph is similar to the one in the Wired article, but not entirely.  Some smoothing was involved and perhaps a different sentiment analysis technique.  We do see a similar finding in the most recent SOTU speech:  it ended with the highest sentiment score out of all the speeches.  Several of the speeches in my analysis showed a curving up toward the end, which would in general go along with "ending on a positive note".  Additionally, one can see the "valleys" or lower sentiment values occurring between the 50 and 75 time intervals.  This isn't too surprising in that the same speech writer is being used and that the SOTU has perhaps a more standard sentiment form (another analysis perhaps?).  

This same library has a function which scores certain words to emotional categories.  These 10 categories include a positive/negative categorization.  Along with these, I added in the applause count for each speech and the approval rating for each year for the time period of the speech.  The matrix below depicts the correlation values of each category with corresponding color.  Additionally, I added in a p-value scoring for each relationship, those >.1 were given bubbles.

There's a lot here in terms of what could be said about the speeches but I'll only say a few things that I thought were interesting.  The applause/approval rating correlation showed a weaker value than last year (-.5), which isn't too surprising since this is probably spurious anyways.  Negative word categorization and applause had a higher correlation than positive word categorization and applause.  Meaning, when comparing applause and negative word use across speeches, these counts varied in a similar way (applause count higher - negative word count higher and vice versa).  Speeches with words categorized as "anger" or "fear" had a weak correlation to the applause count.  Conversely, speeches with words categorized in emotions like "joy", "surprise", and "trust" portray a stronger correlation with applause count in those same speeches.  So perhaps to get more applause in general, certain positive words are better than others?  Yoda's advice about fear would make sense here in that words associated with fear tend to vary similarly to words associated with anger.

We also see a decent amount of correlation among more positive emotions as well as within more negative emotions.  This refers back to the common "curve" that these speeches may have.  In that the sentiment used year over year tend to be similar, or at least the emotional categorization of words follow similar patterns.

Thanks to Matthew Jockers, Taiyun Wei, and Hadley Wickam for their work on the 'syuzhet', 'corrplot', and 'ggplot' packages respectively.  Code for the above analysis is on my github page.

Thursday, January 14, 2016

Philip Glass Composition and Exploding Boxplot

This post will highlight a couple of my favorite things.  R programming and composer Philip Glass.  For those of you not familiar with his works, he basically pioneered the "minimalist" style of piano playing.  He has been writing and performing music since the 60s, and his pieces are still heard in film and other genres.  Much information on the composer can be found at

I put together a file for all his compositions, their date and composition style.  Using a new R package 'explodingboxplot', the compositions and dates can be visually inspected.  Click the image below to navigate to the interactive plot.

Philip Glass Compositions by Date and Type

In general when looking at the compositions of an artist over time like this, we can make a few observations.  The style of music that he composed the most evenly distributed overtime was "Chamber" music.  The wikipedia page describing the time-periods of his work mentions today as the "Chamber music" time-frame.  Early in his career we can see his compositions and really his reputation was built on his ensemble and solo compositions, though his solo compositions continued into more recent years (boxplot in red).  His ballets were mostly confined to the 80s and 90s, where his music accompanied the modern choreographer Twyla Harp in her pioneering contribution to contemporary ballet.

In general, this visualization is quite useful when considering summary information about categorical time-series data.  As it pertains to the compositions of an artist, it is a nice way to quickly survey their life-time work.  The package 'explodingboxplot' is available on github.  Much thanks to Kent Russell for his work on this R package.

Friday, December 18, 2015

Marathon Races Shiny App

About a year ago I posted about men's and women's marathon (and longer) distance races from the dataset.  In the meantime, Shiny development and the open source announcement of have brought data visualization to the next level.  As an avid (at least former) runner, exploring marathon data is interesting at both the personal and "data science" (or is that personal too?) levels.  Thus, I finished a Shiny app that explores this dataset from 2014.  Unfortunately, 2015 data is not being updated for one reason or another, but 2014 provides a lot of observations about marathon and longer distance races.

Click here to access the app.

The values can be toggled between months for 2014 and a searchable table of all the data is below the graph.  You will notice many of the points are small ultra-marathons around the world. provides nice graph interactive abilities found when hovering in the upper right corner of the graph.

Thanks to RStudio for all their work on Shiny and to for their plotly package and charting library.