Wednesday, June 17, 2015

Language Difficulty and Diversity

*For R users not interested in the post but the code, a markdown file is available on github.  Thanks to Zuguang Gu and Bob Rudis for the 'circlize' and 'waffle' packages respectively.

I've been studying Arabic for about 10 months now and had some thoughts that I wanted to post about.  It's challenging, but I didn't know how challenging it was exactly when compared to other languages even though I knew it was on the more challenging end of the spectrum.  Turns out someone has measured (or attempted to) the amount of "class time" a native English speaker would need in order to learn a language.  In general, I've found my language ability improves most when I complement time in the classroom with time practicing with native speakers (which I think would be a more useful measure of time in tandem with "class time").

The data for the chord diagram below was retrieved from a language wiki site that used a study from the Foreign Service Institute.  The number of class hours for each of these languages communicates more a scale of difficulty than the exact number of hours it would take to speak a language (as not all learners are equal).  In general, this seemed to be a pretty comprehensive list of world languages so I thought it could look nice in a chord diagram.



From my own experience, I've been studying for about 10 months.  Not intensively perse, but about 6 hours per week along with conversation practice I have on my own.  Which means if I miss a few weeks I'll have logged about 300 class hours at a year.  Which is kind of disappointing considering I supposedly need 2,200 class hours!  I think these numbers are actually REALLY conservative but it does set some sort of benchmark for difficulty when comparing languages for English speakers.  I'm conversational now and feel comfortable with the language (though by no means fluent) after about 300 hours.  Which as a side note is why I think being immersed would decrease the amount of class time above dramatically.  

So, in learning Arabic and spending a supposed 2,200 hours studying it, how many more people can I actually communicate with?  Well, a lot more.  But in terms of world population I was surprised at the percentage of people who accounted for the top 5 most spoken languages (this includes Arabic).  I honestly had no idea language was quite this diverse (in that the top 5 languages comprise 35% of the world's languages, thought it would be more but that's just me).  Furthermore, if we get into dialects these percentages decrease further.  

Link to data

In real terms though, being able to speak with millions of additional people is fantastic and I would encourage all to pursue such an endeavor.  As for measuring my ability to communicate globally in percentage terms perhaps viewing language learning in the context of world language use is a scale reserved for those with unique skills in language acquisition.


Sunday, April 19, 2015

Boston Elite Field 2015

Last year I posted about how chances of a non-African country winning the Boston Marathon seemed to be good because of the widening interval of winning times (more recently there had been some historically "slower" races and some historically "faster" ones) and this actually happened.   Meb Kflezighi ran a remarkable race and was widely celebrated as he represented the US in a race more recently dominated by African countries.  His time for winning the race was obviously the fastest, but others in the field had faster PRs.  Because of the variation in winning times my conclusion has been that this provides opportunities for certain runners representing non-African countries to contest the race well.


The amount of participants from Africa in the elite field clearly increases the likelihood that the winner represents an African country.  The runners in the elite field mostly fall into or below the confidence interval shown in the graph above with the slight exception of Matt Tegenkamp whose PR for the marathon is 2:12 ish, just above where this statistical measurement would encompass.  It is clear that once again the elite field is dominated by African runners who are putting up some really impressive PRs.



And yet, with the difference in PRs, last year there was a similar dynamic.  Dennis Kimetto comes to the race with a 2:03 PR and Meb Kflezighi wins the Boston Marathon having run a 2:09 PR previously.  Thus we have another great story this year.  Incredible athletes, some of whom have in the past run much faster than others.  And yet, who can tell what will happen race day.

But why try?  Why did Meb think he could beat someone who in marathon terms could go somewhere he could not?  More broadly, why do we love these events?  Why should Matt Tegankamp attempt to rival someone who would be 2 miles ahead of him on each of their best days?  Variance.  Within these elite athletes there is the notion that on any given day, the guy next to you could be at his best or worst.  As spectators, we're drawn to variance...we love possibilities of things not turning out predictably, or that there is variation in what we assume to be true.  Athletes place their hopes in this, that they could run their absolute best and others may not.  Confidence intervals tell the story of variance, that statistically we can't know for certain.  I think this year yet again, we could see this same variance play out.  The athlete that doesn't have the fastest PR runs their best despite the odds.  This is what makes a great race and what we could see again tomorrow.

Monday, January 26, 2015

Presidential Approval and Applause

Some may have seen a twitter post about spurious correlations that myself and others mentioned on twitter.  Basically this was a joke about how correlation can be found in many things that certainly have no influence over each other.  I mention this because this post may or may not be in that category ;-)

About this time last year I looked at the two most recent State of the Union speeches and talked about the political priorities ostensibly shown in each.  For those that don't know, The State of the Union is the speech that the President of the United States delivers at the beginning of each year to a joint session of Congress (that is, both House and Senate).  For the most part or at least traditionally the aim of this speech is to outline the priorities for the next year for the President's office and to give a bit of an idea where the United States is at in general, or the "state of the union".

One of the more nuanced parts of the speech is that there are periods where the President is either interrupted with applause by members who feel what he is saying is good, or where he pauses to allow for applause (typically from his party).  The speeches are fairly lengthy.  The past several years these speeches have averaged about an hour.  Turns out applause is definitely a big part of the speech (it's polite afterall).  For President Obama's terms in his speeches, the word "applause" appears more than any other word (outside articles).  If applause lasts about 10 seconds on average, we're looking somewhere around 12-13 minutes of total applause during his speeches.  This doesn't take account the length of the applause times as in the text of the speeches it is only shown as "(applause)".

So what's the point other than that's a lot of clapping?  I wanted to look at if there was any similarities between the applause being given and the President's approval rating.  Appropriately, I'll be using a popular graph theme from the political analysis etc. site fivethirtyeight to display this brief analysis.  The theme was actually put together in R here, by Austin Clemens (thanks!).


So just by looking at the two lines, one indicating the number of times applause occurs during the speech, the other indicating the % approval (though the scale on the left not in % terms), we can see that it doesn't change a lot.  Except for two years, 2010 and 2014.  In 2010 he received 50% more applause than in the other years and about 30% more in 2014.

The question becomes, is the applause tactical to show support for the president by the party in a period of lessening approval?  The correlation coefficient was -.50, but as with spurious correlations, this very well could speak nothing of the influence of approval on the amount of applause.  In general just looking at the graph, the percentage change isn't the same for Applause and Approval however we can see that in general the change year over year between Applause and Approval certainly has an inverse relationship.   Meaning the line moves up for Applause between 2009 and 2010 then the line moves down for Approval rating for the same period.

Showing support and unity for a party leader by applauding is certainly reasonable, especially when support may be lacking from the general public.  Guessing as to whether this is considered before in response to the approval rating is more difficult.  Then again, it's a bit more fun to think that members of Congress would tactically use this:


  Code for this will appear on my Github page.