Someone sent around a link this morning to data on grade inflation at Duke, which shows a table of average GPAs for undergraduates from 1932 on. Looking at the table you can sort of get a sense of when GPA’s really started increasing (the ’60s), but it would be nicer to just plot them:
Or to plot the year over year change in average GPA, with some missing values interpolated:
I’ve never tried to scrape a website with R before, but it turns out for this it was pretty easy (with some help).
A few weeks ago we were asked to teach the basics of (interpreting) duration models to a group of consumers without using any math. When I learned about this it involved a lot of math and Stata, and when you look around the web it’s usually presented similarly. So this was a bit of a challenge.
A nice thing about duration analysis though is that a lot of the key concepts are already explicitly graphical, like survival curves (wikipedia) and hazard rates. Below, for example, is a survival curve for cancer patients diagnosed with acute lymphoblastic leukemia between 1988 and 2008 in the US, from SEER fast stats:
Will Moore, Kentaro Fukumoto, and I have been working on a random walk negative binomial model for time-series of counts, based on earlier work by Kentaro on a negative binomial integrated (NB I(1)) model. We just presented a related poster in which we look at monthly civilian deaths in Iraq at Peace Science in Savannah, Georgia. Here is the actual pdf poster (it’s a big file, be warned), but the basic point is that ARIMA or classical count-models are not a good way to deal with time-series of counts, like monthly deaths in a conflict, and that we have a tested model for non-stationary counts that has some attractive features.
We are working on a draft paper, so I don’t want to go through the whole story, but if you’d like to try it out yourself and know how to use JAGS, all the R and JAGS code is available on github.
A few months ago I produced some thematic maps of Bosnia (paper) using
maptools and other packages in R, but I didn’t include scales or a north arrow. It sounds simple and
sp has functions for doing those things, but I couldn’t get it to work well with my maps. Here is a basic map of Bosnia’s pre-war municipalities:
The Iraq Body Count project collects reports of civilian deaths, and makes their event data publicly available. Each event gives the date, location, description and civilian deaths associated with an incident. Looking at a few examples [1, 2, 3], you can see that while the data values for the date and deaths are straightforward, the place values get a little bit complicated. I’m looking for the province in which incidents occurred, so the challenge is to associate each place value with a province.
Using the incident data from 2003 to February 2012, about 27,500 records, I’ve written an R script that assign provinces to ~95 percent of the records, 26,000.
Here’s a basic overview of how it works:
Almost all states, at least at some point between 1995 and 2005.
The Ill-Treatment and Torture (ITT) project by Courtenay Conrad and Will Moore codes Amnesty International (AI) allegations of government torture, including the perpetrator, motive, and judicial response. The aggregated, country-year version of their data shows whether AI made allegations against a country in a given year and if so, what the extent of alleged torture or ill-treatment was, on a 5-point scale from “infrequent” to “systematic”.
Here is a video showing the AI torture allegations from 1995 to 2005 using their country-year data and shape files for world borders from Thematic Mapping.
The initial impression I had from this is the sheer extent of (alleged) torture and ill-treatment. It looks like pretty much all major states engaged in torture at some point between 1995 and 2005. Only 8 out of 151 states had no allegations of torture at all (Costa Rica, Uruguay, Finland, Benin, Gabon, Quatar, Singapore, and New Zealand), and in those remaining states with AI allegations of torture, on average there were allegations for 7 out of 10 years. More than a quarter of states were accused of torture or ill-treatment in all 10 years covered by the data.
That doesn’t necessarily mean that a lot of torture or ill-treatment is going on in any specific country, nor that it is systematic. It doesn’t reflect what the specific acts of torture or ill-treatment were, e.g. whether someone was tortured to death or water-boarded (which may not be different). But, nevertheless, unpleasant stuff happens.
R code and source. This produces images for each year that I strung together in iMovie.
For the most part I don’t do things that are computationally so intensive that I can’t run them on my work desktop. There have been a few times however where I ran simulations or bootstrapped models, and now Bayesian models with MCMC, that take a while to run. One solution has been to run things on FSU’s high performance computing cluster. It takes a little bit of effort, for someone without a background in computer science or programming like me, and it is inconvenient in several ways.
An alternative is to use Amazon’s EC2 cloud computing service. A few weeks ago I started playing around with it, and running basic instances for limited time is actually free. I use R/RStudio for the most part, but was overwhelmed by which AMI to use without having to instal R/RStudio. Fortunately someone has created AMI’s (Amazon Machine Images) with RStudio Server, which, once running, lets you use RStudio through your web browser.
If you started the instance with the default security group settings like me, you will also have to open port 80 to get access. Go to the security group settings in Amazon management console, select whichever group your instance runs under (e.g. default), and add a custom TCP rule for port 80 (i.e. port range 80). Add the rule and apply. Find out your instance address (instances, at the bottom, it’s the string that ends with amazonaws.com, e.g. ec2-184-88-8-888.compute-1.amazonaws.com), paste into your browser and you should get to a RStudio log in. The defaults are “rstudio” for both. And, there.
In many circumstances political scientists study binary dependent variables that have been measured with bias. For example, in surveys the strategic interests of actors can lead them to misrepresent an attitude or behavior to the surveyor in a non-random fashion. Data on terror or torture that are coded using media reports likely suffer from a similar bias related to factors such as freedom of the press in a country.
To give you an idea of what this new model allows one to do, consider the issue of self-reported infidelity between romantic partners. Using survey data, the reported rate of infidelity is about 13% of the sample. Yet common sense would suggest that this rate be higher, at least due to social desirability bias that would lead respondents who did in fact cheat lie about it to avoid the negative stigma. The split population logit model allows us to separate respondents’ rates of honesty and infidelity separately, as shown in the table excerpt from our paper. It shows for example that 41% of the sample likely cheated on their partner, but also that around three-quarters chose to lie about it when surveyed. Quite a difference from the 13% reported in the observed data.
Here are replication files for the simulations we used to evaluate our estimator and replication files for the infidelity example. The simulations were run through Florida State University High Performance Computing.
Paper to follow in a few weeks.
Another set of notes from when I was TA for our Advanced Quantitative Methods course with Prof. Matt Golder in 2008. The notes for Programming MLE models in Stata (pdf) walk you through how to recreate your own logit regression command and ado files for Stata, as well as how to use simulations to check your model. Here are also the associated ado and do files.
The notes are closely based on Maximum Likelihood Estimation with Stata (2006, see full citation in the notes), which is definitely worth it if you are considering writing your own MLE commands in Stata.
A couple of lab notes from 2009, when I was TA for our Basic Quantitative Methods course: