If you haven’t seen The Right Tool, you should check it out. It’s a little web doodad where you can rank programming languages according to how well they’re described by various statements, like “This language is very flexible” or “The thought that I may still be using this language in twenty years time fills me with dread”. It summarizes the results and displays them, letting you see what the consensus view is on each language, which languages are similar to or different from which others, etc.
I saw this a year or so ago, and one thing that intrigued me was the variety in the types of statements on which you could rate the languages. Some of them have an obviously subjective, value-judgment nature, and seem to be evaluating the language on overall quality. Other statements seem to have to do more with how widely-used the language is, independent of how much people like it. Still others describe the language’s suitability for particular tasks.
The Right Tool lets you see how languages compare on a particular statement, or how a particular language stacks up on various statements, and it lets you compare two languages. It also lets you see which statements are “similar to” other statements, in the sense that languages described well by one statement also tend to be described well by another. What it doesn’t do is let you see how languages perform on groups of statements like the ones I just described — i.e., “quality-judgment” statements vs “versatility” statements.
In a happy twist of fate, I recently discovered that raw data from The Right Tool is available. So I decided to use that data for my own nefarious purposes, namely, to see how languages compare when you group the statements into categories and aggregate language rankings within a category. What I wanted to see was whether, for instance, languages that are widely-used are also well-liked, and that sort of thing. Why did I want to see this? Because it’s. . . an interesting question!
The data from The Right Tool consists of rankings. To rank programming languages on The Right Tool, you have to tell it what languages you know. Then it shows you one statement at a time, and you have to rank all the languages you know from “best described by this statement” to “worst described by this statement”.
This is an interesting methodology for a couple reasons. First, it means that not everyone is ranking the same things. Since people only rank the languages they know (or say they know), everyone is leaving some languages out. Only someone who is willing to boldly claim that they know Cobol gets the chance to rank it. Second, it forces each person to rank every language they know, on every category. This can produce peculiar results, because it forces people to make essentially meaningless choices at the bottom of the list where two languages are pretty much equally bad at something. (Quick, which is worse for writing a web app: R or Matlab?)
The data I downloaded from the right tool has 109 statements. I categorized each statement as being in zero or more of three categories:
These are statements that seem to be describing features that are inherently evaluative — that is, if a language is described well by one of these statements, that is either unequivocally a good thing or unequivocally a bad thing (depending on the statement). Prototypical examples would be statements like “I use this language out of choice” and “The thought that I may still be using this language in twenty years time fills me with dread”. There is no reasonable way that using a language because you like to can mean something bad about the language, and there is no reasonable way that dreading having to use the language can mean something good.1 This category also includes good and bad things that extend beyond the language per se, like “Third-party libraries are readily available, well-documented, and of high quality”. Third-party libraries may not be “part of the language” in the strict sense, but, nevertheless, having good third-party libraries available cannot be a bad thing.
These statements characterize the given language as popular, well-known, influential, or otherwise possessing some sort of widespread currency. This is, of course, distinct from quality: there are languages that are pretty universally regarded as sucky yet widely used (PHP being perhaps the classic example). A prototypical example is “This is a mainstream language”. Some statements simply indicate popularity peacemeal via individual use: “I use this language regularly” doesn’t say that it’s widely used, but if a lot of people answered yes to that, then the language is widely used.
This category is different from the others in that the statements don’t all index a single dimension. Rather, each statement in this category indicates that a language is good at one particular thing. Thus, a language’s aggregate performance in this category indicates how suitable it is across a range of tasks. These statements include things like “I would use this language for mobile applications” and “This language excels at text processing”.
Here’s a table showing all the statements, along with how I categorized each one.
A value of 1 in a particular category means that statement is a “plus” in that category, a -1 means it’s a negative, and zero means that statement is neutral with respect to that category. This accounts for statements that measure the same dimension with opposite polarity: for instance, “I enjoy using this language” and “This language has an annoying syntax” both describe quality, but in opposite directions.
Some statements count in more than one category, and some have all zeros, meaning they don’t count for any of the three categories. These “neutral” statements are mostly descriptive, relatively objective statements like “This language is built on a small core of orthogonal features” or “This language has a strong static type system”. There are also a few statements that come in pairs where neither element of the pair is obviously good or bad, like “This language is best for very large/small projects”. 2
So what I did is I aggregated all the statements in each category, so that a vote for a language on any statement in that category counted as a vote in that category. That is, a vote that Language A is better than Language B on statement X counts as a vote that Language A is higher-quality than B if X is in the quality category; it counts as a vote that Language A is more popular than B if X is in the popularity category; and it counts as a vote that Language A is more versatile than B if X is in the popularity category. So, for example, if someone (brazenly) ranked Java higher than Cobol on the statement “This is a mainstream language”, it means they cast a vote for “Java is more popular than Cobol”, because that statement is in the Popularity category.
Then, for each category, I aggregated the rank information in two different ways.
The first and simpler way is I just averaged the ranks each language received in each category. So if a language got ranked first place once and second place once, and that’s it, its average rank would be 1.5.
This way of aggregating the ranks is pretty crude, though. It’s not indefensible, but it’s questionable. If someone ranked only Python and Perl and put Python first, that counts as a 1 for Python. If someone ranked Python first in a list of ten other languages, that still counts as a 1 for Python, even though in the first case it beat only one other language and in the second case it beat nine others. Also, languages that were ranked less often can have more unstable averages. If a language was only ranked ten times, and was ranked first every time, it will have an average rank of 1, while a language that was ranked first 100 times out of 1000 total rankings will have an average rank lower than first place.
I have a little discussion below about how this metric penalizes unpopular languages, which makes it theoretically problematic. However, this way of doing it is certainly simple. It essentially answers the question, “On average, if someone ranks all the languages they know in terms of quality/popularity/versatility, how far down the list is language X?”
The other way of doing it is the one described by David MacIver, the creator of The Right Tool, in his blog post on the matter. This is the aggregation method used in The Right Tool. It’s a pretty interesting method, which he adapted from research on rank aggregation.
What you do is you make a Markov chain where each state is a language, and the state transition probabilities are calculated by looking at how often each language is ranked above each other. If you don’t know what a Markov chain is, you can just think of it like this: imagine you have a deck of cards where each card has the name of a programming language. You shuffle the deck and take the first card, which starts you off at a random language. Then you draw the next card, which shows a new language. If most people who ranked both languages thought the new language was better, you discard your old one and keep the new one; otherwise, you discard the new one and keep the one you already had. You just keep doing this over and over. 3
Then, to calculate the ranking, you simulate doing this a gazillion times, and look at what proportion of the time you spent holding each card. In general, languages that won more votes will be cards you hold longer, because you’re more likely to switch to them from other cards, and less likely to switch from them to other cards.
As MacIver describes, his method (which I replicated) is slightly different from this, in that you don’t always switch to the new language just because more people liked it. Rather, you have chance of switching, with greater probability the greater the disparity was. So suppose you’re holding a card with Language A on it, and you draw Language B. If 25% of people who ranked them both thought A was better, and the other 75% thought B was better, then with 25% probability you keep A, and with 75% probability you switch to B. 4
So I calculated the rankings this way too.
I hasten to point out that the results of this should be taken with a grain of salt due to inherent imbalances in the data. When the statements are aggregated, each statement counts as much as any other statement. This means that features of the language which have more statements about them will be overrepresented, and features that don’t have many statements will be underrepresented. The statements were not designed to be evenly distributed across different features of languages, so the data may be skewed by what statements were available.
For instance, there is a statement (categorized under Quality) for “This language has an annoying syntax”, but there is no corresponding statement for “This language has an awesome syntax”. So a language that has annoying syntax could lose points for quality, while a language that has a great syntax has no way to gain points on quality for that particular feature.
This problem is especially likely to show up in the versatility category, because of the way it’s created by grouping together statements about how good a language at various different tasks. The available statements don’t necessarily spread evenly over all possible task domains, though, and that means that the system may work against languages which are better in areas for which there are fewer statements. As an example, there are statements about a language’s suitability for “casual scripting”, “command-line apps”, and “programs for an embedded hardware platform”, but the first two are arguably more similar to each other than the last is to either. So the statements may cover the “casual scripting/command-line app” side of things more densely than the “embedded hardware” end of things. Basically, this would mean that languages that are good at “command-line app sort of stuff” might be able to get two points for that, while languages that are good at “embedded hardware sort of stuff” could only get one for that.
This isn’t to say the results are totally meaningless, but you have to keep in mind that they’re based on a responses to an essentially arbitrary set of statements. Changing what statements are available could change how languages are ranked.
The X axis shows the “score”. You can click the radio buttons on the top to change which kind of score is displayed (average rank or Markov-chain probability). Languages further to the right are better. 5 For the average-rank metric, the axis just shows the average rank. For the Markov metric, the X axis shows the percentage of the time we spent “holding that card” in the system described above. (This is the probability of that language in the stationary distribution of the Markov chain.)
The Y axis just divides the points into the three categories of quality, popularity, and versatility. Vertical position within a single category is meaningless; the languages are just spread out vertically so they won’t overlap. So it doesn’t mean anything that O’Caml is above Clojure in the Quality row; since they’re both at about the same left-right position, they have nearly the same quality rank.
If you hover your mouse over a language, that language will be highlighted in all three categories, so you can compare its positions. You can click on a language to keep it highlighted, then click it again to unhighlight it.
You can check the “Show lines” checkbox at the top to show colored lines indicating “overratedness” versus “underratedness”, which I operationalized as just the difference between a language’s quality score and its popularity score. Languages that score higher on popularity than on quality are overrated (or over-used, you might say) and show up in red. Languages that score higher on quality than on popularity are underrated — not used as much as they “deserve to be” — and show up in green. The depth of the color indicates the magnitude of the over- or under-rating. Languages whose popularity and quality are roughly equal show up as white.
Also, if you click in a blank area of the graph, a little window pops up. Here you can see the whole list of languages, and select them by name if you can’t immediately find them on the graph. (Control-click to select multiple languages.)
What the results mean
For the average-rank model, the number of ratings a language received was pretty well correlated with its score in all three categories, although, as we might expect, the correlation was strongest for popularity. Languages that were voted on more often were considered more popular (Spearman’s rho=0.75), higher-quality (rho=0.61), and more versatile (rho=0.65). The three categories themselves are also highly correlated with each other (rho>0.86 for all three pairs). This seems to suggest that this metric doesn’t distinguish that well between the three categories.
Correlations among the categories
I’ll focus on the Markov-chain model, though, because it looks more robust and has a stronger statistical foundation. For this metric, the number of rankings a language received is strongly correlated with its popularity score — languages that were ranked by more popular were also considered more popular (Spearman’s rho=0.89) — but there is only a weak correlation with quality (rho=0.35) and versatility (rho=0.25). Versatility and quality are highly correlated (rho=0.86), but neither is strongly correlated with popularity (versatility-popularity rho=0.46, quality-popularity rho=0.53).
Looking at the graph (for the Markov metric), you can sort of see why this is. The range of popularity is much greater than the other two, and the growth is almost all at the top end. In other words, the languages are more or less evenly distributed in terms of qualuty and versatility, but when it comes to popularity, there is a “ruling class” of about half a dozen languages which are way more popular than the rest.
The somewhat depressing message is that just because a language is good and/or versatile doesn’t mean it will be popular. Rather, it would seem that there is a “this town isn’t big enough for the both of us” effect. You can be a great language, but it’s difficult to break away from the hoi polloi and into that elite club of widely used languages.
A similar story is told by the red/green overrated/underrated lines. Basically, all the popular languages are overrated and all the unpopular ones are underrated. Java, for instance, is a pretty good language according to this data, but its quality relative to other languages does not justify its popularity relative to other languages. Scala is again an intriguing case: it is apparently not overrated despite gaining popularity. Poor Haskell is regarded as the best language in quality, but is lost in the crowd when it comes to popularity. There are also a few languages at the bottom end, like AWK and Assembler, which people seem to regard as deservedly unpopular.
Of course, the most fun thing to poke through the graph to find your favorite and least favorite languages and draw grandiose conclusions based on their positions. My favorite language, of course, is Python, and it’s interesting to note that this is basically the only language that is near the top in all three categories, never going lower than third place. Plus, it’s rated the most versatile language overall. Booyah.
It’s also satisfying to note Python’s definitive win in all categories over its arch-nemesis, Perl. This ties in, for instance, with PYPL‘s observation of a long-term trend of increasing popularity for Python coupled with decreasing popularity for Perl. I would suspect that Perl might have ranked higher on some of these lists in earlier years, and it could be an example of a language that has suffered in the competition for the top slots. The Godot-like saga of waiting for Perl 6 has dragged on for over a decade, during which time Python has grown stronger, and apparently pushed Perl largely out of its niche.
Comparing the categories
Another way to look at the data is to scatterplot the scores in one category against the scores in another category. I made another little doodad that lets you do that:
At the top, you can just select the X and Y variables. You can compare different categories within the same metric, or you can compare across the two metrics. Doing the latter lets you see how the two metrics stack up against each other. As mentioned earlier, although the average-rank metric is conceptually easier to grasp, you can see that its scores on all three categories are correlated. Moreover, all three average-rank category scores are highly correlated with the Markov popularity score, but not the other Markov scores. This suggests that the average-rank metric doesn’t adequately separate the categories, giving popular languages extra weight even in the other categories.
It’s sort of easy to see why this is. People only ranked languages they knew. If you only know three languages, none of your responses will have a rank lower than three. But if you only know three languages, odds are good that they’re all popular languages, which means those popular languages will have their average rank improved (since you ranked them all third place or better). On the other hand, if you know 20 languages, odds are good at least a couple will be obscure and/or sucky languages, and those will be ranked low. So, basically, obscure languages get really walloped by the average-rank metric, because if they show up on a list at all, they probably show up at a low rank. Popular languages, on the other hand, show up on a lot of short lists, which means their rank is inflated.
By comparing categories within the Markov metric, you can get a visual sense of the correlations discussed above. The Versatility-Quality graph shows a pretty tight relationship; plotting either of those two against Popularity shows that the most popular languages span a wide range in terms of quality and versatility.
Philosophy of ranking systems
In another blog post, MacIver discusses a technique called local Kemeny optimization, which he uses to adjust the Markov-chain rankings. I found this an interesting technique, and it gets into some questions about voting methods that I think are quite profound, in the sense that they make you think about what you want a ranking to really capture about the items it ranks. So here I’m going to go into some of the technical details of ranking systems.
Let’s suppose we have a set of items that we’re trying to rank. There are two ways we might go about doing this, a global way and a local way. The global way is to come up with a single scale or metric, force all the items onto this scale, and then rank them by their position on that scale. The local way is to not look at all the items at once, but instead look at pairs of items. For each pair, we put the two items into a head-to-head matchup, and look at which one wins — that is, which one is the better of the two in whatever sense we’re trying to measure. In our overall ranking, we then try to make sure we rank the winning item of each head-to-head matchup higher than the loser.
Each of these ways makes our overall ranking “nice” in a certain sense. The global way makes our ranking transitive: because we have a single scale, we know that every item on the list is, in the absolute terms of that scale, “better” than every item lower on the list. The local way makes our ranking majoritarian: because we used majority-rules logic in the pairwise comparisons, we know (or we think we know — see below) that every item would win in a head-to-head matchup with items lower on the list.
Unfortunately, each method’s weakness is the other’s strength. The reason is that people’s preferences may not be transitive. In a group of 10 people, you may have 6 people who prefer apples to oranges, another (overlapping) group of 6 who prefer oranges to pears, and a third group of 6 who prefer pears to apples. This creates a “rock-paper-scissors”-style cycle: apples are better than oranges, and oranges are better than pears, but then pears are somehow better than apples again.
So the weakness of the global mechanism is that, although you can force all the items onto a single scale, in doing so you will forcibly break cycles, losing information about non-transitive preferences. The weakness of the local method is that, although you can see who wins each head-to-head matchup, you may not be able to put that information together into a transitive list. In the apples-oranges-pears example above, you can’t tell which fruit should be in first place, because each one lost and won an equal number of times. So the third item on your list might actually be “better” than the first item. (This why I said “we think we know” above — it turns out we can’t actually ensure that every item wins a head-to-head matchup with every lower item.)
Local Kemenization, as the name suggests, takes the local approach. Specifically, it satisfies the extended Condorcet criterion. Essentially, the extended Condorcet criterion says that if you can break your items into two groups, a “better” and “worse” group, then everything in the “better” group should be ranked above everything in the “worse” group. More technically, if you’re ranking a set of items, and you can partition that set into disjoint subsets A and B such that for every pair (a, b) — with a chosen from A and b chosen from B — a is ranked above b, then every item in A should be ranked above every item in B. (We say that such a ranking “is Condorcet”.)
This seems like a very reasonable criterion, and in a sense it is. However, there is a big “if” in the criterion. The criterion says that if you can split the group into a better and a worse group, you get to rank the better group above the worse group. It says nothing about what will happen if you can’t split it that way.
The way local Kemenization works to meet this criterion is it tweaks the ranking, moving items up or down to ensure that every item on the list beats the item right below it in a head-to-head matchup. You know that the first-place item beats second-place in a head-to-head matchup, and that the second-place beats the third-place, and so on. 6
This is all well and good, but tweaking the rankings in this way doesn’t escape the fact that there may still be cycles where A beats B beats C beats A. You know that the first-place item beats the second-place item, and that the second-place beats the third-place, but it’s possible that the third-place item actually beats the first-place. In other words, the ranking may not be transitive.
This is in fact the case in the programming-language rankings on The Right Tool. For instance, in the data set I have, if we look at votes on statements in the Quality category, Java was ranked above C# more than vice versa, and C# was ranked above O’Caml more than vice versa, but O’Caml was ranked above Java more than vice versa. That is, Java is higher-quality than C#, which is higher quality than O’Caml, which is higher-quality than Java.
These loops are a genuine feature of the intransitivity of people’s preferences, but to my mind it’s not that useful to attempt to represent them on an ordered list, because people just don’t interpret ordered lists that way. If you see a list with Java in first place and C in second place, you don’t just think that means Java beats the C. You think it means Java beats every other item on the list. This may not be true in a “majority rules” sense, but it’s just how people interpret ranked lists.
To put it a bit more strongly, I’d even say that the list does mean that A beats every other item on the list, simply because that’s what ranked lists mean. You may create the list based on other sorts of information, but when someone looks at a ranked list, the list doesn’t represent what beats what, it defines what beats what for the viewer. No one looks at a ranked list and thinks, “Hmmm, Java’s ranked 5th and O’Caml is 8th, so that might mean O’Caml actually beats Java in a pairwise matchup.” They just look at it and think “Oh, Java beats O’Caml.”
Because of that, my own perspective is that, for rank aggregation, it’s less important to satisfy something like the Condorcet criterion than to simply have a clearly-defined metric on which the items are scored. For instance, in the graphs above, if you look at the Markov-chain metric, you’re seeing the actual proportion of the time that would be spend “holding that card” in a card-drawing game like I described above. That may or may not translate into winning pairwise matchups, but who cares? At least you know what it’s measuring.
So that’s why I didn’t do local Kemenization. Local Kemenization starts with some ranking — for instance, you can start with the Markov-probability ranking I used — and then tweaks it in the way I described. But the tweaks may disrupt the original ranking. For instance, if you compute the popularity of languages with a Markov chain, you get a total ordering of the languages, top to bottom. But it’s possible that this ordering doesn’t meet the Condorcet criterion. If you force it to be Condorcet by using local Kemenization, you may move items “out of order” with respect to the Markov chain. For instance, you might find that, on the Markov model, you spend more time holding the Java card than the O’Caml card, even though a majority prefers O’Caml to Java. (This could happen, for instance, if Java beats most other languages, but not O’Caml.)
The upshot of that is that there’s no way to meaningfully graph the local Kemenization of a list. All you can do is list it. You can’t say “how much better” the first-place language is than the third-place language, because it might actually be worse. Part of what I wanted to do with the graphs in this post is show the scores of all languages relative to all other languages, and that notion has no real meaning for a locally Kemenized list (or, more generally, a Condorcet-based ranking) if there are preference cycles.
The right tool for the job, or the right tool for the worker?
In an odd way, things like The Right Tool make me realize why I may always be a dilletante as a programmer. The philosophy of having “the right tool for the job” is prevalent in programming communities, and many times I’ve seen people suggest that someone should use a different programming language to accomplish some task. I can understand this in some extreme cases, but own feeling is that the choice of language itself is, or ought to be, really more philosophical. You might pick a library within a particular language to do a particular task, but ideally (for me), every language would be able to do pretty much everything, and you’d just pick the one whose overall way of doing things you liked best.
Because of this, people don’t always use the “right” tool for the job so much as just the tool to which people have assigned that job. It’s not like PHP has some special powers that make it ideal for writing blog and forum apps; it’s just the language you write blog and forum apps in because people don’t write them in other languages because people already are in the habit of writing them in PHP.
Although I understand this, I still think it’s unfortunate, at least for programming weenies like me who can’t or won’t just learn a new language whenever the need arises. It would be nice if instead of picking the right tool for the job, you could pick the right tool for you, and be confident the tool could handle most jobs.
Epilogue: Tools I used to do this
After getting the data from The Right Tool, I used the world’s most versatile language to organize it. Remember what language that was? Python. Oh yeah baby. In particular I made heavy use of the excellent pandas library for dealing with tabular data. This enabled me to aggregate the individual rankings into totals for each category of statement, and then make them into big matrices showing how each language compared against every other language.
Of course “only” a couple hours isn’t exactly great for making a garden-variety scatterplot. My impression is that D3 is more useful the more complicated and unusual your visualization is. There is a fair amount of set-up work to do for simple stuff like displaying the axes and arranging the coordinate systems, and for something simple like a scatterplot, that wound up being most of the work. Once I have the data in the right form, if I just wanted a scatterplot, it’d be three ready-made lines of Python (that is, just plugging the names of the data fields into a function call), versus maybe 50 lines of have-to-think-about-it D3. However, for an interactive plot like the three-bar plot, being able to leverage the built-in UI infrastructure of HTML/CSS begins to make it worth the effort.
I would hope that, in time, people will develop additional tools for use with D3, providing ready-made solutions for standard plot types. In Python with matplotlib, doing something like a scatterplot or a boxplot is ridiculously easy, because there are just functions called “scatter” and “boxplot” that you can call, and they handle all the stuff like deciding what the axes’ limits will be. D3 has nothing like that, so to do even a simple plot, you have to manually create the plot background, the axes, their ranges, and their labels. To make the axes labels that say “Quality” and the like on the scatterplot, I had to just add separate SVG text elements that are totally unconnected to the axis objects; D3 axes don’t provide a way to specify an axis label at all. But it seems like it would be possible to create libraries that add one-step solutions for simple plots in D3. Hopefully that’ll happen as D3 gains in popularity.
- Of course, even if you say an unequivocally good statement applies to a language, it doesn’t mean you think that language is flawless. You could still think it has problems, but those thoughts aren’t expressed by a wholly positive statement. [↩]
- There are two statements for “There are many good open-source/commercial tools for this language”. I left these out of all three categories because there is already another statement for “There are many good tools for this language”, which subsumes the others. [↩]
- This isn’t how a “normal” Markov chain works, in that for a normal Markov chain you consider all other states at once, and choose among all of them on a probablistic basis, instead of just choosing between your current state and one other state. However, the procedure I describe here does describe a Markov chain; it’s just one where the state transition probabilities aren’t exactly the same as the probability of switching the card you’re holding for a particular other card. The procedure I describe here is the one used by MacIver, who borrowed it from a research paper on rank aggregation methods. [↩]
- I didn’t include his other tweak, which involves slightly nudging the votes to smooth out small uncertainties. However, this tweak is less important for my application since I’m aggregating results over many statements. Essentially, the tweak is a sort of smoothing that helps in cases where only a tiny number of people voted on a particular combination of languages for a particular statement. This is needed because there are some statement-language combinations with very few votes. However, because my categories include many statements, there aren’t any category-language combinations with very few votes, so the smoothing would have negligible effect. [↩]
- To make things consistent, for average rank, the axis is numerically reversed so that better (i.e., lower-numbered) ranks are toward the right. [↩]
- Except in the case of an exact tie where the two had equal numbers of votes in the pairwise comparison. [↩]