I envy economists. Unlike computer scientists, they seem to be able to publish best-seller books with innovative research. One such book is Piketty’s Capital. The book is reminiscent of Marx’s capital in its scope. If you haven’t heard about the book yet, it has a simple message: the yield on capital is higher than wage growth which means that those with the capital are bound to get richer and more powerful. The bulk of the population is doomed. A small elite will soon collect all the wealth, leaving none for the regular folks. This observation is hardly original… the idea of wealth concentration has even a catch phrase: The rich get richer and the poor get poorer.

Anyone could make similar claims. But it is not easy to prove it, and some economists even found evidence to the contrary (Kopczuk and Schrager): a big inheritance is less likely to land you in the list of the richest people today than in 1970 (see Edlund and Kopczuk, Kopczuk et al., and Kopczuk and Saez for further research in this direction).

What is remarkable regarding Piketty’s work, is that he backed his work with comprehensive data and thorough analysis. Unfortunately, like too many people, Piketty used speadsheets instead of writing sane software. On the plus side, he published his code… on the negative side, it appears that Piketty’s code contains mistakes, fudging and other problems.

In his spreadsheets, however, there are transcription errors from the original sources and incorrect formulas. (…), once the FT cleaned up and simplified the data, the European numbers do not show any tendency towards rising wealth inequality after 1970. An independent specialist in measuring inequality shared the FT’s concerns. (Financial Times, May 23rd 2014)

The economics profession has been considerably more forgiving than the FT of the constructed data in Prof Piketty’s datasets. These include the entire 90-year period between 1870 and 1960 when there is no source material for the top 10 per cent wealth share in the US. For those years Prof Piketty simply adds 36 percentage points to his estimate of the top 1 per cent wealth share. (…) For France, a similar point is worth noting. Prof Piketty cites his 2006 paper in the American Economic Review as the original source of the French numbers. Footnote 32 of that paper raises doubts about the ability to come up with plausible 20th century numbers for wealth concentration in France. It says: “The only wealth-of-the-living concentration estimates we provide for the twentieth century are national estimates for 1947 and 1994.” There is no documentation in his book on how he overcomes the problem that stumped him and fellow researchers in the 2006 paper. (Financial Times, May 28th 2014)

What we therefore seem to have (…) is less a synthesis of other data sets, or even an update of their numbers using newer data, than a sloppy muddying of the water around them (…) then the projection of artificial trend lines upon them under the guise of “smoothing” and recalibration. (…) [the] product was put into print by a major academic press despite serious issues with the rigor of its analysis. (Magness)

I am not surprised. Last year, I reviewed the Reinhart-Rogoff case. Two famous economists had concluded that high debt led to slow economic growth based on an extensive statistical analysis of historical data. Unfortunately, they also used a spreadsheet.

Simply put, spreadsheets are good for quick and dirty work, but they are not designed for serious and reliable work.

  • All professional software should contain extensive tests… how do you know that your functions do what you think they do if you do not test them? Yet spreadsheets do not allow testing.
  • Spreadsheets make code review difficult. The code is hidden away in dozens if not hundreds of little cells… If you are not reviewing your code carefully… and if you make it difficult for others to review it, how do expect it to be reliable?
  • Spreadsheets encourage copy-and-paste programming and ad hoc fudging. It is much harder to review, test and maintain such code.

I will happily use a spreadsheet to estimate the grades of my students, my retirement savings, or how much tax I paid last year… but I will not use Microsoft Excel to run a bank or to compute the trajectory of the space shuttle. Spreadsheets are convenient but error prone. They are at their best when errors are of little consequence or when problems are simple. It looks to me like Piketty was doing complicated work and he betting his career on the accuracy of his results.

Like Reinhart and Rogoff, Piketty does not deny that there are problems, but (like Reinhart and Rogoff) he claims that his mistakes are probably inconsequential…

In his first answer to his critics, Piketty puts the burden of proof on them: “Of course, if the Financial Times produces statistics and wealth rankings showing the opposite, I would be very interested to see these statistics, and I would be happy to change my conclusion!”

He is missing the point. It is not enough to get the right answer… Would you be happy to fly in a plane run by buggy software… given that the programmer insists that the bugs are probably inconsequential? Sure, the plane might land safety… but how much is due to luck?

He is also missing the point when he insists on the fact that he did publish his code. Yes, it does indicate that his analysis was probably done in good faith… but, no, it does not tell us that it is correct.

We all make mistakes. When you ship software, be it a spreadsheet or an app, it will contain bugs of some sort… You cannot help it. But you can certainly do a lot of work to prevent, catch and limit bugs. I program every day. At least half the time I spend programming has to do with bug hunting. How much effort did Piketty, or Reinhart and Rogoff, put on checking the accuracy of their analysis? The fact that they use spreadsheets suggests that they spend very little time worrying about accuracy. Sure, a good enough surgeon can probably take out a tumor with a kitchen knife while wearing a tuxedo… but how much confidence do you have that the job will be done right? If you are going to write a 600-page book based on data crunching, you probably ought to spend months reviewing, testing and documenting your analysis.

It will be interesting to see how it impacts Piketty. I heard a dozen different economists last week state that he was bound to get a Nobel prize… will he now get the Nobel prize? I suppose that the answer depends on how flawed his analysis is… and how important the quality of the analysis is to his peers.

It is said that Piketty used an enormous amount of data, for an economist. I think that economists will have to cope with very complex and abundant data. They will need to do ever more complicated analysis. I hope that they will learn to use better tools and techniques.

Further reading:

Update: Richard Tol is another economist who was recently caught doing careless data analysis. He also used Excel.

Update 2: I do not know how common it is for economists to use Excel. It might not be so common. Sergio Salgado, an economist, wrote that “any serious statistical analysis is done either in STATA, SAS, or even R or FORTRAN. Nobody uses Excel.” Representatives from the SAS corporation, who sell statistical software, pointed out to me that they offer free training to academics.

Update 3: An anonymous reader sent me this email:

I work in a large company and I can’t help but notice the way the business team uses excel for everything. There are times were emergency meetings are pulled because the numbers don’t add up. Sometimes the issue is a single cell among 60,000 containing a typo in the formula (a dollar sign missing).

57 Comments

  1. I worked as a DBA at a financial company for years and I learned the exact opposite: Always use spreadsheets for financial work, unless you have large tables or several tables to connect, then use databases. However, you must use them correctly: each column contains one more step of the calculation, so it can be audited by hand for chosen records with a calculator.

    Spreadsheets are designed for financial documents and they work very well for that.

    Comment by Sam Bowne — 23/5/2014 @ 22:05

  2. @Sam

    What Piketty was doing was merging lots of different data sources, interpolating missing data, and so on. Very complex number crunching that spreadsheets were not designed for.

    Comment by Daniel Lemire — 23/5/2014 @ 23:24

  3. If Barak Obama can get a Nobel prize without actually having done anything peaceful up to that time, I suspect that the “Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel” committee might not care about the actual correctness.

    Comment by John DOugan — 23/5/2014 @ 23:52

  4. You are very very naive if you think this mistakes are not on purpose. Socialism sells in leftist media(also known as all the media minus neocon FOX) so he did what he needed to do to become next hero of the MSM.
    Ofc it is possible that he is just an useful idiot also.

    Comment by Ivan — 24/5/2014 @ 9:00

  5. I completely agree with you, of course.

    One small point in defense of Piketty: I am actually reading his book, and although I am not an economist, I find it very readable and full of interesting points of view. One thing that Piketty says at the very beginning is that what is important is the trend, and not the actual precise values, because in effect there is no way to get the actual precise values. So, it is all a matter of how large is the statistical error. The first chapter contains a long critical analysis of what does it mean to collect these data, and a computer scientist may find appalling the amount of approximations that it is necessary to do to get anywhere in this kind of studies.

    I do not know what he did wrong, and surely he should have used more professional statistical tools for analysing the data. At the same time, it is vane to think that it would be possible to find the “truth” in studies like these. Economic is still more closely related to politic and sociology than it is to mathematics.

    Comment by Giuseppe Lipari — 24/5/2014 @ 9:23

  6. If spreadsheets are used, visual basic macros on Excel can be used to automate the process. This gives the best of both worlds: viewing results is easy due to the spreadsheet display, but the code is still centralized and can be verified more easily.

    Comment by Frédéric Godin — 24/5/2014 @ 10:07

  7. I’ve worked on efforts to make spreadsheet-like systems that were smarter (worked at a higher level so less prone to stupid manual errors). I used the Reinhart-Rogoff example in a grant application. So I’m well aware of the problem, but I think you aren’t right to yell at Piketty for not writing his own software, the blame is on software makers for making crap tools that lead to mistakes. Why should we expect an economist to also be a professional grade programmer?

    Comment by mtraven — 24/5/2014 @ 14:29

  8. First, I am inclined to cut Piketty a lot more slack. First, he claims to have put all his data in public, and welcomed review.

    Just by itself, that is a very positive sign.

    (Bought his book, but not yet read. He got a highly positive review from a mind I trust.)

    Second, he is trying the make sense of a rather motley collection of data. No alternative, in this case.

    Last, there are a lot of folk using spreadsheets, with all their weakness, for lack of a better alternative.

    (The story of Lotus Improv is relevant. It should exist, but does not.)

    Comment by Preston L. Bannister — 24/5/2014 @ 22:27

  9. I wonder how much can be fixed by changing the UI. Some companies such as infotron.nl and useslate.com are trying to visualize the information flow and bugs in spreadsheets.

    Comment by Henk Poley — 26/5/2014 @ 0:51

  10. @Lipari

    One thing that Piketty says at the very beginning is that what is important is the trend, and not the actual precise values, because in effect there is no way to get the actual precise values. So, it is all a matter of how large is the statistical error. (…) At the same time, it is vane to think that it would be possible to find the “truth” in studies like these. Economic is still more closely related to politic and sociology than it is to mathematics.

    It is not uncommon for scientists to have to deal with high margins of errors. They do end up, sometimes, concluding that the results are inconclusive. This is not what Piketty did… he found that the data was conclusive so he felt that the margin of error was sufficiently small.

    Comment by Daniel Lemire — 26/5/2014 @ 7:13

  11. It may not be his conclusions that are incorrect, it is Piketty’s approach to having errors pointed out to him that is. When Andrew Wiles was shown that there was an error/omission in his proof of Fermat’s Last Theorem, did he tell his detractors, “You show me otherwise”? No, he cracked on and corrected that proof.

    Piketty would have shown more judgement and rigour by reviewing the data and its conclusions in light of the highlights from the FT (and others), looking to see how it materially affected the conclusions and re-publishing if necessary and taking the concept of peer-review with the humility that it deserves. Had he done that, there would be no doubts of potential Nobel-worthiness.

    Comment by Len Pannett — 26/5/2014 @ 7:42

  12. @Pannett

    looking to see how it materially affected the conclusions and re-publishing if necessary and taking the concept of peer-review with the humility that it deserves

    It is wise advice but… when you publish a 600-page book out of original research… you are in a tough spot if mistakes are found. He simply cannot revise the book in a timely manner. He is stuck with it as it stands now for many years.

    Of course, he chose to publish a long book based on original research…

    Comment by Daniel Lemire — 26/5/2014 @ 8:23

  13. In particular, on should not use Excel for any computations, since it cannot correctly calculate a standard deviation:

    http://www.npl.co.uk/mathematics-scientific-computing/numerical-computation/research/testing-the-numerical-correctness-of-scientific-software

    Comment by JJG — 27/5/2014 @ 7:29

  14. Had a customer last week doing biotech research with excel, processing data logs in the gigabytes. Because of intitute policy. She knows R, python, matlab and some crude C, but because “the environment is very international”, they settled for Excel.

    Comment by SomeDude — 27/5/2014 @ 10:33

  15. Nice article, but:

    In addition to numbers, natural language (here, English) has its challenges, too. Example: “Two famous economists had concluded that high debt lead to slow economic growth based on an extensive statistical analysis of historical data.” reads better as “Two famous economists had concluded that high debt led to slow economic growth based on an extensive statistical analysis of historical data.”
    TL;DR “lead” should be “led”

    Comment by Martin Haeberli — 27/5/2014 @ 11:13

  16. Devil’s advocate:

    When I’ve had to deal with “sane software” rather than spreadsheets, in 100% of cases when I’ve dug up the research paper with the original algorithm, the software had at least one elementary transcription bug. As you rightly point out, it’s not about being bug-free (which is virtually impossible), but whether spreadsheets as a tool are better or worse at this than other kinds of languages.

    Spreadsheets make testing hard, make code review difficult, and encourage copy/paste coding. Apparently normal programming languages are just the same, because I can go on Github and find that *most* source code, across all languages, is not tested, has not been reviewed, and is full of copy/paste sections, too.

    What is your threshold? For example, how much copy/paste does a language need to exhibit, in practice, before it’s bad? If we find that (hypothetically) Python users copy/paste as much as spreadsheet users, should we stop trusting results from Python programs? (Maybe the problem is not the tool but the economists writing software, and so telling them to stop using spreadsheets means they’ll just carry over their bad habits to other languages.)

    Just as you put the burden on spreadsheet users to demonstrate that their processes are as good as other programming languages, why isn’t the burden also on you to show that it’s worse? *Is* it worse? Is it bigger than the difference in robustness between, say, C++ and Lisp?

    I share a gut feeling that spreadsheets are terrible, but I don’t see this as an indictment of them. You’ve shown me nothing here I don’t also see in every other programming language ever.

    Comment by Foo — 27/5/2014 @ 12:22

  17. I came to this page thinking it was a joke, but you’ve made an excellent case! And if I compare an Excel spreadsheet with a Matlab source file, one big difference jumps out: the Matlab file has a clear linear progression from beginning to end. You simply cannot trust a human being to effectively review a two-dimensional array of equations with all kinds of random linkages.

    Comment by Paul Dulaney — 27/5/2014 @ 12:31

  18. @Martin

    Thanks.

    Comment by Daniel Lemire — 27/5/2014 @ 12:48

  19. @Foo

    Many Python or R programs are not tested or reviewed and rely extensively on copied code. And that is perfectly fine. Nobody is going to care if you get it wrong.

    However, if you write a 600-page book and believe you are going to be up for the Nobel prize… then the bar is higher… Your code should be tested, documented and reviewed.

    Excel does not allow any of these things.

    Comment by Daniel Lemire — 27/5/2014 @ 13:15

  20. I’ve made errors using econometric software and writing my own programs. It’s not spreadsheets per se that lead to errors. And programming is harder for the casual user (like an economist) than it was in the days of BASIC. Those of us who use spreadsheets where they are appropriate get comfortable with the tool and then push it into areas it probably shouldn’t be pushed–but that’s easier than struggling with a rarely-used programming language.

    Back in the 1980s I read something about Lotus 1-2-3: “Every complex spreadsheet has an error.”

    With that in mind, a spreadsheet creator can add lots of internal validity checks. I also pull out a calculator to hand-check some random results. I try not to copy data, but instead to link to source data. These techniques are not foolproof, but they help a lot.

    Comment by Bill Conerly — 27/5/2014 @ 17:28

  21. …shocking how much investment banks abuse Excel with many desks running intra-day risk off of spreadsheets, and PC cludging together rec’s and analysis… its a much abused tool.

    Comment by D — 27/5/2014 @ 17:29

  22. A few points from a financial market economist with some experience in R:

    You use the tool that everybody else uses, otherwise ideas cannot be communicated effectively. If had published his work in R, the vast majority of economists, including the reviewers at the FT would not have been able to review his work.

    Excel is very useful for combining different data sets of different frequency and often irregular frequency. This would be a nightmare in R.

    Generally basic data manipulation, which is actually the bulk of most interesting work in economics, can be done in excel. Other tools, such as R or Eviews, are used when more advanced statistical techniques are needed.

    Comment by Ashley — 27/5/2014 @ 19:48

  23. @Foo:

    Copying and pasting maybe done with other programming languages, but spreadsheets practically *demand* copying and pasting just to get formulas into the cells. Worse, as there are more and more cells, more and more copying and pasting has to be done, far more than would be needed with an actual programming language. Furthermore, when writing actual program code, one can see where one is starting to repeat oneself via copy-and-paste. If one catches that repetition early, then it’s not too difficult to refactor by replacing the copy-paste code with a suitable function that contains the code that had originally been copied and pasted. You can’t do that in a spreadsheet.

    Comment by J. J. Ramsey — 27/5/2014 @ 20:10

  24. ‘In his spreadsheets … there are transcription errors from the original sources & incorrect formulas.’

    and ‘code’ magically prevents this?

    as for testing, there’s nothing in Excel that prevents you from testing a formula. in either case (spreadsheet vs code) you have to understand your formulas and know what results to expect.

    for anyone but an experienced and highly skilled coder, writing code it as prone to error as using a spreadsheet, maybe moreso. for manipulating tables of data, a spreadsheet is easier for the average person than writing code by several levels of magnitude. why would i write a GUI to manipulate tabular data when someone’s already done it? why would i learn how to code complex formulas, rounding rules, display rules for various number formats, etc etc when someone’s already done it – and it’s been tested over a couple of decades by millions of users?

    Comment by sneedy — 27/5/2014 @ 20:14

  25. Granted the spreadsheet in question is huge, but I wonder if http://paulhammant.com/2012/08/10/its-time-for-a-new-take-on-spreadsheets/ paints a better foundation for such work.

    Comment by Paul Hammant — 27/5/2014 @ 21:14

  26. I doubt it would have made much of a difference in this case. Even with other kinds of software, users have a distressing habit of introducing bugs that lead to results favoring their hypotheses and removing bugs that lead to results contradicting their hypotheses.

    Although it’s nice that people have tracked down specific errors in Piketty’s numerical analysis, I don’t think those are even the worst part; even his basic assumptions don’t make much sense.

    Comment by Tom — 27/5/2014 @ 21:29

  27. “Like Reinhart and Rogoff, Piketty does not deny that there are problems, but (like Reinhart and Rogoff) he claims that his mistakes are probably inconsequential…”

    Well he’s probably right and
    insofar as Reinhart and Rogoff were referring only to their spreadsheet errors (and not e.g. their glaring causality attribution error), they were right too. But of course such errors easily could be consequential and I agree that “spreadsheet science” is a problem. There’s another good article on the subject here: http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction/

    Comment by phayes — 27/5/2014 @ 21:50

  28. Of course, it’s possible to write crappy code in any language. The point is rather that some environments make it easier than others to write robust, maintainable code. Excel isn’t one of them.

    Another issue I’ve encountered with Excel (and other environments that generally combine data, program, and GUI in one monolithic file) is that it’s very difficult for multiple developers to work simultaneously, to build reusable components, and to use source code control system. Anyone who has worked professionally in software development would recognize these as serious limitations for all but trivial projects.

    Of course, one can code in VBA within Excel, but VBA is not a very efficient for software development. This probably isn’t apparent to folks who have worked only in VBA. There are half a dozen languages I’d rather use…

    Comment by Rich Plevin — 28/5/2014 @ 1:07

  29. This is a peer review problem not a spreadsheet problem. Economists don’t have the same tradition of peer review as scientists.

    Comment by Mark Pawelek — 28/5/2014 @ 1:19

  30. I work in research. With any kind of modeling or statistics you need a statistician. There is a proper process for predictive modeling and trends, it’s not something you just whip up yourself as an economist regardless what software you use.

    Comment by Vad — 28/5/2014 @ 2:16

  31. @Daniel

    You call for better tools and for [better] techniques. But are they flawed in equal proportion? You’ve left that point ambiguous, yet I think it’s an important distinction. Curious to know your view.

    Spreadsheets represent a resource tradeoff. To borrow from @D’s comment above, as I write this I am “running intra-day risk off of spreadsheets”. This is because I cannot afford a better solution if I want to keep my costs low enough to deliver competitive investment returns to my clients. The big banks don’t have that excuse at the institutional level, but as the “London Whale” episode illustrates, even JPM was making economies on risk management. Speaking from experience that is entirely commonplace.

    Academic researchers probably have relatively less demanding deadlines but we all have families and time is precious. Tradeoffs will be made. So if this is an optimization issue, are we first to solve for the tool or for the technique(s)?

    Comment by Indig-Gestion — 28/5/2014 @ 3:45

  32. Use spreadsheets to get the all calculations correct.

    Then move it to a real programming language to make a program that you can put in a automated production line, that is automatically tested, that integrates well, is fast and nobody will mess it up by accidently changing some random cell.

    That why accounting isn’t done in Excel for example.

    Comment by Björn — 28/5/2014 @ 5:25

  33. @John DOugan
    The Nobel Peace Prize is decided on by Norwegian Politicians. The other Nobel prizes are decided by scientists. You can’t really compare the two.

    Comment by Martin — 28/5/2014 @ 5:42

  34. Having been an IT guy for a long time now, the one biggest change I would like to make is to make spreadsheets illegal except for specific circumstances. What I’ve found over the years is that not only are they used for the wrong problems, they’re almost always used incorrectly as well. Errors abound. Software in general is pretty buggy, but spreadsheets thrown together by end users are absolutely horrible. Stop the madness, uninstall spreadsheet software!

    Comment by Brian Knoblauch — 28/5/2014 @ 6:38

  35. I used to work for an electricity supplier maintaining software for a quotations system for commercial customers. In essence what it did was provide bespoke prices based on estimated uses drawn from half-hourly meter readings. So whereas a domestic customer may get their meter read once every few months if they’re lucky, these companies had meters recording readings 48 times a day.

    So what the system did was take the last set of known readings, extrapolate them over the next contract period (adjusting so the days of the week and public holidays matched), factor in some of the add-on charges and try to predict costs based on that. In mathematical terms this meant a lot of manipulations of matrices with 48×365 or 48×730 values.

    This was conventional software but how to test? Well there was a guy in the contracts department with a spreadsheet. Whenever our figures didn’t match we had to figure out if it was a bug in our code or his spreadsheet. We found plenty examples of both.

    And the irony of it all was that most times these figures were the starting point for negotiations. So the importance of the “accuracy” of the results was moot. It was primarily a way of giving a customer an impression that we were really rigorous and detailed with figures and therefore must be giving them the best possible deal. Whereas in fact the way to get the best deal is same as always – negotiate as hard a possible and play suppliers off each other to find the lowest quote.

    Comment by shuggie — 28/5/2014 @ 6:49

  36. Very interesting article. I sent the URL to one of our actuarials. I really doubt he’ll agree. One thing especially mentioned was Excel. As an alternative, I would suggest that perhaps it would be easier to audit a spreadsheet if people would use Libre/Open Office. The reason is that the file format is documented. Basically, the data are contained in XML files in a zip archive. This means that it is possible to transform this data into a different format (using XML tools), which would be easier to read and audit.

    Comment by John McKown — 28/5/2014 @ 6:55

  37. a big inheritance is less likely to land you in the list of the richest people today than in 1970…

    Fascinating, has the population changed since 1970 where are your numbers and code? Seems like you are responding to a fairly serious work with a lot of fluff because you disagree with it.

    Comment by SR — 28/5/2014 @ 7:47

  38. Theoretically, once a programming problem is solved it is never necessary to “re-invent the wheel” by inventing your own code to solve the problem. For years, the ACM has published all the fundamental algorithms in one form or another. Modern programmers are using code that programmers, such as myself, invented back in the 60′s and earlier. Cut and paste programming is not always such a bad thing.

    But errors creep in at other places: The Floating point processor can make errors; there can be memory errors, compile errors, and language deficiencies. Since so many systems are interdependent on multiple sources of software, the accumulation of errors can be disastrous.

    Excel made extensive changes to it’s probability and statistics fopr 2010, but there is still a maze of complex systems.

    All that said, the conclusions prought about by economists are a matter of logic, not statistics. Precision at lower levels should not affect the logical conclusions unless there is a major distortion in the analysis.

    Comment by Meburke — 28/5/2014 @ 10:47

  39. There’s a small study showing how Reinhart-Rogoff and others might have caught their errors by writing instead in Executable English.

    The study is here:
    http://www.astd.org/Publications/Magazines/The-Public-Manager/Archives/2013/Fall/Social-Knowledge-Transfer-Using-Executable-English

    and here is the “code”:
    http://www.reengineeringllc.com/demo_agents/GrowthAndDebt1.agent

    Comment by Adrian Walker — 28/5/2014 @ 12:11

  40. Spreadsheets certainly have a bad reputation and there are no shortage of spreadsheet horror stories (we did our own rather tongue in cheek round up in this ebook – http://info.f1f9.com/dirty-dozen-12-financial-modelling-horror-stories-ebook)

    Spreadsheets are software, and all software development requires discipline. The FAST Standard Organisation is making good progress introducing a disciplined structured approach to spreadsheet model development. Check out http://www.fast-standard.org.

    Comment by Kenny Whitelaw-Jones — 28/5/2014 @ 14:44

  41. The only way you can guard against mis-transcribed data is to have two people do the transcription independently to create files with the same format. Then take the difference, or otherwise detect differences. This is what they did for data input long ago during the “data processing” era of the 70′s and 60′s.

    Comment by Mark Pawelek — 28/5/2014 @ 15:02

  42. Blaming tools is a futile endeavor. The issue is rarely the result of the tool utilized. Reaching back to the analogy of the surgeon, a scalpel is an inactive piece of metal. However, no one ever blames the scalpel when a surgery goes awry. The success of an endeavor always comes down to the skill and due diligence of the individual approaching the effort. As a former financial analyst and current software engineer I can emphatically say that tools are helpful and may contribute or detract from intended outcomes, but the human being behind the tool is always the key factor in whether a quality result is achieved. (Typed and spell-checked in Microsoft Word).

    Comment by Tafari — 29/5/2014 @ 6:25

  43. @Tafari – “blaming tools is a future endeavor” – totally agree with your comment.

    Comment by Kenny Whitelaw-Jones — 29/5/2014 @ 8:17

  44. By the way, Piketty has responded officially:

    http://www.nytimes.com/2014/05/30/upshot/thomas-piketty-responds-to-criticism-of-his-data.html?partner=rss&emc=rss&smid=tw-nytimes&_r=0

    and claims that FT experts made an arbitrary choice in their counter-analysis. Now it becomes too technical for non-economists like me, but I want to remark that the matter is much more complicated than it seems.

    Comment by Giuseppe Lipari — 29/5/2014 @ 12:17

  45. @Lipari

    I allude to his response in my post. I find it lacking. He puts the burden of proof on his critics. This is not how science works. He has the burden of proof.

    Comment by Daniel Lemire — 29/5/2014 @ 12:27

  46. The one you allude to is the quick response to the journalist after the FT published its concerns, and certainly it is lacking. This response I linked is a 10-page document with links to data:

    http://piketty.pse.ens.fr/files/capital21c/en/Piketty2014TechnicalAppendixResponsetoFT.pdf

    Comment by Giuseppe Lipari — 29/5/2014 @ 12:31

  47. Hi Guiseppe,

    You wrote: “Now it becomes too technical for non-economists like me, but I want to remark that the matter is much more complicated than it seems.”

    Well, yes. But if written in Executable English, people with varying amounts of expertise can understand and comment. Plus, they can get step-by-step English explanations of the execution results.

    The “code” for an RR-like example is here: http://www.reengineeringllc.com/demo_agents/GrowthAndDebt1.agent

    Comment by Adrian Walker — 29/5/2014 @ 13:07

  48. Thank you for a very interesting article, Daniel, and the comments were quite informative too.

    Tools can influence outcomes.

    IMHO, whatever mistakes were made in the spreadsheet would have been more visible and hence likely to get fixed, if something like R was used, although I see Ashley has said R can’t easily handle sparse/irregular data.

    Perhaps Python with Pandas or some such?

    Or a Timeseries DB?

    Adrian’s Executable English comment is intriguing – will definitely take a look.

    Also, not entirely OT, you may have heard of Tufte’s comment on NASA’s use of Powerpoint:

    http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001yB&topic_id=1

    Comment by A.Lahiri — 29/5/2014 @ 16:33

  49. I think the bigger idea is that researchers of all stripes adopt the ideas of ‘reproducible research.’ A start for the ideas here:

    http://reproducibleresearch.net/
    http://en.wikipedia.org/wiki/Reproducibility

    The big idea to me is that all manipulations with data leave an audit trail, and you need a system (software) that supports this. And this is the main difference between R/SAS/Python-Pandas and Excel: all of the non-Excel examples really allow you to record *in code* what is being done with the data. Well, as long as you force yourself to never edit the underlying data, but only remove/edit data via code statements. This allows many errors to be surfaced purely because they are explicitly laid out in code. Compare that to moving data around in Excel by dragging and dropping. There is no way to record that motion in Excel, but there is in the other languages.

    There is much more to it than that, but to me, that is the first big advantage…

    Comment by Matt Pettis — 29/5/2014 @ 22:17

  50. Also, hand in glove with Excel problems, I think of this animated true story… https://www.youtube.com/watch?v=N2zK3sAtr-4

    Comment by Matt Pettis — 29/5/2014 @ 22:39

  51. … I should also add that Daniel did address this here, but I just wanted to put the the google-able topic name that address is this with this post…

    Comment by Matt Pettis — 29/5/2014 @ 23:06

  52. ”Modern’ Excel using the Power BI add ins such as Power Pivot using the DAX formula language with pivot tables can help. It gives you what Rob Collie calls ‘reusable formula’. Trust me, I’m a programmer :)

    Comment by Donakd Parish — 31/5/2014 @ 8:51

  53. Code and spreadsheets can contain errors and be written to be indecipherable or more easily reviewable. My view, based on experience, is that it is just as likely with either method for novel analysis. Both methods are also amenable to manipulation to achieve desired results.

    You say “it appears that Piketty’s code contains mistakes, fudging and other problems” and link to discussions of some. I followed the link to Magnus who provides a detailed discussion of the problems he observes in the spreadsheet. I see this as evidence against your claim that spreadsheets are not reviewable.

    My view may be wrong, but I see only assertions in this article to contradict it.

    Comment by Ken Taylor — 1/6/2014 @ 7:52

  54. But even STATA cannot cope with assumptions initially poured into the formulas, or the blindness we all labor under with these assumptions.

    Comment by E.L. Beck — 1/6/2014 @ 8:47

  55. @Ken

    You can review Excel spreadsheets, my point is that “spreadsheets make code review difficult.”

    The only reason Piketty’s spreadsheets are being reviewed (by a few key individuals) is because he is holding in the best-seller list.

    But look at how difficult these reviews are. People have to attach screenshots to make their points… and, even then, it isn’t very clear…

    You wouldn’t want the people who write software for the space shuttle to work this way.

    Comment by Daniel Lemire — 2/6/2014 @ 8:28

  56. Daniel said: “You can review Excel spreadsheets, my point is that “spreadsheets make code review difficult.”

    Adrian says: +1

    To Ken: a more radical fix is to write the data extraction in executable English [1]. You can then get a human-readable audit trail.

    Here’s an example: http://www.reengineeringllc.com/demo_agents/GrowthAndDebt1.agent

    Comment by Adrian Walker — 2/6/2014 @ 8:35

  57. Matt wrote: This is the main difference between R/SAS/Python-Pandas and Excel: all of the non-Excel examples really allow you to record *in code* what is being done with the data.

    Adrian agrees and says: Where possible, writing in *Executable English* goes further. In particular, it supports human readable explanations/audit-trails of what data were used and what the app has done with the data.

    Example: http://www.reengineeringllc.com/demo_agents/EnergyIndependence1.agent

    Comment by Adrian Walker — 2/6/2014 @ 12:39

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress