Almost all software I write for my research is open sourced. Some fellow researcher argued today that I risk reducing the gap between and my pursuers. Similarly, I should keep my data to myself (and avoid listing good sources of research data).

Here is my take on this issue.

  1. Sharing can’t hurt the small fish. Almost nobody sets out to beat Daniel Lemire at some conference next year. I have no pursuer. And guess what? You probably don’t. But if you do, you are probably doing quite well already, so stop worrying. Yes, yes, they will give you a grant even if you don’t actively sabotage your competitors. Relax already!
  2. Sharing your code makes you more convincing. By making your work easier to reproduce, you are instantly more credible. Trust is important in science. Why would anyone trust that I actually wrote the code and ran the experiments? Because I published my code, that’s why!
  3. Source code helps spread your ideas faster. On the long run, you should not care about getting papers accepted at some hot conference. What matters is the impact you have had. Make it easy for me to use your ideas! Help yourself!
  4. Sharing raises your profile in industry. Having open source software makes your more attractive to software engineers.
  5. You write better software if you share it. While not all code I publish is bug-free, documented or even usable, I care slightly more about my code because I publish it.

Finally, does sharing code works? Do people download and use my software? Here are download statistics for my latest source-code publications:

A compressed alternative to the Java BitSet class >280
Rolling Hash C++ Library >200
Lemur Bitmap Index C++ Library >2 000
Fast Nearest-Neighbor Retrieval under the Dynamic Time Warping >1400

Related reading: Good prototyping software and The challenge of doing good experimental work by Suresh Venkatasubramanian. And More on algorithms and implementation by Michael Mitzenmacher.

Update: Joachim Wuttke pointed out another potential benefit: your users will debug your code.

Update 2: This post appeared on slashdot.

27 Comments

  1. Daniel,

    Excellent points. Your arguments are also backed up by this article that I’ve found and tweeted recently: http://bit.ly/9UxbNL

    Comment by Nicholas — 10/2/2010 @ 14:26

  2. The real question is why this is even an option. Why are papers accepted without the source code and data attached?

    Comment by Neil — 10/2/2010 @ 14:35

  3. I was the one who proposed the idea that publishing source code could “reduce the gap between a researcher and its pursuers”. Don’t get me wrong: I’m all in favor of publishing source code, but this was part of my reasoning as to why, in my field (computer animation, graphics), freely available source code is so rare.

    My tentative explanation — that publishing code is not a “fit strategy” for the academic reward system — is partly rooted in my disdain of this system. Other explanations might include :
    - The perceived reward-to-effort ratio is too low.
    - Researchers are afraid that publishing academic-grade (i.e. throw-away) code would tarnish their reputation.
    - The claimed results wouldn’t stand up to the public scrutiny enabled by open source code.

    In all cases, I wish we would see more openly available source code, test cases and experimental results… But I doubt this can be done simply by convincing researchers they should do it. There has to be some intrinsic motivation within the reward system.

    Comment by Philippe Beaudoin — 10/2/2010 @ 14:52

  4. @Philippe

    (1) It is hardly difficult to wrap up your code and post it online. Maybe people will have a hard time getting it to work, especially without proper documentation, but that is another issue.

    (2) I don’t buy the “tarnish your reputation” line. Researchers publish cheesy out-of-date web pages all the time. They show up for work unshaved and with days old clothes. They give lectures out of lectures notes from two centuries back. They are just not such perfectionists.

    (3) Rightfully, they are afraid that their result might not stand to scrutiny. And you know, maybe some paper I have published has *mistakes* in it and people might find out. Big deal.

    Nobody cares about the mistakes you make, as you long as you make them in good faith.

    When measuring your contributions, people don’t subtract your mistakes.

    Comment by Daniel Lemire — 10/2/2010 @ 15:23

  5. I wonder if it has anything to do with keeping all your papers, presentations and posters open… I actually do post everything at my web page, since I hardly believe that the ‘stealing’ issue is really working in hard sciences. What do you think?

    Comment by Misha Lemeshko — 10/2/2010 @ 15:32

  6. @Misha

    They are all components of Open Scholarship. But I think that software and data are a different (yet related) issue from, say, posting your papers online. It is about showing the ingredients that went into making your paper.

    Comment by Daniel Lemire — 10/2/2010 @ 15:36

  7. That’s true for papers. But what about presenting stuff which is very important these days? I don’t care if anyone learns from my presentations that I post online…

    Comment by Misha Lemeshko — 10/2/2010 @ 15:39

  8. @Daniel: I’m not trying to make a case for them. I’m simply trying to figure out why we don’t see more open source research code. Also, I disagree with your reasoning about reputation. Most researchers I know, especially young profs, find it very important to maintain a good reputation. If not in front of their classroom, at least in front of fellow researchers. This reputation has nothing to do with the way they dress or shave, but rather with how well they write their papers, how innovative their idea is, or how good their code is.

    My point (3) wasn’t related to mistakes you have overlooked and that will be forgiven. I was referring to the problem of over-claiming or carefully hiding known defects and limitations. It’s much easier to over-claim when you don’t publish your source code. This might not be frequent in your field, but I swear it is in mine.

    Comment by Philippe Beaudoin — 10/2/2010 @ 15:51

  9. @Philippe

    Oh! Absolutely. People fudge their results all the time. But that’s an excellent reason to publish your code. Because *you don’t fudge* your results, you can share your code and gain an edge over those who don’t.

    Comment by Daniel Lemire — 10/2/2010 @ 16:02

  10. @Daniel What if the “fittest” strategy was to fudge your result? More formally:

    reward( fudge + no_source ) > reward ( not_fudge + source )

    This seems to hold in my area, at least.

    Comment by Philippe Beaudoin — 10/2/2010 @ 16:22

  11. @Philippe

    Fine. But what if you don’t fudge your results? Shouldn’t you then try to maximize your personal gain, even if everyone else is better with a different strategy?

    You know, I guess what I am trying to say is that we should not be content to just do what others do. We should think through the issues and adopt behaviours which fit best our own needs and wants.

    For me, given who I am, how I think, opening up is the most profitable course of action. I don’t give a damn if others don’t do it, for whatever reason.

    And you know.. I’m pretty sure that if I throw away my academic career one day… all this open code will help convince employers or clients that I am not just faking it…

    Because, let’s face it, when it comes down to it, it can be damn nice to be able to go back to writing code for a living. It beats serving fries.

    (Not that I plan to throw away my academic career, but who knows what I might want to do with my life in ten years? Maybe I’ll want to give live in Siberia.)

    Comment by Daniel Lemire — 10/2/2010 @ 16:39

  12. @Neil That is why some journals, e.g. the journal of experimental algorithmics do not accept paper without source code.

    Comment by Itman — 10/2/2010 @ 17:09

  13. @Daniel I think I understand why we manage to both agree AND disagree at the same time. I agree with you on the desirable course of action. Publishing source code and not fudging my results is exactly what I would do in a total absence of external pressure. I guess I could reach a point — say, by getting tenure — where this external pressure would be small enough to let me do research the way I believe it should be done.

    However, until I get a job, and given the very competitive market profs-to-be are faced with, I feel tremendous external pressure to publish many papers in renowned conference. That is why I’m so keen to analyze not “what is the good way to conduct research”, but rather “what is the optimal strategy given the reward system”.

    My experience is that, by doing research the way I like doing it, I don’t get as many publications as others. I’m therefore faced with a choice: do I try to do research in the same way as others, or do I bail out? I can’t bring myself to fudge my results, to be harsh with my fellow researchers, to pepper my texts with needlessly complex equations or to intentionally obfuscate my thoughts.

    The only choice therefore seems to be to bail out. That’s what I might do, but not without a fight! I will keep bringing arguments as to why I believe this reward system is shutting out people that could have made important contributions to research.

    Sour grapes? Could be that too. :)

    Comment by Philippe Beaudoin — 10/2/2010 @ 18:58

  14. I agree 100%. If only my former postdoc supervisor would…

    Comment by Nicolas — 10/2/2010 @ 23:29

  15. I agree that having researchers publish their code Open Source would be a great thing. Just like it would be a great thing if researchers published their lab book and detailed research notes online.

    However, I see a at least two good reasons why one would not want to make scientific research software available as Open Source, none of which involve doing anything bad (such as fudging results or hiding mistakes):

    1) In my experience, sharing code costs a lot of my personal time. Not because it takes lots of time to wrap up the code and share it. Not because I feel that it has to be beautiful code before I am willing to let other people see it. Not because I feel a need for it to be well commented and documented before I share it. But because of the amount of time I have to spent on helping the people who try to make my ugly, undocumented code work.

    2) Sharing raises your profile in industry – and it lowers it too. Making your tools available (for example via a web interface) increases the visibility of the tools, both in academia and in industry. But making it Open Source means the company would gain less by hiring you than they would if you had kept the intellectual property (i.e. the code) to yourself.

    Comment by Lars Juhl Jensen — 11/2/2010 @ 3:25

  16. Nice post! I don’t know whether you’re aware of this, but in the area of machine learning, we’ve initiated an effort to give researchers more incentive to publish their research code under an open source model. We’ve built a website http://mloss.org where you can register your software projects, and we’ve installed a special track with the Journal of Machine Learning Research for research related open source software http://jmlr.csail.mit.edu/mloss/

    Comment by Mikio L. Braun — 16/2/2010 @ 4:43

  17. As a department chair, I gave this some thought in terms of promotion and tenure of junior faculty members. I educated a dean on the value of widespread downloading of research software as a SUPPLEMENT to publication in conferences and journals. I have also been saddled with people for whom the software, not the research, became the end–these people were less successful academics. There is also the ongoing burden of responding to bug reports, requests for new capabilities, etc.

    As a program manager at DOE, I helped DOE/NNSA and DOE/Science/OASCR jointly declare that all software products produced by their funding would be available with an open-source license. I have also worked with researchers who have invested a lot (several generations of back-room grad students) in complex software infrastructures within which their experimental computer science could be conducted. Freely sharing such an infrastructure would indeed enable other researchers to compete without having made any investment in infrastructure–so I empathize with this point of view as well. Finally, lots of research software is simply not fit for consumption by other than the authors; if shared, the authors’ reputations could be diminished.

    As some commenters have noted, the issue is not as simple as the author briefly claims.

    Comment by Leading Edge Boomer — 29/1/2013 @ 21:54

  18. Once (in 1985) I had a good idea. I coded it up, published a paper and posted the code on the Internet. Google says that the paper has been cited 2576 times. The idea wasn’t _that_ good. Distributing the code lead to early adopters and then it sort of avalanched.

    Comment by Andrew Fraser — 29/1/2013 @ 22:01

  19. Researchers, from what I have seen, refuse to share their source because they don’t want flaws found that would nullify their research. Usually, their research can’t be reproduced. The whole academic process is screwed up.

    Comment by Ateeq — 30/1/2013 @ 8:58

  20. @Leading Edge Boomer

    Just to be clear, my post does not state that people should be required to publish their software. I only invite people to think rationally about it instead of relying strictly on their gut reaction.

    I have also worked with researchers who have invested a lot (…) Freely sharing such an infrastructure would indeed enable other researchers to compete without having made any investment in infrastructure (…)

    I understand your argument. You are saying that one might feel he is giving his competitors an edge and losing out in the process. But that’s an impression. How does it work in reality?

    As a scientist, I’d rather work with facts than impressions. I see no practical evidence that people who publish their software suffer.

    You give the example of tenure-track professors who forgo scholarship in favor of maintaining software. But these few professors might also have been denied tenure without this open source software. It is hard for me to believe that publishing their software is what caused them to fail at their job.

    It not like answering bug reports is likely to become an addiction.

    Rather, it is entirely possible that, on average, the tenure-track professors who do publish their software are more likely to get tenure.

    (…) lots of research software is simply not fit for consumption (…) if shared, the authors’ reputations could be diminished.

    This feels like an argument for improving the quality of the work rather than an argument against publishing the software.

    Comment by Daniel Lemire — 30/1/2013 @ 9:34

  21. I think a major point has been only alluded to. The reality is that publishing in top venues is only one part of the equation – citations count a whole lot too!

    So, publishing the code that comes with a paper means that people will build on that code, and cite the paper a lot. That can give you very good stats.

    Comment by M-A — 4/2/2013 @ 21:05

  22. I am very strongly in support of publishing source code in the interests of reproducibility and progress. Withholding code might do service to a career in a competitive field, but it does a disservice to science. It’s completely at odds with the spirit of collaboration.

    I have two questions for anyone reading this blog who has published source code:

    1) When publishing source code, do you advocate attaching it as supplementary material accessible through a journal’s website? Or would you tend toward something more truly open source, like uploading it to GitHub or to a field-specific code repository (and then pointing to this online repository somewhere in the paper’s end-matter)?

    2) Do you advocate including a GPL/equivalent licence along with a stipulation that XXX paper be cited if the code is used in published research?

    Comment by cgb — 29/3/2013 @ 2:47

  23. @cgb

    1) It makes much more sense for me to post it on GitHub or some similar site. The paper then contains an hyperlink to the source code.

    The downside is that GitHub might go away one day… but if your code sees any use at all, it will stick around (e.g., people will come to rely on it, make copies…). If not, then it is no big loss if it goes away.

    2) I think that requiring people to cite you if they use your work is an annoying practice. That’s like handing out presents along with a letter stating that you expect to be thanked for your present.

    Comment by Daniel Lemire — 29/3/2013 @ 8:36

  24. Are they really presents? Aren’t they more like signposts? What is inherently wrong in requesting a citation along with the source, in the interests of tracking provenance for career reasons and aiding traceability for “science”?

    When you publish code, you don’t ever prepend a GPL? (Genuine question. I’m not convinced about what I should do with the things I have ready right now – I want anyone to use/extend it freely, but after 6 months of writing/debugging I do feel like it’s intellectually “mine” and I would like people who use it to acknowledge where the source came from.)

    The possibility of GitHub disappearing is a bigger issue for me. Far bigger than ego issues. If I’m looking up FORTRAN code written in the 80s uploaded to a no-longer extant host, how useful is that? How open is open? Should all computational papers include a mandatory code appendix?

    Comment by cgb — 29/3/2013 @ 12:21

  25. It is not that hard to keep local copies of your soft. And also make sure that if the github (or what else) goes down, you put it back online. Besides, if the software is popular it will be replicated in many repositories. At the very least, immediately following papers will be able to either confirm or refute results. A singleton paper (likely to be error-prone) promising some non-verifiable benefits? How useful is this? I guess that it could have been much more useful.

    Comment by Itman — 29/3/2013 @ 12:30

  26. @cgb

    What is inherently wrong in requesting a citation (…) I want anyone to use/extend it freely, but after 6 months of writing/debugging I do feel like it’s intellectually “mine” and I would like people who use it to acknowledge where the source came from

    Obviously, people who use your work should acknowledge the use of said work. Failing to acknowledge your inputs is a major faux pas in science.

    However, making it an explicit requirement is a breach of etiquette. When I post my papers online, I expect people who use them to cite me. I do not, however, add a license to my papers requiring people to cite me.

    And it is useless too: how are you going to enforce such agreements? Do you sue people who break this requirement? For what kind of damages?

    When you take a girl out, you don’t ask her to agree that if you pay for dinner she has to make out. It would be very rude. If the girl likes you, she will make out. Your goal is to make her like you, not make her sign agreements. But even if you did get her to sign an agreement that she will make out, what if she doesn’t? Are you going to rape her? Right, I did not think so.

    People who say upfront that I need to cite them rub me the wrong way. I find it rude. I’ll acknowledge you if I use your work. Please don’t make it an explicit requirement.

    When you publish code, you don’t ever prepend a GPL?

    I no longer use the GPL as I feel it is too restrictive for academic software. I prefer to use the Apache license (APL 2.0).

    If I’m looking up FORTRAN code written in the 80s uploaded to a no-longer extant host, how useful is that?

    I would not assume that old Fortran code that has not been used in 40 years can be useful today. It probably won’t compile and even if it does, it may be unreliable.

    However, if it got any significant use over the years, then it has probably been maintained, ported, fixed, repackaged and so on.

    I simply don’t care very much for a lump of code that has been dropped on a server 10 years ago. Who is going to provide support for it? Where do you file the bug reports? How do you contribute to it? How do you establish dependencies?

    Comment by Daniel Lemire — 29/3/2013 @ 12:46

  27. @Daniel

    I take your point about it rubbing you the wrong way.

    My thinking was more along the lines of establishing clear provenance of the source code by incorporating a reference to the associated publication.  The intention would simply be to remind, encourage and indeed enable end-users to cite the work from which the code originated by providing that reference.  Of course it would be academic misconduct to plagiarise work, so I don’t see why it would be offensive to prepend a statement like “This code is the product of work published as XXX, and any derivative works should reference this paper wherever citation is appropriate”. It’s helpful! No?

    In this case the work I’m talking about involves specific numerical computations of analytical work published decades ago. A few groups have published numerical data focusing on this or that aspect of the underlying physical phenomena, but nobody has been willing to publish their code. This offends me. I resent having to reinvent the wheel, that’s not how science should work. Having now written something fairly robust that can give very specific, somewhat novel insight into certain aspects of this system, I would like to share this with the community. Certain members of this community have proven to be rather antisocial, so I am inclined to make my expectation of openness and fair citation more-explicit.

    RE: Apache vs GPL, I think I agree with you. While there are some philosophical costs, Apache provides more freedom in some important ways.

    Comment by cgb — 30/3/2013 @ 12:55

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress