Almost all software I write for my research is open sourced. Some fellow researcher argued today that I risk reducing the gap between and my pursuers. Similarly, I should keep my data to myself (and avoid listing good sources of research data).

Here is my take on this issue.

  1. Sharing can’t hurt the small fish. Almost nobody sets out to beat Daniel Lemire at some conference next year. I have no pursuer. And guess what? You probably don’t. But if you do, you are probably doing quite well already, so stop worrying. Yes, yes, they will give you a grant even if you don’t actively sabotage your competitors. Relax already!
  2. Sharing your code makes you more convincing. By making your work easier to reproduce, you are instantly more credible. Trust is important in science. Why would anyone trust that I actually wrote the code and ran the experiments? Because I published my code, that’s why!
  3. Source code helps spread your ideas faster. On the long run, you should not care about getting papers accepted at some hot conference. What matters is the impact you have had. Make it easy for me to use your ideas! Help yourself!
  4. Sharing raises your profile in industry. Having open source software makes your more attractive to software engineers.
  5. You write better software if you share it. While not all code I publish is bug-free, documented or even usable, I care slightly more about my code because I publish it.

Finally, does sharing code works? Do people download and use my software? Here are download statistics for my latest source-code publications:

A compressed alternative to the Java BitSet class over 280 downloads
Rolling Hash C++ Library over 200 downloads
Lemur Bitmap Index C++ Library over 2 000 downloads
Fast Nearest-Neighbor Retrieval under the Dynamic Time Warping over 1400 downloads

Related reading: Good prototyping software and The challenge of doing good experimental work by Suresh Venkatasubramanian. And More on algorithms and implementation by Michael Mitzenmacher.

Update: Joachim Wuttke pointed out another potential benefit: your users will debug your code.

16 Comments »

  1. Daniel,

    Excellent points. Your arguments are also backed up by this article that I’ve found and tweeted recently: http://bit.ly/9UxbNL

    Comment by Nicholas — 10/2/2010 @ 14:26

  2. The real question is why this is even an option. Why are papers accepted without the source code and data attached?

    Comment by Neil — 10/2/2010 @ 14:35

  3. I was the one who proposed the idea that publishing source code could “reduce the gap between a researcher and its pursuers”. Don’t get me wrong: I’m all in favor of publishing source code, but this was part of my reasoning as to why, in my field (computer animation, graphics), freely available source code is so rare.

    My tentative explanation — that publishing code is not a “fit strategy” for the academic reward system — is partly rooted in my disdain of this system. Other explanations might include :
    - The perceived reward-to-effort ratio is too low.
    - Researchers are afraid that publishing academic-grade (i.e. throw-away) code would tarnish their reputation.
    - The claimed results wouldn’t stand up to the public scrutiny enabled by open source code.

    In all cases, I wish we would see more openly available source code, test cases and experimental results… But I doubt this can be done simply by convincing researchers they should do it. There has to be some intrinsic motivation within the reward system.

    Comment by Philippe Beaudoin — 10/2/2010 @ 14:52

  4. @Philippe

    (1) It is hardly difficult to wrap up your code and post it online. Maybe people will have a hard time getting it to work, especially without proper documentation, but that is another issue.

    (2) I don’t buy the “tarnish your reputation” line. Researchers publish cheesy out-of-date web pages all the time. They show up for work unshaved and with days old clothes. They give lectures out of lectures notes from two centuries back. They are just not such perfectionists.

    (3) Rightfully, they are afraid that their result might not stand to scrutiny. And you know, maybe some paper I have published has *mistakes* in it and people might find out. Big deal.

    Nobody cares about the mistakes you make, as you long as you make them in good faith.

    When measuring your contributions, people don’t subtract your mistakes.

    Comment by Daniel Lemire — 10/2/2010 @ 15:23

  5. I wonder if it has anything to do with keeping all your papers, presentations and posters open… I actually do post everything at my web page, since I hardly believe that the ‘stealing’ issue is really working in hard sciences. What do you think?

    Comment by Misha Lemeshko — 10/2/2010 @ 15:32

  6. @Misha

    They are all components of Open Scholarship. But I think that software and data are a different (yet related) issue from, say, posting your papers online. It is about showing the ingredients that went into making your paper.

    Comment by Daniel Lemire — 10/2/2010 @ 15:36

  7. That’s true for papers. But what about presenting stuff which is very important these days? I don’t care if anyone learns from my presentations that I post online…

    Comment by Misha Lemeshko — 10/2/2010 @ 15:39

  8. @Daniel: I’m not trying to make a case for them. I’m simply trying to figure out why we don’t see more open source research code. Also, I disagree with your reasoning about reputation. Most researchers I know, especially young profs, find it very important to maintain a good reputation. If not in front of their classroom, at least in front of fellow researchers. This reputation has nothing to do with the way they dress or shave, but rather with how well they write their papers, how innovative their idea is, or how good their code is.

    My point (3) wasn’t related to mistakes you have overlooked and that will be forgiven. I was referring to the problem of over-claiming or carefully hiding known defects and limitations. It’s much easier to over-claim when you don’t publish your source code. This might not be frequent in your field, but I swear it is in mine.

    Comment by Philippe Beaudoin — 10/2/2010 @ 15:51

  9. @Philippe

    Oh! Absolutely. People fudge their results all the time. But that’s an excellent reason to publish your code. Because *you don’t fudge* your results, you can share your code and gain an edge over those who don’t.

    Comment by Daniel Lemire — 10/2/2010 @ 16:02

  10. @Daniel What if the “fittest” strategy was to fudge your result? More formally:

    reward( fudge + no_source ) > reward ( not_fudge + source )

    This seems to hold in my area, at least.

    Comment by Philippe Beaudoin — 10/2/2010 @ 16:22

  11. @Philippe

    Fine. But what if you don’t fudge your results? Shouldn’t you then try to maximize your personal gain, even if everyone else is better with a different strategy?

    You know, I guess what I am trying to say is that we should not be content to just do what others do. We should think through the issues and adopt behaviours which fit best our own needs and wants.

    For me, given who I am, how I think, opening up is the most profitable course of action. I don’t give a damn if others don’t do it, for whatever reason.

    And you know.. I’m pretty sure that if I throw away my academic career one day… all this open code will help convince employers or clients that I am not just faking it…

    Because, let’s face it, when it comes down to it, it can be damn nice to be able to go back to writing code for a living. It beats serving fries.

    (Not that I plan to throw away my academic career, but who knows what I might want to do with my life in ten years? Maybe I’ll want to give live in Siberia.)

    Comment by Daniel Lemire — 10/2/2010 @ 16:39

  12. @Neil That is why some journals, e.g. the journal of experimental algorithmics do not accept paper without source code.

    Comment by Itman — 10/2/2010 @ 17:09

  13. @Daniel I think I understand why we manage to both agree AND disagree at the same time. I agree with you on the desirable course of action. Publishing source code and not fudging my results is exactly what I would do in a total absence of external pressure. I guess I could reach a point — say, by getting tenure — where this external pressure would be small enough to let me do research the way I believe it should be done.

    However, until I get a job, and given the very competitive market profs-to-be are faced with, I feel tremendous external pressure to publish many papers in renowned conference. That is why I’m so keen to analyze not “what is the good way to conduct research”, but rather “what is the optimal strategy given the reward system”.

    My experience is that, by doing research the way I like doing it, I don’t get as many publications as others. I’m therefore faced with a choice: do I try to do research in the same way as others, or do I bail out? I can’t bring myself to fudge my results, to be harsh with my fellow researchers, to pepper my texts with needlessly complex equations or to intentionally obfuscate my thoughts.

    The only choice therefore seems to be to bail out. That’s what I might do, but not without a fight! I will keep bringing arguments as to why I believe this reward system is shutting out people that could have made important contributions to research.

    Sour grapes? Could be that too. :)

    Comment by Philippe Beaudoin — 10/2/2010 @ 18:58

  14. I agree 100%. If only my former postdoc supervisor would…

    Comment by Nicolas — 10/2/2010 @ 23:29

  15. I agree that having researchers publish their code Open Source would be a great thing. Just like it would be a great thing if researchers published their lab book and detailed research notes online.

    However, I see a at least two good reasons why one would not want to make scientific research software available as Open Source, none of which involve doing anything bad (such as fudging results or hiding mistakes):

    1) In my experience, sharing code costs a lot of my personal time. Not because it takes lots of time to wrap up the code and share it. Not because I feel that it has to be beautiful code before I am willing to let other people see it. Not because I feel a need for it to be well commented and documented before I share it. But because of the amount of time I have to spent on helping the people who try to make my ugly, undocumented code work.

    2) Sharing raises your profile in industry – and it lowers it too. Making your tools available (for example via a web interface) increases the visibility of the tools, both in academia and in industry. But making it Open Source means the company would gain less by hiring you than they would if you had kept the intellectual property (i.e. the code) to yourself.

    Comment by Lars Juhl Jensen — 11/2/2010 @ 3:25

  16. Nice post! I don’t know whether you’re aware of this, but in the area of machine learning, we’ve initiated an effort to give researchers more incentive to publish their research code under an open source model. We’ve built a website http://mloss.org where you can register your software projects, and we’ve installed a special track with the Journal of Machine Learning Research for research related open source software http://jmlr.csail.mit.edu/mloss/

    Comment by Mikio L. Braun — 16/2/2010 @ 4:43

Leave a comment

Warning: When entering a long comment, please ensure that you make copy of your text prior to submitting it. If the server should fail or if you hit a bug, you might lose your work. I am not responsible for your lost effort.

To spammers: I carefully review every single post and make sure that spam gets deleted. You are wasting your time if you are manually entering spam using this form. Read my terms of use to see what I consider to be abusive.

Example: duo plus septem is '9'. The numbers are expressed in latin numerals but you should give your answers using ordinary digits.

 

« Blog's main page

Powered by WordPress