Science and Technology links (May 18th, 2018)

  1. How is memory encoded in your brain? If you are like me, you assume that it is encoded in the manner in which your brain cells are connected together. Strong and weak connections between brain cells create memories. Some people think that it is not how memories are encoded.

    To prove that it is otherwise, scientists have recently transferred memories between snails by injections of small molecules taken from a trained snail. Maybe one day you could receive new memories through an injection. If true, this result is probably worth a Nobel prize. It is probably not true.

  2. Inflammation is a crude and undiscerning immune response that your body uses when it has nothing better to offer. One of the reasons aspirin is so useful is that it tends to reduce inflammation. There are many autoimmune diseases that can be described as “uncontrolled inflammation”. For example, many people suffer from psoriasis: their skin peels off and becomes sensitive. Richards et al. believe that most neurodegenerative diseases (such as Alzheimer’s, Parkinson’s, ALS) are of a similar nature:

    it is now clear that inflammation is (…) likely, a principal underlying cause in most if not all neurodegenerative diseases

  3. Scientists are sounding the alarm about the genetic tinkering carried out in garages and living rooms.
  4. The more intelligent you are, the less connected your brain cells are:

    (…)individuals with higher intelligence are more likely to have larger gray matter volume (…) intelligent individuals, despite their larger brains, tend to exhibit lower rates of brain activity during reasoning (…) higher intelligence in healthy individuals is related to lower values of dendritic density and arborization. These results suggest that the neuronal circuitry associated with higher intelligence is organized in a sparse and efficient manner, fostering more directed information processing and less cortical activity during reasoning.

  5. It is known that alcohol consumption has a protective effect on your heart. What about people who drink too much? A recent study found that patients with a troublesome alcohol history have a significantly lower prevalence of cardiovascular disease events, even after adjusting for demographic and traditional risk factors. Please note that it does not imply that drinking alcohol will result in a healthier or longer life.
  6. A third of us have high blood pressure. And most of us are not treated for it.
  7. Eating lots of eggs every day is safe. Don’t be scared of their cholesterol. (Credit: Mangan.)
  8. According to data collected by NASA, global temperatures have fallen for the last two years. This is probably due to the El Nino effect that caused record temperatures two years ago. What is interesting to me is that these low global temperatures get no mention at all in the press whereas a single high temperature record (like what happened two years ago) gets the front page.

    That’s a problem in my opinion. You might think that by pushing aside data that could be misinterpreted, you are protecting the public. I don’t think it works that way. People are less stupid and more organized than you might think. They will find the data, they will talk about themselves, and they will lose confidence in you (rightly so). The press and the governments should report that the temperatures are decreasing… and then explain why it does not mean that the Earth is not warming anymore.

    The Earth is definitively getting warmer, at a rate of about 0.15 degrees per decade. You best bet is to report the facts:

  9. Low-carbohydrate, high-fat diets might prevent cancer progression.
  10. Participating in the recommended 150 minutes of moderate to vigorous activity each week, such as brisk walking or biking, in middle age may be enough to reduce your heart failure risk by 31 percent. (There is no strong evidence currently that people who exercise live longer. It does seem that they are more fit, however.)

Validating UTF-8 strings using as little as 0.7 cycles per byte

Most strings found on the Internet are encoded using a particular unicode format called UTF-8. However, not all strings of bytes are valid UTF-8. The rules as to what constitute a valid UTF-8 string are somewhat arcane. Yet it seems important to quickly validate these strings before you consume them.

In a previous post, I pointed out that it takes about 8 cycles per byte to validate them using a fast finite-state machine. After hacking code found online, I showed that using SIMD instructions, we could bring this down to about 3 cycles per input byte.

Is that the best one can do? Not even close.

Many strings are just ASCII, which is a subset of UTF-8. They are easily recognized because they use just 7 bits per byte, the remaining bit is set to zero. Yet if you check each and every byte with silly scalar code, it is going to take over a cycle per byte to verify that a string is ASCII. For much better speed, you can vectorize the problem in this manner:

__m128i mask = _mm_setzero_si128();
for (...) {
    __m128i current_bytes = _mm_loadu_si128(src);
    mask = _mm_or_si128(mask, current_bytes);
}
__m128i has_error =  _mm_cmpgt_epi8(
         _mm_setzero_si128(), mask);
return _mm_testz_si128(has_error, has_error);

Essentially, we are loading up vector registers, computing a logical OR with a running mask. Whenever a character outside the allowed range is present, then the last bit will be set in the running mask. We continue until the very end no matter what, and only then do we examine the mask.

We can use the same general idea to validate UTF-8 strings. My code is available.

finite-state machine (is UTF-8?)8 to 8.5 cycles per input byte
determining if the string is ASCII0.07 to 0.1 cycles per input byte
vectorized code (is UTF-8?)0.7 to 0.9 cycles per input byte

If you are almost certain that most of your strings are ASCII, then it makes sense to first test whether the string is ASCII, and only then fall back on the more expensive UTF-8 test.

So we are ten times faster than a reasonable scalar implementation. I doubt this scalar implementation is as fast as it can be… but it is not naive… And my own code is not nearly optimal. It is not using AVX to say nothing of AVX-512. Furthermore, it was written in a few hours. I would not be surprised if one could double the speed using clever optimizations.

The exact results will depend on your machine and its configuration. But you can try the code.

I have created a C library out of this little project as it seems useful. Contributions are invited. For reliable results, please configure your server accordingly: benchmarking on a laptop is hazardous.

Credit: Kendall Willets made a key contribution by showing that you could “roll” counters using saturated subtractions.

Is research sick?

One of the most important database researchers of all time, Michael Stonebraker, has given a talk recently on the state of database research. I believe that many of his points are of general interest:

  • We have a lost our consumers… Researchers write for other researchers. They are being insular and ultimately irrelevant.
  • We ignore the important problems… Researchers prefer to focus on the easy problems they know how to solve right away.
  • Irrelevant theory comes to dominate… Making things work in the real work is harder than publishing yet-another-theory paper.
  • We do not have research taste… Researchers jump on whatever is fashionable with little regard to whether it makes sense (XML, Semantic Web, MapReduce, Object databases and so forth).

The core message that Stonebraker is trying to pass is that the seminal work he did years ago would no longer be possible today.

I do not share his views, but I think that they are worth discussing.

The fact that we are publishing many papers is not, by itself, a problem. The irrelevant theory can be ignored. The fact that many people follow trends mindlessly whereas only a few make their own ways, is just human nature.

So what is the real problem?

My take is that there is an underlying supply-and-demand issue. We are training many more PhDs than there are desirable tenure-track jobs. This puts early-career researchers in a overcompetitive game. If you can signal your value by getting papers accepted at competitive venues, then that is what you are going to do. If you need to temporarily exagerate your accomplishments, or find shortcuts, maybe you will do so to secure the much needed job.

You cannot magically make a field less competitive without tackling the supply and demand. Either you reduce the number of PhDs, or you increase the number of desirable research jobs. I’d vote for the latter. Stonebraker says nothing about this. In fact, though I like him well enough, he is an elitist who cares for little but the top-20 schools in the US. The problem is that even just these top-20 schools produce many more PhDs than they consume. Is it any wonder if all these PhD students desperately play games?

The reason Stonebraker was able to get a tenure-track job in a good school when he completed his PhD without any publication is… Economics 101… there was no shortage of tenure-track jobs back then.

Is it any wonder if the new professors recruited in an overcompetitive system are status-obsessed folks?

There is no doubt that some PhD students and young faculty members will pursue volume at the expense of quality and relevance. However, there is no reason to believe that the best American universities (the ones Stonebraker cares about) do not already take this into account. Do you really think you can go from a PhD program to a faculty job at Harvard, Stanford or MIT without one or two great papers? Do you really think that these schools will pass on someone who has produced seminal work, and pick the fellow who has a large volume of irrelevant papers instead? If that’s the case, where are the examples?

Meanwhile, there are more great papers being published in my area, in any given year, than I can read. Here is a hint: you can find many of them on arXiv. No need to go to an expensive conference at some remote location. It might be the case that some disciplines are corrupted, but computer science research feels quite worthwhile, despite all its flaws.

Science and Technology links (May 11th, 2018)

  1. It looks like avoiding food most of the day, even if you do not eat less, is enough to partially rejuvenate you.
  2. Google researchers use deep learning to emulate how mammals find their way in challenging environments.
  3. We know that athletes live longer than the rest of us. It turns out that Chess players also live longer.
  4. Google claims to have new pods of TPUs (tensor processor units) reaching 100 petaflops. I don’t know if this number makes sense. It seems higher than many supercomputers. Some estimates indicate that our brains have about 1 exaflop, so if you put 10 pods together, you should have enough computational power to match the human brain. If Google has this technology, I’d consider it to be a massive step forward.
  5. Gwern Branwen has a fantastic essay on the genetics of the novel Dune. In the novel, thousands of years are spent improving human genetics. Gwern suggests that this is wrong:

    even the most pessimistic estimates of how many generations it might take to drastically increase average human intelligence (…) might be 20 generations, certainly not thousands of generations

    The point seems to be that there is no need to wait for the emergence of new genes. Instead, we just have to select the right alleles in many genes which is far easier.

  6. In a clinical trial, weight loss cured most people of type 2 diabetes.
  7. There is currently no cure for male pattern baldness. A drug called WAY-316606 has been found to cure baldness… in mice. I could use more hair.

How quickly can you check that a string is valid unicode (UTF-8)?

Though character strings are represented as bytes (values in [0,255]), not all sequences of bytes are valid strings. By far the most popular character encoding today is UTF-8, part of the unicode standard. How quickly can we check whether a sequence of bytes is valid UTF-8?

Any ASCII string is a valid UTF-8 string. An ASCII character is simply a byte value in [0,127] or [0x00, 0x7F] in hexadecimal. That is, the most significant bit is always zero.

You can check that a string is made of ASCII characters easily in C:

bool is_ascii(const signed char *c, size_t len) {
  for (size_t i = 0; i < len; i++) {
    if(c[i] < 0) return false;
  }
  return true;
}

However, there are many more unicode characters than can be represented using a single byte. For other characters, outside the ASCII set, we need to use two or more bytes. All of these “fancier” characters are made of sequences of bytes all having the most significant bit set to 1. However, there are somewhat esoteric restrictions:

  • All of the two-byte characters are made of a byte in [0xC2,0xDF] followed by a byte in [0x80,0xBF].
  • There are four types of characters made of three bytes. For example, if the first by is 0xE0, then the next byte must be in [0xA0,0xBF] followed by a byte in [0x80,0xBF].
    • It is all quite boring but can be summarized by the following table:

      First ByteSecond ByteThird ByteFourth Byte
      [0x00,0x7F]
      [0xC2,0xDF][0x80,0xBF]
      0xE0[0xA0,0xBF][0x80,0xBF]
      [0xE1,0xEC][0x80,0xBF][0x80,0xBF]
      0xED[0x80,0x9F][0x80,0xBF]
      [0xEE,0xEF][0x80,0xBF][0x80,0xBF]
      0xF0[0x90,0xBF][0x80,0xBF][0x80,0xBF]
      [0xF1,0xF3][0x80,0xBF][0x80,0xBF][0x80,0xBF]
      0xF4[0x80,0x8F][0x80,0xBF][0x80,0xBF]

      So, how quickly can we check whether a string satisfies these conditions?

      I went looking for handy C/C++ code. I did not want to use a framework or a command-line tool.

      The first thing I found is Björn Höhrmann’s finite-state machine. It looks quite fast. Without getting in the details, given a small table that includes character classes and state transitions, the gist of Höhrmann’s code consists in repeatedly calling this small function:

      bool is_ok(uint32_t* state, uint32_t byte) {
        uint32_t type = utf8d[byte];
        *state = utf8d[256 + *state * 16 + type];
        return (*state != 1); // true on error 
      }
      

      Then I went looking for a fancier, vectorized, solution. That is, I want a version that uses advanced vector registers.

      I found something sensible by Olivier Goffart. Goffart’s original code translates UTF-8 into UTF-16 which is more than I want done. So I modified his code slightly, mostly by removing the unneeded part. His code will only run on x64 processors.

      To test these functions, I wanted to generate quickly some random strings, but to measure accurately the string, I need it to be valid UTF-8. So I simply generated ASCII strings. This makes the problem easier, so I probably underestimate the difficulty of the problem. This problem is obviously dependent on the data type and lots of interesting inputs are mostly just ASCII anyhow.

      Olivier Goffart’s code “cheats” and short-circuit the processing when detecting ASCII code. That’s fine, but I created two versions of his function, one with and one without the “cheat”.

      So, how quickly can these functions check that strings are valid UTF-8?

      string sizeis ASCII?Höhrmann’s finite-state machine Goffart’s (with ASCII cheat) Goffart’s (no ASCII cheat)
      32~2.5 cycles per byte~8 cycles per byte~5 cycles per byte~6 cycles per byte
      80~2 cycles per byte~8 cycles per byte~1.7 cycles per byte~4 cycles per byte
      512~1.5 cycles per byte~8 cycles per byte~0.7 cycles per byte~3 cycles per byte

      My source code is available.

      The vectorized code gives us a nice boost… Sadly, in many applications, a lot of the strings can be quite small. In such cases, it seems that we need to spend something close to 8 cycles per byte just to check that the string is valid?

      In many cases, you could short-circuit the check and just verify that the string is an ASCII string, but it is still not cheap, at about 2 cycles per input byte.

      I would not consider any of the code that I have used to be “highly optimized” so it is likely that we can do much better. How much better remains an open question to me.

      Update: Daniel Dunbar wrote on Twitter

      I would expect that in practice best version would be highly optimized ascii only check for short segments, with fallback to full check if any in the segment fail

      That is close to Goffart’s approach.

Science and Technology links (May 5th, 2018)

  1. Oculus, a subsidiary of Facebook, has released its $200 VR headset (the Oculus Go). You can order it on Amazon. The reviews are good. It is standalone and wireless which is excellent. The higher-quality Oculus Rift and its nifty controllers are down to only $400, with the caveat that it needs a cable back to a powerful PC.

    Some analysts are predicting massive growth for virtual reality headsets this year: IDC anticipates sales to reach 12.4 million units.
    To me, the killer application has to be conferencing in virtual reality. Who wouldn’t like to go chat with a friend in some virtual world? But it is unclear whether we have the hardware to do it: you’d need good eye tracking as it is needed if you are going to be able to look at people in the eyes.

  2. Consuming glucose (sugar) impairs memory (in rats). The theory is that glucose reduces neurogenesis: you are making fewer new neurons when eating sugar.
  3. At age 20, the life expectancy was another 47 years (age 67/68) in 1930. This means that when the retirement age of 65 was enacted in the US and other countries, we expected people to have reached the very end of their life by then.
  4. Some coffee drinkers, me included, report side-effects when abstaining from coffee. Yet the belief that one has ingested caffeine is sufficient to reduce caffeine withdrawal symptoms and cravings.
  5. A private charity, the Charity Open Access Fund, pays publishers to ensure that research papers are available freely (under open access). Oxford University Press was apparently happy to take the money, but failed to deliver: a total of 34 per cent of articles paid for by the Charity Open Access Fund in 2016-17 were not made available as promised. The Oxford University Press is treated as a tax-exempt non-profit.
  6. It is believed that part of the biological aging process can be attributed to a reduction in nicotinamide adenine dinucleotide (NAD+) in our cells. A new paper in Cell Metabolism shows that this reduction can be reversed with some drug called 78c. It improves muscle function, exercise capacity, and cardiac function in mice.
  7. In Singapore, the number of people aged 65 and above is projected to double to 900,000 or 25% of the population by 2030. In Japan, 27% of the population was 65 years or older in 2016, it will be over 33% by 2030.

How fast can you parse JSON?

JSON has become the de facto standard exchange format on the web today. A JSON document is quite simple and is akin to a simplified form of JavaScript:

{
     "Image": {
        "Width":  800,
        "Height": 600,
        "Animated" : false,
        "IDs": [116, 943, 234, 38793]
      }
}

These documents need to be generated and parsed on a large scale. Thankfully, we have many fast libraries to parse and manipulate JSON documents.

In a recent paper by Microsoft (Mison: A Fast JSON Parser for Data Analytics), the researchers report parsing JSON document at 0.1 or 0.2 GB/s with common libraries such as RapidJSON. It is hard to tell the exact number as you need to read a tiny plot, but I have the right ballpark. They use a 3.5 GHz processor, so that’s 8 to 16 cycles per input byte of data.

Does it make sense?

I don’t have much experience processing lots of JSON, but I can use a library. RapidJSON is handy enough. If you have a JSON document in a memory buffer, all you need are a few lines:

rapidjson::Document d;
if(! d.ParseInsitu(buffer).HasParseError()) {
 // you are done parsing
}

This “ParseInsitu” approach modifies the input buffer (for faster handling of the strings), but is fastest. If you have a buffer that you do not want to modify, you can call “Parse” instead.

To run an example, I am parsing one sizeable “twitter.json” test document. I am using a Linux server with a Skylake processor. I parse the document 10 times and check that the minimum and the average timings are close.

ParseInsitu Parse
4.7 cycles / byte 7.1 cycles / byte

This is the time needed to parse the whole document into a model. You can get even better performance if you use the streaming API that RapidJSON provides.

Though I admit that my numbers are preliminary and partial, they suggest to me that Microsoft researchers might not have given RapidJSON all its chances, since their numbers are closer to the “Parse” function which is slower. It is possible that they do not consider it acceptable that the input buffer is modified but I cannot find any documentation to this effect, nor any related rationale. Given that they did not provide their code, it is hard to tell what they did exactly with RapidJSON.

The Microsoft researchers report results roughly 10x better than RapidJSON, equivalent to a fraction of a cycle per input byte. The caveat is that they only selectively parse the document, extracting only subcomponents of the document. As far as I can tell, their software is not freely available.

How would they fare against an optimized application of the RapidJSON library? I am not sure. At a glance, it does not seem implausible that they might have underestimated the speed of RapidJSON by a factor of two.

In their paper, the Java-based JSON libraries (GSON and Jackson) are fast, often faster than RapidJSON even if RapidJSON is written in C++. Is that fair? I am not, in principle, surprised that Java can be faster than C++. And I am not very familiar with RapidJSON… but it looks like performance-oriented C++. C++ is not always faster than Java but in the hands of the right people, I expect it to do well.

So I went looking for a credible performance benchmark that includes both C++ and Java JSON libraries and found nothing. Google is failing me.

In any case, to answer my own question, it seems that parsing JSON should take about 8 cycles per input byte on a recent Intel processor. Maybe less if you are clever. So you should expect to spend 2 or 3 seconds parsing one gigabyte of JSON data.

I make my code available.

Is software prefetching (__builtin_prefetch) useful for performance?

Many software performance problems have to do with data access. You could have the most powerful processor in the world, if the data is not available at the right time, the computation will be delayed.

It is intuitive. We used to locate libraries close to universities. In fact, universities were built around libraries. We also aggregate smart and knowledgeable people together.

Sadly, our computers are organized with processors on one side and memory on the other. To make matters worse, memory chips are simply unable to keep up with the processing power of processors. Thus we introduce caches. These are fast memory stores that are stuck to the processor. We have a whole hierarchy of CPU caches.

Happily, processor vendors like Intel have invested a lot of effort in making sure that the right data is made available at the right time. Thus if you access data according to a predictable pattern the processor will do a good job. Also, the processor can reorder its instructions and execute a load instruction earlier if it is useful. Simply put, a lot of engineering went to make sure that data-access problems are minimized. You might think that, as a programmer, you can do better… but truth is, most times you couldn’t. At least, most of us, most of the time, couldn’t.

You can still “blind” your processor. Some expensive instructions like the division will apparently prevent processors from fetching the data in time. At the very least, they can prevent some instruction reordering. Intuitively, the processor is so busy processing a monstrous instruction that it has no time to see what is coming next.

You can also thoroughly confuse the processor by doing many different things in an interleaved manner.

In a comment to one of my earlier blog post, someone suggested that I could improve the performance of my code by using software prefetches. That is, processor vendors like Intel provide us with instructions that can serve as hints to the processor… effectively telling the processor ahead of time that some data will be needed. Using the GCC and Clang compiler, you can invoke the __builtin_prefetch intrinsic function to get this effect. I do not know if other compilers, like that of Visual Studio, have this function.

The instructions generated by this function have a cost of their own, so they can definitively make your software slower. Still, when inserted in just the right manner in the right code, they can improve the performance.

Should you be using __builtin_prefetch? My experience says that you probably shouldn’t. I don’t like hard rules, but I have not yet seen a scenario where they are obviously a good idea. That is, I claim that if software prefetching is useful, that’s probably because your code could use some work.

I asked the commenter to provide an example where software prefetching was a good idea. He offered a scenario where we sum the values of bytes within an array…

for (int j = 0; j < ...; j++) {
    counter += bytes[p];
    p = (p + 513) & (NUMBYTES - 1);
}

This code, by itself, will not benefit from software prefetching. However, the commenter interleaved the sums…

for (int j = 0; j < ...; j++) {
  for(int i = 0; i < batchsize; i++) {
    counter[i] += bytes[p[i]];
    p[i] = (p[i] + 513) & (NUMBYTES - 1);
  }
}

If you pick batchsize large enough, you can confuse the processor and get some benefits from using the __builtin_prefetch function. I suppose that it is because the access pattern looks too unpredictable to the processor.

Let us look at the numbers I get…

sequential sums3 s
interleaved sums4.4 s
interleaved sums with __builtin_prefetch4.0 s

The prefetching improves the performance of the interleaved sums by 10%, but you can get much better performance simply by doing the sums one by one. The resulting code will be simpler, easier to debug, more modular and faster.

To provide evidence that __builtin_prefetch is useful, you need to take code, optimize it as much as you can… and then, only then, show that adding __builtin_prefetch improves the performance.

My source code is available.

Credit: I’d like to thank Simon Hardy-Francis for his code contribution.

Science and Technology links (April 29th, 2018)

  1. Our heart regenerates very poorly. That is why many of us will die of a heart condition. Harvard researchers find the mice that exercise generate many more new heart cells. The researchers hint at the fact that you might be able to rejuvenate your heart by exercising.
  2. Cable TV is losing subscribers faster than anticipated. In related news, cable prices are rising at 4 times the rate of inflation. That is, just as cable TV is facing stiffer competition from Netflix and others, it is also raising its prices.
  3. The rich get richer while the poor get poorer. That’s called the Matthew effect. It exists in science: young scientists who get lucky early on tend to do much better funding-wise. But why is that? It seems to be mostly because people who get bad news early on give up or become less aggressive:

    Our results show that winners just above the funding threshold accumulate more than twice as much funding during the subsequent eight years as nonwinners with near-identical review scores that fall just below the threshold. This effect is partly caused by nonwinners ceasing to compete for other funding opportunities, revealing a “participation” mechanism driving the Matthew effect.

    It follows that a key to success is to take failure well. It is like a superpower.

  4. A dietary supplement, MitoQ, is believed to rejuvenate blood vessels in mice. A new study finds that the effect is similar in human beings. (I am not taking this supplement.)
  5. When health workers go on strike, mortality stays level or decrease.
  6. Student strikes negatively affect performance on mathematics tests, but not on language tests.
  7. Naked mole rats do not age biologically. That is, their mortality rate does not increase with age. It is a rather unique feature among mammals and we do not know exactly what makes it work. Lewis et al. found that naked mole rats have relatively low levels of some amino acids in their blood, levels resembling those of hibernating quirrels.
  8. Should you be eating more fibers? Maybe not:

    increasing fibre in a Western diet for two to eight years did not lower the risk of bowel cancer (…) after four years participants receiving dietary fibre had higher rates of bowel cancer.

    In concrete terms, this makes me question the belief that whole wheat bread is better for your health than white bread.

  9. It seems that Neandertals had boats of some sort, going from island to island in the Mediterranean.
  10. Years ago I produced a small podcast for my students, in the pre-iPhone era. Feedback from my students was not great: they complained that they simply did not have time for such things. I used to listen to podcasts myself, but the hassle of selecting, downloading and managing audio files was too annoying. I had since assumed that podcasts had gone out of favor. I have recently rediscovered podcasts. Apple reports having half a million shows on offer, and they have counted 50 billion downloads. It seems that podcasts are all the rage right now. I suspect that this has to do with the high quality of cellular network connections. I can listen to podcasts down in Montreal’s subways. Though the monthly fees for my smartphone are high, the service is great… usually superior to what I get with wifi at the office. This almost makes me want to resume my own podcasting, but I’d have to give up other activities to make time.
  11. Progeria is a terrible disease that appears like accelerated aging. Many kids suffering from progeria won’t live to become adults. In a small trial, a protein inhibitor was found to drastically reduce the mortality rate of patients affected with progeria. Is a cure on the way?
  12. My colleagues from other scientific fields often assume that software is just a form of applied mathematics. Surely we can automatically reason over software implementations and prove them to be correct. This turns out to be remarkably difficult. It is hard to prove as correct even silly small functions, to say nothing of actual software. The fact is, building software is very much an empirical endeavour… which is why people like me always spend at least as much time testing the software as writing it. Moreover, I always assume that my software is, at best, empirically correct. There are always cases where it will fail, even if the hardware runs perfectly (which it does not).

    Hillel Wayne checks whether a particular programming style that is fashionable right now (functional programming) makes software correctness easier to verify. He finds the evidence weak. To me, the real take-away should be that it is ridiculously hard to prove that code is real-world correct, irrespective of the programming style.

    What is also interesting is the reaction of the functional-programming community to his analysis…

    A common critique of functional-programming communities is they have very aggressive people and this held true here.

    I have been on the receiving end of the functional-programming communities. A mere joke at functional programming expense is enough to get hate mail. That is, software is nothing like objective truth and pure ideas. It is filled with deeply held ideologies.

  13. In our society is widely accepted that men are freer than women sexually, even though we have contraception that should act as an equalizer. That is, a woman that openly likes sex is a “slut” or a “prostitute” whereas a men that likes sex merely has a healthy libido. People who care about gender fairness should be concerned with this obvious bias. But what is the cause? Baumeister and Twenge find that the view that men suppress female sexuality is flatly contradicted and the the evidence favors the view that women have worked to stifle each other’s sexuality because sex is a limited resource that women use to negotiate with men, and scarcity gives women an advantage.

Why a touch of secrecy can help creative work

Though I am a long-time blogger and I spend most of my day talking or writing to other people… I am also quite secretive about the research that I am doing.

There are reasons to be secretive that are bogus. The primary one is that you are afraid others might steal your ideas. That’s overwhelming bogus because ideas are cheap. I can write out long lists of good ideas that are almost sure to work, but that I will never pursue for lack of time, energy, competence, and interest. Coming up with the idea is easy, executing is much harder. If you are running so low on new ideas that you need to take other people’s projects and compete with them, then you are doing something wrong.

If you have fierce competitors that might quickly adopt your ideas and run with them… then maybe you are working on the wrong project in the first place.

No. I am sometimes secretive for other reasons…

  • Sharing your project ideas is rewarding in itself. But it is not a good policy to reward yourself for work that has not been completed. Look around you at people who never finish anything… they are often quite open about all the marvelous projects they have going. It is so much easier to start something than to complete. Don’t reward yourself early.
  • Most things I work on end up not working the way I expect. So I don’t want overcommit to a narrative that might need changing. Once you have begun explaining what you are doing to others, it gets harder to change the project. You have to go back and say “you know what? I was wrong.” Having to admit you are wrong is fine, but it creates friction in your mind.
  • There is communication fatigue. If you keep announcing something that is not yet there, you risk eroding the interest of people who are expecting your work.
  • Early work often looks and sounds naive. It takes time to find the right way to present your work. You do not want to ruin your first impression… and the first impression does matter.
  • In the public eye, we tend to think like others… human beings are gregarious and we have a strong tendency to want to “fit in”. Isolating yourself from the state-of-the-art can be the best way to “ignore” the state-of-the-art, and that’s sometimes just what is needed. You don’t want people to tell you that what you are going after “has been done before”. You want to see what you can come up as independently as possible… otherwise, you risk just being stuck thinking like everyone else.

Of course, I am not always secretive… far from it… I am extremely transparent… you can literally follow up almost everything I do programming-wise on GitHub. I often post my research papers on arXiv months before I announce it.

Secrecy comes at a cost…

  • It is harder to recruit the right people to your cause if your work is done in secret. If you need more ressources, it is pays to drop secrecy.
  • Secrecy introduces friction of its own that might slow you down.
  • Early feedback, even critical feedback, can be extremely valuable.

So there is no hard rule about when to be secretive. You have to make judgment calls on a case-by-case basis. I tend to be secretive when…

  • I’m starting something that is “new” for me. Maybe it is a problem I am unfamiliar with.
  • Having more people would not obviously help. Maybe it is the case that I already have great collaborators.
  • I expect to be working for a long time still on the project.

I tend to be more transparent regarding work on problems that I understand well.