Fast software is a discipline, not a purpose

When people train, they usually don’t try to actually run faster or lift heavier weights. As a relatively healthy computer science professor, how fast I run or how much I can lift is of no practical relevance. However, whether I can walk the stairs without falling apart is a metric.

I am not an actor or a model. Who cares how much I weight? I care: it is a metric.

I could probably work in a dirty office without ill effect, but I just choose not to.

So when I see inefficient code, I cringe. I am being told that it does not matter. Who cares? We have plenty of CPU cycles. I think you should care, it is a matter of discipline.

Yes, only about 1% of all code being written really matters. Most people write code that may as well be thrown out.

But then, I dress cleanly every single day even if I stay at home. And you should too.

I do not care which programming language you use. It could be C, it could be JavaScript. If your code is ten times slower than it should, I think it shows that you do not care, not really. And it bothers me. It should bother you because it tells us something about your work. It is telling us that you do not care, not really.

Alexander Jay sent me a nice email. He reviewed some tricks he uses to write fast code. It inspired me these recommendations:

  • Avoid unnecessary memory allocations.
  • Avoid multiple passes over the data when one would do.
  • Avoid unnecessary runtime inferences and branches.
  • Avoid unnecessary performance-adverse abstraction.
  • Prefer simple value types when they suffice.
  • Learn how the data is actually represented in bits, and learn to dance with these bits when you need to.

Alexander asked me “At what point would you consider the focus on optimization a wasted effort? ” My answer: “At what point do you consider being fit and clean a wasted effort?”

There is a reason we don’t tend to hire people who show up to work in dirty t-shirts. It is not that we particularly care about dirty t-shirts, it is that we want people who care about their work.

If you want to show care for your software, you first have to make it clean, correct and fast. If you start caring about bugs and inefficiencies, you will write good software. It is that simple.

China is catching to the USA, while Japan is being left behind

I have previously reported that there has been a silent revolution in science where countries like China, India, South Korea… that previously contributed few research articles… have started to catch up and even exceed the productivity of the western world.

The National Science Foundation (NSF) in the US has published a report on Science and Engineering Publication Output Trends. Their subtitle is “Rise of Developing Country Output while Developed Countries Dominate Highly Cited Publications”.

The report echoes my earlier observations: lots of new countries are become scientific powers in terms of publication output, with China leading the charge. The major claim made by the report is that developed countries continue to dominate in terms of highly cited publications.

I should preface any further analysis by reminding us that such numbers should be interpreted with care. They count the number of research articles produced per country along with other metrics such as which fraction of those is highly cited. Counting citations and research papers is an imperfect way to measure scientific output. Bibliometrics is a field that is full of methodological issues. So we should not try to compare countries on this basis only. Furthermore, the report does not seem concerned with “per capita” analysis. For example, though the European Union appears to fare quite well according to this report&emdash;i.e., they indicate that it publishes more than the US—we know that it is no longer true once you factor in the number of citizens which is higher in the European Union (500 million people versus a bit over 300 million people in the USA). So it might be misleading. Generally, the report does not shy away from comparing small entities to large ones, and that is a problem.

Back to the main claim of the report: though China is rising in output, countries like the US still have a larger fraction of their research articles that are highly cited.

If you accept that Japan is a “developed country”, and you should, then you are forced to view it as an exception from this observation. What is surprising to me regarding the numbers is the fact that Japan seems to underperform compared to what we might expect. They have barely expanded their publication volume and their citation patterns seem to be closer to India than the US. Back in 2008, I already observed that Japan produces far fewer research articles per capita than other rich countries, so it is nothing new. It reminds me of an August 2017 article in “Nature: Budget cuts fuel frustration among Japan’s academics:

young researchers, facing unstable employment conditions and economic uncertainty, are forced to aim for results that can be accomplished in the short term and in which true originality and creativity are difficult to realize.”

Furthermore, the data presented in the paper clearly indicate that countries like China are bridging the gap with respect to how impactful their work is (see Fig. 2 in the report), on top of bridging the gap with respect to the total volume (see Table 1 in the report). It should be viewed as remarkable especially if you consider that most scientific papers are written in English and published by American or European publishers and that most of the most prestigious universities are in the US. And of course, China lags behind the US considerably in research spending, both in absolute dollars and in a percentage of GDP.

A better title for the NSF report might have been “China is catching to the USA, while Japan is being left behind”.

Does any of it matter? Many people believe, or assume, that great output in terms of research articles should cause economic prosperity and innovation. I have post entitled Does academic research cause economic growth? that makes the contrary point. That is, though China is catching up in terms of scientific output, this may be a consequence of their prosperity: they can now afford to have their very best minds work on producing research articles. It is much easier for rich countries to fund people so that they can publish in Nature. So being rich will allow you to catch up.

But Japan shows that you can be a very rich country and choose not to produce many great research articles. In the least, this establishes that you do not need to produce many great research articles to be prosperous.

Science and Technology links (November 11th, 2017)

Read the following quote from the New York Times:

Business is taking an interest in artificial intelligence, or A.I., and some professors, are forming or joining companies to capitalize on the expected boom. But the new move toward commercialization is disrupting the academic community and provoking fears that university research will be hurt.

Some researchers welcome the business interest. Others, however, complain that corporations are outbidding the campus for scarce personnel, and that work is being diverted from long-term research to short-term problems with immediate application.

It was written in 1982.

The Spectator has a piece arguing that moderate drinking is good for you and that this fact is being suppressed by health professionals. (My thoughts: I think it is far from certain that alcohol is good for you, but it is probably not what will kill you.)

We all know that human activity is wiping out some animal species. According to professor Chris Thomas… a different story can be told

The biological processes of evolutionary divergence and speciation have not been broken in the Anthropocene, they have gone into overdrive. Come back in a million years and we might be looking at several million new species whose existence can be attributed to humans. In the end, the Anthropocene biological revolution will almost certainly represent the sixth mass genesis of new biological diversity. It could be the fastest acceleration of evolutionary diversification in the last half-billion years.

In a short talk, the famous polymath Eric Weinstein argues that Physics is lost to theoretical nonesense… that it is stuck away from reality. He suggests that it might be time for outsiders to come in an disrupt the discipline.

You can spot cancer cells because they burn through sugar very fast, generating heat. Canadian researchers have designed a very low-cost device that cools the skin and then measure the temperature to reliably spot skin cancers. If not for regulations, this could be available cheaply at Amazon in a few years?

A boy was suffering from a terrible genetic skin disease. Doctors took skin samples, corrected the genetic anomaly, regrew the skin and put it back in place. It worked. The vast majority of his skin had to be replaced.

You may have heard that chocolate was a health food? According to a story in the Atlantic, there is some really bad research practices involved:

If you look at the most recent version of their clinical trials registry, it was published in January 2015, three months after they published their Nature Neuroscience article. “So they went back after article was published in Nature and changed their clinical trial registry. There is no mention of this in the trial report,” Drysdale added.

(…)

“The bigger concern is that people are trying to do a better job of selling the research itself and not just telling what the straight out answer is,” University of Toronto nutrition researcher Richard Bazinet said. This study only showed that over a period of three months, in a small group, according to a very narrow test that taps a very specific region of the brain, cocoa supplements enhanced cognition. That became “chocolate fights Alzheimer’s” — a message Mars surely appreciated.

As we grow older, we accumulate senescent cells. These are old cells that won’t replicate or function normally. It is strongly associated with several diseases of old age. Currently, there are a few initiatives to develop safe drugs that can kill senescent cells. Amazon’s Jeff Bezos is funding one such initiative. Unexpectedly, some researchers have done breakthrough work that shows we can possibly reverse the senescent state:

The researchers applied compounds called reversatrol analogues, chemicals based on a substance naturally found in red wine, dark chocolate, red grapes and blueberries, to cells in culture. The chemicals caused splicing factors, which are progressively switched off as we age to be switched back on. Within hours, the cells looked younger and started to rejuvenate, behaving like young cells and dividing.

The discovery has the potential to lead to therapies which could help people age better, without experiencing some of the degenerative effects of getting old. Most people by the age of 85 have experienced some kind of chronic illness, and as people get older they are more prone to stroke, heart disease, and cancer.

This week, I learned that molecules inside our cells move really, really fast:

You may wonder how things get around inside cells if they are so crowded. It turns out that molecules move unimaginably quickly due to thermal motion. (…) a molecule will collide with something billions of times a second and bounce off in a different direction. (…) As a result of all this random motion, a typical enzyme can collide with something to react with 500,000 times every second. (…) In addition, a typical protein is tumbling around, a million times per second. Imagine proteins crammed together, each rotating at 60 million RPM, with molecules slamming into them billions of times a second. (…) enzymes spin at up to 700 revolutions per second, which is faster than a jet engine.

Waymo, the Alphabet [i.e., Google] self-driving car company, now has cars driving on public roads in the Phoenix metropolitan area with no one in the driver’s seat.

In a timid study, a small amount of blood plasma from young people was transfused into people suffering from Alzheimer’s. The study failed to show massive benefits:

Nine patients with mild to moderate Alzheimer’s got four once-weekly infusions of either saline (as a placebo) or plasma from 18- to 30-year-old male donors. After a 6-week break, the infusions were switched so that the patients who had gotten plasma got saline, and the patients who had gotten saline received plasma. Another nine patients received young plasma only, and no placebo. (…) The remaining patients who completed the young plasma treatment performed no better overall on objective cognitive tests given by medical staff. However, on average their scores improved slightly—4.5 points on a 30-point scale—on a caregiver survey about whether they needed help with daily activities such as making a meal or traveling. The patients’ average scores also improved modestly on another survey that asks caregivers how well patients can perform simple tasks like getting dressed and shopping. (…) Wyss-Coray agrees that not much can be concluded from the small trial, but says, “It’s tempting to feel hopeful about the improvement in functional scores.” Because the treatment seemed safe, Alkahest now wants to launch another trial that will use just the fraction of the blood plasma that contains growth factors, but not coagulation factors and other components that may do more harm than good. In animals, this plasma fraction was more effective at improving cognition in the mice with an Alzheimer’s-like condition than whole plasma, Wyss-Coray says. Alkahest also wants to test a range of doses and include patients with more severe Alzheimer’s.

Recall that we do know that parabiosis, the transfer of blood between two animals, can age or rejuvenate, but the best evidence suggests that we simply acquire aging factors as we age: there are no magical youthful factors. So it might be a total waste of time to add a bit of young plasma in the blood of older individuals. Instead, we need to find a way to normalize the composition of the blood of older people. This problem may prove very hard with our current technology. Or not.

It is usually assumed that peer review, as practiced by science journal, improves research articles. It may not be so simple:

Peer reviewers fail to detect important deficiencies in reporting of the methods and results of randomised trials. The number of changes requested by peer reviewers was relatively small. Although most had a positive impact, some were inappropriate and could have a negative impact on reporting in the final publication.

Alzheimer’s disease may not start in the brain. Without kidding: there is some evidence that unhealthy teeth and gums could lead to Alzheimer’s.

A young girl that was doomed as a baby due to a spinal muscular atrophy was rescued due to gene therapy.

I always assumed that prestigious journals had better peer review. That might not be so:

If you take all journals and rank them according to prestige,the most prestigious journals publish the least reliable science (at least when you look at the available evidence from experimental fields).

(…)

In the field of genetics, it appears that errors in gene names (and accession numbers) introduced are more common in higher ranking journals

(…)

statistical power has been declining since the 1960s and that statistical power is negatively correlated with journal rank (i.e., a reproduction of the work above, with an even worse outcome). Moreover, the fraction of errors in calculating p-values is positively correlated with journal rank, both in terms of records and articles

What might explain this effect? Maybe high prestige attracts certain types of people with goals the differ from that of science.

A breast cancer drug appears to cure arteriosclerosis (a condition responsible for heart diseases) in mice.

Systematic reviews of the state-of-the-art are often produced in medical research to summarize the research. I always expected this work to be top-notch. It may not be so, as these studies fail to provide the necessary information to make their analysis reproducible:

the data needed to recreate all meta-analytic effect estimates (…) in only 65% of SRs. Only 30% of SRs mentioned access to datasets and statistical code used to perform analyses.

It is just not possible to check whether a review paper has done its work properly. You cannot double check the computations.

According to the New York Times, the best college majors according to wages are engineering, economics, computer science and nursing. These worst are education, social work, humanities, philosophy, liberal arts, psychology, English, and biology. Business and accounting are somewhere in the middle.

The hole is the ozone layer is shrinking.

Amazon’s Jeff Bezos has funded a company that is buidling a gigantic indoor farm in Seattle.

Lifting weights twice a week is the best routine to keep your muscles as you age. Lifting weights is better than cardio to stay fit as you age.

Sleep deprivation messes with your neurons in a bad way, according to Nature.

How should you build a high-performance column store for the 2020s?

Though most relational databases (like MySQL) are “row oriented”, in that they keep rows stored together… experience has taught us that pivoting the data arrow so that columns, and not rows, are stored together can be beneficial. This is an old observation that most experienced programmers know about as the array-of-struct versus struct-of-array choice. There is even a wikipedia page on the topic.

For example, in Java you might have a Point object containing three attributes (X, Y, and Z). You can store an array of Point objects. Or you could have three arrays (one for X, one for Y and one for Z). Which one is best depends on how you are using the data, you should consider both.

It is fairly common for databases to have tons of columns, and for analytical queries to bear only on a few columns. So it makes sense to keep the columns together.

There is actually a full spectrum of possibilities. For example, if you have a timestamp column, you might consider it as several separate columns that can be stored together (years, day, seconds). It is impossible in the abstract to tell how to best organize the data, you really need some domain knowledge about how you access the data.

In the data science world, I have been very interested in a new initiative called Apache Arrow. One way to view Apache Arrow is a common column-oriented database format that can be shared by many database engines.

Column-orientation is at its best when used with compression. Indeed, it is arguably easier and more efficient to compress columns stored together, than one they are interleaved with other columns.

Arrow relies on dictionary compression, a tried-and-true technique. It is a great technique that can enable really fast algorithm. The key idea is quite simple. Suppose that there are 256 distinct values in your column. Then you can represent them as numbers in [0,255], with a secondary lookup table to recover the actual values. That is, if you have N distinct values, you can store them using ceil(log(N)/log(2)) bits using binary packing (C code, Java code, Go code, C# code). Thus you might use just one byte per value instead of possibly much more. It is a great format that enables superb optimizations. For example you can accelerate dictionary decoding using SIMD instructions on most Intel and AMD processors released in the last 5 years.

In a widely read blog post, professor Abadi criticized Apache Arrow in these terms:

I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression).

I thought I would comment a bit on this objection because I spent a lot of time hacking these problems.

Let me first establish the terminology:

  1. Run-length encoding is the idea where you represent repeated values as the value being repeated followed by some count that tells you how often the value is repeated. So if you have 11111, you might store is as “value 1 repeated 5 times”.
  2. A bit-vector (or bitmap, or bit array, or bitset) is simply an array of bits. It is useful top represent sets of integer values. So to repeat the set {1,2,100}, you would create an array containing at least 100 bits (e.g., an array of two 64-bit words) and set only the bits at index 1, 2 and 100 to ‘true’, all other bits would be set to false. I gave a talk at the Spark Summit earlier this year on the topic.

    To make things complicated, many bit-vectors used in practice within indexes rely on some form of compression.

    You can build an entire data engine on top of bit-vectors: Pilosa is a great example.

Here are some considerations:

  1. To sort, or not to sort?

    Run-length encoding is very powerful when the data is sorted. But you cannot have all columns in sorted order. If you have 20 columns, you might lexicographically sort the data on column 2 and 3, and then column 2 and 3 will be very highly compressible by run-length encoding, but other columns might not benefit so much from your sorting efforts. That’s great if your queries focus on columns 2 and 3, but much less interesting if you have queries hitting all sorts of columns. You can duplicate your data, extracting small sets of overlapping columns, sorting the result, but that brings engineering overhead. You could try using sorting based on space-filling curves, but if you have lots of columns, that might be worse than useless. There are better alternatives such as Vortex or Multiple-List sort.

    Should you rely on sorting to improve compression? Certainly, many engines, like Druid, derive a lot of benefits from sorting. But it is more complicated than it seems. There are different ways to sort, and sorting is not always possible. Storing your data in distinct projection indexes like Vertica may or may not be a good option engineering-wise.

    The great thing about a simple dictionary encoding is that you do not need to ask these questions…

  2. There are more ways to compress than you know.

    I have been assuming (and I guess Adabi did the same) that dictionary coding was reliant on binary packing. But there are other techniques found in Oracle and SAP platforms. All of them assume that you divide your column data into blocks.

    • By using dictionary coding again within each small block, you can achieve great compression because most blocks will see very few distinct column values. That’s called indirect coding.
    • Another technique is called sparse coding, where you use a bit-vector to mark the occurrences of the most frequent value, followed by a list of the other values.
    • You also have prefix coding where you record the first value in the block and how often it is repeated consecutively, before storing the rest of the block the usual manner.

    (We review these techniques in Section 6.1.1 of one of our papers.)

    But that is not all! You can also use patched coding techniques, such as FastPFor (C++ code, Java code). You could even get a good mileage out of a super fast Stream VByte.

    There are many, many techniques to choose from, with different trade-offs between engineering complexity, compression rates, and decoding speeds.

  3. Know your hardware!

    And every single processor you might care about would support SIMD instructions that can process several values at once. And yes, this includes Java through the Panama project (there is a great talk by top Oracle engineers on this topic). So you absolutely want to make it easy to benefit from these instructions since they are only getting better over time. Intel and AMD will popularize AVX-512, ARM is coming with SVE.

    I think that if you are designing for the future, you want to take into account SIMD instructions in the same way you would take into account multi-core or cloud-based processing. But we know less about doing this well than we could hope for.

  4. Be open, or else…

    Almost every single major data-processing platform that has emerged in the last decade has been either open source, sometimes with proprietary layers (e.g., Apache Hive), or built substantially on open source software (e.g., Amazon Redshift). There are exceptions, of course… Google has its own things, Vertica has done well. On the whole, however, the business trade-offs involved when building a data processing platform make open source increasingly important. You are either part of the ecosystem, or your are not.

So how do we build a column store for the future? I don’t yet know, but my current bet is on Apache Arrow being our best foundation.

Advertisement. If you are a recent Ph.D. graduate and you are interested in coming to Montreal to work with me during post-doc to build a next-generation column store, please get in touch. The pay won’t be good, but the work will be fun.

Should computer scientists keep the Lena picture?

There is a very famous picture commonly used in image processing research and classes, that of Lena Söderberg. Söderberg posed nude for the Playboy magazine in the 1970s. A cropped version of the picture has been used ever since. The picture being used is “safe for work”, but if you Google long enough, you will find the complete version.

I worked for years on the picture, being totally ignorant of the origin story. It is many years later that I learned the story. (I am old enough that I did my Ph.D. in the pre-Google days).

To many women, the use of the Lena picture is offending. This puts me in the awkward position of choosing not to include the picture I refer to in my post. It is available online.

What should we do?

Richard Matthews from the University of Adelaide thinks that we should keep using the picture:

Should the field in general stop using the Lena image? My personal view is: no. The use of the Lena test image is a quirk of the industry that should be celebrated. That being said, it should be used alongside others equally.

I disagree with Matthews.

When I wrote my Ph.D. thesis, I think I was ignorant of the backstory behind Lena, and I used it abundantly in my tests because it is a very rich picture, with lots of different textures. Still, when I wrote my Ph.D. thesis, instead of going with Lena, I used a painting of some famous mathematician because I thought that the theme was more interesting.
There are good alternatives such as the Kodak image set that do not contain any objectionable image.

“Getting along” is how we prosper. There are times when offending people is warranted. But I don’t think that the offense caused by Lena serves a useful purpose.

We cannot remove the Lena picture from the articles and textbooks, and we should not seek to make the picture disappear… but in the interest of “getting along” we should use alternative pictures.

Getting along is a statistical concept. You can never please everyone or even most people. If you remove anything that could offend someone, somewhere, you will be forced to hide in a very dark hole. In the case of the Lena image, in my view, enough reasonable people are offended that it is worth taking it into consideration.

Of course, you can never win cleanly. It is entirely possible that other people would be offended at the thought that we might remove the picture from consideration. Hence, I use the term “statistical”. It is not black and white. Wrong and right.

I will conclude with a personal story. In the lecture notes of one of my courses, there is a picture of a prostitute by the side of the road. If you are not paying attention to the content of the course, it seems totally out of place and maybe sexist. At one point, we were paying a revisor who went through my notes. When she got to this picture, I think she was ready to throw her hands up and call the cops. Except that the cover in question is that of a feminist magazine who was one of the first to benefit from digital archiving. It was, historically, a very important feminist (I stress feminist) magazine. That particular issue had a piece on prostitution. I used this cover because that’s the cover the people who completed the digital conversion promoted; in the original announcement, there were famous feminists being photographed right by the cover in question. I could have removed the cover in question from my notes in the interest of “getting along”, but one should draw a line somewhere. My point is that we should not bend so far as to enter a 1984-like Orwellian world. We should not go so far as to forbid the Lena picture, or to clean the world of anything that might offend. Still, I would probably have chosen a different cover had I thought it could offend (needlessly).

Further reading: Should we discourage the use of Lenna?

Are universities an egalitarian force?

Many people are concerned about economic inequality. Many of these people work on university campuses. My experience is that the vast majority of the university professors, including those working at prestigious schools, believe that they are working toward a more egalitarian society.

I have reviewed this issue earlier this year in College and inequality. I believe that the case for colleges as agents of equality is much weaker than people expect, and the reverse case can be made with equal confidence. The tax breaks that Princeton and Harvard receives (tens of thousands of dollars per student per year) are not egalitarian.

This time, I want to examine more closely what the scholarship has to teach us.

In her book, Mettler (2014) argues that

our system of higher education not only fails to mitigate inequality but it exacerbates it.

Her book makes a comprehensive case. One of her argument is that well-off kids attend elite schools where they receive a much better deal. Elite schools are often much better funded, in part thanks to tax breaks and other forms of government favoritism. Poorer kids tend to attend lesser schools where they get a worse deal. This argument finds an echo in Carnevale and Strohl:

Selective colleges give bigger annual subsidies for each student’s educational program. For example, in 2006 private research universities provided $15,000 in non-tuition subsidy compared with a subsidy of $6,500 by community colleges. In order to hold or increase their prestige rankings, colleges compete by using their financial and capital resources to bid for students with the highest test scores. For example, economist Gordon C. Winston at Williams College reports that Williams had offered a college education that cost about $65,000 for a net price to students of about $20,000. The $45,000 subsidy for each student allows Williams to attract the student body with the highest test scores, thereby maintaining or increasing its prestige ranking and in turn maintaining or increasing ability to attract students with high test scores.

To this, one must add an important fact: if you are a kid from the top 1%, you are almost certain to graduate from college with a degree, even if you party all the time… kids of lesser families are much, much more likely to never graduate, and end up with leftover debts. See also the 2015 book by Armstrong and Hamilton, Paying for the Party: How College Maintains Inequality which I already covered.

Hershbein et al. (2015) say that you should not expect too much from education regarding inequality:

Increasing educational attainment will not significantly change overall earnings inequality. The reason is that a large share of earnings inequality is at the top of the earnings distribution, and changing college shares will not shrink those differences.

Keng et al. make an empirical observation that is not favorable to college as a force for equality of outcomes:

Between 1990 and 2014, Taiwan increased the college share of its labor force from 7 to 32 percent (…) Such a rapid surge in skill supply should suppress college wages and lower wage income inequality. Instead, wage inequality rose 7 percent since 1978. (…) The Taiwan case shows that increasing college access alone will not lower inequality.

What about the case for free tuition? Surely, that’s egalitarian? Maybe not:

(…) we study the English higher education system which has, in just two decades, moved from a free college system to one in which tuition fees are among the highest in the world. Our findings suggest that England’s shift has resulted (…) a narrowing of the participation gap between advantaged and disadvantaged students. (…) We conclude that tuition fees, at least in the English case supported their goals of increasing quality, quantity, and equity in higher education.

Science and Technology links (November 3rd, 2017)

A paper in the journal Intelligence reports that people with high intelligence are more at risk for psychological problems. As someone who works with lots of very smart people, I am not surprised.

Atherosclerosis is a medical condition, common with age, that makes you more at risk for heart diseases and strokes. Your arteries become damaged. Lipids (fat) play a key role in this phenomenon and it was believed until now that these lipids came from what you eat. Thus it was suggested that a low-fat diet might be good for your arteries. It turns out that these lipids may come from bacteria. That is, avoiding fats in your diet to prevent a stoke might be a waste of time.

The LA Times report on the growing evidence linking sugars and bread (carbohydrates) with cancer.

Malaria could hold the cure for cancer. Or not.

When you suffer from a heart attack, your heart is damaged. Sadly, heart cells (cardiomyocyte) that die are usually not regenerated though other cells are. Thus a heart attack diminishes you, permanently. However, we now have the technology to transform common cells (fibroblast) into almost any cell we want. Thus, in theory, we could regenerate new cardiomyocytes. But we don’t quite know how to do this reliably. Nature reports on breakthrough work on this problem. The end goal would be to spur regeneration of your cardiomyocytes, and get your heart to repair itself. There is also another related paper in PNAS about mathematical models to that could give us clear recipes to reprogram any cell into any other cell. What is surprising to me is how important mathematical models and software is to these endeavours.

The New York Times points out that there may not be much demand for science graduates:

Much of the public enthusiasm for STEM [science-technology-engineering-mathematics] education rests on the assumption that these fields are rich in job opportunity. (…) What recent studies have made increasingly apparent is that the greatest number of high-paying STEM jobs are in the “T” (specifically, computing). Earlier this year, Glassdoor, a jobs listing website, ranked the median base salary of workers in their first five years of employment by undergraduate major. Computer science topped the list ($70,000), followed by electrical engineering ($68,438). Biochemistry ($46,406) and biotechnology ($48,442) were among the lowest paying majors in the study. (…) At LinkedIn, researchers identified the skills most in demand. The top 10 last year were all computer skills, including expertise in cloud computing, data mining and statistical analysis, and writing smartphone applications (…) STEM advocates, often executives and lobbyists for technology companies, do a disservice when they raise the alarm that America is facing a worrying shortfall of STEM workers, based on shortages in a relative handful of fast-growing fields like data analytics, artificial intelligence, cloud computing and computer security.

What the article does not ask, and should… is why so little demand for people with skills in biotechnology?

In Why ‘Statistical Significance’ Is Often Insignificant, Noah Smith writes about the challenge we face in keeping scientists productive:

The real danger is that when each study represents only a very weak signal of scientific truth, science gets less and less productive. Ever more researchers and ever more studies are needed to confirm each result. This process might be one reason new ideas seem to be getting more expensive to find. (…) We need to change the incentive for researchers to prove themselves by publishing questionable studies that just end up wasting a lot of time and effort.

To sum up the problem, we reward people from writing papers about how we might cure cancer (in mice), rather than rewarding them for actually trying to cure cancer.

Amazon, the retailer, is hiring hundreds of Ph.D.s. More so than most universities.

The retail behemoth has hired nearly 500 Ph.D.s, former professors among them, since the beginning of this year to work in its applied-science and research-science units, according to company figures. The pace and scale of that hiring are far greater than those of any college or university in the country.

In Darwin’s aliens, the authors try to predict what aliens might look like and they conclude that complex aliens will be composed of a nested hierarchy of entities, with the conditions required to eliminate conflict at each of those levels. This tells us little about aliens, sadly.

Scientific American reviews genetic editing toolkits that anyone can order to start doing genetic modifications in their garage. There is speculation that it could help science:

It’s possible that people working in their garages or their kitchens will come up with a novel application or a solution to a problem that professionals just haven’t gotten around to.

That’s maybe how we get superheroes.

Daniel Lattier asks Why Professors Are Writing Crap That Nobody Reads?

Professors usually spend about 3-6 months (sometimes longer) researching and writing a 25-page article to submit an article to an academic journal. (…) 82 percent of articles published in the humanities are not even cited once. Of those articles that are cited, only 20 percent have actually been read. Half of academic papers are never read by anyone other than their authors, peer reviewers, and journal editors. So what’s the reason for this madness? Why does the world continue to be subjected to just under 2 million academic journal articles each year?

Drum writes about the end of work:

These are all harbingers, the way a dropping barometer signals a coming storm—not the possibility of a storm, but the inexorable reality. The two most important problems facing the human race right now are the need for widespread deployment of renewable energy and figuring out how to deal with the end of work. Everything else pales in comparison. Renewable energy already gets plenty of attention, even if half the country still denies that we really need it. It’s time for the end of work to start getting the same attention.

I wonder how well such opinion pieces will age?

There is a mysterious object that is going through our solar system at insane speeds. We don’t know what it is, though it is probably just a dumb piece of rock. Still, we have never seen anything like it: we only ever observe objects that are stuck in our solar system.

The dual-shotgun theorem of software engineering

There is a long-standing problem in software engineering: how does one recognize good code? It would be great if you could point a tool at a software base and get an objective measure of how good the code is. There are tools that pretend to do this, there are textbooks that pretend to guide you toward this objective, but nothing that I care to rely upon.

I have been arguing with folks on Twitter, including Brian Marick, about what makes good code. It all started from a quote by Alexis King retweeted by Eugene Wallingford :

“The best software is the software that is easy to change.”

Hear, hear.

People who never program always tend to assume that once you have code that does something “close” to what you need, then, surely, it is not much effort to bring the code to change a tiny bit its behavior.

Alas, a tiny change in functionality might require a full rewrite.

Not all code is equal in this manner, however. Some code is much easier to change.

You can pretty much recognize bad code without ever looking at it by asking the programmer to make a small change. How long does it take to add a new field to the user accounts… How long does it take to allow users to export their data in XML… How long does it take to add support for USB devices… and so forth.

We see companies collapsing under the weight of their software all the time. You have this company and it has not updated its software in the longest time… why is that? It might be that software updates are not a priority, but it could also very well be that they have 10 software engineers hard at work and achieving very little results because they are stuck in a pit of bad software.

As a practical matter, it is simply not convenient to ask for small changes constantly to measure the quality of your code. In any case, how do you recognize in the abstract a “small” change? Is getting your corporate website to work on mobile a small or large change?

The problem, also, is that really bad code grows out of sequences of small changes. Many small changes are bad ideas that should not get implemented.

It is like a Heisenberg effect: to measure the code quality through many small changes, you have to change it in a way that might lead to poor code. What you would need to do it is to ask for small changes, have experts verify that they are small changes, and then immediately retract the request and undo the small change. That would work beautifully, I think, to ensure that you always have great code but your programmers would quickly resign. Psychologically, it is untenable.

As an alternative, I have the following conjecture which I call the dual-shotgun theorem:

Code that is both correct and fast can be modified easily.

The proof? Code is not correct and fast on its own, you have to change it to make it correct and fast. If it is hard to change the code, it is hard to make it correct and fast.

Let me define my terms.

  • Code is correct if it does what we say it does. If you are not a programmer, or if you are a beginner, you might think that it is a trivial requirement… but it is actually very hard. Code is never perfectly correct. It is impossibly hard to do anything non-trivial without having some surprises from time to time. We call them bugs.
  • Code is fast if another engineer doing a rewrite can’t expect to produce equivalent software that is much faster without massive efforts. If you are programming alone or only starting out, you might think that your code is plenty fast… but if you work with great programmers, you might be surprised by how quickly they can delete whole pages of code to make it run 10x faster.

Both of these characteristics are actionable. Good programmers constantly measure how correct and how fast their code is. We have unit tests, benchmarks and so forth. These things can be automated.

They are not trivial pursuits. It is hard to know for certain how correct and how fast your code is, but you can productively work toward making your code more correct and faster.

It is a lot harder to work toward code that is easy to modify. One approach that people use deliberately for this goal is to use more abstract software. So you end up with layers upon layers of JavaScript frameworks. The result is slow code that fills up with weird bugs caused by leaky abstractions. Or they use higher-level programming languages like Visual Basic and end up in a mess of spaghetti code.

The irony is that code that is designed to be flexible is often neither correct nor fast, and usually hard to modify. Code designed to be flexible is nearly synonymous with “overengineered”. Overengineering is the source of all evils.

Yes, you have heard that early optimization was the source of all evils, but if you read Knuth in context, what he was objecting to was the production of code (with GOTOs) that is hard to read, hard to verify, and probably not much faster, if faster at all.

My theory is that if you focus on correctness and performance (in this order) then you will get code that is easy to modify when the time comes. It will work irrespective of the programming language, problem domain, and so forth.

Some might object that we have great tools to measure code complexity, and thus the difficulty of the code. Take cyclomatic complexity which essentially measure how many branches your code generates. But if you have many branches, it will be hard for your code to be correct. And frequent branches are also bad for performance. So cyclomatic complexity brings no new information. And, of course, overengineered code is often filled with what are effectively branches, although they may not come in the recognizable form of “if” statements.

What is nice about my conjecture is that it is falsifiable. That is, if you can come up with code that is fast and correct, but hard to modify, then you have falsified my conjecture. So it is scientific!

If you come up with a counterexample, that is code that is correct and fast, but hard to modify, you have to explain how the code came to be correct and fast if it is hard to modify.

A decade of using text-mining for citation function classification

Academic work is typically filled with references to previous work. Unfortunately, most of these references have, at best, a tangential relevance. Thus you cannot trust that a paper that cites another actually “builds on it”. A more likely scenario is that the authors of the latest paper did not even read the older paper, and they are citing the previous work for various unscientific reasons.

Some time ago, I helped co-author a paper that aims to identify the “influential” references (Zhu et al., Measuring academic influence: Not all citations are equal). When we worked on this paper, there wasn’t really much work at all on the subject of identifying automatically influential citations and, indeed, I think that the expression “influence” used with this technical meaning originated from our work.

Pride and Knoth published a superb paper on the matter: Incidental or influential? – A decade of using text-mining for citation function classification.

In their review, they confirm our own findings that few references are actually relevant:

It is extremely interesting to note that all studies which employed human annotators to judge citation influence reported a broadly similar ratio of positive examples. This ranged from 10.3%, through 14.3%, to 17.9%. This is an important finding as it gives a clear indication that only a relatively small percentage of all citations are actually influential at all. All of the studies find that the majority of citations are perfunctory at best. Negative citations are extremely rare (…)

They also confirm that the number of times you cite a reference is a strong indication of influence: e.g., if you mention a reference 10 times in your own paper, it is likely because this is an important reference for you.

Our results therefore confirm that the number of times a citation appears is a strong indicator of that citation’s influence.

However, they find something we had not discovered, namely that the similarity between abstracts is the most potent feature. That is, if you cite previous work using an abstract that looks like the previous work, it is likely that you have been genuinely influenced by the previous work. As far as I can tell, their result in this respect is a little at odd with previous work, so either they got it wrong or they have a new (and interesting) new result.

They failed to reproduce some of our results, and this might have to do with software issues:

These results show that ParsCit correctly identified the exact number of citations in only 40% of cases. GrobID was even less successful. It was exactly correct in only one case and missed a significant number of citations in many others. We argue that this demonstrates a potentially serious failing in current methodologies that rely on PDF extraction for calculation of number of citations.

The quality of the software you are using in research is often underrated. If we screwed up the data processing: “shame on us”.

Another issue is “author overlap”: you are more likely to be influenced by work that you have written yourself. In our own paper, we did not find this to be a strong predictor but as observed by Pride and Knoth, this might have to do with the nature of our dataset which was annotated by the authors themselves. Clearly, I am strongly influenced by my own previous work, so I do believe that it ought to be a strong predictor. I would argue, however, that there is a special place in hell for people who are only influenced by themselves.

In any case, time has come, I think, to take this field seriously. The likes of Google should classify citations. It is almost meaningless to say that a paper was cited 100 times if you don’t know how many of these citations are influential.

Further reading. Coincidentally, Nature has an editorial entitled Neutral citation is poor scholarship that states:

Neutral, flavorless or unexamined mention predominates, and we believe this to be an increasing problem for the integrity of scientific communication, whether it is used in the Introduction or in the Results and Discussion.

Synthesized hash functions in Swift 4.1 (and why Java programmers should be envious)

When programming, we often rely on maps from keys to values. This is most often implemented using a hash table, which implies that our keys must be “hashable”. That is, there must be a function from key objects to hash values. A hash value is a “random-looking integer” that is determined by the object value. Thus a given string (e.g., “232”) should always have the same hash value within a given context (typically within the program’s life). It should be “random” in the sense that it should be improbable that any two distinct keys have the same hash value (e.g., “232” and “231” should not hash to the same integer).

In many programming languages like Java and C++, you are expected to implement and design your own hash functions when coming up with your own classes or structs. In Java, if you do not provide a hash function (by overriding the hashCode function), then you inherit some hashCode function which is tied to the particular object instance. That’s problematic. Let me illustrate the problem. Suppose I create my own user identifier class:

public class UserID {
  int x;
  public UserID(int X) {x = X;}
}

That sounds reasonable, right?

Let me store some information corresponding to one such user identifier:

HashMap<Stupid,String> h = new HashMap<Stupid,String>();
h.put(new UserID(32321),"Your name is John");

What happens if I try to retrieve it?

System.out.println(h.get(new UserID(32321)));

Most likely, this will give a null, as in “we don’t have such a user identifier”.

So you have to remember to override hashCode properly or risk having buggy software. Annoying.

It is especially annoying because most people do not know how to design a hash function! You are lucky if you got a single class in college on the topic.

Swift, the new programming language designed by Apple is in the same boat. It is slightly better in that it will not generate a potentially misleading hash function by default. For your class or struct to be “hashable”, you need to provide a hash function.

However, Swift 4.1 fixes this annoyance by generating sensible default hash functions. These hash functions are probably not perfect but they are likely better than whatever bored programmers can conjure up.

Swift 4.1 has not yet been released, so I suppose the details could change, but here is the gist of it. Suppose that you define your own class:

struct Point:Hashable {
     var x : Int
     var y : Int
     public init(_ x:Int,_ y:Int) {
         self.x = x
         self.y = y
     }
 }

Notice that I declared that it is “Hashable” but I did not provide any hash-function implementation. On current Swift (e.g., Swift 4.0), this will not compile, but if you get early builds of Swift 4.1, it will work fine.

So what does it do? Well, the Swift compiler looks at the values stored in your class and struct, and it automagically hashes them together.

A point worth noting: the hash value only depends on the values being stored. So suppose you create a new class called MyPoint (instead of Point):

struct MyPoint:Hashable {
     var x : Int
     var y : Int
     public init(_ x:Int,_ y:Int) {
         self.x = x
         self.y = y
     }
 }

It will hash in the same manner so you can be sure that the following is always true:

Point(1,2).hashValue == MyPoint(1,2).hashValue

How does it compute the hash? It takes the hash value of each value in the struct or class (let me assume that they are hashable themselves) and it combines them using the function _mixForSynthesizedHashValue. This mysterious function could change but for the time being, it is just the following linear polynomial in simplified code:

function _mixForSynthesizedHashValue(x,y) {
  return 31 * x + y
}

It often called a compression function because it takes two values, and combines them into a single one (we go from two times 64 bits to 64 bits). For more “randomness”, the Swift compiler uses a _mixInt function which takes a 64-bit value and returns another 64-bit value with the “bits mixed” (it is just a function that appears to generate random outputs). Thus the following should print the same value twice:

print(Point(1,1).hashValue);
print(_mixInt(_mixForSynthesizedHashValue(1,1)))

What if you have more than two values in your struct or class? Let me consider a tridimensional point:

struct Point3D:Hashable {
     var x : Int
     var y : Int
     var z : Int
     public init(_ x:Int,_ y:Int, _ z:Int) {
         self.x = x
         self.y = y
         self.z = z
     }
 }

Then the result is similar except that we need to call the compression function twice, so that the following two lines will print the same value:

print(Point3D(32,45,66).hashValue);
print(_mixInt(_mixForSynthesizedHashValue(
  _mixForSynthesizedHashValue(32,45),66)))

I alluded to the fact that these automatically generated hash functions might not be perfect… In the current implementation, we get that tridimensional points can trivially collide with bidimensional points, the following being always true:

Point3D(0,32,45).hashValue == Point(32,45).hashValue

If this ends up being a problem for your application, you can always roll out your own hash functions, of course.

I did not elaborate on the _mixInt, but one nice thing that I noticed in the Swift’s source code is that they are planning for it to be randomized, that is, you will not get the same hash value for the same object or struct instance for every run of your project. This is an important security feature which I alluded to in an older blog post.

Indeed, the problem with the current implementation is that I can trivially construct lots of values that will collide (have the same hash value), thus rendering the performance of hash tables and other algorithms quite bad. So it is good to randomize the hash functions by default, to make the job of an attacker more difficult.

Further reading: If you are interested in the science of hashing, you might like the following papers…

Note about cryptography:
There is a whole different field called cryptographic hashing which seeks to map values to hash values in such a way that it is very difficult for you to ever guess what the original value was given the hash value and such that it is very hard to create a key that maps to a specific hash value. For example, passwords can be hashed so that if I only give you the hash value, you would need to work for years to find the matching password. Yet, as a system, it is enough for me to just store the hash value. But for what Swift does, this type of security is irrelevant because you are not trying to hide the information being stored, you are just trying to ensure that it is processed quickly.