Science and Technology links (November 17th, 2017)

Josiah Zayner, a biochemist who once worked for NASA, became the first person known to have edited his own genes (…) During a lecture about human genetic engineering that was streamed live on Facebook, Zayner whipped out a vial and a syringe, then injected himself. Now, following in his footsteps, other biohackers are getting ready to take the plunge and tinker with their own genes. Zayner’s experiment was intended to boost his strength by removing the gene for myostatin, which regulates muscle growth. A similar experiment in 2015 showed that this works in beagles whose genomes were edited at the embryo stage. He injected himself (…) to remove the gene. (Biohackers are using CRISPR on their DNA and we can’t stop it)

Human beings definitively do not have the largest brains:

We found that the long-finned pilot whale neocortex has approximately 37.2 × 109 neurons, which is almost twice as many as humans, and 127 × 109 glial cells. Thus, the absolute number of neurons in the human neocortex is not correlated with the superior cognitive abilities of humans (at least compared to cetaceans) as has previously been hypothesized.

We can make old mice smarter by tweaking just one gene according to an article published in Nature:

This study demonstrates for we believe the first time in vivo that 6 months after a single injection of s-KL into the central nervous system, long-lasting and quantifiable enhancement of learning and memory capabilities are found. More importantly, cognitive improvement is also observable in 18-month-old mice treated once, at 12 months of age.

I stumbled on an older post (2015) by Marcel Weiher about his views on where software is headed:

(…) for most performance critical tasks, predictability is more important than average speed (…) Alas the idea that writing high-level code without any concessions to performance (often justified by misinterpreting or simply just misquoting Knuth) and then letting a sufficiently smart compiler fix it lives on. I don’t think this approach to performance is viable, more predictability is needed and a language with a hybrid nature and the ability for the programmer to specify behavior-preserving transformations that alter the performance characteristics of code is probably the way to go for high-performance, high-productivity systems.

He is arguing that engineers have made it hard to reason about performance, and to design software for the needed performance. When we are stuck with unacceptable performance, we are often stuck… unable to know how to fix the problems.

We found a single gene that makes you live seven years older on average than your peers.

Exercise increases the size of your brain.

Fast exact integer divisions using floating-point operations (ARM edition)

In my latest post, I explained how you could accelerate 32-bit integer divisions by transforming them into 64-bit floating-point divisions. Indeed, 64-bit floating-point numbers can represent accurately all 32-bit integers on most processors.

It is a strange result: Intel processors seem to do a lot better with floating-point divisions than integer divisions.

Recall the numbers that I got for the throughput of division operations:

64-bit integer division25 cycles
32-bit integer division (compile-time constant)2+ cycles
32-bit integer division8 cycles
32-bit integer division via 64-bit float4 cycles

I decided to run the same test on a 64-bit ARM processor (AMD A1100):

64-bit integer division7 ns
32-bit integer division (compile-time constant)2 ns
32-bit integer division6 ns
32-bit integer division via 64-bit float18 ns

These numbers are rough, my benchmark is naive (see code). Still, on this particular ARM processor, 64-bit floating-point divisions are not faster (in throughput) than 32-bit integer divisions. So ARM processors differ from Intel x64 processors quite a bit in this respect.

Fast exact integer divisions using floating-point operations

On current processors, integer division is slow. If you need to compute many quotients or remainders, you can be in trouble. You potentially need divisions when programming a circular buffer, a hash table, generating random numbers, shuffling data randomly, sampling from a set, and so forth.

There are many tricks to avoid performance penalties:

  • You can avoid dividing by an arbitrary integer and, instead, divide by a known power of two.
  • You can use a divisor that is known to your compiler at compile-time. In these cases, most optimizing compilers will “optimize away” the division using magical algorithms that precompute a fast division routine.
  • If you have a divisor that is not known at a compile time, but that you reuse often, you can make use of a library like liddivide to precompute a fast division routine.
  • You can reengineer your code to avoid needing a division in the first place, see my post A fast alternative to the modulo reduction.

But sometimes, you are really stuck and need those divisions. The divisor is not frequently reused, and you have lots of divisions to do.

If you have 64-bit integers, and you need those 64 bits, then you might be in a bit of trouble. Those long 64-bit integers have a terribly slow division on most processors, and there may not be a trivial way to avoid the price.

However, if you have a 32-bit integers, you might have a way out. Modern 64-bit processors have 64-bit floating-pointer numbers using IEEE standards. These 64-bit floating-point numbers can be used to represent exactly all integers in the interval [0,253). That means that you can safely cast your 32-bit unsigned integers as 64-bit floating-point numbers.

Furthermore, common x64 processors have fast floating-point divisions. And the division operation over floating-point numbers is certain to result in the closest number that the standard can represent. The division of an integer in [0,232) by an integer in [1,232) is sure to be in [0,232). This means that you can almost replace the 32-bit integer division by a 64-bit floating point division:

uint32_t divide(uint32_t a, uint32_t b) {
  double da = (double) a;
  double db = (double) b;
  double q = da/db;
  return (uint32_t) q;
}

Sadly, if you try to divide by zero, you will not get a runtime error, but rather some nonsensical result. Still, if you can be trusted to not divide by zero, this provides a fast and exact integer division routine.

How much faster is it? I wrote a small program to measure the throughput:

64-bit integer division25 cycles
32-bit integer division (compile-time constant)2+ cycles
32-bit integer division8 cycles
32-bit integer division via 64-bit float4 cycles

These numbers are rough, but we can estimate that we double the throughput.

I am not entirely sure why compilers fail to exploit this trick. Of course, they would need to handle the division by zero, but that does not seem like a significant barrier. There is also another downside to the floating-point approach: it generates many more instructions.

Regarding signed integers, they work much the same, but you need extra care. For example, most processors rely on two’s complement notation which implies that you have one negative number that cannot be represented as a positive number. Thus implementing “x / (-1)” can cause some headaches. You probably do not want to divide signed integers anyhow.

I plan to come back to the scenario where you have lots of 64-bit integer divisions with a dynamic divisor.

This result is for current Intel x64 processors, what happens on ARM processors is quite different.

Fast software is a discipline, not a purpose

When people train, they usually don’t try to actually run faster or lift heavier weights. As a relatively healthy computer science professor, how fast I run or how much I can lift is of no practical relevance. However, whether I can walk the stairs without falling apart is a metric.

I am not an actor or a model. Who cares how much I weight? I care: it is a metric.

I could probably work in a dirty office without ill effect, but I just choose not to.

So when I see inefficient code, I cringe. I am being told that it does not matter. Who cares? We have plenty of CPU cycles. I think you should care, it is a matter of discipline.

Yes, only about 1% of all code being written really matters. Most people write code that may as well be thrown out.

But then, I dress cleanly every single day even if I stay at home. And you should too.

I do not care which programming language you use. It could be C, it could be JavaScript. If your code is ten times slower than it should, I think it shows that you do not care, not really. And it bothers me. It should bother you because it tells us something about your work. It is telling us that you do not care, not really.

Alexander Jay sent me a nice email. He reviewed some tricks he uses to write fast code. It inspired me these recommendations:

  • Avoid unnecessary memory allocations.
  • Avoid multiple passes over the data when one would do.
  • Avoid unnecessary runtime inferences and branches.
  • Avoid unnecessary performance-adverse abstraction.
  • Prefer simple value types when they suffice.
  • Learn how the data is actually represented in bits, and learn to dance with these bits when you need to.

Alexander asked me “At what point would you consider the focus on optimization a wasted effort? ” My answer: “At what point do you consider being fit and clean a wasted effort?”

There is a reason we don’t tend to hire people who show up to work in dirty t-shirts. It is not that we particularly care about dirty t-shirts, it is that we want people who care about their work.

If you want to show care for your software, you first have to make it clean, correct and fast. If you start caring about bugs and inefficiencies, you will write good software. It is that simple.

China is catching to the USA, while Japan is being left behind

I have previously reported that there has been a silent revolution in science where countries like China, India, South Korea… that previously contributed few research articles… have started to catch up and even exceed the productivity of the western world.

The National Science Foundation (NSF) in the US has published a report on Science and Engineering Publication Output Trends. Their subtitle is “Rise of Developing Country Output while Developed Countries Dominate Highly Cited Publications”.

The report echoes my earlier observations: lots of new countries are become scientific powers in terms of publication output, with China leading the charge. The major claim made by the report is that developed countries continue to dominate in terms of highly cited publications.

I should preface any further analysis by reminding us that such numbers should be interpreted with care. They count the number of research articles produced per country along with other metrics such as which fraction of those is highly cited. Counting citations and research papers is an imperfect way to measure scientific output. Bibliometrics is a field that is full of methodological issues. So we should not try to compare countries on this basis only. Furthermore, the report does not seem concerned with “per capita” analysis. For example, though the European Union appears to fare quite well according to this report&emdash;i.e., they indicate that it publishes more than the US—we know that it is no longer true once you factor in the number of citizens which is higher in the European Union (500 million people versus a bit over 300 million people in the USA). So it might be misleading. Generally, the report does not shy away from comparing small entities to large ones, and that is a problem.

Back to the main claim of the report: though China is rising in output, countries like the US still have a larger fraction of their research articles that are highly cited.

If you accept that Japan is a “developed country”, and you should, then you are forced to view it as an exception from this observation. What is surprising to me regarding the numbers is the fact that Japan seems to underperform compared to what we might expect. They have barely expanded their publication volume and their citation patterns seem to be closer to India than the US. Back in 2008, I already observed that Japan produces far fewer research articles per capita than other rich countries, so it is nothing new. It reminds me of an August 2017 article in “Nature: Budget cuts fuel frustration among Japan’s academics:

young researchers, facing unstable employment conditions and economic uncertainty, are forced to aim for results that can be accomplished in the short term and in which true originality and creativity are difficult to realize.”

Furthermore, the data presented in the paper clearly indicate that countries like China are bridging the gap with respect to how impactful their work is (see Fig. 2 in the report), on top of bridging the gap with respect to the total volume (see Table 1 in the report). It should be viewed as remarkable especially if you consider that most scientific papers are written in English and published by American or European publishers and that most of the most prestigious universities are in the US. And of course, China lags behind the US considerably in research spending, both in absolute dollars and in a percentage of GDP.

A better title for the NSF report might have been “China is catching to the USA, while Japan is being left behind”.

Does any of it matter? Many people believe, or assume, that great output in terms of research articles should cause economic prosperity and innovation. I have post entitled Does academic research cause economic growth? that makes the contrary point. That is, though China is catching up in terms of scientific output, this may be a consequence of their prosperity: they can now afford to have their very best minds work on producing research articles. It is much easier for rich countries to fund people so that they can publish in Nature. So being rich will allow you to catch up.

But Japan shows that you can be a very rich country and choose not to produce many great research articles. In the least, this establishes that you do not need to produce many great research articles to be prosperous.

Science and Technology links (November 11th, 2017)

Read the following quote from the New York Times:

Business is taking an interest in artificial intelligence, or A.I., and some professors, are forming or joining companies to capitalize on the expected boom. But the new move toward commercialization is disrupting the academic community and provoking fears that university research will be hurt.

Some researchers welcome the business interest. Others, however, complain that corporations are outbidding the campus for scarce personnel, and that work is being diverted from long-term research to short-term problems with immediate application.

It was written in 1982.

The Spectator has a piece arguing that moderate drinking is good for you and that this fact is being suppressed by health professionals. (My thoughts: I think it is far from certain that alcohol is good for you, but it is probably not what will kill you.)

We all know that human activity is wiping out some animal species. According to professor Chris Thomas… a different story can be told

The biological processes of evolutionary divergence and speciation have not been broken in the Anthropocene, they have gone into overdrive. Come back in a million years and we might be looking at several million new species whose existence can be attributed to humans. In the end, the Anthropocene biological revolution will almost certainly represent the sixth mass genesis of new biological diversity. It could be the fastest acceleration of evolutionary diversification in the last half-billion years.

In a short talk, the famous polymath Eric Weinstein argues that Physics is lost to theoretical nonesense… that it is stuck away from reality. He suggests that it might be time for outsiders to come in an disrupt the discipline.

You can spot cancer cells because they burn through sugar very fast, generating heat. Canadian researchers have designed a very low-cost device that cools the skin and then measure the temperature to reliably spot skin cancers. If not for regulations, this could be available cheaply at Amazon in a few years?

A boy was suffering from a terrible genetic skin disease. Doctors took skin samples, corrected the genetic anomaly, regrew the skin and put it back in place. It worked. The vast majority of his skin had to be replaced.

You may have heard that chocolate was a health food? According to a story in the Atlantic, there is some really bad research practices involved:

If you look at the most recent version of their clinical trials registry, it was published in January 2015, three months after they published their Nature Neuroscience article. “So they went back after article was published in Nature and changed their clinical trial registry. There is no mention of this in the trial report,” Drysdale added.

(…)

“The bigger concern is that people are trying to do a better job of selling the research itself and not just telling what the straight out answer is,” University of Toronto nutrition researcher Richard Bazinet said. This study only showed that over a period of three months, in a small group, according to a very narrow test that taps a very specific region of the brain, cocoa supplements enhanced cognition. That became “chocolate fights Alzheimer’s” — a message Mars surely appreciated.

As we grow older, we accumulate senescent cells. These are old cells that won’t replicate or function normally. It is strongly associated with several diseases of old age. Currently, there are a few initiatives to develop safe drugs that can kill senescent cells. Amazon’s Jeff Bezos is funding one such initiative. Unexpectedly, some researchers have done breakthrough work that shows we can possibly reverse the senescent state:

The researchers applied compounds called reversatrol analogues, chemicals based on a substance naturally found in red wine, dark chocolate, red grapes and blueberries, to cells in culture. The chemicals caused splicing factors, which are progressively switched off as we age to be switched back on. Within hours, the cells looked younger and started to rejuvenate, behaving like young cells and dividing.

The discovery has the potential to lead to therapies which could help people age better, without experiencing some of the degenerative effects of getting old. Most people by the age of 85 have experienced some kind of chronic illness, and as people get older they are more prone to stroke, heart disease, and cancer.

This week, I learned that molecules inside our cells move really, really fast:

You may wonder how things get around inside cells if they are so crowded. It turns out that molecules move unimaginably quickly due to thermal motion. (…) a molecule will collide with something billions of times a second and bounce off in a different direction. (…) As a result of all this random motion, a typical enzyme can collide with something to react with 500,000 times every second. (…) In addition, a typical protein is tumbling around, a million times per second. Imagine proteins crammed together, each rotating at 60 million RPM, with molecules slamming into them billions of times a second. (…) enzymes spin at up to 700 revolutions per second, which is faster than a jet engine.

Waymo, the Alphabet [i.e., Google] self-driving car company, now has cars driving on public roads in the Phoenix metropolitan area with no one in the driver’s seat.

In a timid study, a small amount of blood plasma from young people was transfused into people suffering from Alzheimer’s. The study failed to show massive benefits:

Nine patients with mild to moderate Alzheimer’s got four once-weekly infusions of either saline (as a placebo) or plasma from 18- to 30-year-old male donors. After a 6-week break, the infusions were switched so that the patients who had gotten plasma got saline, and the patients who had gotten saline received plasma. Another nine patients received young plasma only, and no placebo. (…) The remaining patients who completed the young plasma treatment performed no better overall on objective cognitive tests given by medical staff. However, on average their scores improved slightly—4.5 points on a 30-point scale—on a caregiver survey about whether they needed help with daily activities such as making a meal or traveling. The patients’ average scores also improved modestly on another survey that asks caregivers how well patients can perform simple tasks like getting dressed and shopping. (…) Wyss-Coray agrees that not much can be concluded from the small trial, but says, “It’s tempting to feel hopeful about the improvement in functional scores.” Because the treatment seemed safe, Alkahest now wants to launch another trial that will use just the fraction of the blood plasma that contains growth factors, but not coagulation factors and other components that may do more harm than good. In animals, this plasma fraction was more effective at improving cognition in the mice with an Alzheimer’s-like condition than whole plasma, Wyss-Coray says. Alkahest also wants to test a range of doses and include patients with more severe Alzheimer’s.

Recall that we do know that parabiosis, the transfer of blood between two animals, can age or rejuvenate, but the best evidence suggests that we simply acquire aging factors as we age: there are no magical youthful factors. So it might be a total waste of time to add a bit of young plasma in the blood of older individuals. Instead, we need to find a way to normalize the composition of the blood of older people. This problem may prove very hard with our current technology. Or not.

It is usually assumed that peer review, as practiced by science journal, improves research articles. It may not be so simple:

Peer reviewers fail to detect important deficiencies in reporting of the methods and results of randomised trials. The number of changes requested by peer reviewers was relatively small. Although most had a positive impact, some were inappropriate and could have a negative impact on reporting in the final publication.

Alzheimer’s disease may not start in the brain. Without kidding: there is some evidence that unhealthy teeth and gums could lead to Alzheimer’s.

A young girl that was doomed as a baby due to a spinal muscular atrophy was rescued due to gene therapy.

I always assumed that prestigious journals had better peer review. That might not be so:

If you take all journals and rank them according to prestige,the most prestigious journals publish the least reliable science (at least when you look at the available evidence from experimental fields).

(…)

In the field of genetics, it appears that errors in gene names (and accession numbers) introduced are more common in higher ranking journals

(…)

statistical power has been declining since the 1960s and that statistical power is negatively correlated with journal rank (i.e., a reproduction of the work above, with an even worse outcome). Moreover, the fraction of errors in calculating p-values is positively correlated with journal rank, both in terms of records and articles

What might explain this effect? Maybe high prestige attracts certain types of people with goals the differ from that of science.

A breast cancer drug appears to cure arteriosclerosis (a condition responsible for heart diseases) in mice.

Systematic reviews of the state-of-the-art are often produced in medical research to summarize the research. I always expected this work to be top-notch. It may not be so, as these studies fail to provide the necessary information to make their analysis reproducible:

the data needed to recreate all meta-analytic effect estimates (…) in only 65% of SRs. Only 30% of SRs mentioned access to datasets and statistical code used to perform analyses.

It is just not possible to check whether a review paper has done its work properly. You cannot double check the computations.

According to the New York Times, the best college majors according to wages are engineering, economics, computer science and nursing. These worst are education, social work, humanities, philosophy, liberal arts, psychology, English, and biology. Business and accounting are somewhere in the middle.

The hole is the ozone layer is shrinking.

Amazon’s Jeff Bezos has funded a company that is buidling a gigantic indoor farm in Seattle.

Lifting weights twice a week is the best routine to keep your muscles as you age. Lifting weights is better than cardio to stay fit as you age.

Sleep deprivation messes with your neurons in a bad way, according to Nature.

How should you build a high-performance column store for the 2020s?

Though most relational databases (like MySQL) are “row oriented”, in that they keep rows stored together… experience has taught us that pivoting the data arrow so that columns, and not rows, are stored together can be beneficial. This is an old observation that most experienced programmers know about as the array-of-struct versus struct-of-array choice. There is even a wikipedia page on the topic.

For example, in Java you might have a Point object containing three attributes (X, Y, and Z). You can store an array of Point objects. Or you could have three arrays (one for X, one for Y and one for Z). Which one is best depends on how you are using the data, you should consider both.

It is fairly common for databases to have tons of columns, and for analytical queries to bear only on a few columns. So it makes sense to keep the columns together.

There is actually a full spectrum of possibilities. For example, if you have a timestamp column, you might consider it as several separate columns that can be stored together (years, day, seconds). It is impossible in the abstract to tell how to best organize the data, you really need some domain knowledge about how you access the data.

In the data science world, I have been very interested in a new initiative called Apache Arrow. One way to view Apache Arrow is a common column-oriented database format that can be shared by many database engines.

Column-orientation is at its best when used with compression. Indeed, it is arguably easier and more efficient to compress columns stored together, than one they are interleaved with other columns.

Arrow relies on dictionary compression, a tried-and-true technique. It is a great technique that can enable really fast algorithm. The key idea is quite simple. Suppose that there are 256 distinct values in your column. Then you can represent them as numbers in [0,255], with a secondary lookup table to recover the actual values. That is, if you have N distinct values, you can store them using ceil(log(N)/log(2)) bits using binary packing (C code, Java code, Go code, C# code). Thus you might use just one byte per value instead of possibly much more. It is a great format that enables superb optimizations. For example you can accelerate dictionary decoding using SIMD instructions on most Intel and AMD processors released in the last 5 years.

In a widely read blog post, professor Abadi criticized Apache Arrow in these terms:

I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression).

I thought I would comment a bit on this objection because I spent a lot of time hacking these problems.

Let me first establish the terminology:

  1. Run-length encoding is the idea where you represent repeated values as the value being repeated followed by some count that tells you how often the value is repeated. So if you have 11111, you might store is as “value 1 repeated 5 times”.
  2. A bit-vector (or bitmap, or bit array, or bitset) is simply an array of bits. It is useful top represent sets of integer values. So to repeat the set {1,2,100}, you would create an array containing at least 100 bits (e.g., an array of two 64-bit words) and set only the bits at index 1, 2 and 100 to ‘true’, all other bits would be set to false. I gave a talk at the Spark Summit earlier this year on the topic.

    To make things complicated, many bit-vectors used in practice within indexes rely on some form of compression.

    You can build an entire data engine on top of bit-vectors: Pilosa is a great example.

Here are some considerations:

  1. To sort, or not to sort?

    Run-length encoding is very powerful when the data is sorted. But you cannot have all columns in sorted order. If you have 20 columns, you might lexicographically sort the data on column 2 and 3, and then column 2 and 3 will be very highly compressible by run-length encoding, but other columns might not benefit so much from your sorting efforts. That’s great if your queries focus on columns 2 and 3, but much less interesting if you have queries hitting all sorts of columns. You can duplicate your data, extracting small sets of overlapping columns, sorting the result, but that brings engineering overhead. You could try using sorting based on space-filling curves, but if you have lots of columns, that might be worse than useless. There are better alternatives such as Vortex or Multiple-List sort.

    Should you rely on sorting to improve compression? Certainly, many engines, like Druid, derive a lot of benefits from sorting. But it is more complicated than it seems. There are different ways to sort, and sorting is not always possible. Storing your data in distinct projection indexes like Vertica may or may not be a good option engineering-wise.

    The great thing about a simple dictionary encoding is that you do not need to ask these questions…

  2. There are more ways to compress than you know.

    I have been assuming (and I guess Adabi did the same) that dictionary coding was reliant on binary packing. But there are other techniques found in Oracle and SAP platforms. All of them assume that you divide your column data into blocks.

    • By using dictionary coding again within each small block, you can achieve great compression because most blocks will see very few distinct column values. That’s called indirect coding.
    • Another technique is called sparse coding, where you use a bit-vector to mark the occurrences of the most frequent value, followed by a list of the other values.
    • You also have prefix coding where you record the first value in the block and how often it is repeated consecutively, before storing the rest of the block the usual manner.

    (We review these techniques in Section 6.1.1 of one of our papers.)

    But that is not all! You can also use patched coding techniques, such as FastPFor (C++ code, Java code). You could even get a good mileage out of a super fast Stream VByte.

    There are many, many techniques to choose from, with different trade-offs between engineering complexity, compression rates, and decoding speeds.

  3. Know your hardware!

    And every single processor you might care about would support SIMD instructions that can process several values at once. And yes, this includes Java through the Panama project (there is a great talk by top Oracle engineers on this topic). So you absolutely want to make it easy to benefit from these instructions since they are only getting better over time. Intel and AMD will popularize AVX-512, ARM is coming with SVE.

    I think that if you are designing for the future, you want to take into account SIMD instructions in the same way you would take into account multi-core or cloud-based processing. But we know less about doing this well than we could hope for.

  4. Be open, or else…

    Almost every single major data-processing platform that has emerged in the last decade has been either open source, sometimes with proprietary layers (e.g., Apache Hive), or built substantially on open source software (e.g., Amazon Redshift). There are exceptions, of course… Google has its own things, Vertica has done well. On the whole, however, the business trade-offs involved when building a data processing platform make open source increasingly important. You are either part of the ecosystem, or your are not.

So how do we build a column store for the future? I don’t yet know, but my current bet is on Apache Arrow being our best foundation.

Advertisement. If you are a recent Ph.D. graduate and you are interested in coming to Montreal to work with me during post-doc to build a next-generation column store, please get in touch. The pay won’t be good, but the work will be fun.

Should computer scientists keep the Lena picture?

There is a very famous picture commonly used in image processing research and classes, that of Lena Söderberg. Söderberg posed nude for the Playboy magazine in the 1970s. A cropped version of the picture has been used ever since. The picture being used is “safe for work”, but if you Google long enough, you will find the complete version.

I worked for years on the picture, being totally ignorant of the origin story. It is many years later that I learned the story. (I am old enough that I did my Ph.D. in the pre-Google days).

To many women, the use of the Lena picture is offending. This puts me in the awkward position of choosing not to include the picture I refer to in my post. It is available online.

What should we do?

Richard Matthews from the University of Adelaide thinks that we should keep using the picture:

Should the field in general stop using the Lena image? My personal view is: no. The use of the Lena test image is a quirk of the industry that should be celebrated. That being said, it should be used alongside others equally.

I disagree with Matthews.

When I wrote my Ph.D. thesis, I think I was ignorant of the backstory behind Lena, and I used it abundantly in my tests because it is a very rich picture, with lots of different textures. Still, when I wrote my Ph.D. thesis, instead of going with Lena, I used a painting of some famous mathematician because I thought that the theme was more interesting.
There are good alternatives such as the Kodak image set that do not contain any objectionable image.

“Getting along” is how we prosper. There are times when offending people is warranted. But I don’t think that the offense caused by Lena serves a useful purpose.

We cannot remove the Lena picture from the articles and textbooks, and we should not seek to make the picture disappear… but in the interest of “getting along” we should use alternative pictures.

Getting along is a statistical concept. You can never please everyone or even most people. If you remove anything that could offend someone, somewhere, you will be forced to hide in a very dark hole. In the case of the Lena image, in my view, enough reasonable people are offended that it is worth taking it into consideration.

Of course, you can never win cleanly. It is entirely possible that other people would be offended at the thought that we might remove the picture from consideration. Hence, I use the term “statistical”. It is not black and white. Wrong and right.

I will conclude with a personal story. In the lecture notes of one of my courses, there is a picture of a prostitute by the side of the road. If you are not paying attention to the content of the course, it seems totally out of place and maybe sexist. At one point, we were paying a revisor who went through my notes. When she got to this picture, I think she was ready to throw her hands up and call the cops. Except that the cover in question is that of a feminist magazine who was one of the first to benefit from digital archiving. It was, historically, a very important feminist (I stress feminist) magazine. That particular issue had a piece on prostitution. I used this cover because that’s the cover the people who completed the digital conversion promoted; in the original announcement, there were famous feminists being photographed right by the cover in question. I could have removed the cover in question from my notes in the interest of “getting along”, but one should draw a line somewhere. My point is that we should not bend so far as to enter a 1984-like Orwellian world. We should not go so far as to forbid the Lena picture, or to clean the world of anything that might offend. Still, I would probably have chosen a different cover had I thought it could offend (needlessly).

Further reading: Should we discourage the use of Lenna?

Are universities an egalitarian force?

Many people are concerned about economic inequality. Many of these people work on university campuses. My experience is that the vast majority of the university professors, including those working at prestigious schools, believe that they are working toward a more egalitarian society.

I have reviewed this issue earlier this year in College and inequality. I believe that the case for colleges as agents of equality is much weaker than people expect, and the reverse case can be made with equal confidence. The tax breaks that Princeton and Harvard receives (tens of thousands of dollars per student per year) are not egalitarian.

This time, I want to examine more closely what the scholarship has to teach us.

In her book, Mettler (2014) argues that

our system of higher education not only fails to mitigate inequality but it exacerbates it.

Her book makes a comprehensive case. One of her argument is that well-off kids attend elite schools where they receive a much better deal. Elite schools are often much better funded, in part thanks to tax breaks and other forms of government favoritism. Poorer kids tend to attend lesser schools where they get a worse deal. This argument finds an echo in Carnevale and Strohl:

Selective colleges give bigger annual subsidies for each student’s educational program. For example, in 2006 private research universities provided $15,000 in non-tuition subsidy compared with a subsidy of $6,500 by community colleges. In order to hold or increase their prestige rankings, colleges compete by using their financial and capital resources to bid for students with the highest test scores. For example, economist Gordon C. Winston at Williams College reports that Williams had offered a college education that cost about $65,000 for a net price to students of about $20,000. The $45,000 subsidy for each student allows Williams to attract the student body with the highest test scores, thereby maintaining or increasing its prestige ranking and in turn maintaining or increasing ability to attract students with high test scores.

To this, one must add an important fact: if you are a kid from the top 1%, you are almost certain to graduate from college with a degree, even if you party all the time… kids of lesser families are much, much more likely to never graduate, and end up with leftover debts. See also the 2015 book by Armstrong and Hamilton, Paying for the Party: How College Maintains Inequality which I already covered.

Hershbein et al. (2015) say that you should not expect too much from education regarding inequality:

Increasing educational attainment will not significantly change overall earnings inequality. The reason is that a large share of earnings inequality is at the top of the earnings distribution, and changing college shares will not shrink those differences.

Keng et al. make an empirical observation that is not favorable to college as a force for equality of outcomes:

Between 1990 and 2014, Taiwan increased the college share of its labor force from 7 to 32 percent (…) Such a rapid surge in skill supply should suppress college wages and lower wage income inequality. Instead, wage inequality rose 7 percent since 1978. (…) The Taiwan case shows that increasing college access alone will not lower inequality.

What about the case for free tuition? Surely, that’s egalitarian? Maybe not:

(…) we study the English higher education system which has, in just two decades, moved from a free college system to one in which tuition fees are among the highest in the world. Our findings suggest that England’s shift has resulted (…) a narrowing of the participation gap between advantaged and disadvantaged students. (…) We conclude that tuition fees, at least in the English case supported their goals of increasing quality, quantity, and equity in higher education.

Science and Technology links (November 3rd, 2017)

A paper in the journal Intelligence reports that people with high intelligence are more at risk for psychological problems. As someone who works with lots of very smart people, I am not surprised.

Atherosclerosis is a medical condition, common with age, that makes you more at risk for heart diseases and strokes. Your arteries become damaged. Lipids (fat) play a key role in this phenomenon and it was believed until now that these lipids came from what you eat. Thus it was suggested that a low-fat diet might be good for your arteries. It turns out that these lipids may come from bacteria. That is, avoiding fats in your diet to prevent a stoke might be a waste of time.

The LA Times report on the growing evidence linking sugars and bread (carbohydrates) with cancer.

Malaria could hold the cure for cancer. Or not.

When you suffer from a heart attack, your heart is damaged. Sadly, heart cells (cardiomyocyte) that die are usually not regenerated though other cells are. Thus a heart attack diminishes you, permanently. However, we now have the technology to transform common cells (fibroblast) into almost any cell we want. Thus, in theory, we could regenerate new cardiomyocytes. But we don’t quite know how to do this reliably. Nature reports on breakthrough work on this problem. The end goal would be to spur regeneration of your cardiomyocytes, and get your heart to repair itself. There is also another related paper in PNAS about mathematical models to that could give us clear recipes to reprogram any cell into any other cell. What is surprising to me is how important mathematical models and software is to these endeavours.

The New York Times points out that there may not be much demand for science graduates:

Much of the public enthusiasm for STEM [science-technology-engineering-mathematics] education rests on the assumption that these fields are rich in job opportunity. (…) What recent studies have made increasingly apparent is that the greatest number of high-paying STEM jobs are in the “T” (specifically, computing). Earlier this year, Glassdoor, a jobs listing website, ranked the median base salary of workers in their first five years of employment by undergraduate major. Computer science topped the list ($70,000), followed by electrical engineering ($68,438). Biochemistry ($46,406) and biotechnology ($48,442) were among the lowest paying majors in the study. (…) At LinkedIn, researchers identified the skills most in demand. The top 10 last year were all computer skills, including expertise in cloud computing, data mining and statistical analysis, and writing smartphone applications (…) STEM advocates, often executives and lobbyists for technology companies, do a disservice when they raise the alarm that America is facing a worrying shortfall of STEM workers, based on shortages in a relative handful of fast-growing fields like data analytics, artificial intelligence, cloud computing and computer security.

What the article does not ask, and should… is why so little demand for people with skills in biotechnology?

In Why ‘Statistical Significance’ Is Often Insignificant, Noah Smith writes about the challenge we face in keeping scientists productive:

The real danger is that when each study represents only a very weak signal of scientific truth, science gets less and less productive. Ever more researchers and ever more studies are needed to confirm each result. This process might be one reason new ideas seem to be getting more expensive to find. (…) We need to change the incentive for researchers to prove themselves by publishing questionable studies that just end up wasting a lot of time and effort.

To sum up the problem, we reward people from writing papers about how we might cure cancer (in mice), rather than rewarding them for actually trying to cure cancer.

Amazon, the retailer, is hiring hundreds of Ph.D.s. More so than most universities.

The retail behemoth has hired nearly 500 Ph.D.s, former professors among them, since the beginning of this year to work in its applied-science and research-science units, according to company figures. The pace and scale of that hiring are far greater than those of any college or university in the country.

In Darwin’s aliens, the authors try to predict what aliens might look like and they conclude that complex aliens will be composed of a nested hierarchy of entities, with the conditions required to eliminate conflict at each of those levels. This tells us little about aliens, sadly.

Scientific American reviews genetic editing toolkits that anyone can order to start doing genetic modifications in their garage. There is speculation that it could help science:

It’s possible that people working in their garages or their kitchens will come up with a novel application or a solution to a problem that professionals just haven’t gotten around to.

That’s maybe how we get superheroes.

Daniel Lattier asks Why Professors Are Writing Crap That Nobody Reads?

Professors usually spend about 3-6 months (sometimes longer) researching and writing a 25-page article to submit an article to an academic journal. (…) 82 percent of articles published in the humanities are not even cited once. Of those articles that are cited, only 20 percent have actually been read. Half of academic papers are never read by anyone other than their authors, peer reviewers, and journal editors. So what’s the reason for this madness? Why does the world continue to be subjected to just under 2 million academic journal articles each year?

Drum writes about the end of work:

These are all harbingers, the way a dropping barometer signals a coming storm—not the possibility of a storm, but the inexorable reality. The two most important problems facing the human race right now are the need for widespread deployment of renewable energy and figuring out how to deal with the end of work. Everything else pales in comparison. Renewable energy already gets plenty of attention, even if half the country still denies that we really need it. It’s time for the end of work to start getting the same attention.

I wonder how well such opinion pieces will age?

There is a mysterious object that is going through our solar system at insane speeds. We don’t know what it is, though it is probably just a dumb piece of rock. Still, we have never seen anything like it: we only ever observe objects that are stuck in our solar system.