It is not uncommon that we need to represent an array of Boolean (true or false) values. There are multiple ways to do it.
The most natural way could be to construct an array of booleans (the native Java type). It is likely that when stored in an array, Java uses a byte per value.
boolean[] array = new boolean[listSize]; for(int k = 0; k < listSize; k++) { array[k] = ((k & 1) == 0) ? true : false; }
You may also use a byte type:
byte[] array = new byte[listSize]; for(int k = 0; k < listSize; k++) { array[k] = ((k & 1) == 0) ? (byte)1 : (byte)0; }
You can get more creative and you could do it using an array of strings:
String[] array = new String[listSize]; for(int k = 0; k < listSize; k++) { array[k] = ((k & 1) == 0) ? "Found" : "NotFound"; }
In theory, Java could optimize the array so that it requires only one bit per entry. In practice, each reference to a string value will use either 32 bits or 64 bits. The string values themselves use extra memory, but Java is probably smart enough not to store multiple times in memory the string “Found”. It might store it just once.
And then you can do it using a BitSet, effectively using about a bit per value:
BitSet bitset = new BitSet(listSize); for(int k = 0; k < listSize; k++) { if((k & 1) == 0) { bitset.set(k); } }
The BitSet has tremendous performance advantages: low memory usage, fancy algorithms that benefit from word-level parallelism, and so forth.
Typically, you do not just construct such an array, you also use it. But let us say that I just want to construct it as fast as possible, how do these techniques differ? I am going to use 64K array with OpenJDK 8 on an Apple M1 processor.
My source code is available. In my benchmark, the content of the arrays is known at compile time which is an optimistic case (the compiler could just precompute the results!). My results are as follow:
boolean | 23 us |
byte | 23 us |
String | 60 us |
BitSet | 50 us |
You may divide by 65536 to get the cost in nanoseconds per entry. You may further divide by 3.2GHz to get the number of cycles per entry.
We should not be surprised that the boolean (and byte) arrays are fastest. It may require just one instruction to set the value. The BitSet is about 3 times slower due to bit manipulations. It will also use 8 times less memory.
I was pleasantly surprised by the performance of the String approach. It will use between 4 and 8 times more memory than the simple array, and thus 32 to 64 times more memory than the BitSet approach, but it is reasonably competitive with the BitSet approach. But we should not be surprised. The string values are known at compile-time. Storing a reference to a string should be more computationally expensive than storing a byte value. These numbers tell us that Java can and will optimize String assignments.
I would still disapprove strongly of the use of String instances to store Boolean values. Java may not be able to always optimize away the computational overhead of class instances.
Furthermore, if you do not care about the extra functionality of the BitSet class, with its dense representation, then an array of boolean values (the native type) is probably quite sane.
Related is the use of a string constant to get runtime initialization cost of nothing – the string constant is stored directly in the class file. So your string might look something like “0100111101010” and you’d just mask off the low-order bit. This ought to have an initialization cost near 0.
private final byte[] FTW
Actually, I was thinking boolean[], lol.
The Java language spec does guarantee that two strings that are identical and originate from literals in the source, are represented by the same object at runtime. That is called interning in Java, see the documentation of java.lang.String.intern() and https://docs.oracle.com/javase/specs/jls/se16/jls16.pdf Chapter 12.5, first bullet point). If the same string is read from a file, that string may be a different object instance.
The newer garbage collectors of the JVM even try to find identical strings to merge them into one object.
Unfortunately not. They try to merge the underlying storage (morally
char[]
). They don’t try to merge the objects. This is not as good, because in many cases strings are short and the object overhead (14-20 bytes) dominates the size.Afaiu the jvm cannot merge the objects: The somewhat ill-considered java semantics allow to distinguish whether two strings have equal contents (.equals / .hashCode) or are identical (== / System.identityHashCode), also in context of synchronization.
So there would need to be some change in java semantics in order to permit the JVM to dedup objects. And java never breaks old promises from simpler times 🙁