First Class Goto

March 2014

Oort is an experimental programming language I have been working on, on and off (mostly off), since 2007. It is a statically typed, object-oriented, imperative language, where classes, functions and methods can be nested arbitrarily, and where functions and methods are full closures, ie., they can be stored in variables and returned from functions. The control structures are the usual ones: if, for, while, do, goto, etc.

It also has an unusual feature: goto labels are first class.

What does it mean for labels to be first class? It means two things: (1) they are lexically scoped so that they are visible from inside nested functions. This makes it possible to jump from any point in the program to any other location that is visible from that point, even if that location is in another function. And (2) labels can be used as values: They can be passed to and returned from functions and methods, and they can be stored in data structures.

As a simple example, consider a data structure with a “foreach” method that takes a callback function and calls it for every item in the data structure. In Oort this might look like this:

table: array[person_t];

table.foreach (fn (p: person_t) -> void {
        print p.name;
        print p.age;
    });

A note about syntax. In Oort, anonymous functions are defined like this:

fn (<arguments>) -> <return type> {
    ...;
}

and variables and arguments are declared like this:

<name>: <type>

so the code above defines an anonymous function that prints the name and the age of person and passes that function to the foreach method of the table.

What if we want to stop the iteration? You could have the callback return true to stop, or you could have it throw an exception. However, both methods are a little clumsy: The first because the return value might be useful for other purposes, the second because stopping the iteration isn’t really an exceptional situation.

With lexically scoped labels there is a direct solution – just use goto to jump out of the callback:

  table.foreach (fn (p: person_t) -> void {
          print p.name;
          print p.age;

          if (p.age > 50)
              goto done;
      });

@done:

Note what’s going on here: Once we find a person older than 50, we jump out of the anonymous callback and back into the enclosing function. The git tree has a running example.

Call/cc in terms of goto

In Scheme and some other languages there is a feature called call/cc, which is famous for being both powerful and mindbending. What it does is, it takes the concept of “where we are in the program” and packages it up as a function. This function, called the continuation, is then passed to another, user-defined, function. If the user-defined function calls the continuation, the program will resume from the point where call/cc was invoked. The mindbending part is that a continuation can be stored in data structures and called multiple times, which means the call/cc invocation can in effect return more than once.

Lexically scoped labels are at least as expressive as call/cc, because if you have them, you can write call/cc as a function:

call_cc (callback: fn (k: fn()->void)) -> void
{
    callback (fn() -> void { 
        goto current_continuation;
    });

@current_continuation:
}

Let’s see what’s going on here. A function called call_cc() is defined:

call_cc (...) -> void
{
}

This function takes another function as argument:

callback: fn (...) -> void

And that function takes the continuation as an argument:

k: fn()->void

The body of call/cc calls the callback:

callback (...);

passing an anonymous function (the continuation):

    fn() -> void {
        goto current_continuation;
    }

@current_continuation:

that just jumps to the point where call_cc returns. So when callback decides to invoke the continuation, execution will resume at the point where call_cc was invoked. Since there is nothing stopping callback from storing the continuation in a data structure or from invoking it multiple times, we have the full call/cc semantics.

Cooperative thread system

One of the examples on the Wikipedia page about call/cc is a cooperative thread system. With the call_cc function above, we could directly translate the Wikipedia code into Oort, but using the second aspect of the first-class-ness of labels – that they can be stored directly in data structures – makes it possible to write a more straightforward version:

run_list: list[label] = new list[label]();

thread_fork (child: fn() -> void)
{
    run_list.append (me);
    child();
    goto run_list.pop_head();
@me:
}

thread_yield()
{
    run_list.append (me);
    goto run_list.pop_head ();
@me:
}

thread_exit()
{
    if (!run_list.is_empty())
        goto run_list.pop_head();
    else
        process_exit();
}

The run_list variable is a list of labels containing the current positions of all the active threads. The keyword label in Oort is simply a type specifier similar to string.

To create a new thread, thread_fork first saves the position of the current thread on the list, and then it calls the child function. Similarly, thread_yield yields to another thread by saving the position of the current thread and jumping to the first label on the list. Exiting a thread consists of jumping to the first thread if there is one, and exiting the process if there isn’t.

The code above doesn’t actually run because the current Oort implementation doesn’t support genericity, but here is a somewhat uglier version that actually runs, while still demonstrating the principle.

Celebrities die 2.7183 at a time

June 2013

The claim that celebrities die in threes is usually dismissed as the result of the human propensity to see patterns where there are none. But celebrities don’t die at regularly spaced intervals either. It would be very weird if a celebrity predictably died on the 14th of every month. And once you deviate from a regularly spaced pattern, some amount of clustering is inevitable. Can we make this more precise?

Rather than trying to define exactly what constitutes a celebrity, I’ll simply assume that they die at a fixed rate and that they do so independently of each other (The Day the Music Died notwithstanding). It follows that celebrity deaths is a Poisson process with intensity $\lambda$ where $\lambda$ is the number of deaths that occur in some fixed time period.

As an example, suppose we define celebrityhood in such a way that twelve celebrities die each year on average. Then $\lambda = 12/\text{year}$, and because the time between events in a Poisson process is exponentially distributed with parameter $\lambda$, the average time between two deaths is $1/\lambda$ = 1/12th year, or one month.

What does it mean for celebrities to die $n$ at a time? We will simply say that two celebrities die together if the period between their deaths is shorter than expected. If the celebrity death rate is 12/year, then two celebrities died together if their deaths were less than one month apart. Similarly, three celebrities died together if the period between death 1 and death 2 and the period between death 2 and death 3 were both shorter than a month. In general, $k$ celebrities died together if the $k - 1$ periods between their deaths were all shorter than expected.

Here is a diagram of 10 years worth of randomly generated deaths with 12 deaths per year and clusters as defined above highlighted:

Average cluster size

Suppose a celebrity has just died after a longer than average wait. This death will start a new cluster, and we want to figure out what the size of it is.

In a Poisson process the waiting time between two events is exponentially distributed with parameter $\lambda$, so it can be modelled with a stochastic variable $W \sim Exp(\lambda)$. The cluster size itself is modelled with another stochastic variable, $C$, whose distribution is derived as follows.

The cluster size will be 1 when the waiting time for the next death is larger than or equal to the average (which is $1/\lambda$ for the exponential distribution):

$\text{P}(C = 1) = \text{P}(W > 1/\lambda)$

The probability that the cluster will have size 2 is the same as the probability that the next waiting time is shorter than average and the next one after that is longer:

$\text{P}(C = 2) = \text{P}(W \le 1/\lambda)\cdot \text{P}(W > 1/\lambda)$

For size three, it’s the probability that the next two waiting times are shorter and the third one longer:

$\text{P}(C = 3) = \text{P}(W \le 1/\lambda)^2\cdot \text{P}(W > 1/\lambda)$

In general, the probability that the next cluster will be size $k$ is:

$\text{P}(C = k) = \text{P}(W \le 1/\lambda)^{k - 1}\cdot \text{P}(W > 1/\lambda)$

So what’s the average size of a Celebrity Death Cluster? The expected value of $C$ is given by:

$\displaystyle \text{E}[C] = \sum_{k=1}^\infty k \cdot \text{P}(C = k) = \sum_{k=1}^\infty k\cdot \text{P}(W \le 1/\lambda)^{k - 1}\cdot \text{P}(W > 1/\lambda)$

Plugging in the distribution function for the exponential distribution, we get:

$ \begin{align*} \text{E}[C] &= \sum_{k=1}^\infty k \cdot (1 - e^{- \lambda \cdot (1/\lambda) })^{k - 1} \cdot (1 - (1 - e^{- \lambda \cdot (1 / \lambda)}))\\ &= \sum_{k=1}^\infty k \cdot (1 - e^{- 1})^{k - 1} \cdot e^{-1} \end{align*} $

It’s not hard to show that this infinite series has sum $e$ (Hint: Use the fact that $k x^{k - 1}$ is the derivative of $x^k$), so on average, celebrities die 2.7183 at a time.

The Radix Heap

May 2013

The Radix Heap is a priority queue that has better caching behavior than the well-known binary heap, but also two restrictions: (a) that all the keys in the heap are integers and (b) that you can never insert a new item that is smaller than all the other items currently in the heap.

These restrictions are not that severe. The Radix Heap still works in many algorithms that use heaps as a subroutine: Dijkstra’s shortest-path algorithm, Prim’s minimum spanning tree algorithm, various sweepline algorithms in computational geometry.

Here is how it works. If we assume that the keys are 32 bit integers, the radix heap will have 33 buckets, each one containing a list of items. We also maintain one global value last_deleted, which is initially MIN_INT and otherwise contains the last value extracted from the queue.

The invariant is this:

The items in bucket $k$ differ from last_deleted in bit $k - 1$, but not in bit $k$ or higher. The items in bucket 0 are equal to last_deleted.

For example, if we compare an item from bucket 10 to last_deleted, we will find that bits 31–10 are equal, bit 9 is different, and bits 8–0 may or may not be different.

Here is an example of a radix heap where the last extracted value was 7:

As an example, consider the item 13 in bucket 4. The bit pattern of 7 is 0111 and the bit pattern of 13 is 1101, so the highest bit that is different is bit number 3. Therefore the item 13 belongs in bucket $3 + 1 = 4$. Buckets 1, 2, and 3 are empty, but that’s because a number that differs from 7 in bits 0, 1, or 2 would be smaller than 7 and so isn’t allowed in the heap according to restriction (b).

Operations

When a new item is inserted, it has to be added to the correct bucket. How can we compute the bucket number? We have to find the highest bit where the new item differs from last_deleted. This is easily done by XORing them together and then finding the highest bit in the result. Adding one then gives the bucket number:

bucket_no = highest_bit (new_element XOR last_deleted) + 1

where highest_bit(x) is a function that returns the highest set bit of x, or $-1$ if x is 0.

Inserting the item clearly preserves the invariant because the new item will be in the correct bucket, and last_deleted didn’t change, so all the existing items are still in the right place.

Extracting the minimum involves first finding the minimal item by walking the lowest-numbered non-empty bucket and finding the minimal item in that bucket. Then that item is deleted and last_deleted is updated. Then the bucket is walked again and all the items are redistributed into new buckets according to the new last_deleted item.

The extracted item will be the minimal one in the data structure because we picked the minimal item in the redistributed bucket, and all the buckets with lower numbers are empty. And if there were a smaller item in one of the buckets with higher numbers, it would be differing from last_deleted in one of the more significant bits, say bit $k$. But since the items in the redistributed bucket are equal to last_deleted in bit $k$, the hypothetical smaller item would then have to also be smaller than last_deleted, which it can’t be because of restriction (b) mentioned in the introduction. Note that this argument also works for two-complement signed integers.

We have to be sure this doesn’t violate the invariant. First note that all the items that are being redistributed will satisfy the invariant because they are simply being inserted. The items in a bucket with a higher number $k$ were all different from the old last_deleted in the $(k-1)$th bit. This bit must then necessarily also be different from the $(k-1)$th bit in the new last_deleted, because if it weren’t, the new last_deleted would itself have belonged in bucket $k$. And finally, since the bucket being redistributed is the lowest-numbered non-empty one, there can’t be any items in a bucket with a lower number. So the invariant still holds.

In the example above, if we extract the two ‘7’s from bucket 0 and the ‘8’ from bucket 4, the new heap will look like this:

Notice that bucket 4, where the ‘8’ came from, is now empty.

Performance

Inserting into the radix heap takes constant time because all we have to do is add the new item to a list. Determining the highest set bit can be done in constant time with an instruction such as bsr.

The performance of extraction is dominated by the redistribution of items. When a bucket is redistributed, it ends up being empty. To see why, remember that all the items are different from last_deleted in the $(k - 1)$th bit. Because the new last_deleted comes from bucket $k$, the items are now all equal to last_deleted in the $(k - 1)th$ bit. Hence they will all be redistributed to a lower-numbered bucket.

Now consider the life-cycle of a single element. In the worst case it starts out being added to bucket 31 and every time it is redistributed, it moves to a lower-numbered bucket. When it reaches bucket 0, it will be next in line for extraction. It follows that the maximum number of redistributions that an element can experience is 31.

Since a redistribution takes constant time per element distributed, and since an element will only be redistributed $d$ times, where $d$ is the number of bits in the element, it follows that the amortized time complexity of extraction is $O(d)$. In practice we will often do better though, because most items will not move through all the buckets.

Caching performance

Some descriptions of the radix heap recommend implementing the buckets as doubly linked lists, but that would be a mistake because linked lists have terrible cache locality. It is better to implement them as dynamically growing arrays. If you do that, the top of the buckets will tend to be hot which means the per-item number of cache misses during redistribution of a bucket will tend to be $O(1/B)$, where $B$ is the number of integers in a cache line. This means the amortized cache-miss complexity of extraction will be closer to $O(d/B)$ than to $O(d)$.

In a regular binary heap, both insertion and extraction require $\Theta(\log n)$ swaps in the worst case, and each swap (except for those very close to the top of the heap) will cause a cache miss.

In other words, if $d = \Theta(\log n)$, extraction from a radix heap will tend to generate $\Theta(\log n / B)$ cache misses, where a binary heap will require $\Theta(\log n)$.

Porter/Duff Compositing and Blend Modes

March 2013

In the Porter/Duff compositing algebra, images are equipped with an alpha channel that determines on a per-pixel basis whether the image is there or not. When the alpha channel is 1, the image is fully there, when it is 0, the image isn’t there at all, and when it is in between, the image is partially there. In other words, the alpha channel describes the shape of the image, it does not describe opacity. The way to think of images with an alpha channel is as irregularly shaped pieces of cardboard, not as colored glass.

Consider these two images:

When we combine them, each pixel of the result can be divided into four regions:

One region where only the source is present, one where only the destination is present, one where both are present, and one where neither is present.

By deciding on what happens in each of the four regions, various effects can be generated. For example, if the destination-only region is treated as blank, the source-only region is filled with the source color, and the ‘both’ region is filled with the destination color like this:

The effect is as if the destination image is trimmed to match the source image, and then held up in front of it:

The Porter/Duff operator that does this is called “Dest Atop”.

There are twelve of these operators, each one characterized by its behavior in the three regions: source, destination and both. The ‘neither’ region is always blank. The source and destination regions can either be blank or filled with the source or destination colors respectively.

The formula for the operators is a linear combination of the contents of the four regions, where the weights are the areas of each region:

$A_\text{src} \cdot [s] + A_\text{dest} \cdot [d] + A_\text{both} \cdot [b]$

Where $[s]$ is either 0 or the color of the source pixel, $[d]$ either 0 or the color of the destination pixel, and $[b]$ is either 0, the color of the source pixel, or the color of the destination pixel. With the alpha channel being interpreted as coverage, the areas are given by these formulas:

$A_\text{src} = \alpha_\text{s} \cdot (1 - \alpha_\text{d})\\ A_\text{dst} = \alpha_\text{d} \cdot (1 - \alpha_\text{s})\\ A_\text{both} = \alpha_\text{s} \cdot \alpha_\text{d}$

The alpha channel of the result is computed in a similar way:

$A_\text{src} \cdot [\text{as}] + A_\text{dest} \cdot [\text{ad}] + A_\text{both} \cdot [\text{ab}]$

where $[\text{as}]$ and $[\text{ad}]$ are either 0 or 1 depending on whether the source and destination regions are present, and where $[\text{ab}]$ is 0 when the ‘both’ region is blank, and 1 otherwise.

Here is a table of all the Porter/Duff operators:

	$[\text{s}]$	$[\text{d}]$	$[\text{b}]$
Src	$s$	$0$	s
Atop	$0$	$d$	s
Over	$s$	$d$	s
In	$0$	$0$	s
Out	$s$	$0$	$0$
Dest	$0$	$d$	d
DestAtop	$s$	$0$	d
DestOver	$s$	$d$	d
DestIn	$0$	$0$	d
DestOut	$0$	$d$	$0$
Clear	$0$	$0$	$0$
Xor	$s$	$d$	$0$

And here is how they look:

Despite being referred to as alpha blending and despite alpha often being used to model opacity, in concept Porter/Duff is not a way to blend the source and destination shapes. It is way to overlay, combine and trim them as if they were pieces of cardboard. The only place where source and destination pixels are actually blended is along the antialiased edges.

Blending

Photoshop and the Gimp have a concept of layers which are images stacked on top of each other. In Porter/Duff, stacking images on top of each other is done with the “Over” operator, which is also what Photoshop/Gimp use by default to composite layers:

Conceptually, two pieces of cardboard are held up with one in front of the other. Neither shape is trimmed, and in places where both are present, only the top layer is visible.

A layer in these programs also has an associated Blend Mode which can be used to modify what happens in places where both are visible. For example, the ‘Color Dodge’ blend mode computes a mix of source and destination according to this formula:

$ \begin{equation*} B(s,d)= \begin{cases} 0 & \text{if $d=0$,} \\ 1 & \text{if $d \ge (1 - s)$,} \\ d / (1 - s) & \text{otherwise} \end{cases} \end{equation*} $

The result is this:

Unlike with the regular Over operator, in this case there is a substantial chunk of the output where the result is actually a mix of the source and destination.

Layers in Photoshop and Gimp are not tailored to each other (except for layer masks, which we will ignore here), so the compositing of the layer stack is done with the source-only and destination-only region set to source and destination respectively. However, there is nothing in principle stopping us from setting the source-only and destination-only regions to blank, but keeping the blend mode in the ‘both’ region, so that tailoring could be supported alongside blending. For example, we could set the ‘source’ region to blank, the ‘destination’ region to the destination color, and the ‘both’ region to ColorDodge:

Here are the four combinations that involve a ColorDodge blend mode:

In this model the original twelve Porter/Duff operators can be viewed as the results of three simple blend modes:

Source:	$B(s, d) = s$
Dest:	$B(s, d) = d$
Zero:	$B(s, d) = 0$

In this generalization of Porter/Duff the blend mode is chosen from a large set of formulas, and each formula gives rise to four new compositing operators characterized by whether the source and destination are blank or contain the corresponding pixel color.

Here is a table of the operators that are generated by various blend modes:

The general formula is still an area weighted average:

$A_\text{src} \cdot [s] + A_\text{dest} \cdot [d] + A_\text{both}\cdot B(s, d)$

where [s] and [d] are the source and destination colors respectively or 0, but where $B(s, d)$ is no longer restricted to one of $0$, $s$, and $d$, but can instead be chosen from a large set of formulas.

The output of the alpha channel is the same as before:

$A_\text{src} \cdot [\text{as}] + A_\text{dest} \cdot [\text{ad}] + A_\text{both} \cdot [\text{ab}]$

except that [ab] is now determined by the blend mode. For the Zero blend mode there is no coverage in the both region, so [ab] is 0; for most others, there is full coverage, so [ab] is 1.

Big-O Misconceptions

October 2012

In computer science and sometimes mathematics, big-O notation is used to talk about how quickly a function grows while disregarding multiplicative and additive constants. When classifying algorithms, big-O notation is useful because it lets us abstract away the differences between real computers as just multiplicative and additive constants.

Big-O is not a difficult concept at all, but it seems to be common even for people who should know better to misunderstand some aspects of it. The following is a list of misconceptions that I have seen in the wild.

But first a definition: We write

$f(n) = O(g(n))$

when $f(n) \le M g(n)$ for sufficiently large $n$, for some positive constant $M$.

Misconception 1: “The Equals Sign Means Equality”

The equals sign in

$f(n) = O(g(n))$

is a widespread travestry. If you take it at face value, you can deduce that since $5 n$ and $3 n$ are both equal to $O(n)$, then $3 n$ must be equal to $5 n$ and so $3 = 5$.

The expression $f(n) = O(g(n))$ doesn’t type check. The left-hand-side is a function, the right-hand-side is a … what, exactly? There is no help to be found in the definition. It just says “we write” without concerning itself with the fact that what “we write” is total nonsense.

The way to interpret the right-hand side is as a set of functions:

$ O(f) = \{ g \mid g(n) \le M f(n) \text{ for some $M > 0$ for large $n$}\}. $

With this definition, the world makes sense again: If $f(n) = 3 n$ and $g(n) = 5 n$, then $f \in O(n)$ and $g \in O(n)$, but there is no equality involved so we can’t make bogus deductions like $3=5$. We can however make the correct observation that $O(n) \subseteq O(n \log n)\subseteq O(n^2) \subseteq O(n^3)$, something that would be difficult to express with the equals sign.

Misconception 2: “Informally, Big-O Means ‘Approximately Equal’"

If an algorithm takes $5 n^2$ seconds to complete, that algorithm is $O(n^2)$ because for the constant $M=7$ and sufficiently large $n$, $5 n^2 \le 7 n^2$. But an algorithm that runs in constant time, say 3 seconds, is also $O(n^2)$ because for sufficiently large $n$, $3 \le n^2$.

So informally, big-O means approximately less than or equal, not approximately equal.

If someone says “Topological Sort, like other sorting algorithms, is $O(n \log n)$", then that is technically correct, but severely misleading, because Toplogical Sort is also $O(n)$ which is a subset of $O(n \log n)$. Chances are whoever said it meant something false.

If someone says “In the worst case, any comparison based sorting algorithm must make $O(n \log n)$ comparisons” that is not a correct statement. Translated into English it becomes:

“In the worst case, any comparison based sorting algorithm must make fewer than or equal to $M n \log (n)$ comparisons”

which is not true: You can easily come up with a comparison based sorting algorithm that makes more comparisons in the worst case.

To be precise about these things we have other types of notation at our disposal. Informally:

$O()$: Less than or equal, disregarding constants

$\Omega()$: Greater than or equal, disregarding constants

$o()$: Stricly less than, disregarding constants

$\Theta()$: Equal to, disregarding constants

and some more. The correct statement about lower bounds is this: “In the worst case, any comparison based sorting algorithm must make $\Omega(n \log n)$ comparisons”. In English that becomes:

“In the worst case, any comparison based sorting algorithm must make at least $M n \log (n)$ comparisons”

which is true. And a correct, non-misleading statement about Topological Sort is that it is $\Theta(n)$, because it has a lower bound of $\Omega(n)$ and an upper bound of $O(n)$.

Misconception 3: “Big-O is a Statement About Time”

Big-O is used for making statements about functions. The functions can measure time or space or cache misses or rabbits on an island or anything or nothing. Big-O notation doesn’t care.

In fact, when used for algorithms, big-O is almost never about time. It is about primitive operations.

When someone says that the time complexity of MergeSort is $O(n \log n)$, they usually mean that the number of comparisons that MergeSort makes is $O(n \log n)$. That in itself doesn’t tell us what the time complexity of any particular MergeSort might be because that would depend how much time it takes to make a comparison. In other words, the $O(n \log n)$ refers to comparisons as the primitive operation.

The important point here is that when big-O is applied to algorithms, there is always an underlying model of computation. The claim that the time complexity of MergeSort is $O(n \log n)$, is implicitly referencing a model of computation where a comparison takes constant time and everything else is free.

Which is fine as far as it goes. It lets us compare MergeSort to other comparison based sorts, such as QuickSort or ShellSort or BubbleSort, and in many real situations, comparing two sort keys really does take constant time.

However, it doesn’t allow us to compare MergeSort to RadixSort because RadixSort is not comparison based. It simply doesn’t ever make a comparison between two keys, so its time complexity in the comparison model is 0. The statement that RadixSort is $O(n)$ implicitly references a model in which the keys can be lexicographically picked apart in constant time. Which is also fine, because in many real situations, you actually can do that.

To compare RadixSort to MergeSort, we must first define a shared model of computation. If we are sorting strings that are $k$ bytes long, we might take “read a byte” as a primitive operation that takes constant time with everything else being free.

In this model, MergeSort makes $O(n \log n)$ string comparisons each of which makes $O(k)$ byte comparisons, so the time complexity is $O(k\cdot n \log n)$. One common implementation of RadixSort will make $k$ passes over the $n$ strings with each pass reading one byte, and so has time complexity $O(n k)$.

Misconception 4: “Big-O Is About Worst Case”

Big-O is often used to make statements about functions that measure the worst case behavior of an algorithm, but big-O notation doesn’t imply anything of the sort.

If someone is talking about the randomized QuickSort and says that it is $O(n \log n)$, they presumably mean that its expected running time is $O(n \log n)$. If they say that QuickSort is $O(n^2)$ they are probably talking about its worst case complexity. Both statements can be considered true depending on what type of running time the functions involved are measuring.

Sysprof 1.2.0

September 2012

A new stable release of Sysprof is now available. Download version 1.2.0.

Over is not Translucency

September 2011

The Porter/Duff Over operator, also known as the “Normal” blend mode in Photoshop, computes the amount of light that is reflected when a pixel partially covers another:

The fraction of bg that is covered is denoted alpha. This operator is the correct one to use when the foreground image is an opaque mask that partially covers the background:

Red mask on blue background

A photon that hits this image will be reflected back to your eyes by either the foreground or the background, but not both. For each foreground pixel, the alpha value tells us the probability of each:

$a \cdot \text{fg} + (1 - a) \cdot \text{bg}$

This is the definition of the Porter/Duff Over operator for non-premultiplied pixels.

But if alpha is interpreted as translucency, then the Over operator is not the correct one to use. The Over operator will act as if each pixel is partially covering the background:

Which is not how translucency works. A translucent material reflects some light and lets other light through. The light that is let through is reflected by the background and interacts with the foreground again.

Let’s look at this in more detail. Please follow along in the diagram to the right. First with probability $a$, the photon is reflected back towards the viewer:

$\displaystyle \begin{align*} &a \cdot \text{fg} \end{align*} $

With probability $(1 - a)$, it passes through the foreground, hits the background, and is reflected back out. The photon now hits the backside of the foreground pixel. With probability $(1 - a)$, the foreground pixel lets the photon back out to the viewer. The result so far:

$\displaystyle \begin{align*} a\cdot \text{fg} &+(1 - a) \cdot \text{bg} \cdot (1 - a) \end{align*} $

But we are not done yet, because with probability $a$ the foreground pixel reflects the photon once again back towards the background pixel. There it will be reflected, hit the backside of the foreground pixel again, which lets it through to our eyes with probability $(1 - a)$. We get another term where the final $(1 - a)$ is replaced with $a \cdot \text{fg} \cdot \text {bg} \cdot (1 - a)$:

$\displaystyle \begin{align*} a\cdot \text{fg} &+(1 - a) \cdot \text{bg} \cdot (1 - a)\\ &+(1 - a) \cdot \text{bg} \cdot a \cdot \text{fg} \cdot \text{bg} \cdot (1 - a) \end{align*} $

And so on. In each round, we gain another term which is identical to the previous one, except that it has an additional $a \cdot \text{fg} \cdot \text{bg}$ factor:

$\displaystyle \begin{align*} a\cdot \text{fg} &+(1 - a) \cdot \text{bg} \cdot (1 - a)\\ &+(1 - a) \cdot \text{bg} \cdot a \cdot \text{fg} \cdot \text{bg} \cdot (1 - a)\\ &+(1 - a) \cdot \text{bg} \cdot a \cdot \text{fg} \cdot \text{bg} \cdot a \cdot \text{fg} \cdot \text{bg} \cdot (1 - a) \\ &+\cdots \end{align*} $

or more compactly:

$\displaystyle \begin{align*} &a \cdot \text{fg} + (1 - a)^2 \cdot \text{bg} \cdot \sum_{i=0}^\infty (a \cdot \text{fg} \cdot \text{bg})^i \end{align*} $

Because we are dealing with pixels, both $a$, $\text{fg}$, and $\text{bg}$ are less than 1, so the sum is a geometric series:

$\displaystyle \begin{align*} &\sum_{i=0}^\infty x^i = \frac{1}{1 - x} \end{align*} $

Putting them together, we get:

$\displaystyle \begin{align*} &a \cdot \text{fg} + \frac{(1 - a)^2 \cdot bg}{1 - a \cdot \text{fg} \cdot \text{bg}} \end{align*} $

I have sidestepped the issue of premultiplication by assuming that background alpha is 1. The calculations with premultipled colors are similar, and for the color components, the result is simply:

$\displaystyle \begin{align*} &r = \text{fg} + \frac{(1 - a_\text{fg})^2 \cdot \text{bg}}{1 - \text{fg}\cdot\text{bg}} \end{align*} $

The issue of destination alpha is more complicated. With the Over operator, both foreground and background are opaque masks, so the light that survives both has the same color as the input light. With translucency, the transmitted light has a different color, which means the resulting alpha value must in principle be different for each color component. But that’s not possible for ARGB pixels. A similar argument to the above shows that the resulting alpha value would be:

$\displaystyle \begin{align*} &r = 1 - \frac{(1 - a)\cdot (1 - b)}{1 - \text{fg} \cdot \text{bg}} \end{align*} $

where $b$ is the background alpha. The problem is the dependency on $\text{fg}$ and $\text{bg}$. If we simply assume for the purposes of the alpha computation that $\text{fg}$ and $\text{bg}$ are equal to $a$ and $b$, we get this:

$\displaystyle \begin{align*} &r = 1 - \frac{(1 - a)\cdot (1 - b)}{1 - a \cdot b} \end{align*} $

which is equal to

$\displaystyle \begin{align*} &a + \frac{(1 - a)^2 \cdot b}{1 - a \cdot b} \end{align*} $

Ie., exactly the same computation as the one for the color channels. So we can define the Translucency Operator as this:

$\displaystyle \begin{align*} r = \text{fg} + \frac{(1 - a)^2 \cdot \text{bg}}{1 - \text{fg} \cdot \text{bg}} \end{align*} $

for all four channels.

Here is an example of what the operator looks like. The image below is what you will get if you use the Over operator to implement a selection rectangle. Mouse over to see what it would look like if you used the Translucency operator.

Both were computed in linear RGB. Typical implementations will often compute the Over operator in sRGB, so that’s what see if you actually select some icons in Nautilus. If you want to compare all three, open these in tabs:

And for good measure, even though it makes zero sense to do this,

Translucency, in sRGB

Gamma Correction vs. Premultiplied Pixels

August 2011

Pixels with 8 bits per channel are normally sRGB encoded because that allocates more bits to darker colors where human vision is the most sensitive. (Actually, it’s really more of a historical accident, but sRGB nevertheless remains useful for this reason). The relationship between sRGB and linear RGB is that you get an sRGB pixel by raising each component of a linear pixel to the power of $1/2.2$.

It is common for graphics software to perform alpha blending directly on these sRGB pixels using alpha values that are linearly coded (ie., an alpha value of 0 means no coverage, 0.5 means half coverage, and 1 means full coverage). Because alpha blending is best done with premultiplied pixels, such systems store pixels in this format:

$\left[\,\alpha,\enspace\alpha \cdot \text{R}^{1/2.2},\enspace\alpha \cdot \text{G}^{1/2.2},\enspace\alpha \cdot \text{B}^{1/2.2}\,\right]$,

that is, the alpha channel is linearly coded, while the R, G, and B channels are first sRGB coded, then premultiplied with the linear alpha. This works well as long as you are happy with blending in sRGB. And if you discard the alpha channel of such pixels and display them directly on a monitor, it will look as if the pixels were alpha blended (in sRGB space) on top of a black background, which is the desired result.

But what if you want to blend in linear RGB? If you use the format above, some expensive conversions will be required. To convert to premultiplied linear, you have to first divide by alpha, then raise each color to 2.2, then multiply by alpha. To convert back, you must divide by alpha, raise to $1/2.2$, then multiply with alpha.

Those conversions can be avoided if you store the pixels linearly, ie., keeping the premultiplication, but coding red, green, and blue linearly instead of as sRGB:

$\left[\,\alpha,\enspace\alpha \cdot \text{R},\enspace\alpha \cdot \text{G},\enspace\alpha \cdot \text{B}\,\right]$.

This makes blending fast, but 8 bits per channel is no longer good enough. Without the sRBG encoding, too much resolution will be lost in darker tones. And to display these pixels on a monitor, they have to first be converted to sRGB, either manually, or, if the video card can scan them out directly, by setting the gamma ramp appropriately.

Can we get the best of both worlds? Yes. The format to use is this:

$\left[\,\alpha,\enspace (\alpha \cdot \text{R})^{1/2.2},\enspace (\alpha \cdot \text{G})^{1/2.2},\enspace \left(\alpha \cdot \text{B}\right)^{1/2.2}\,\right]$,

The alpha channel is stored linearly, and the color channels are first premultiplied with the linear alpha, then raised to $1/2.2$.

With this format, 8 bits per channel is sufficient. Discarding the alpha channel and displaying the pixels directly on a monitor will look as if the pixels were alpha blended (in linear space) against black, as desired.

You can convert to linear RGB simply by raising the R, G, and B components to 2.2, and back by raising to $1/2.2$. Or, if you feel like cheating, use an exponent of 2 so that the conversions become a multiplication and a square root respectively.

This is also the pixel format to use with texture samplers that implement the sRGB OpenGL extensions (textures and framebuffers). These extensions say precisely that the R, G, and B components are raised to 2.2 before texture filtering, and raised to 1/2.2 after the final raster operation.

Sysprof 1.1.8

July 2011

A new version 1.1.8 of Sysprof is out.

This is a release candidate for 1.2.0 and contains mainly bug fixes.

Fast Multiplication of Normalized 16 bit Numbers with SSE2

July 2011

If you are compositing pixels with 16 bits per component, you often need this computation:

uint16_t a, b, r;

r = (a * b + 0x7fff) / 65535;

There is a well-known way to do this quickly without a division:

uint32_t t;

t = a * b + 0x8000;

r = (t + (t >> 16)) >> 16;

Since we are compositing pixels we want to do this with SSE2 instructions, but because the code above uses 32 bit arithmetic, we can only do four operations at a time, even though SSE registers have room for eight 16 bit values. Here is a direct translation into SSE2:

a = punpcklwd (a, 0);
b = punpcklwd (b, 0);
a = pmulld (a, b);
a = paddd (a, 0x8000);
b = psrld (a, 16);
a = paddd (a, b);
a = psrld (a, 16);
a = packusdw (a, 0);

But there is another way that better matches SSE2:

uint16_t lo, hi, t, r;

hi = (a * b) >> 16;
lo = (a * b) & 0xffff;

t = lo >> 15;
hi += t;
t = hi ^ 0x7fff;

if ((int16_t)lo > (int16_t)t)
    lo = 0xffff;
else
    lo = 0x0000;

r = hi - lo;

This version is better because it avoids the unpacking to 32 bits. Here is the translation into SSE2:

t = pmulhuw (a, b);
a = pmullw (a, b);
b = psrlw (a, 15);
t = paddw (t, b);
b = pxor (t, 0x7fff);
a = pcmpgtw (a, b);
a = psubw (t, a);

This is not only shorter, it also makes use of the full width of the SSE registers, computing eight results at a time.

Unfortunately SSE2 doesn’t have 8-bit variants of pmulhuw, pmullw, and psrlw, so we can’t use this trick for the more common case where pixels have 8 bits per component.

Exercise: Why does the second version work?

Søren Sandmann Pedersen