Concurrency and Synchronization¶

Emil Sekerinski, McMaster University, Fall 2019¶

Specifying Concurrency¶

The assignment statement, sequential composition, conditional statement, and repetitive statement are universal in the sense that any sequential program can be expressed through those. For concurrent programs, we need to add means to express:

parallel composition
mutual exclusion
condition sychronization

When reasoning about concurrency, it is natural to think in terms of those. Programming languages commonly do not support parallel composition, mutual exclusion, and condition synchronization in full generality, but their constructs can be understood by those notions.

Parallel Composition¶

Statements execute in a sequence of atomic steps. Such sequences are called traces. These can be either finite or infinite. The behaviour of a statement is given by the set of all traces.

The parallel composition S₁ ‖ S₂ executes S₁ and S₂ "in parallel" by interleaving (merging) their traces; formally, σ is a trace of S₁ ‖ S₂ if σ projected onto S₁ steps is a trace of S₁ and projected onto S₂ steps is a trace of S₂.

In state diagrams, parallel composition is visualized by a dashed line; transitions are atomic steps:

No description has been provided for this image

The composition S₁ ‖ S₂ terminates when both S₁ and S₂ terminate. For example:

Statement	Behaviour
A	{➀ ➁}
B	{➀ ➁}

| A ‖ B | {➀ ➁ ➀ ➁, ➀ ➀ ➁ ➁, ➀ ➀ ➁ ➁,
➀ ➀ ➁ ➁, ➀ ➀ ➁ ➁, ➀ ➁ ➀ ➁}

Question: If ➀ = x := 1, ➁ = x := 2, ➀ = y := 1, ➁ = y := 2 what is the final state?

Answer: x = 2 ∧ y = 2

The statement fork P, where P is a procedure name, creates a new process and executes it concurrently with the one creating it. The statement join P waits for a process to terminate. For example:

fork P					 			
x := 1
join P

procedure P
    y := 2

Any ‖ statement can be equivalently expressed by fork and join:

S₁ ‖ … ‖ Sₙ

is equivalent to

fork P₁ ; … ; fork Pₙ
join P₁ ; … ; join Pₙ

procedure Pᵢ
    Sᵢ

Unlike the parallel composition, fork statements allow an arbitrary number of processes to be started, but introduce a parent-child relation among the processes.

The process declaration executes the declared process concurrently with any other declared processes.

process p
    S

Process declarations can be equivalently expressed with ‖ or with fork and join. For example

process setX
    x := 1

process setY
    y := 2

is equivalent to:

setX ‖ setY

procedure setX
	x := 1

procedure setY
	y := 2

The par statement is a generalization of ‖ in the same way as for is a generalization of ;:

par x = E to F do S

Here, x is a constant local to the par statement. For example, vectors a, b : array 1 .. N of integer can be added in parallel:

par i = 1 to N do
    b(i) := b(i) + a(i)

This can be equivalently expressed with fork and join:

for i = 1 to N do fork P(i)
for i = 1 to N do join P(i)

procedure P(i: 1 .. N)
    b(i) := b(i) + a(i)

Dedicated languages or language extension for concurrent and for numeric computing have par statements (Matlab) or process declarations (called tasks in Ada); Python, Java, and C/Pthread have only fork and join. Go has only fork (called go). None of these languages have ‖ as an operator.

For a, b, c: array 0 .. N - 1, 0 .. N - 1 of float, sequential matrix multiplication is expressed by:

for i = 0 to N - 1 do
    for j = 0 to N - 1 do
        c(i, j) := 0.0
        for k = 0 to N - 1 do
            c(i, j) := c(i, j) + a(i, k) × b(k, j)

Question: Which of the three for can be turned into a par?

Answer: Only the outer two can:

par i = 0 to N - 1 do
    par j = 0 to N - 1 do
        c(i, j) := 0.0
        for k = 0 to N - 1 do
            c(i, j) := c(i, j) + a(i, k) × b(k, j)

Aside: It would be more systematic to write seq instead of for for generalized sequential composition, as in occam, a predecessor of Go.

The previous program creates N² processes, which may be far more than there are processors; in some languages, the overhead of process creation may outweigh the benefit of parallelism. A solution is to use one worker process per strip.

Let P be the number of worker processes and N a multiple of P:

procedure worker(w: 1 .. P)
    const first = (w - 1) × N div P
    const last = first + N div P - 1
    for i = first to last do
        for j = 0 to N - 1 do
            c(i, j) := 0.0
            for k = 0 to N - 1 do
                c(i, j) := c(i, j) + a(i, k) × b(k, j)

par w = 1 to P do worker(w)

Matrix multiplication is an example of "embarrassing parallelism", as no synchronization is needed: all processes assign to different array elements.

Mutual Exclusion¶

In algorithms, atomic steps are indicated by atomicity brackets. These can be placed around expressions and statements, for example:

⟨x := x + 1⟩ ‖ ⟨x := x + 2⟩

⟨x⟩ := ⟨x + 1⟩ ‖ ⟨x⟩ := ⟨x + 2⟩

Atomicity brackets cannot be nested. Auxiliary variables ("registers") are needed to express atomicity in state diagrams. Above statements are represented by:

When atomicity brackets are left out, implicitly only access to individual basic variables and array elements of basic types is atomic:

x := x + 1    =    ⟨x⟩ := ⟨x⟩ + ⟨1⟩

If processes run by multiplexing on a single processor, atomicity can be implemented by disabling and enabling interrupts: any switch to another process is caused by a timer interrupt at the end of a process' time slice or by an interrupt from an external event. Care is needed to ensure that the atomic regions don't last too long and other processes are not unduly delayed or external events lost.

If multiple processors (cores) are available, ensuring atomicity by suspending processors would be too drastic. Instead, the hardware guarantees that certain operations are atomic. Typically, reading and writing a word from and to memory is atomic. Programming languages reflect this: for example, in Java access to int, boolean, float, and pointers, which are all word-sized, is atomic, but not access to double and long. We continue to use atomicity brackets as a design notation and return to techniques for implementing atomic regions.

Question: Which of these algorithms for computing the maximum of a: array 1 .. N of integer is correct?

m := a(1)
for i = 2 to N do if a(i) > m then m := a(i)

m := a(1)
par i = 2 to N do if ⟨a(i)⟩ > ⟨m⟩ then ⟨m⟩ := ⟨a(i)⟩

m := a(1)
par i = 2 to N do ⟨if a(i) > m then m := a(i)⟩

m := a(1)
par i = 2 to N do if ⟨a(i) > m⟩ then ⟨m := a(i)⟩

m := a(1)
par i = 2 to N do
    if a(i) > m then ⟨if a(i) > m then m := a(i)⟩

Answer: Correct, incorrect, correct (but less efficient than sequential version), incorrect, correct (and potentially faster than sequential version).

What can we say about the postcondition of a parallel composition? In the example below, the statements x := x + 1 and y := y + 2 are correct with respect to their pre- and postconditions:

Their parallel composition establishes the conjunction of their postconditions.

Compare this to the parallel composition of x := 1 and x := 2:

Their parallel composition establishes the disjunction of their postconditions.

Now consider the parallel composition of ⟨x := x + 1⟩ and ⟨x := x + 2⟩:

Question: What are appropriate predicates P₁, P₂, Q₁, Q₂?

Answer: Here is a naive attempt:

If both processes start with x = 0 and ⟨x := x + 1⟩ is executed first, the precondition of ⟨x := x + 2⟩ is no longer x = 0. Likewise, if ⟨x := x + 2⟩ is executed first and establishes x = 2, then ⟨x := x + 1⟩ would invalidate that postcondition. The reason is interference. Additionally, from x = 1 and x = 2 the postcondition x = 3 cannot be inferred.

Question: How can the annotation be modified such that the parallel execution will not invalidate the predicates?

Answer: The annotation needs to be weakened.

Now each process does not interfere with the states of the other process.

As another example, consider following two procedures operating on global variables a, b, t:

procedure P
    t, a := false, 0
    while ¬t do
        t, a := true, a + 1

procedure Q
    t, b := true, 0
    while t do
        t, b := false, b + 1

Question: Do P and Q terminate and what is their outcome? What are the loop invariants?

Answer:

P terminates with t ∧ a = 1; the invariant is (¬t ∧ a = 0) ∨ (t ∧ a = 1);
Q terminates with ¬t ∧ b = 1; the invariant is (t ∧ b = 0) ∨ (¬t ∧ b = 1).

Question: Does P ‖ Q terminate and what is the outcome?

Answer: P ‖ Q may terminate with a ≥ 0 ∧ b ≥ 0 or may not terminate. The state diagram shows how the annotation of P that is correct for sequential execution is invalidated by the concurrent execution of Q:

Question: How can the annotation be modified such that the parallel execution will not invalidate the predicates?

Answer: The loop invariant in P is a ≥ 0 and in Q is b ≥ 0.

Consider following parallel compositions:

x := y + z ‖ (y := 3 ; z := 5)

x := y + z ‖ y := 3

The assignment x := y + z is executed as ⟨x⟩ := ⟨y⟩ + ⟨z⟩. On the left-hand side, the value taken for y could be that before or after y := 3 and the value of z could be that before or after z := 5. However, on the right-hand side the effect of x := y + z is the same as ⟨x := y + z⟩, allowing a simpler state diagram:

The reason is that on the right-hand side, only one of the variables of y + z is changed to in the parallel process.

In a parallel composition

S₁ ‖ … ‖ Sₙ

an expression E in Sᵢ satisfies the at-most-once property if it refers at most once to a basic variable that may be changed in another process. An assignment x, y := E, F in Sᵢ satisfies the at-most-once property if it refers at most once to a basic variable (x, y, or in E, F) that may be changed in another process.

Expressions and assignment statements with the at-most-once property are evaluated or executed as if they were atomic. Thus in state diagrams they don't need to be broken up.

Question: which of the following assignments satisfy the at-most once property?

the two assignments in x := 3 ‖ x := x + 1
the three assignments in x := 3 ‖ y := 5 ‖ z := x + y
the assignment t, a := false, 0 in procedure P above when considering P ‖ Q
the assignment t, a := true, a + 1 in procedure Q above when considering P ‖ Q

Answer:

only x := 3
only x := 3 and y := 5
yes, as only t is changed in Q
yes, as only t is changed in P

This retrospectively justifies the state diagram for P ‖ Q.

An atomic statement A that starts in state P does not interfere with state Q if A preserves Q, formally:

algorithm
{P ∧ Q}  A  {Q}

In state diagrams non-interference is visualized by a green dotted line. If A is a guarded assignment statement, the condition for non-interference is:

if P ∧ B ∧ Q ⇒ Q[x, y := E, F] then

Question: Consider again the state diagram for ⟨x := x + 1⟩ ‖ ⟨x := x + 2⟩. What are the conditions for non-interference? Give the proofs!

Answer: The conditions for x := x + 1 not intefering are:

(x = 0 ∨ x = 2) ∧ (x = 0 ∨ x = 1) ⇒ (x = 0 ∨ x = 1)[x := x + 1]
(x = 0 ∨ x = 2) ∧ (x = 2 ∨ x = 3) ⇒ (x = 2 ∨ x = 3)[x := x + 1]

The conditions for x := x + 2 not intefering are:

(x = 0 ∨ x = 1) ∧ (x = 0 ∨ x = 2) ⇒ (x = 0 ∨ x = 2)[x := x + 2]
(x = 0 ∨ x = 1) ∧ (x = 1 ∨ x = 3) ⇒ (x = 1 ∨ x = 3)[x := x + 2]

The proof of the first condition is:

algorithm
    (x = 0 ∨ x = 2) ∧ (x = 0 ∨ x = 1) ⇒ (x = 0 ∨ x = 1)[x := x + 1]
≡        {by substitution, factoring out x = 0, simplification}
    x = 0 ⇒ (x + 1 = 0 ∨ x + 1 = 1)
≡        {arithmetic}
    x = 0 ⇒ (x = - 1 ∨ x = 1)
≡        {weakening / strengthening}
    true

Rule for correctness of parallel composition: Assume that statements S₁, S₂ establish Q₁, Q₂ under P₁, P₂, respectively:

algorithm
{P₁}  S₁  {Q₁}
{P₂}  S₂  {Q₂}

Their parallel composition will establish the conjunction of the postconditions under the conjunction of the preconditions,

algorithm
{P₁ ∧ P₂}  S₁ ‖ S₂  {Q₁ ∧ Q₂}

provided that all atomic statements (transitions) S₁,ᵢ in S₁ do not interfere with any state of S₂ and vice versa, visually:

The rule generalizes to more than two processes.

Condition Synchronization¶

Condition synchronization delays one process until another establishes a certain condition. The fundamental mechanism is the await statement:

⟨await B then S⟩

The await statement "waits" until Boolean expression B is true and then executes statement S atomically. Formally, this reduces the number of interleavings. The body S can also be empty:

⟨await B⟩ = ⟨await B then skip⟩

For example:

⟨await s > 0 then s := s - 1⟩

⟨await s > 0⟩ ; s := s - 1

Question: What is the difference between these two?

Answer: The first guarantees that s will not be negative, the second not.

The await statement is at the very core of every synchronization mechanism, even though few programming languages support it in full generality. One way to implement await B is by spinning: the process keeps looping as long as B is false.

Formally, the statement ⟨await B then S⟩ is just ⟨if B → S⟩, so the same correctness rule holds:

Rule for correctness of await:

algorithm
{P}  ⟨await B then S⟩  {Q}

if P ⇒ ∆B
and {P ∧ B'} S {Q}

If B is false, the interpretation in concurrent programs is waiting until it becomes true, rather than staying in the initial state forever.

Producer and Consumer¶

A producer places objects into a (one-place) buffer. A consumer removes objects from the buffer. The producer has to wait until the buffer is empty before placing an object. Producers and consumers proceed in their own pace. The consumer has to wait until the buffer is not empty before removing the object. Assume buf: T:

p := 0 ; c := 0

process Producer
    var a: array 0 .. N - 1 of T = …
    while p < N do
        ⟨await p = c⟩
        buf := a(p)
        p := p + 1

process Consumer
    var b: array 0 .. N - 1 of T
    while c < N do
        ⟨await p > c⟩
        b(c) := buf
        c := c + 1

Question: What is the relation among a, b, p, c?

The state diagram shows the required annotations to allow to conclude that upon termination of Consumer, it will have a copy of array a of Producer in its own local array b. Formally, the producer invariant PI is conjoined to every state of Producer, the consumer invariant CI is conjoined to every state of Consumer, and the global invariant PC is conjoined to every state:

Textually, with some simplifications this is equivalently expressed as:

algorithm
p := 0 ; c := 0
{PC: 0 ≤ p ≤ c + 1 ∧ (p = c + 1 ⇒ buf = a(p - 1))}

algorithm
process producer
    var a: array 0 .. N - 1 of T = …
    {PI: p ≤ N}
    {PC ∧ PI}
    while p < N do
        {PC ∧ p < N}
        ⟨await p = c⟩
        {PC ∧ p < N ∧ p = c} 
        buf := a(p)
        {PC ∧ p < N ∧ p = c ∧ buf = a(p)}
        p := p + 1
        {PC ∧ PI}

algorithm
process consumer
    var b: array 0 .. N - 1 of T
    {CI: c ≤ N ∧ b[0..c-1] = a[0..c-1]}
    {PC ∧ CI} 
    while c < N do
        {PC ∧ CI ∧ c < N}
        ⟨await p > c⟩
        {PC ∧ CI ∧ c < N ∧ p > c} 
        b(c) := buf
        {PC ∧ CI ∧ c < N ∧ p > c ∧ b(c) = a(c)} 
        c := c + 1
        {PC ∧ CI}

Each of the black solid transitions has to be correct in isolation. The producer invariant PI does not contain variables that are modified by Consumer and likewise the consumer invariant CI does not contain variables that are modified by Producer, so these are always preserved by transitions of the other process. The conditions for Consumer are

algorithm
{PC ∧ CI ∧ c < N ∧ p > c} 
b(c) := buf
{PC ∧ CI ∧ c < N ∧ p > c ∧ b(c) = a(c)}


2. ```algorithm
{PC ∧ CI ∧ c < N ∧ p > c ∧ b(c) = a(c)} 
c := c + 1
{PC ∧ CI}

or equivalently:

PC ∧ CI ∧ c < N ∧ p > c ⇒ (PC ∧ CI ∧ c < N ∧ p > c ∧ b(c) = a(c))[b := (b; c: buf)]


2. ```
PC ∧ CI ∧ c < N ∧ p > c ∧ b(c) = a(c) ⇒ (PC ∧ CI)[c := c + 1]

The green dotted arrows indicate which transitions of Consumer may interfere with the states of Producer. There are six non-interference conditions to check for the transition with b(c) := buf; of those the three with dotted arrows going to "empty" states of Producer are identical, leaving four conditions:

algorithm
{PC ∧ PI ∧ CI ∧ c < N ∧ p > c}
b(c) := buf
{true}

2. ```algorithm
{PC ∧ PI ∧ CI ∧ c < N ∧ p > c ∧ p < N}
b(c) := buf
{p < N}

algorithm
{PC ∧ PI ∧ CI ∧ c < N ∧ p > c ∧ p < N ∧ p = c}
b(c) := buf
{p < N ∧ p = c}

4. ```algorithm
{PC ∧ PI ∧ CI ∧ c < N ∧ p > c ∧ p < N ∧ p = c ∧ buf = a(p)}
b(c) := buf
{p < N ∧ p = c ∧ buf = a(p)}

These hold as b(c) := buf does not modify any of the variables in the states that need to be preserved.

Similarly, there are four distinct non-interference conditions for the transition with c := c + 1:

algorithm
{PC ∧ PI ∧ CI ∧ c < N ∧ p > c}
c := c + 1
{true}

2. ```algorithm
{PC ∧ PI ∧ CI ∧ c < N ∧ p > c ∧ p < N}
c := c + 1
{p < N}

algorithm
{PC ∧ PI ∧ CI ∧ c < N ∧ p > c ∧ p < N ∧ p = c}
c := c + 1
{p < N ∧ p = c}

4. ```algorithm
{PC ∧ PI ∧ CI ∧ c < N ∧ p > c ∧ p < N ∧ p = c ∧ buf =a(p)}
c := c + 1
{p < N ∧ p = c ∧ buf =a(p)}

The first two hold as c := c + 1 does not modify any of the variables in the states that need to be preserved. The last two hold as the precondition (antecedent) is false: p = c and p > c cannot hold at the same time.

Formally, concurrent states P and Q exclude each other if ¬(P ∧ Q). For example, states p < N ∧ p = c and c < N ∧ p > c exclude each other.

The producer-consumer processes do not interfere with each other. In general, interference can be avoided by

having annotations only over local variables, i.e. not relying on any values of shared variables,
when shared variables are involved, making annotations sufficiently weak, as in the example with ⟨x := x + 1⟩ ‖ ⟨x := x + 2⟩,
using atomic regions where needed; in the example with variables a, b, t, that could be ⟨while ¬t do t := true ; a := a + 1⟩

Fairness¶

Question: Does the following program terminate? What is the postcondition if it ever terminates?

algorithm
n, t := 0, false
(while ¬t do n := n + 1) ‖ t := true

The minimal progress assumption implies only that one process has to make progress, if possible at all. Under minimal progress, above program may not terminate.

An unfair scheduler may select one process and neglect others. We may require a scheduler to be (weakly) fair:

an unconditional atomic region is executed eventually;
a conditional atomic region whose guard is continuously true is executed eventually.

Under fairness, above program does necessarily terminate; if it does, n will be at least 0.

Question: Does the following program terminate under a (weakly) fair scheduler?

t, u := false, true
⟨await t then u := false⟩ ‖ while u do t := true ‖ while u do t := false

Under weak fairness, it does not necessarily terminate. For that, the scheduler would have to be strongly fair: if the guard of a conditional atomic region is repeatedly (but not necessarily continuously) true, it is taken.

Typically, schedulers of operating systems and programming languages are weakly fair. Strong fairness is more difficult to implement.

Safety and Liveness¶

Every functional property of a (sequential or concurrent) program can be formulated as either a safety or a liveness property.

Safety: nothing bad will ever happen; all states of all traces are "not bad". Safety is expressed by invariance properties.

Liveness: something good will eventually happen; all traces will eventually lead to a "good" state.

Termination is a liveness property: a final state will eventually be reached.

Formally, safety properties are properties of individual states of traces (or finite sequences of states), and liveness properties are properties of infinite traces.

Question: What are the safety and liveness properties of a traffic intersection?

Question: What is the outcome of P ‖ Q? Do P and Q interfere with each other with the given annotation?

algorithm
procedure P
    {true}
    while n > 0 do n := n - 1
    a := 1
    {a = 1}

algorithm
procedure Q
    {true}
    while n < 5 do n := n + 1
    b := 1
    {b = 1}

The parallel composition P ‖ Q does not always terminate, but if it does, it establishes both postconditions; P and Q do not interfere with each other.

Without interference, safety properties are preserved in a parallel composition, but liveness, including termination not!

Critical Section¶

Mutual exclusion is typically implemented by locks that protect critical sections.

algorithm
process CSi
    while true do
        entry protocol
        critical section
        exit protocol
        noncritical section

Any solution has to satisfy three safety properties and one liveness property:

Mutual exclusion: at most one process can be in its critical section
No deadlock (livelock): if two or more are trying to enter and no other process is in its critical section, one will.
No unnecessary delay: if one is trying to enter and no other process is in its critical section, it is not prevented.
Eventual entry: a process trying to enter will eventually succeed.

Question: What are these properties at a traffic intersection? You may consider 4-way stops and traffic lights.

Answer:

Mutual exclusion: at most one car enters the intersection
No deadlock: if four cars arrive at a 4-way stop at the same time, they will not all wait, but one will enter
No livelock: if four cars arrive at a 4-way stop at the same and all drivers gesture to the car on the right that they may proceed, they will not keep doing that forever but one will enter
No unnecessary delay: if a car arrives at a traffic light and there are no other cars, the traffic light will eventually turn green
Eventual entry: all cars arriving at an intersection will eventually be able to enter it

Here is an abstract solution, i.e. with "coarse-grained" atomic regions; CS is the critical section invariant:

algorithm
in1, in2 := false, false
{CS: ¬(in1 ∧ in2)}

algorithm
process CS1
    while true do
        ⟨await ¬in2 then in1 := true⟩
        critical section
        in1 := false
        noncritical section

algorithm
process CS2
    while true do
        ⟨await ¬in1 then in2 := true⟩
        critical section
        in2 := false
        noncritical section

Question: Do the four properties, mutual exclusion, no deadlock (livelock), no unnecessary delay, eventual entry hold?

Answer: Yes, yes, yes, no.

Exercise: Give the annotation that is required for the CS invariant in a state diagram! Check both the correctness of transitions as well as non-interference.

The solution can be easily generalized to more processes by observing that only one of in1, in2, ... can be true, and hence these can be replaced by a single variable lock. The relationship is lock = in1 ∨ in2 ∨ ...:

algorithm
lock := false

algorithm
process CS1
    while true do
        ⟨await ¬lock then lock := true⟩
        critical section
        lock := false
        noncritical section

algorithm
process CS2
    while true do
        ⟨await ¬lock then lock := true⟩
        critical section
        lock := false
        noncritical section

All processors have an atomic test-and-set instruction or an equivalent instruction:

algoritm
procedure TAS(lock: pointer to boolean) → (init: boolean)
    ⟨init := *lock ; *lock := true⟩

Here, *lock refers to the variable pointed to by lock; procedure TAS returns a boolean result init (in a processor, lock and init would be registers).

By employing a spin lock, eventual entry (property 4) is "more likely":

algorithm
process CS1
    while true do
        repeat init ← TAS(lock) until ¬init
        critical section
        lock := false
        noncritical section

The loop repeat S until B is formally equivalent to S ; while ¬B do S.

The atomic region ⟨S⟩ can be implemented by

algorithm
CSenter
S
CSexit

provided that access to all variables of S are protected with the same lock; atomic regions accessing disjoint variables can proceed in parallel.

The guarded atomic region ⟨await B then S⟩ can be implemented by

CSenter
while ¬B do CSexit ; CSenter
S
CSexit

provided that access to all variables of B and S are protected with the same lock.

This involves spinning and may lead to memory contention when accessing the variables of B. To reduce this, it is preferable to insert a delay:

CSenter
while ¬B do CSexit ; delay ; CSenter
S
CSexit

A delay can be randomly chosen from an interval that is doubled each time, known as the exponential back-off protocol.

The Tie-Breaker Algorithm (Peterson's Algorithm) obtains a fair solution to the critical sections problem by recording which process had the last turn; in1, in2 now mean that CS1, CS2 want to enter their critical section:

algorithm
in1, in2, last := false, false, 1

algorithm
process CS1
    while true do
        last := 1 ; in1 := true
        ⟨await ¬in2 or last = 2⟩
        {in1 ∧ (¬in2 ∨ last = 2)}
        critical section
        in1 := false
        noncritical section

algorithm
process CS2
    while true do
        last := 2 ; in2 := true
        ⟨await ¬in1 or last = 1⟩
        {in2 ∧ (¬in1 ∨ last = 1}
        critical section
        in2 := false
        noncritical section

The drawback of the tie-breaker algorithm is that it is difficult to generalize to more than two processes. The idea of the ticket algorithm is that each process first gets a ticket with the number in the queue. Assume turn: array 0 .. N - 1 of integer:

algorithm
number, next, turn := 1, 1, [0, ..., 0]
{next > 0 ∧
(∀ i ∈ 0 .. N - 1 • (CS(i) in critical section ⇒ turn(i) = next) ∧
    (turn(i) > 0 ⇒ (∀ j ∈ 0 .. N - 1 • j ≠ i ⇒ turn(i) ≠ turn(j))))}
process CS(i: 0 .. N - 1)
    while true do
        ⟨turn(i) := number ; number := number + 1⟩
        ⟨await turn(i) = next⟩
        critical section
        ⟨next := next + 1⟩
        noncritical section

The parameter of CS is a notation for process replication: N copies of CS are started, each with a different value of i.

The ticket algorithm requires fetching and incrementing an integer as one atomic operation. Some processors have a fetch-and-add instruction specifically for this purpose.

The x86 processor has an XADD instruction that can be used with the LOCK prefix to achieve atomicity,

algorithm
LOCK XADD dst, src   =   ⟨dst, src := dst + src, dst⟩

where dst is a register or memory location and src is a register.

The ARM v7 processor splits atomic updates into two instructions, load-exclusive and store-exclusive, with the hardware monitoring what happens in between,

algorithm
LDREX dst, [src]      =   ⟨dst := Memory[src]⟩
STREX R, src, [dst]   =   ⟨Memory[dst], R := src, 0 ⫿ R := 1⟩

were dst, src, R are registers and ⫿ is nondeterministic choice: if the memory location was modified in the meantime, R is set to 1, otherwise to 0. An increment of Memory[ptr] is performed in a loop:

arm
L:  LDREX R1, [ptr]        ; load exclusive and monitor Memory[ptr]
      ADD R1, R1, 1          ; add 1
      STREX R2, R1, [ptr]    ; attempt to store
      CMPNE R2, #1           ; check if store-exclusive failed
      BEQ L                  ; failed, retry

The gcc compiler has built-in functions for atomic operations that are compiled to whatever the hardware supports:

type __atomic_add_fetch (type *ptr, type val, int memorder)

where type must be 1, 2, 4, or 8 bytes in length.

Question: Does the ARM code guarantee fairness among processes trying to increment Memory[ptr]?

Answer: No, it is not fair, but the chances of one process starving are negligible.

If a special instruction is not available, then an implementation of

algorithm
⟨turn(i) := number ; number := number + 1⟩

would be

algorithm
CSenter
turn(i) := number ; number := number + 1
CSexit

where CSenter and CSexit use a lock for variable number. Although the contention for accessing number may not be resolved in a fair way, it turns out that this implementation works sufficiently well in practice.

The ticket algorithm leads to memory contention if many processors are trying to access number. The bakery algorithm does not rely on a single global counter and does not need a special fetch-and-add instruction. Assuming turn: array 0 .. N - 1 of integer, a coarse-grained solution is:

algorithm
turn := [0, ..., 0]
process CS(i: 0 .. N - 1)
    while true do
        ⟨turn(i) := max(turn) + 1⟩
        for j = 0 to N - 1 do
            if i ≠ j then ⟨await turn(j) = 0 or turn(i) < turn(j)⟩
        critical section
        turn(i) := 0
        noncritical section

Operation max(turn) returns the maximum value in array turn.

The max operation would need to be implemented by a loop, making entry inefficient. If we were to implement

turn(i) := max(turn) + 1

non-atomically, two processes, say i and j, may obtain the same value for turn(i) and turn(j), leading to a deadlock in the await statement.

This situation can be avoided by giving preference to one of the two or more processes with the same value of turn in some arbitrary way, for example by using the process number. The conditions of the await do not need to be evaluated atomically. Thus we can arrive at a solution that only relies on assignments being atomic:

algorithm
turn:= [0, ..., 0]
process CS(i: 0 .. N - 1)
    while true do
        turn(i) := max(turn) + 1
        for j = 0 to N - 1 do
            while turn(j) ≠ 0 and (turn(i) > turn(j) or (turn(i) = turn(j) and i > j)) do
                skip
        critical section
        turn(i) := 0
        noncritical section

Question: Does this guarantee fairness among processes trying to enter their critical section?

Answer: Yes, although processes with a lower process number may have an advantage.