May 27, 2025

            Predictions About Computer Memory Models

                          William Collier
                          collier@acm.org


Outline: It may help you to follow the talk if you read and print out
these notes beforehand.

  1.  Memory models (MM) will continue to evolve.

      Sequential Consistency (SC) (Lamport, 1978 [8]) says:  do not
      visibly violate Program Order (PO).

      SCII (RAPA Plus)  Obey rules of computation (CMP), PO, and
      write atomicity (WA) [2]. Show how to unrelax relaxable rules.

      Scheurich's problem.  Memory model by fiat [9].

  2.  Single copy write atomicity (SCWA) will be seen as
      simply part of the CMP rule.  "data-race free" will
      disappear from architecture manuals.

  3.  Long thread tests (LTT's) will almost entirely replace
      short thread tests (STT's).

  4.  Indistinguishable memory models.  The fundamental goal
      of memory models.  The WSisWA theorem.

  5.  X=X program.  What happens when two threads write
      into each other's instruction stream?

  6.  Making new entries in www.oeis.org be tables will result
      in finding many new integer sequences.  This has nothing
      to do with memory models.

  7.  Work with Bill Rubin to see structures in LTT output.

==============================================

  The memory model (MM) for a shared memory multiprocessor (SMMP)
defines the SMMP's behavior when two or more processors read
and/or write the same operands at the same time,

  Some memory models require that all instructions be executed in
the order defined by the program they are contained in.  This is
called the rule of program order (PO).  This form of the rule is
called "strong behavior."  Other memory models allow "relaxed
behavior", that is, they allow some instructions to be executed
out of order in order to increase performance.

  Some memory models require that every store operation appear to
become visible to every part of a machine at the same instant.
This is the rule of multiple copy write atomicity (MCWA).  This
form of the rule is called "strong behavior."  Other memory
models allow "relaxed behavior", that is, they allow a change in
the value of an operand to become visible at different times to
different parts of the system in order to increase performance.

  The term 'computer architecture' for a oomputer is used as
defining the programming interface for the computer.  Out of
habit I use the two phrases interchangeably.

  The first memory model, called Sequential Consistency (SC), was
defined by Leslie Lamport as requiring that violations of the
rule of program order do not become visible:

  "... the result of any execution is the same as if the
operations of all the processors were executed in some sequential
order, and the operations of each individual processor appear in
this sequence in the order specified by its program." [8]

  In the 1970's efforts to understand the effects of
multiprocessing led to many small programs being created.  The
point of such tests was to show that if a program calculated a
certain set of answers, then it was possible to deduce that
certain events had happened before certain other events.  In some
cases it became possible to see events happening out of order or
nonatomically.  Collections of these programs can be found in
[2, 10].

  Here is a test program created by Ron Smith.  The goal was to
show that if read events were allowed to occur before logically
preceding write events, then this behavior could become visible
to users.

  Execution 1.

  Initially, A = B = U = V = 0.

        P0           P1
    L0: A = 1;   L0: B = 1;
    L1: U = B;   L1: V = A;

  Terminally, A = B = 1, U = V = 0.

  Call Execution 1 a short thread test (STT).

  The logic showing the existence of a circuit in the data is:
    the write A in A=1 comes before the read of B in U=B
      by program order.
    the read of B in U=B comes before the write into B in B=1
      by the rule of computation.
    the write of B in B=1 comes before the read of A in V=A
      by program order.
    the read of A in V=A comes before the write of A in A=1
      by the rule of computation.

  Note:  the existence of the circuit reveals weak behavior of
the machine since one of the rules used in constructing the
circuit must have been violated.

  In [2] I speculated that Smith's STT above could be expanded
into a long thread test (LTT) of the following form.

  Execution 2.

  Initially, A = B = U[i] = V[i] = 0, 0 <= i <= N.

          P0              P1
    L11: A    = 1;   L11: B    = 1;
    L12: U[1] = B;   L12: V[1] = A;
    L21: A    = 2;   L21: B    = 2;
    L22: U[2] = B;   L22: V[2] = A;
    L31: A    = 3;   L31: B    = 3;
    L32: U[3] = B;   L32: V[3] = A;
      .   .   .        .   .   .
    LN1: A    = N;   LN1: B    = N;
    LN2: U[N] = B;   LN2: V[N] = A;

  Seek, for example, U[3] = 2 and V[3] = 2, to see that a read
comes before a write.

  The logic showing the existence of a circuit in the data is:

    the write A in A=3 comes before the read of B in U[3]=B
      by program order.
    the read of B in U[3]=B comes before the write into B in B=3
      by the rule of computation.
    the write of B in B=3 comes before the read of A in V[3]=A
      by program order.
    the read of A in V[3]=A comes before the write of A in A=3
      by the rule of computation.

  After [2] came out, the program in Execution 2 was run, and its
output data confirmed that, as the System/390 architecture
allowed, read operations could indeed be executed before
logically preceding write operations.

  See Variation 1, p. 221, of RAPA.
  See Test 4, p. 224, of RAPA.
  See Test T400 in http://www.mpdiag.com/analysis.html#t04.

  Execution 3.  Scheurich's execution (1987):

  Initially, (A,B,X,Y} = {0,0,0,0).

      P1          P2          P3
  L1: A = 1;  L1: B = A;  L1: X = B;
                          L2: Y = A;

  Terminally, (A,B,X,Y} = {1,1,1,0).

  The logic for Execution 3
    the write of A in A=1 comes before the read of A in B=A
      by the rule of computation.
    the read of A in B=A comes before the write of B in B=A
      by the rule of computation.
    the write of B in B=A comes before the read of B in X=B
      by the rule of computation.
    the read of B in X=B comes before the read of A in Y=A
      by program order.
    the read of A in Y=A comes before the write of A in A=1
      by the rule of computation.
    the write of A in A=1 comes before the read of A in B=A
      by the rule of computation.

  It appears that the execution violates either program order (P3
fetches A well before it fetches B) or write atomicity (P3 sees
the old value of A well after it has seen the new value of A, via
operand B).

  Intel's response to Scheurich's execution at one point was to
declare that it was impossible to occur.  This is memory modeling
by fiat.

  It would be easy to laugh at this, but to do so would be to
ignore the fact that the folks who wrote the fiat knew something
which they could not express rationally.  Perhaps it is something
like this:  signals take time to propagate.  A memory model
should allow for one to express durations.  Such a model would
allow one to explain that P3 cannot see the new value of A and
then at a later time still see the old value of  A, in the way
described in Execution 3 above.  Hopefully this new memory
model would still be simple enough for programmers to use in
writing shared memory routines.

==============================================

  Sequential consistency provided a valuable guide for writing
SMMP programs for many years.  However, it can now be seen to be
missing vital elements.

  When SC was first formulated, the rule of computation seemed so
obvious that it was not necessary to state it formally.  But, as
noted in [2], the events of an execution can always be
arranged to obey any set of rules other than the rule of
computation.  In other words, without using the rule of
computation, it is impossible to deduce the existence of a
circuit among the events of an execution.

  I propose that the definition of SC should be updated further
as follows:

  The rule of computation (CMP).  Strong form only.  I do not
believe that any weak form of CMP can be permitted.

  The rule of program order (PO).  PO in its strong form, or PO
in a weak form, plus fences to enable programmers to achieve
strong forms of PO where needed.

  The rule of write atomicity (WA).  WA in its strong form, or WA
in a weak form, plus atomic instructions to enable programmers to
achieve strong forms of WA where needed.

  Call this form of SC, SCII.

==============================================

  From the short thread tests which were created at this time it
became evident that the phrase "write atomicity" was used to
describe two very different behaviors of a machine.

  1.  Single Copy Write Atomicity (SCWA).  If two threads both
write to an operand, or one reads the operand and the other
writes into it, the resultant value(s) should be one of the
original values (not a value which is partly created from one
operand and partly from another.)

  2.  Multiple copy write atomicity.  If several copies of an
operand exist in a system (say in distinct copies in several
caches), then, if one copy changes value, all of the copies
(appear to) change value at the same instant.

  We have already seen the MCWA rule in the discussion of the
Scheurich execution above.

==============================================

  Single copy write atomicity (SCWA) and data race freedom (Adve
[1]).

  The rules of computation require that for Execution 4
the terminal value of A can be either 1 or 2 and no other
value:

  Execution 4.

  Initially, A = 0.

      T0             T1
  L1: A = 1;     L2: A = 2;

  Terminally, A can be either 0 or 2.

and for Execution 5 the terminal value of A can be
either 0 or 1 and no other value.

  Execution 5.

  Initially, A = X = 0.

      T0             T1
  L1: A = X;     L2: X = 1;

  Terminally, A can be either 0 or 1.

  These rules ensure that the system appears to be data race
free.

  How valuable is SCWA as a rule for reasoning about the
behavior of theads?

  If a program needs to change a large object which can also be
changed by other programs at the same time, then there is no
question that a lock needs to be set to allow only one program at
a time to access the object.

  If the object is small and can be changed by an atomic
instruction, then the atomic instruction should be used to
modify the object.

  If the object is small and efforts to change it using nonatomic
instructions results in Executions 4 or 5 above, the machine does
not compute and engineers should be asked to fix it.  This may be
a point of some contention.

==============================================

  Here is an argument that Data Race Free behavior can be
achieved purely via programming means, that is, without any
support from system hardware.

  Dijkstra [5] showed that the function of Test and Set can be
achieved using only nonatomic BAL instructions.  He used this
capability to guard entrance to a critical section.

  Ron Smith and I [4] sharpened Dijkstra's result by requiring
all of the operands used in controlling access to a critical
section be only one bit long.  Since they are only one bit long,
interactions on these operands are data race free.  And there
will be only one thread in the critical section at any one time
so its operations will also be data race free.

==============================================

  3.  Long thread tests will almost entirely replace short thread
tests.

  The STT's in [2,10] will be recast as LTT's.  In testing CPU-
GPU interactions Srivastava (11] found that system set up time
in running one LTT was reduced by a factor of 15,000 over running
a long sequence of STT's. (But note that there may be some places
where a STT can not be recast as a LTT.)

==============================================

  4.  Indistinguishable memory models.

  Suppose that MM1 and MM2 are two memory models.  Suppose
further that any result which can be calculated on MM1 can also
be calculated on MM2.  And vice versa.  Then we say that MM1 and
MM2 are indistinguishable.

  The fundamental goal of computer memory models is to identify
indistinguishable memory models.  A system can then be presented
to customers as emobodying the best qualities of both MM's, and no
program can be written which reveals the deception.

 Here are two examples.

  Example 1.

  If I remember correctly, on the early 360 machines the
architecture required that (it appeared that) operands were
fetched a word at a time.  The rule improved the performance of
large machines while hurting the performance of small machines.

  The someone realized that the rule could be changed to say that
operands were fetched a byte at a time.  The small machines would
be faster.  The large machines could continue to fetch operands a
word at a time, but no program could detect the violation of the
memory model.

  Example 2.

  Let MM1 be the MM which obeys this rule:  every pair of threads
see all changes in the values of operands in the same order.  Call a MM
which obeys this rule Write Synchronized (WS).

  Let MM2 be the MM which obeys this rule:  every change in the
value of an operand becomes visible (or appears to beome visible)
to all threads in the system at the same instant.  Call a MM
which obeys this rule Write Atomic (WA).

  In RAPA I proved the WSisWA theorem:  if a system obeys the
rule of WS, it obeys (appears to obey) the rule of WA.

  I submitted an invention disclosure on the WSisWA theorem.  It
was rated publish.  After RAPA came out, I was told that if it
had been rated file, the invention would have been worth as much
to IBM as the channel patent was.

  Maybe.  Maybe not.  Achieving write atomicity is expensive, no
matter how it is done.

  Note.  I recently took a look again at the proof of the
theorem.  I believe the theorem is true, but the proof may need
some work.

==============================================

  5.  The X=X program.

  Is it reasonable to test a machine to see what it does when
told to divide a number by zero?  Of course it is.  You need to
know that the machine deals gracefully with a potential error
situation.

  Is it reasonable to test a machine to see what it does when
a thread modifies its own instruction stream?  Yes, this behavior
is explicitly allowed (on at least some machines).

  Is it reasonable to test a machine to see what it does when two
threads each modify the other's instruction stream?  Presumably
this behavior is not allowed.  It is still valid to test the
machine to see that the machine handles the situation gracefully,
the same as in the divide by zero test.

  If a machine stores into an operand exactly the value which
was already in the operand, it has performed a nop-store
operation.  To software a nop-store appears to be a nop.  To
hardware it appears to be a store operation.

  Is it fair to test a machine to see how it handles two threads
each doing nop-stores into the other's instruction stream?  Of
course it is fair, just as it is fair to test the machine for its
handling of a divide by zero operation.

  For an additional dollop of piquancy in the test programs, the
instruction streams might contain self-modifying instructions,
such as the 370 instructions OI *,1 and XI *,1 which self-modify
into each other.

==============================================

  6.  The Online Encyclopedia of Integer Sequences (OEIS).  This
discussion has nothing to do with memory models.

  In 2018 Jeffrey, et al [7] created the following table. I
independently created the same table a few weeks later [4].

i\j |0  1   2    3      4       5        6          7           8
----+------------------------------------------------------------
   0|2  2   2    2      2       2        2          2           2
   1|2  3   7   18     47     123      322        843        2207
   2|2  4  14   52    194     724     2702      10084       37634
   3|2  5  23  110    527    2525    12098      57965      277727
   4|2  6  34  198   1154    6726    39202     228486     1331714
   5|2  7  47  322   2207   15127   103682     710647     4870847
   6|2  8  62  488   3842   30248   238142    1874888    14760962
   7|2  9  79  702   6239   55449   492802    4379769    38925119
   8|2 10  98  970   9602   95050   940898    9313930    92198402
   9|2 11 119 1298  14159  154451  1684802   18378371   200477279

  Let a[i][j] be the element in row i and column j.

  Let the table be initialized as follows:
    a[i][0] = 2, i >= 0.
    a[i][1] = i + 2, i >= 0.

  Then the values in each row are given by:

  a[i][j] = a[i][1] * a[i][j-1] - a[i][j-2], i >= 0, j > 2.

  The values C[i][j] in each column are given by:

  The values in the columns are given by (for n > 0)

  C[0][n] = 2.

  C[1][n] = n + 2.

  C[2][n] = n^2-2, from A008865.

  C[3][n] = n^3 + 6*n^2 + 9*n + 2, from A058794.

  C[4][n] = ???

  C[5][n] = n^5 - 5*n^3 + 5*n, from A230586.

  C[6][n] = ???

  The point is that the columns are defined by many rules, not
just by the above polynomials.  You can follow the links to my
website to see some of the rules.  I would really like to work
with someone who'd like to take a look at this.

=============================================

  Ideas not covered elsewhere.

  Are there forms of data races, other than those described in
Executions 4 and 5 above?

======================

  Single copy write atomicity (SCWA) will be seen as simply part
of the CMP rule.  "data-race free" will disappear from
architecture manuals.

======================

  A test program can not prove that a machine obeys a rule of
architecture.  The only thing a test program can do is to show
that a machine fails to obey a rule of architecture.

======================

  I used the BXLE instruction to test and advance a bit string,'
saving many bytes in an operating system which had to run in 2K
bytes.  1965.

======================

  I defined graph_sets and the reification of nondeterminism in
in order to prove the WSisWA theorem.  Nondeterminsim disappears.
One is left instead to deal only with a very large set of
chromatic graphs.  1987.

==============================================

  7.  Structure in the output from long thread tests

  I have talked about the circuits which Bill Rubin and I see in
the output from the LTT's.  We think we see more structure in the
output data.  We plan to share results later this summer.

  I have two hopes for this work.  One, I hope it will enable
people to see a little more deeply into the internal workings of
a SMMP.  Two, I hope the work will grow into a small, but
interesting subfield of number theory.

==============================================

  References

  1.  S. Adve.  Designing Memory Consistency Models For Shared-
Memory Multiprocessors (PhD thesis). 1993.

  2.  W. W. Collier.  Reasoning About Parallel Architectures,
Prentice-Hall, 1992.

  3.  W. W. Collier and R. M. Smith.  Interactive Tie-Breaking
system.  1992.
https://patentimages.storage.googleapis.com/eb/52/ec/67e872a2ce4625/US3676860.pdf

  4. W. W. Collier.  Entry 299741 in Online Encyclopedia of
Integer Sequences.  2018.
https://oeis.org/a299741.

  5.  E. W. Dijkstra.  Solution of a problem in concurrent
programming control.  Communications of the ACM, Volume 8, Issue
9, Page 569.  1965.
https://dl.acm.org/doi/pdf/10.1145/365559.365617

  6.  Intel.  At one point page 11 in section 2.5 "2.5 Stores are
transitively visible" of this document

http://developer.intel.com/products/processor/manuals/318147.pdf.

Intel explicitly forbade Scheurich's test case from happening.

  7,  L. E. Jeffery, B. Selcoe and A. Hone.   Entry 298675 in
Online Encyclopedia of Integer Sequences.  2018.
https://oeis.org/a298675.

  8.  L. Lamport, "How to Make a Multiprocessor Computer That
Correctly Executes Multiprocess Programs", IEEE Trans. Comput. C-
28,9 (Sept. 1979), 690-691.

  9.  M. Dubois, C. Scheurich, and F. A. Briggs. Memory Access
Buffering in Multiprocessors. Preceedings of the Thirteenth
Annual International Symposium on Computer Architecture 14:2
(1986), 434-442.

 10.  P. Sewell.  ARM Litmus Tests,
https://www.cl.cam.ac.uk/~pes20/arm-supplemental/

 11.  S. Srivastava.  Testing Memory Models of Heterogeneou CPU-
GPU Systems, Master's thesis, University of California, Santa
Cruz, June 2024.  See https://escholarship.org/uc/item/5hm7q1jv
and https://escholarship.org/content/qt5hm7q1jv/qt5hm7q1jv.pdf?t=sgthsf

==============================================