Nov 20, 2023
                                                    
                        The X=X test program
                      
                          William Collier
                          collier@acm.org

                             Abstract

  The X=X program is a new program to stress test a shared memory
multiprocessor (SMMP).  Multiple threads execute the same code.
Each thread reads from and writes into the shared code.  Each
such action of each thread is a NOP.  However, the interaction
among the threads may reveal real problems on real machines.

  1.  The purpose of the X=X test program.
  
  The X=X test program tries to uncover problems on a SMMP.

  One group ran it on an IBM Z16 machine (1,000,000,000
iterations over an hour and a half).  The part of the machine
running the X=X program ran slowly.  No other effects were seen.

  No evidence exists at this time that the X=X program finds
problems on real machines.

  2.  Version 1 of the X=X program.

  There are several threads.  Each thread has access to a shared
data area (DA).  All threads execute the same code in a loop.

  In the loop each thread randomly selects a portion of the DA,
reads the data in the portion, and writes the data back into the
portion from which it was just read.  This is a NOP.  And so
logically the entire program is a NOP.
   
  The machine does not know it is performing NOPs, and therefore
it performs all the checks it performs when threads try for real
to read and write the same data at the same time.  Probably no
machine in today's world would fail Version 1 of the X=X test.

  3.  Version 2 of the X=X program.

  Version 2 differs from Version 1 in only one regard.  The DA is
the area of memory occupied by the program being executed.

  Each thread reads and writes into its own instruction stream.
Each thread experiences other threads reading and writing into
its instruction stream.

  Logically, the program is still a giant NOP, and so, logically,
it should run with no problem.

  4.  The data cache and the instruction cache
    
  Each processor in a SMMP has a bit of local memory, called a
data cache, associated with it.  The processor saves in its data
cache the operands which the processor has recently accessed and
the operands which the processor thinks the thread will soon
access.
 
  Each processor has another bit of local memory, called an
instruction cache, associated with it.  The processor saves in
its instruction cache the instructions which the processor is
currently executing and the instructions the processor thinks
the thread may soon execute.

  In the X=X test program, threads read and write words in the
X=X program which the thread is executing.  Thus words may
simultaneously be in both the data cache and the instruction
cache for the thread.  This is unusual.  Problems may be
revealed.

  5.  Should the X=X program be written in basic assembler
language (BAL) or in a higher level language (HLL)?  It is
straightforward to write it in BAL, but the result is good for
evaluating only a specific machine.  If written in a HLL, there
are problems to deal with.
  
  a.  Find the beginning and the end of the program.

  b.  Ensure that no one but the X=X program changes the code in
the X=X program.
  
  c.  Prevent the compiler from recognizing that the program
consists of nothing but NOPs.
  
  Composite approach:  a HLL program creates threads and does
other housekeeping, but then calls subroutines, some of which are
HLL and others are BAL.
  
  To be continued.  Please share these notes with anyone you
think might find them interesting.
  
--------------------------------------------------  

  Bill Collier was a programmer in IBM Poughkeepsie from 1960 to
1993.  He wrote "Reasoning About Parallel Architectures"
(Prentice-Hall 1992).  He holds degrees from Harvard and Syracuse
University in math and computer science.  He joined ACM in 1960.




















  7.  Write in assembler or in a higher level language?
  
  It is not clear which choice is preferable.
  
  If one writes the X=X program in BAL, one has complete control
over exactly which areas of memory are copied to themselves and
over the order in which the copies are performed.  Unfortunately
in this approach the program has to be written afresh in a
different BAL for each different machine type to be tested.

  If one writes the X=X program in a HLL, the program can be
easily compiled and run on different machines.  However,
compilers for HLL sometimes rearrange code for the presumed
benefit of programmers, and compilers for different machines may
differ, in ways which turn out to be significant, in the code
they generate for different machines.  Thus the compiler itself
may contribute to results which can be erroneously interpreted as
evidence of machine malfunction.
  
  For a quick and dirty start to machine testing, consider coding
the X=X program in C/C++.  The variables in the DA which are
being copied to themselves are declared volatile.  This should
ensure that statements referencing the variables are not
reordered by the compiler.
  
  The variables themselves can be declared in a union of the
form:
  
  union 
  {
    int a[1000];
    int b[1000];
    int c[1000];
  }  ab  *ab_ptr;

  In Version 1 of the X=X program, the X=X program uses malloc()
to get an area of storage.  Then ab_ptr is set to address the
beginning of the malloc() area.

  In Version 2 of the X=X program, the X=X program (somehow) gets
the address of the beginning of the X=X program and sets ab_ptr
to that address.
  
  Here is a sample X=X program. In both Version 1 and Version 2
of the X=X program, there are multiple threads, all
simultaneously executing these statements:

  int temp[12] = {5,7,9,8,6,4,5,7,9,8,6,4}; 
  ab.a[17] = ab.b[17];                                    // nop
  for (i=0;i<1000;i++) ab.a[i] = ab.b[i];                 // nop
  for (i=220;i<800;i++) ab.c[i] = ab.b[i];                // nop
  for (i=800;i>220;i--) ab.c[i] = ab.b[i];                // nop
  for (i=0;i<1000;i+=3) ab.a[i] = ab.c[i];                // nop
  for (i=0;i<1000;i++) ab.a[i] = ab.b[i] | ab.c[i];       // nop  
  for (i = 0; i < 12; i++) ab.a[temp[i]] = ab.b[temp[i]]; // nop
 
  Note:  In the last 24 hours I have spent time with two people
who know far more about C/C++ than I do.  We could not agree on
the language needed to accomplish the goals stated above.

--------------------------------------------------  page  8

  8.  Version 3 of the X=X program.

  One way a SMMP can deal with the contention created by the X=X
program is, I am told, for the machine to detect that a thread is
seeking to modify the code it is executing and then to enter a
super-cautious mode of execution where operations are performed
only one, very careful, step at a time.
    
  Here is a way for the program to avoid triggering super-
cautious mode while still provoking a lot of the contention of
Version 2 of the X=X program.
  
  Create two data areas, DA1 and DA2.  Each contains a copy of
the program being executed.

  Divide the threads into two groups, the even threads and the
odd threads.

  The even threads execute the code in DA2, but access the
code/data in DA1.

  The odd threads execute the code in DA1, but access the
code/data in DA2.

  Thus no thread seeks to read/write its own code (and thereby
fails to trigger super-cautious mode), but it does read and write
the code of other threads.