#### Numerical Reproducibility for Parallel Stochastic Simulation "Exascale Ready"

















# HW/SW Codesign (Reliability)

- Can we identify at compile time certain critical regions which need stronger correctness guarantees?
- •We are already generating terabytes to petabytes of state per second. At exascale we will be generating exabytes of state each second.
- A single wrong bit can vitiate the entire calculation.
- •For many scientific calculations: we should be able to gracefully tolerate many kinds of bit errors, and also the loss of many kinds of local resources.
- •For example: in many Monte Carlo simulations, the loss of a processor does not imply the inherent failure of the simulation.

## Checkpointing (Reliability)

- Limits of classical checkpointing will be reached : a fault every hour (or less) with current MTF - but an Exascale checkpoint could last 30 minutes at 1 Terabyte/s !!!
- Without a radical change we are going to be much worse than we are today...
- •We have to build a much higher level of local check-pointing capability into our software and hardware systems.
- Parallel Stochastic Simulations could checkpoint must faster with only intermediate results and all the pseudorandom number generator statuses.
- Using raided non-volatile memory, we could checkpoint state very often by moving copies of needed application state to nearest neighbor nodes (they only draw power when in use, this would have minimal energy implications).

















| umeri                                                                                         | ca                                                                 | ļĘ                                                         | <u>}e</u>                                                     | pr                                                               | 00                                                                                      | luc                                                                      | iþ                                                                             | ilit                                                                                                 | y r                                                                | esu                                                                            | lts                                        | § 1                                                                         |
|-----------------------------------------------------------------------------------------------|--------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------|--------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|--------------------------------------------------------------------------------|--------------------------------------------|-----------------------------------------------------------------------------|
| Errors f                                                                                      | our                                                                | d.                                                         |                                                               |                                                                  |                                                                                         |                                                                          |                                                                                |                                                                                                      |                                                                    |                                                                                |                                            |                                                                             |
|                                                                                               |                                                                    |                                                            |                                                               |                                                                  |                                                                                         |                                                                          |                                                                                |                                                                                                      |                                                                    |                                                                                |                                            |                                                                             |
| <ul> <li>for dif</li> </ul>                                                                   | terei                                                              | nt n                                                       | ard                                                           | ware                                                             | ,                                                                                       |                                                                          |                                                                                |                                                                                                      |                                                                    |                                                                                |                                            |                                                                             |
| <ul> <li>differe</li> </ul>                                                                   | ent o                                                              | pera                                                       | atin                                                          | g svs                                                            | sten                                                                                    | ns.                                                                      |                                                                                |                                                                                                      |                                                                    |                                                                                |                                            |                                                                             |
|                                                                                               |                                                                    |                                                            |                                                               |                                                                  |                                                                                         | ,                                                                        |                                                                                |                                                                                                      |                                                                    |                                                                                |                                            |                                                                             |
| <ul> <li>differe</li> </ul>                                                                   | ent c                                                              | omp                                                        | ner                                                           | S.                                                               |                                                                                         |                                                                          |                                                                                |                                                                                                      |                                                                    |                                                                                |                                            |                                                                             |
|                                                                                               |                                                                    |                                                            |                                                               |                                                                  |                                                                                         |                                                                          |                                                                                |                                                                                                      |                                                                    |                                                                                |                                            |                                                                             |
|                                                                                               |                                                                    |                                                            |                                                               |                                                                  |                                                                                         |                                                                          |                                                                                |                                                                                                      |                                                                    |                                                                                |                                            |                                                                             |
| Table 3: Testing of                                                                           | reprodu                                                            | cibility                                                   | for 7 di                                                      | ifferent I                                                       | PRNG                                                                                    | (MT1993                                                                  | 7 with                                                                         | 2 versions                                                                                           | TinvMT                                                             | with 2 versi                                                                   | ons M                                      | RG32k3a                                                                     |
| Table 3: Testing of<br>VELL512, MLFG6                                                         |                                                                    |                                                            |                                                               |                                                                  |                                                                                         |                                                                          |                                                                                |                                                                                                      |                                                                    |                                                                                |                                            |                                                                             |
| VELL512, MLFG6                                                                                | 4) perfor                                                          | med on                                                     | 5 diffe                                                       | rent pro                                                         | cessors                                                                                 | (Intel E5-                                                               | 2650v2                                                                         | 2, Intel E5                                                                                          | -2687W, C                                                          |                                                                                | [7100,                                     | AMD 62                                                                      |
| VELL512, MLFG6                                                                                | 4) perfor<br>Core i7-4                                             | med on<br>800MQ                                            | 5 diffe<br>)) with                                            | rent pro<br>differen                                             | cessors<br>t comp                                                                       | ilers (gcc, i                                                            | 2650v2<br>icc, lcc                                                             | 2, Intel E5<br>, open64, l                                                                           | -2687W, C<br>MinGW, C                                              | ore 2 Duo 7<br>ygwin) wer                                                      | r7100,<br>e testec                         | AMD 62                                                                      |
| VELL512, MLFG6<br>Opteron, (                                                                  | 4) perfor                                                          | med on<br>800MQ                                            | 5 diffe<br>)) with                                            | rent pro                                                         | cessors<br>t comp<br>Cor                                                                | ilers (gcc, i                                                            | 2650v2<br>icc, lcc                                                             | 2, Intel E5<br>, open64, 1                                                                           | -2687W, C<br>MinGW, C                                              | ore 2 Duo 7                                                                    | r7100,<br>e testec                         | AMD 62                                                                      |
| VELL512, MLFG6                                                                                | 4) perfor<br>Core i7-4                                             | med on<br>800MQ                                            | 5 diffe<br>)) with                                            | rent pro<br>differen                                             | cessors<br>t comp<br>Cor                                                                | ilers (gcc, i                                                            | 2650v2<br>ice, lee<br>A<br>Ol                                                  | 2, Intel E5<br>, open64, l                                                                           | -2687W, C<br>MinGW, C                                              | ore 2 Duo 7<br>ygwin) wer                                                      | r7100,<br>e testec                         | AMD 62                                                                      |
| VELL512, MLFG6<br>Opteron, (                                                                  | 4) perfor<br>Core i7-4                                             | med on<br>800MQ                                            | 5 diffe<br>)) with                                            | rent pro<br>differen                                             | cessors<br>t comp<br>Cor                                                                | ilers (gcc, i                                                            | 2650v2<br>ice, lee<br>A<br>Ol                                                  | 2, Intel E5-<br>c, open64, l<br>AMD<br>pteron                                                        | -2687W, C<br>MinGW, C                                              | ore 2 Duo 7<br>ygwin) wer                                                      | 0MQ                                        | AMD 62                                                                      |
| VELL512, MLFG6<br>Opteron, O<br>Generator                                                     | 4) perfor<br>Core i7-4<br>E5-26<br>gcc                             | med on<br>800MQ<br>50v2<br>icc                             | 5 diffe<br>() with<br>E5-2<br>gcc                             | rent pro<br>different<br>687W                                    | cessors<br>t comp<br>Cor<br>1<br>gcc                                                    | i (Intel E5-<br>ilers (gcc, i<br>e 2 Duo<br>7100<br>open64               | 2650v2<br>icc, lcc<br>A<br>Ol<br>(TM<br>gcc                                    | 2, Intel E5-<br>, open64, 1<br>AMD<br>pteron<br>A) 6272<br>open64                                    | -2687W, C<br>MinGW, C<br>C<br>Cygwin                               | ore 2 Duo 7<br>ygwin) wer<br>Core i7-480<br>MinGW                              | 0MQ                                        | AMD 62<br>l.<br>cc<br>lc64                                                  |
| VELL512, MLFG6<br>Opteron, (                                                                  | 4) perfor<br>Core i7-4<br>E5-26<br>gcc<br>Yes                      | med on<br>800MQ<br>50v2<br>icc<br>Yes                      | 5 diffe<br>)) with<br>E5-2<br>gcc<br>Yes                      | rent pro<br>different<br>687W<br>icc<br>Yes                      | cessors<br>t comp<br>Cor<br>1<br>gcc<br>Yes                                             | i (Intel E5-<br>ilers (gcc, i<br><b>e 2 Duo</b><br>7100<br>open64<br>Yes | 2650v/<br>icc, lcc<br>A<br>Ol<br>(TM<br>gcc<br>Yes                             | 2, Intel E5-<br>, open64, 1<br>AMD<br>pteron<br>A) 6272<br>open64<br>Yes                             | -2687W, C<br>MinGW, C<br>Cygwin<br>Yes                             | ore 2 Duo 7<br>ygwin) wer<br>Core i7-480<br>MinGW<br>Yes                       | C7100,<br>e testec<br>0MQ<br>lc<br>Yes     | AMD 62<br>l.<br>cc<br>lc64<br>Yes                                           |
| VELL512, MLFG6<br>Opteron, O<br>Generator                                                     | 4) perfor<br>Core i7-4<br>E5-26<br>gcc                             | med on<br>800MQ<br>50v2<br>icc                             | 5 diffe<br>) with<br>E5-2<br>gcc                              | rent pro<br>different<br>687W                                    | cessors<br>t comp<br>Cor<br>1<br>gcc                                                    | i (Intel E5-<br>ilers (gcc, i<br>e 2 Duo<br>7100<br>open64               | 2650v2<br>icc, lcc<br>A<br>Ol<br>(TM<br>gcc                                    | 2, Intel E5-<br>, open64, 1<br>AMD<br>pteron<br>A) 6272<br>open64                                    | -2687W, C<br>MinGW, C<br>C<br>Cygwin                               | ore 2 Duo 7<br>ygwin) wer<br>Core i7-480<br>MinGW                              | 0MQ                                        | AMD 62<br>l.<br>cc<br>lc64                                                  |
| VELL512, MLFG6<br>Opteron, C<br>Generator<br>MT19937                                          | 4) perfor<br>Core i7-4<br>E5-26<br>gcc<br>Yes                      | med on<br>800MQ<br>50v2<br>icc<br>Yes                      | 5 diffe<br>)) with<br>E5-2<br>gcc<br>Yes                      | rent pro<br>different<br>687W<br>icc<br>Yes                      | cessors<br>t comp<br>Cor<br>1<br>gcc<br>Yes                                             | i (Intel E5-<br>ilers (gcc, i<br><b>e 2 Duo</b><br>7100<br>open64<br>Yes | 2650v/<br>icc, lcc<br>A<br>Ol<br>(TM<br>gcc<br>Yes                             | 2, Intel E5-<br>, open64, 1<br>AMD<br>pteron<br>A) 6272<br>open64<br>Yes                             | -2687W, C<br>MinGW, C<br>Cygwin<br>Yes                             | ore 2 Duo 7<br>ygwin) wer<br>Core i7-480<br>MinGW<br>Yes                       | C7100,<br>e testec<br>0MQ<br>lc<br>Yes     | AMD 62<br>l.<br>cc<br>lc64<br>Yes                                           |
| VELL512, MLFG6<br>Opteron, C<br>Generator<br>MT19937<br>MT19937_64                            | 4) perfor<br>Core i7-4<br>E5-26<br>gcc<br>Yes<br>Yes               | med on<br>800MQ<br>50v2<br>icc<br>Yes<br>Yes               | 5 diffe<br>)) with<br>E5-2<br>gcc<br>Yes<br>Yes               | rent pro<br>different<br>687W<br>icc<br>Yes<br>Yes               | Cor<br>Cor<br>Cor<br>Cor<br>Scc<br>Yes<br>Yes                                           | i (Intel E5-<br>ilers (gcc, i<br>7100<br>open64<br>Yes<br>Yes            | 2650v/<br>icc, lcc<br>A<br>Ol<br>(TM<br>gcc<br>Yes<br>Yes                      | 2, Intel E5:<br>, open64, 1<br>AMD<br>pteron<br>A) 6272<br>open64<br>Yes<br>Yes                      | -2687W, C<br>MinGW, C<br>Cygwin<br>Yes<br>Yes                      | ore 2 Duo 1<br>ygwin) wer<br>Core i7-480<br>MinGW<br>Yes<br>Yes                | 0MQ<br>la<br>la<br>Yes<br>Yes<br>Yes       | AMD 62<br>1.<br>cc<br>lc64<br>Yes<br>Yes                                    |
| VELL512, MLFG6<br>Opteron, 6<br>Generator<br><u>MT19937</u><br><u>MT19937 64</u><br>TinyMT_32 | 4) perfor.<br>Core i7-4<br>E5-26<br>gcc<br>Yes<br>Yes<br>Yes       | med on<br>800MQ<br>50v2<br>icc<br>Yes<br>Yes<br>Yes        | 5 diffe<br>)) with<br>E5-2<br>gcc<br>Yes<br>Yes<br>Yes        | rent pro<br>different<br>687W<br>icc<br>Yes<br>Yes<br>Yes        | Cor<br>Cor<br>Cor<br>Cor<br>Cor<br>Cor<br>Cor<br>Cor<br>Cor<br>Ves<br>Yes<br>Yes<br>Yes | i (Intel E5-<br>ilers (gcc, i<br>7100<br>open64<br>Yes<br>Yes<br>NO      | 2650v2<br>icc, lcc<br>A<br>Ol<br>(TM<br>gcc<br>Yes<br>Yes<br>Yes<br>Yes        | 2, Intel E5:<br>, open64, 1<br>AMD<br>pteron<br>A) 6272<br>open64<br>Yes<br>Yes<br>Yes<br>Yes        | -2687W, C<br>MinGW, C<br>Cygwin<br>Yes<br>Yes<br>Yes               | ore 2 Duo 1<br>ygwin) wer<br>Core i7-480<br>MinGW<br>Yes<br>Yes<br>Yes         | 0MQ<br>la<br>la<br>Yes<br>Yes<br>Yes       | AMD 62<br>I.<br>I.<br>Ic64<br>Yes<br>Yes<br>Yes                             |
| VELL512, MLFG6<br>Opteron, (<br>Generator<br>MT19937<br>MT19937 64<br>TinyMT_32<br>TinyMT_64  | 4) perfor<br>Core i7-4<br>E5-26<br>gcc<br>Yes<br>Yes<br>Yes<br>Yes | med on<br>800MQ<br>50v2<br>icc<br>Yes<br>Yes<br>Yes<br>Yes | 5 diffe<br>)) with<br>E5-2<br>gcc<br>Yes<br>Yes<br>Yes<br>Yes | rent pro<br>different<br>687W<br>icc<br>Yes<br>Yes<br>Yes<br>Yes | Cor<br>Cor<br>I<br>gcc<br>Yes<br>Yes<br>Yes<br>Yes                                      | ilers (gcc, i<br>re 2 Duo<br>7100<br>open64<br>Yes<br>Yes<br>NO<br>Yes   | 2650v/<br>ice, lee<br>A<br>Ol<br>(TM<br>gcc<br>Yes<br>Yes<br>Yes<br>Yes<br>Yes | 2, Intel E5-<br>, open64, 1<br>AMD<br>pteron<br>A) 6272<br>open64<br>Yes<br>Yes<br>Yes<br>Yes<br>Yes | -2687W, C<br>MinGW, C<br>Cygwin<br>Yes<br>Yes<br>Yes<br>Yes<br>Yes | ore 2 Duo 1<br>ygwin) wer<br>Core i7-480<br>MinGW<br>Yes<br>Yes<br>Yes<br>NO Q | 0MQ<br>la<br>la<br>Yes<br>Yes<br>Yes<br>NO | AMD 62<br>l.<br><b>cc</b><br><b>lc64</b><br>Yes<br>Yes<br>Yes<br>Yes<br>Yes |



|    |      | erical Reprodu               | cibility resul            | ts 3/4   |
|----|------|------------------------------|---------------------------|----------|
| ΘE | rro  | rs found :                   |                           |          |
| Pr | obl  | ems Encountered With 32      | 2 And 64 Bits Architec    | ture For |
| Tł | ne S | ame Compiler (Lcc comp       | oiler 32 bits - ok for 64 | 4 bits)  |
|    |      |                              |                           | ,        |
|    | Tak  | ole 6: Results for TinyMT 64 | PRNG on Core i7-4800N     | 10       |
|    | 1 au | running Windows 7            |                           | ,iQ      |
|    |      | running v nidows /           |                           |          |
|    |      | Expected results             | <b>Results</b> obtained   |          |
|    |      | CHECK64.OUT.TXT              | with lc 32 bits           |          |
|    |      |                              | compiler                  |          |
|    |      | 0.125567123229521            | 0.514472427354387         |          |
|    |      | 1.437679237017648            | 1.386730269781771         |          |
|    |      | 0.231189305675805            | 0.112526841009551         |          |
|    |      | 0.777528512172794            | 0.197121666699821         |          |
|    |      |                              | ·                         |          |





#### **Quick survey of random streams parallelization** (1) Using the same generator

\*The **Central Server** (CS) technique (avoid for flexible reproducibility)

\*The Leap Frog (LF) technique. Means partitioning a sequence  $\{x_i, i=0, 1, ...\}$  into 'n' sub-sequences, the j<sup>th</sup> sub-sequence is  $\{x_{kn+j-1}, k=0, 1, ...\}$  - like a deck of cards dealt to card players.

\*The Sequence Splitting (SS) – or blocking or regular/fixed spacing technique. Means partitioning a sequence  $\{x_i, i=0, 1, ...,\}$  into 'n' subsequences, the j<sup>th</sup> sub-sequence is  $\{x_{k+(j-1)m}, k=0, ..., m1\}$  where m is the length of each sub-sequence

\*Jump Ahead technique (can be used for both Leap Frog or Sequence splitting)

\*The **Cycle Division** or **Jump ahead** approach. Analytical computing of the generator state in advance after a huge number of cycles (generations)

\*The Indexed Sequences (IS) - or random spacing. Means that the generator is initialized with 'n' different seeds/statuses

### **Quick survey of random streams parallelization** (2) Using different generators:

#### **Parameterization:**

The same type of generator is used with different parameters for each processor meaning that we produce different generators

- In the case of linear congruential generators (LCG), this can rapidly lead to poor results even when the parameters are very carefully checked. (Ex: Mascagni and Chi proposed that the modulus be Mersenne or Sophie Germain prime numbers)
- Explicit Inversive Congruential generator (EICG) with prime modulus has some very compelling properties for parallelizing via parameterizing. A recent paper describes an implementation of parallel random number sequences by varying a set of different parameters instead of splitting a single random sequence (Chi and Cao 2010).
- In 2000 Matsumoto et al proposed a dynamic creation technique





# Optimization for a single « hybrid » node (Intel E52650 & Xeon Phi 7120P)

Parallel stochastic simulation of muonic tomography

- Parallel programming model using p-threads
- On stochastic object for each Muon
- Multiple streams using MRG32k3a<sup>1</sup>
- A billion threads handled by a single node
- Compiling flags set to maximum reproducibility

Table 3: Performance of a billion event simulation when parallelized on 1 Phi, 1 CPU, 2 CPUs

|         | Intel Xeon Phi 7120P | Intel Xeon E5-2650v2 | 2x Intel Xeon E5-2650v2 |
|---------|----------------------|----------------------|-------------------------|
| Time    | 48 h 49 min          | 36 h 32 min          | 18 h 17 min             |
| Speedup | 1                    | 1.34                 | 2.67                    |

(1) P. L'Ecuyer, R. Simard, E. J. Chen, and W. D. Kelton, ``An Objected-Oriented Random-Number Package with Many Long Streams and Substreams", Operations Research, Vol. 50, no. 6 (2002), pp. 1073-1075.

| Bit for                                                                                                                                                                                                              | bit r                                | eproc                                | lucibi                                | lity                |                        |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|--------------------------------------|---------------------------------------|---------------------|------------------------|
| Do not expect bit for b<br>vs. regular Intel proces                                                                                                                                                                  |                                      | ducibility                           | y when wo                             | orking on           | Intel Phi              |
| •We observed bit for b<br>in double precision (a                                                                                                                                                                     |                                      |                                      |                                       |                     |                        |
| The relative different precision were analyzed                                                                                                                                                                       |                                      |                                      |                                       | vs Phi) in          | double                 |
| Table 1: Relative CPU-Phi differences bet                                                                                                                                                                            | ween the resul                       | ts and number                        | of altered bits                       |                     |                        |
|                                                                                                                                                                                                                      | ween the resul                       | ts and number<br>Position Z          | of altered bits                       | Direction Y         | Direction Z            |
| Difference $\checkmark$ Result $\rightarrow$ 0 bit:       bit for bit reproducibility                                                                                                                                |                                      |                                      |                                       | Direction Y<br>4975 | Direction Z<br>4913    |
| $\underline{\text{Difference}} \downarrow \land \underline{\text{Result}} \rightarrow$                                                                                                                               | Position X                           | Position Z                           | Direction X                           |                     |                        |
| Difference ↓ \ Result →<br>0 bit: bit for bit reproducibility                                                                                                                                                        | Position X<br>4922                   | Position Z<br>4934                   | Direction X<br>4896                   | 4975                | 4913                   |
| Difference $\downarrow$ \ Result →<br>0 bit: bit for bit reproducibility<br>1 bit: 1.11E-16 ≤ $\Delta$ < 2.22E-16                                                                                                    | Position X<br>4922<br>25             | Position Z<br>4934<br>21             | Direction X<br>4896<br>14             | 4975<br>5           | 4913<br>18             |
| Difference $\checkmark$ Result $\rightarrow$ 0 bit: bit for bit reproducibility1 bit: 1.11E-16 $\leq \Delta < 2.22E$ -162 bits: 2.22E-16 $\leq \Delta < 4.44E$ -16                                                   | Position X<br>4922<br>25<br>21       | Position Z<br>4934<br>21<br>18       | Direction X<br>4896<br>14<br>52       | 4975<br>5<br>4      | 4913<br>18<br>31       |
| Difference $\checkmark$ Result $\rightarrow$ 0 bit: bit for bit reproducibility1 bit: 1.11E-16 $\leq \Delta \leq 2.22E-16$ 2 bits: 2.22E-16 $\leq \Delta \leq 4.44E-16$ 3 bits: 4.44E-16 $\leq \Delta \leq 8.88E-16$ | Position X<br>4922<br>25<br>21<br>15 | Position Z<br>4934<br>21<br>18<br>12 | Direction X<br>4896<br>14<br>52<br>23 | 4975<br>5<br>4<br>6 | 4913<br>18<br>31<br>12 |

Relative difference (Phi vs E5)

The results on the two architectures are of the same order,

oint-calculations-for-applications-on-intel-xeon

Both of them have the same sign and the same exponent (even if some exceptions would be theoretically possible, they would be very rare).

The only bits that can differ between these results are the least significant bits of the significand.

For a given exponent e, and a result  $r1 = m \times 2e$ , the closest value greater than r1 is r2 =  $(m + \epsilon d) \times 2e$ , where  $\epsilon d$  is the value of the least significant bit of the significand:  $\epsilon d = 2^{-52} \approx 2.22 \ 10^{-16}$ .

Intel Compiler flags:

✓ "-fp-model precise -fp-model source -fimf-precision=high" for the compilation on the Xeon CPU.





