Alternate strategies for globalization in ConstantPropagation #288

MatthewFluet · 2019-01-16T20:05:02Z

Extend globalization aspect of ConstantPropagation to support
globalization of arrays and to support different "small type"
strategies.

Closes #206.

Space-safety prohibits ConstantPropagation from globalizing all arrays
and refs that are allocated at most once by a program. In particular,
because globals are live for the duration of the program, globalizing
an int list ref (for example) would not be safe-for-space: an
arbitrarily large list may be written to the reference and never be
garbage collected (whereas, when the int list ref is not globalized,
it will be garbage collected when it is no longer live). On the other
hand, globalizing an int ref is safe-for-space.

However, MLton previously used only a very conservative estimation for
space safety. Only "small" types may be globalized, where smallness is
defined as:

    fun isSmall t =
        case dest t of
           Array _ => false
         | Datatype _ => false
         | Ref t => isSmall t
         | Tuple ts => Vector.forall (ts, isSmall)
         | Vector _ => false
         | _ => true

Note that no Datatype is small; this is conservative (since a
recursive datatype could represent unbounded data), but prevents
globalizing bool ref. Also, no Array is small; this is correct
(because an int array ref should not be globalized), but the
globalization of a val a: t array = Array_alloc[t] (l) was
conditioned on the smallness of t array, not the smallness of
t. It is correct to globalize an array if t were small; note that
to globalize val a: t array = Array_alloc[t] (l), l (the length)
must be globalized and must, therefore, be a constant and the array is
of constant size. (This is Stephen Weeks's relaxed notion of
safe-for-space, where the constant factor blowup can be chosen per
program.)

This pull request adds support for alternate globalization strategies:

-globalize-arrays {false|true}: globalize arrays
-globalize-refs {true|false}: globalize refs
-globalize-small-int-inf {true|false}:globalize IntInf as a
small type
-globalize-small-type {1|0|2|3|4|9}: strategies for classifying a
type as "small":
- 0: constant false function (no types considered small)
- 1: no Datatype is considered small (original strategy)
- 2: Datatypes with all nullary constructors are considered
  small
- 3: Datatypes with all constructor arguments considered small
  according to strategy 2 are considered small
- 4: Fixed-point analysis of Datatypes to determine smallness
- 9: constant true function (all types considered small; not
  safe-for-space)

The defaults correspond to the previous behavior.

Unfortunately, additional globalization has little to no (positive)
effect on benchmarks:

MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton
MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1
MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2
MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3
MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4
MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1
MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2
MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3
MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4
run time ratio
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator        1.00   1.00   1.03   1.00   1.00   0.99   1.00   1.01   1.01
checksum            1.00   1.00   1.13   1.14   1.13   1.00   1.12   1.13   1.14
flat-array          1.00   1.00   1.00   1.00   1.01   1.19   1.19   1.19   1.19
hamlet              1.00   1.00   1.02   1.01   1.01   1.00   0.96   0.97   0.97
imp-for             1.00   1.00   1.06   1.05   1.06   1.00   1.05   1.06   1.06
knuth-bendix        1.00   1.00   1.03   1.03   1.03   1.00   1.03   1.03   1.03
lexgen              1.00   0.99   1.05   0.99   1.03   1.04   1.03   1.02   1.07
model-elimination   1.00   1.01   1.02   1.02   1.02   1.00   1.03   1.02   1.03
peek                1.00   1.00   1.03   1.03   1.04   1.00   1.04   1.04   1.04
ray                 1.00   1.03   0.99   0.99   1.01   0.99   0.99   0.99   0.98
raytrace            1.00   1.03   1.00   1.02   1.03   0.99   1.00   1.01   1.01
simple              1.00   0.98   0.99   1.00   0.99   0.97   0.97   0.96   0.97
tak                 1.00   1.00   1.04   1.03   1.10   1.01   1.02   1.08   1.00
wc-scanStream       1.00   1.00   1.06   1.06   1.06   1.00   1.04   1.06   1.07

Note that MLton0 and MLton1 generate identical code (modulo the
random magic number), so the slowdowns in ray and raytrace are
noise, which also suggests that slowdowns/speedups of <= 3% are also
likely noise.

The slowdown in flat-array with -globalize-array true is explained
as follows. The flat-array benchmark uses Vector.tabulate to
allocate a vector that is used for all iterations of the benchmark.
With -globalize-array false, the array is not globalized, and in
SSA/SSA2, we have:

    x_1212: ((word32, word32) tuple) array =
      prim Array_alloc((word32, word32) tuple) (global_138 (*0xF4240*))
...
    x_757: ((word32, word32) tuple) vector =
      prim Array_toVector((word32, word32) tuple) (x_1212)
...
    x_1287: (word32, word32) tuple =
      prim Vector_sub((word32, word32) tuple) (x_757, x_1283)

but with -globalize-array true, the array is globalized, and in
SSA/SSA2, we have:

global_490: ((word32, word32) tuple) array =
  prim Array_alloc((word32, word32) tuple) (global_138 (*0xF4240*))
...
    x_757: ((word32, word32) tuple) vector =
      prim Array_toVector((word32, word32) tuple) (global_490)
...
    x_1286: (word32, word32) tuple =
      prim Vector_sub((word32, word32) tuple) (x_757, x_1282)

At RSSA, the Array_toVector becomes a header update and the array
variable is cast/copy-propagated for the vector variable;
with -globalize-arrays false, we have

  L_531 (x_1212: Objptr (opt_11)) CReturn {func = {..., target = GC_sequenceAllocate}} =
...
    OW64 (x_1212, ~8): Word64 := opt_12
...
    x_1354: Word32 = XW32 (Cast (x_1212, Objptr (opt_12)), x_1283, 8, 0)
    x_1353: Word32 = XW32 (Cast (x_1212, Objptr (opt_12)), x_1283, 8, 4)

but with -globalize-arrays true, we have

  L_488 (global_490: Objptr (opt_7)) CReturn {func = {..., target = GC_sequenceAllocate}} =
...
    OW64 (global_490, ~8): Word64 := opt_12
...
    x_1353: Word32 = XW32 (Cast (global_490, Objptr (opt_12)), x_1282, 8, 0)
    x_1352: Word32 = XW32 (Cast (global_490, Objptr (opt_12)), x_1282, 8, 4)

Finally, with -globalize-arrays false, x_1212 becomes a local
(because the loops to initialize and use the vector are
non-allocating):

    RW32(2): Word32
     = XW32 (Cast (RP(0): Objptr (opt_11), Objptr (opt_12)),
             RW64(0): Word64, 8, 0): Word32
    RW32(3): Word32
     = XW32 (Cast (RP(0): Objptr (opt_11), Objptr (opt_12)),
             RW64(0): Word64, 8, 4): Word32

but with -globalize-arrays true:

    RW32(2): Word32
     = XW32 (Cast (glob {index = 1, isRoot = true, ty = Objptr (opt_7)},
                   Objptr (opt_12)),
             RW64(0): Word64, 8, 0): Word32
    RW32(3): Word32
     = XW32 (Cast (glob {index = 1, isRoot = true, ty = Objptr (opt_7)},
                   Objptr (opt_12)),
             RW64(0): Word64, 8, 4): Word32

The innermost loop of the benchmark goes from indexing a sequence
stored in a local (RP(0)) to indexing a sequence stored in a global
(GP(1)). All of the codegens should implement the former by using a
hardware register for RP(0), but will implement the latter with a
memory read.

In light of the above, and related to #218, it may be
beneficial to "deglobalize" object pointer globals; that is, in RSSA
functions that have multiple accesses through the same object pointer
global (particularly within loops) could be translated to copy the
global to a local.

The slowdown in checksum is less easily explained. The only new
objects globalized with -globalize-small-type 2 as compared to
-globalize-small-type 1 are two bool ref objects, corresponding to
the exiting flag of basis-library/mlton/exit.sml and the
staticIsInUse flag of basis-library/util/one.sml used by
Int.fmt. That small change seems to lead to code layout and cache
effects that result in the slowdown, because the assembly code is not
substantial different. With -enable-pass machineSuffle and
-seed-rand <w>, one can perturb the code layout and observe that the
slowdowns are not universal:

MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton
MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1
MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand 42424242
MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand deadbeef
MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2
MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand 42424242
MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand deadbeef
MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3
MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand 42424242
MLton9 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand deadbeef
MLton10 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4
MLton11 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand 42424242
MLton12 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand deadbeef
MLton13 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1
MLton14 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand 42424242
MLton15 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand deadbeef
MLton16 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2
MLton17 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand 42424242
MLton18 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand deadbeef
MLton19 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3
MLton20 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand 42424242
MLton21 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand deadbeef
MLton22 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4
MLton23 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand 42424242
MLton24 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand deadbeef
run time ratio
benchmark     MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11 MLton12 MLton13 MLton14 MLton15 MLton16 MLton17 MLton18 MLton19 MLton20 MLton21 MLton22 MLton23 MLton24
checksum        1.00   1.00   1.00   1.01   1.14   1.01   1.00   1.14   1.01   1.01    1.14    1.00    1.04    1.00    1.00    1.00    1.14    1.00    1.01    1.16    1.00    1.01    1.15    1.00    1.01
flat-array      1.00   1.01   1.02   1.00   1.00   1.01   1.01   1.03   1.01   1.01    1.01    1.01    1.01    1.23    1.19    1.19    1.19    1.20    1.19    1.20    1.20    1.19    1.20    1.19    1.20
hamlet          1.00   0.99   1.00   0.99   1.01   1.01   1.00   1.01   1.03   1.01    1.01    1.02    1.01    1.00    1.00    0.99    0.95    0.95    0.94    0.97    0.96    0.96    0.98    0.98    0.99
imp-for         1.00   1.00   1.05   1.05   1.05   1.02   1.00   1.05   0.99   1.00    1.05    1.00    1.01    1.00    1.05    1.05    1.05    0.99    1.00    1.05    1.00    1.00    1.05    1.00    0.99
lexgen          1.00   0.97   1.00   0.97   1.03   1.03   1.01   1.04   0.99   0.95    0.99    0.99    0.95    0.97    0.98    0.95    0.96    1.01    0.98    1.00    1.04    0.95    0.96    1.00    0.95
peek            1.00   1.00   1.00   1.01   1.03   1.01   1.04   1.03   1.01   1.04    1.03    1.01    1.03    1.00    1.00    1.01    1.04    1.01    1.05    1.03    1.01    1.04    1.04    1.00    1.04
simple          1.00   1.01   1.01   1.00   1.00   0.99   1.02   1.00   0.99   1.00    1.00    1.00    1.00    0.98    0.97    0.99    0.98    0.97    0.99    0.97    0.99    0.98    0.98    0.98    0.99
tak             1.00   0.99   0.90   1.00   1.05   0.90   0.99   1.02   0.89   0.99    1.04    0.90    1.00    0.99    0.89    0.99    0.99    0.90    0.99    1.01    0.90    1.00    1.00    0.90    1.00
wc-scanStream   1.00   1.01   1.01   1.03   1.06   1.02   1.03   1.05   1.02   1.02    1.04    1.01    1.00    1.01    1.01    1.00    1.06    1.01    1.00    1.07    1.01    1.00    1.05    1.03    1.02

Note that while checksum with MLton4 has a slowdown, checksum with
MLton5 and MLton6 (which are identical up to shuffling of the
functions and basic blocks at the MachineIR) do not have a slowdown.
Similarly tak with MLton0 and MLton1 have similar running time, but
tak with MLton3 has a speedup. On the other hand, flat-array's
slowdowns with -globalize-arrays true are not due to code layout
effects.

hamlet may have a slight speedup with -globalize-arrays true, but
that is significantly outweighed by the slowdown in flat-array.

The conclusion is to leave the defaults corresponding to the original
behavior.

Full benchmark results:

MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton
MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1
MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2
MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3
MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4
MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1
MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2
MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3
MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4
run time ratio
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator        1.00   1.00   1.03   1.00   1.00   0.99   1.00   1.01   1.01
barnes-hut          1.00   1.01   1.01   1.02   1.01   1.01   1.01   1.02   1.01
boyer               1.00   1.00   1.01   1.01   1.02   1.00   1.01   1.01   1.01
checksum            1.00   1.00   1.13   1.14   1.13   1.00   1.12   1.13   1.14
count-graphs        1.00   1.00   0.99   1.01   0.99   1.00   1.00   1.00   0.99
even-odd            1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
fft                 1.00   1.00   1.01   1.02   1.02   1.01   1.01   1.00   1.01
fib                 1.00   1.00   1.00   1.00   1.01   1.00   1.00   1.00   1.00
flat-array          1.00   1.00   1.00   1.00   1.01   1.19   1.19   1.19   1.19
hamlet              1.00   1.00   1.02   1.01   1.01   1.00   0.96   0.97   0.97
imp-for             1.00   1.00   1.06   1.05   1.06   1.00   1.05   1.06   1.06
knuth-bendix        1.00   1.00   1.03   1.03   1.03   1.00   1.03   1.03   1.03
lexgen              1.00   0.99   1.05   0.99   1.03   1.04   1.03   1.02   1.07
life                1.00   1.00   1.00   1.00   1.00   1.01   1.01   1.00   1.00
logic               1.00   1.00   0.98   0.99   0.99   1.00   0.98   0.99   1.00
mandelbrot          1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
matrix-multiply     1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.01   1.00
md5                 1.00   1.00   0.99   0.99   0.99   1.00   0.99   0.99   0.99
merge               1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
mlyacc              1.00   1.01   1.01   1.01   1.01   1.00   0.99   0.99   1.01
model-elimination   1.00   1.01   1.02   1.02   1.02   1.00   1.03   1.02   1.03
mpuz                1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
nucleic             1.00   1.01   1.00   1.00   0.99   1.00   0.99   0.99   1.00
output1             1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
peek                1.00   1.00   1.03   1.03   1.04   1.00   1.04   1.04   1.04
psdes-random        1.00   1.00   1.00   1.00   1.00   1.00   1.01   1.00   1.00
ratio-regions       1.00   0.99   0.98   0.98   1.01   1.01   0.99   1.00   1.01
ray                 1.00   1.03   0.99   0.99   1.01   0.99   0.99   0.99   0.98
raytrace            1.00   1.03   1.00   1.02   1.03   0.99   1.00   1.01   1.01
simple              1.00   0.98   0.99   1.00   0.99   0.97   0.97   0.96   0.97
smith-normal-form   1.00   1.00   1.01   0.99   1.00   1.00   1.00   1.00   1.00
string-concat       1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
tailfib             1.00   1.00   1.00   0.99   1.00   1.00   1.00   1.00   1.00
tak                 1.00   1.00   1.04   1.03   1.10   1.01   1.02   1.08   1.00
tensor              1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
tsp                 1.00   1.00   1.00   1.00   1.00   0.99   1.00   1.00   1.00
tyan                1.00   1.00   1.00   1.01   1.01   1.01   1.01   1.01   1.01
vector-rev          1.00   0.99   0.99   0.98   0.98   0.98   0.99   0.98   0.98
vector32-concat     1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
vector64-concat     1.00   1.00   1.00   1.00   1.00   0.99   1.00   1.00   0.99
vliw                1.00   0.98   1.00   1.01   1.00   1.00   0.98   0.98   0.98
wc-input1           1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
wc-scanStream       1.00   1.00   1.06   1.06   1.06   1.00   1.04   1.06   1.07
zebra               1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
zern                1.00   1.02   1.02   0.99   1.01   0.99   0.99   1.01   0.99
size
benchmark            MLton0    MLton1    MLton2    MLton3    MLton4    MLton5    MLton6    MLton7    MLton8
DLXSimulator        209,076   209,076   209,140   209,140   209,140   209,076   208,836   208,340   208,340
barnes-hut          176,199   176,199   176,071   176,071   176,071   176,199   176,071   176,071   176,071
boyer               243,369   243,369   243,289   243,289   243,289   243,369   243,289   243,289   243,289
checksum            117,561   117,561   117,433   117,433   117,433   117,561   117,433   117,433   117,433
count-graphs        145,065   145,065   145,017   145,017   145,017   145,065   144,937   144,937   144,937
even-odd            117,529   117,529   117,433   117,433   117,433   117,529   117,433   117,433   117,433
fft                 142,307   142,307   141,315   141,315   141,315   142,307   141,315   141,315   141,315
fib                 117,449   117,449   117,321   117,321   117,321   117,449   117,321   117,321   117,321
flat-array          117,177   117,177   117,049   117,049   117,049   117,193   117,081   117,081   117,081
hamlet            1,434,228 1,434,228 1,433,220 1,433,220 1,433,220 1,434,228 1,432,564 1,427,956 1,427,396
imp-for             117,241   117,241   117,145   117,145   117,145   117,241   117,145   117,145   117,145
knuth-bendix        186,116   186,116   186,212   186,212   186,212   186,116   186,212   186,212   186,212
lexgen              290,931   290,931   290,819   290,819   290,819   290,931   290,819   290,819   290,819
life                141,113   141,113   141,065   141,065   141,065   141,113   141,065   141,065   141,065
logic               197,417   197,417   197,273   197,273   197,273   197,417   197,273   197,273   197,273
mandelbrot          117,273   117,273   117,177   117,177   117,177   117,273   117,177   117,177   117,177
matrix-multiply     119,577   119,577   119,417   119,417   119,417   119,577   119,417   119,417   119,417
md5                 144,676   144,676   144,500   144,500   144,500   144,676   144,500   144,500   144,500
merge               118,953   118,953   118,857   118,857   118,857   118,953   118,857   118,857   118,857
mlyacc              643,555   643,555   643,651   643,651   643,651   643,555   643,651   643,651   643,475
model-elimination   796,054   796,054   793,958   793,798   793,798   796,054   794,166   792,246   792,246
mpuz                123,545   123,545   123,481   123,481   123,481   123,545   123,481   123,481   123,481
nucleic             297,249   297,249   297,233   297,233   297,233   297,249   297,233   297,233   297,233
output1             151,768   151,768   149,848   149,848   149,848   151,768   149,848   149,848   149,848
peek                150,164   150,164   150,132   150,132   150,132   150,164   150,132   150,132   150,132
psdes-random        121,545   121,545   121,401   121,401   121,401   121,545   121,401   121,401   121,401
ratio-regions       144,137   144,137   144,169   144,169   144,169   144,137   144,169   144,169   144,169
ray                 250,058   250,058   250,218   250,218   250,218   250,058   249,818   249,066   249,066
raytrace            368,988   368,988   368,108   368,108   368,108   368,956   367,868   367,468   367,468
simple              345,205   345,205   345,381   345,381   345,381   329,557   329,557   329,317   329,317
smith-normal-form   279,837   279,837   279,645   279,645   279,645   279,837   279,341   279,341   279,341
string-concat       119,129   119,129   119,033   119,033   119,033   119,209   119,033   119,033   119,033
tailfib             117,273   117,273   117,177   117,177   117,177   117,273   117,177   117,177   117,177
tak                 117,449   117,449   117,321   117,321   117,321   117,449   117,321   117,321   117,321
tensor              179,292   179,292   176,908   176,908   176,908   179,292   176,908   176,908   176,908
tsp                 158,860   158,860   158,668   158,668   158,668   158,860   158,668   158,668   158,668
tyan                223,588   223,588   223,044   223,044   223,044   223,588   223,044   223,044   223,044
vector-rev          118,105   118,105   118,009   118,009   118,009   118,153   117,977   117,977   117,977
vector32-concat     118,297   118,297   118,201   118,201   118,201   118,329   118,185   118,185   118,185
vector64-concat     118,329   118,329   118,169   118,169   118,169   118,329   118,217   118,217   118,217
vliw                505,509   505,509   503,013   503,013   503,013   505,637   500,917   497,957   497,957
wc-input1           179,051   179,051   178,923   178,923   178,923   179,051   178,923   178,923   178,923
wc-scanStream       188,155   188,155   188,027   188,027   188,027   188,155   188,027   188,027   188,027
zebra               225,364   225,364   225,220   225,220   225,220   225,364   225,220   225,220   225,220
zern                153,241   153,241   152,521   152,521   152,521   153,241   152,585   152,585   152,585
compile time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator        3.19   3.14   3.45   3.51   3.02   3.27   3.18   3.45   3.24
barnes-hut          2.93   2.94   2.96   2.97   3.06   2.92   2.96   2.93   2.96
boyer               3.36   3.49   3.48   3.32   3.42   3.40   3.52   3.53   3.49
checksum            2.53   2.56   2.47   2.56   2.57   2.46   2.56   2.54   2.62
count-graphs        2.68   2.69   2.71   2.80   2.80   2.79   2.67   2.69   2.76
even-odd            2.45   2.56   2.46   2.56   2.48   2.50   2.48   2.55   2.57
fft                 2.64   2.57   2.62   2.66   2.35   2.68   2.60   2.54   2.62
fib                 2.46   2.46   2.58   2.48   2.55   2.45   2.55   2.54   2.46
flat-array          2.52   2.56   2.55   2.56   2.55   2.54   2.53   2.56   2.55
hamlet             15.27  15.74  14.48  14.35  14.48  15.63  15.21  14.88  15.04
imp-for             2.48   2.56   2.52   2.55   2.55   2.34   2.42   2.48   2.54
knuth-bendix        2.89   2.90   2.90   3.01   3.02   2.93   3.01   3.13   2.99
lexgen              3.40   3.88   3.76   3.82   3.74   3.62   3.50   3.81   3.71
life                2.66   2.71   2.70   2.74   2.72   2.64   2.74   2.68   2.60
logic               3.06   3.05   3.11   3.12   3.00   3.13   3.14   2.92   2.83
mandelbrot          2.51   2.54   2.53   2.58   2.57   2.53   2.56   2.55   2.45
matrix-multiply     2.48   2.50   2.52   2.49   2.57   2.60   2.60   2.57   2.53
md5                 2.65   2.68   2.78   2.58   2.78   2.80   2.58   2.80   2.69
merge               2.47   2.49   2.58   2.57   2.55   2.49   2.57   2.50   2.52
mlyacc              7.85   7.94   8.00   7.98   7.90   7.69   8.05   8.09   7.54
model-elimination   7.08   7.98   7.62   7.11   7.40   8.24   7.84   7.96   8.15
mpuz                2.34   2.61   2.60   2.56   2.53   2.62   2.42   2.53   2.52
nucleic             4.06   4.17   4.07   4.05   4.08   4.06   4.28   4.33   4.14
output1             2.69   2.68   2.52   2.58   2.77   2.79   2.78   2.57   2.77
peek                2.74   2.78   2.58   2.80   2.80   2.73   2.70   2.68   2.72
psdes-random        2.56   2.53   2.49   2.48   2.50   2.48   2.53   2.64   2.50
ratio-regions       2.80   2.82   2.62   2.88   2.86   2.89   2.81   2.80   2.79
ray                 3.34   3.62   3.69   3.47   3.45   3.60   3.48   3.37   3.40
raytrace            4.55   4.88   4.47   4.32   4.64   4.30   4.48   4.43   4.50
simple              4.01   4.07   4.04   4.00   3.91   3.83   3.77   3.81   3.74
smith-normal-form   3.75   3.60   3.82   3.58   3.40   3.79   3.60   3.58   3.58
string-concat       2.46   2.66   2.54   2.45   2.57   2.56   2.46   2.51   2.49
tailfib             2.44   2.54   2.53   2.57   2.50   2.57   2.57   2.36   2.54
tak                 2.45   2.57   2.56   2.63   2.47   2.44   2.43   2.44   2.52
tensor              3.05   3.16   3.07   3.15   3.10   3.32   3.18   3.16   3.13
tsp                 2.81   2.79   2.57   2.74   2.76   2.75   2.84   2.84   2.82
tyan                3.27   3.06   3.38   3.26   3.35   3.23   3.22   3.22   3.24
vector-rev          2.49   2.57   2.56   2.33   2.61   2.40   2.52   2.49   2.54
vector32-concat     2.53   2.49   2.52   2.47   2.55   2.63   2.49   2.49   2.51
vector64-concat     2.48   2.46   2.50   2.52   2.48   2.54   2.46   2.47   2.48
vliw                5.63   5.63   5.64   6.14   5.43   6.26   5.78   6.17   6.16
wc-input1           3.06   2.96   2.95   2.98   2.91   2.87   2.90   2.97   2.96
wc-scanStream       3.01   2.97   3.02   2.95   3.06   3.04   3.03   3.01   3.09
zebra               3.30   3.37   3.36   3.26   3.34   3.51   3.31   3.26   3.34
zern                2.69   2.74   2.62   2.64   2.72   2.74   2.73   2.43   2.70
run time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator       32.67  32.69  33.65  32.71  32.53  32.50  32.63  32.84  32.92
barnes-hut         28.33  28.63  28.64  28.85  28.59  28.50  28.71  28.78  28.61
boyer              55.98  56.01  56.81  56.75  56.82  56.08  56.29  56.72  56.76
checksum           25.35  25.39  28.74  28.79  28.64  25.39  28.42  28.68  28.87
count-graphs       39.96  39.98  39.51  40.23  39.38  40.13  40.16  39.92  39.75
even-odd           39.09  39.07  39.09  39.06  39.08  39.14  39.10  39.09  39.09
fft                30.56  30.69  30.75  31.05  31.20  30.98  30.86  30.62  30.77
fib                17.70  17.75  17.77  17.77  17.83  17.77  17.74  17.67  17.77
flat-array         23.60  23.54  23.65  23.70  23.82  28.18  28.09  28.02  28.01
hamlet             39.69  39.77  40.32  40.20  40.20  39.72  38.00  38.70  38.68
imp-for            24.40  24.44  25.78  25.74  25.79  24.51  25.73  25.98  25.84
knuth-bendix       34.10  34.09  35.05  34.97  35.03  34.16  35.02  35.11  35.07
lexgen             34.22  33.77  35.88  33.88  35.27  35.64  35.09  34.93  36.52
life               38.70  38.80  38.76  38.76  38.89  38.90  38.91  38.74  38.84
logic              35.01  34.93  34.44  34.82  34.82  35.06  34.47  34.77  34.93
mandelbrot         35.78  35.77  35.79  35.80  35.81  35.82  35.79  35.78  35.81
matrix-multiply    29.74  29.80  29.82  29.89  29.79  29.72  29.84  29.90  29.86
md5                28.38  28.39  28.04  28.05  28.01  28.42  28.12  28.14  27.98
merge              32.55  32.45  32.41  32.44  32.59  32.48  32.41  32.56  32.40
mlyacc             32.76  33.01  32.94  33.14  33.23  32.83  32.52  32.49  33.11
model-elimination  38.05  38.26  38.89  38.72  38.80  38.09  39.11  38.97  39.36
mpuz               29.94  29.86  29.88  29.91  29.89  29.91  29.89  29.86  29.90
nucleic            33.73  33.91  33.60  33.68  33.35  33.74  33.40  33.47  33.58
output1            30.01  30.01  29.99  30.01  30.02  30.06  29.99  30.10  29.91
peek               33.58  33.60  34.75  34.63  34.78  33.61  34.79  34.78  34.86
psdes-random       33.84  33.91  33.83  33.86  33.91  33.88  34.16  33.91  33.91
ratio-regions      49.08  48.63  48.33  48.33  49.40  49.45  48.70  48.95  49.45
ray                37.55  38.58  37.18  37.06  37.80  37.04  37.19  37.08  36.65
raytrace           34.20  35.34  34.35  34.94  35.23  34.00  34.08  34.61  34.52
simple             29.73  29.25  29.47  29.64  29.51  28.96  28.84  28.57  28.86
smith-normal-form  39.88  39.79  40.38  39.56  39.99  39.92  39.93  39.85  39.88
string-concat      91.43  91.56  91.68  91.40  91.65  91.58  91.75  91.50  91.36
tailfib            38.06  38.06  37.93  37.85  37.90  38.12  38.00  37.98  37.87
tak                30.73  30.77  31.96  31.60  33.85  30.94  31.41  33.22  30.87
tensor             39.62  39.59  39.67  39.70  39.70  39.54  39.71  39.76  39.67
tsp                37.88  37.80  37.84  37.79  37.75  37.62  37.86  37.99  37.82
tyan               30.48  30.45  30.55  30.86  30.79  30.78  30.74  30.85  30.79
vector-rev         27.00  26.60  26.61  26.44  26.55  26.37  26.68  26.53  26.39
vector32-concat    82.46  82.52  82.56  82.72  82.43  82.56  82.37  82.30  82.22
vector64-concat    91.66  91.86  91.57  91.77  91.63  91.13  91.54  91.42  91.13
vliw               28.91  28.43  28.94  29.05  28.90  28.82  28.29  28.27  28.44
wc-input1          43.95  43.81  43.87  43.95  43.86  43.93  43.95  43.89  43.79
wc-scanStream      21.63  21.64  22.96  23.03  23.02  21.65  22.51  22.89  23.17
zebra              30.39  30.46  30.32  30.29  30.28  30.37  30.37  30.36  30.39
zern               32.37  33.09  33.05  32.00  32.71  31.99  32.04  32.83  31.96

Space-safety prohibits ConstantPropagation from globalizing all arrays that are allocated at most once by a program. In particular, because globals are live for the duration of the program, globalizing an `int list array` (for example) would not be safe-for-space: arbitrarily large lists may be stored in the array and never garbage collected (whereas, when the `int list array` is not globalized, it will be garbage collected when it is no longer live). On the other hand, globalizing an `int array` (with a constant length) is safe-for-space. However, previously, the globalization of a `val a: t array = Array_alloc[t] (l)` was conditioned on the smallness of `t array`. Furthermore, `Type.isSmall` returns `false` for arrays, so no array would be globalized. It is correct to globalize an array if `t` is small; note that to globalize `val a: t array = Array_alloc[t] (l)`, `l` (the length) must be globalized and must, therefore, be a constant and the array is of constant size. (This is Stephen Weeks's relaxed notion of safe-for-space, where the constant factor blowup can be chosen per program.) In practice, it may be better to limit globalization of arrays to ones with "small" length in addition to small element type. This commit allows a `val a: t array = Array_alloc[t] (l)` to be globalized if `-globalize-arrays true`, `t` is small, and `l` is globalized.

Use standard Trace.trace and prep recursive calls.

`-globalize-small-type 0` is the constant `false` function, which globalizes no references or arrays. `-globalize-small-type 1` is the previous `Type.isSmall` function, which returns false for all datatypes. `-globalize-small-type 9` is the constant `true` function, which is not safe-for-space.

`-globalize-small-type 2` treats datatypes with all nullary constructors as small.

`-globalize-small-type 3` treats datatypes that have all constructor arguments satisfying `#isSmallType o (mkIsSmallType 1)` as small.

Fixes a bug (triggered by the `simple` benchmark with `-globalize-arrays true`) where the rawness of an array value was not properly propagated. Previously, the rawness of an array value was implemented as `bool option ref`, but coercing `ref NONE` to `ref NONE` would not remember the coercion when the first raw became `ref (SOME true)`. Now, a proper "flat lattice"-like `structure Raw` is used.

Technically, both IntInf and Thread should not be considered small types, as they can be of unbounded size (an IntInf can be represented by a sequence, while a Thread implicitly contains a stack). Note, an IntInf value can be globalized (as it is a constant), but an `IntInf.int ref` cannot.

Allow previous behavior of globalizing `IntInf.int ref` values.

`-globalize-small-type 4` uses RefFlatten's more precise notion of small/large types: build a graph of the dependencies between datatypes, force any (mutually) recursive datatypes to be large types, solve a fixed-point analysis of `Size.<=` constraints.

Bounce variables around forced stack allocations in RSSA loops See #218. In the translation from RSSA to Machine, each RSSA variable is assigned a location: either a stack slot or a register (i.e., local). An RSSA variable is assigned to a stack slot if it is live across a non-tail call or a `mayGC` ccall. Assigning an RSSA variable to a stack slot has a potential cost, because stack-slot accesses appear as memory reads and writes to the codegens and are not easily realized by hardware registers. This cost is multiplied when the stack-slot accesses are within a loop. For example, consider the tail recursive `revapp`: ``` standard-ml fun revapp (l, acc) = case l of [] => acc | h::t => revapp (t, h::acc) ``` This will be realized as an intraprocedural loop; however, due to the `h::acc` allocation in the loop, there will be a potential GC that forces `l` and `acc` to be assigned to stack slots (assuming the GC check occurs at the loop header). It would seem to be more efficient to assign `l` and `acc` to registers, moving them to and from stack slots at the GC, because a GC should occur on only a small fraction of the loop iterations. A new `BounceVars` RSSA pass attempts to split the live ranges of RSSA variables that are used within loops so that the within-loop instances of the variables are assigned registers (possibly being moved to and from stack slots around a GC). Unfortunately, it has not been easy to find good heuristics. The current defaults are biased towards not degrading performance. ## Benchmark results (cadmium) Specs: * 4 x AMD Opteron(tm) Processor 6172 (2.1GHz; 48 physical cores) * Ubuntu 16.04.6 LTS * gcc: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) * gcc-8: gcc version 8.3.0 (Homebrew GCC 8.3.0) * llvm: LLVM version 8.0.0 ``` text config command C00 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen amd64 C01 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c C02 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c -cc gcc-8 C03 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen llvm C04 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen amd64 C05 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c C06 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c -cc gcc-8 C07 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen llvm ``` ### Run-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.050 1.004 0.9939 0.9724 boyer 1.020 0.9547 0.9826 0.9940 checksum 1.007 1.034 0.9795 1.003 count-graphs 0.9735 1.008 0.9826 0.9722 DLXSimulator 0.9397 0.9566 0.9553 0.8925 even-odd 0.9185 1.019 1.000 0.9754 fft 1.061 1.041 1.039 1.021 fib 1.013 1.046 1.040 0.9884 flat-array 1.064 0.9138 0.9901 0.9247 hamlet 0.9442 0.9713 0.9664 0.9955 imp-for 1.040 0.9223 0.8787 1.005 knuth-bendix 0.9967 1.007 0.9924 0.9744 lexgen 1.005 1.025 0.9892 0.9574 life 0.9805 0.9813 0.9779 1.012 logic 1.062 0.9721 1.002 0.9641 mandelbrot 0.9995 1.003 1.005 1.000 matrix-multiply 0.9424 1.028 1.040 0.9958 md5 0.9946 0.9913 0.9929 1.053 merge 0.9942 0.9397 0.9797 0.9558 mlyacc 0.9956 0.9558 1.021 1.026 model-elimination 0.9816 0.9842 1.000 0.9769 mpuz 0.6748 0.8899 0.9354 1.084 nucleic 0.9955 0.9901 1.008 0.9746 output1 0.9880 0.8836 0.8840 0.9090 peek 1.010 1.075 0.9669 1.000 pidigits 0.9778 1.055 0.9936 1.018 psdes-random 0.9905 0.9943 1.054 1.000 ratio-regions 1.030 1.071 0.9943 0.9448 ray 1.016 0.9944 0.9993 1.030 raytrace 1.024 1.009 0.9994 1.017 simple 0.9900 1.017 1.019 1.032 smith-normal-form 1.015 1.017 1.080 1.061 string-concat 0.9619 1.004 0.9863 1.173 tailfib 0.9867 1.056 1.001 0.9989 tailmerge 1.005 1.002 0.9844 0.9697 tak 0.9886 1.025 0.9999 1.013 tensor 1.015 0.9728 1.006 0.9927 tsp 1.028 0.9878 0.9956 0.9745 tyan 1.029 1.011 1.012 0.9404 vector32-concat 0.7653 0.6395 0.8281 1.003 vector64-concat 0.8517 0.7782 0.8571 0.9619 vector-rev 0.9514 0.9847 0.8541 0.9161 vliw 1.028 0.9368 1.017 1.011 wc-input1 0.9170 0.9157 0.9084 0.9240 wc-scanStream 0.9929 1.007 1.146 1.038 zebra 0.9984 1.008 1.008 0.9941 zern 1.026 0.9929 1.005 0.9942 MIN 0.6748 0.6395 0.8281 0.8925 GMEAN 0.9811 0.9771 0.9846 0.9912 MAX 1.064 1.075 1.146 1.173 ``` Notes: * Overall, there is less run-time improvements than were hoped, although there are some "big" wins and no "big" losses. * The LLVM codegen seems to benefit the least; it may be that LLVM is able to realizes some stack-slot accesses as hardware registers (although it is unclear how it justifies doing so --- LLVM should not be able to "see" that the access of stack slots and heap objects do not alias). * The speedups with `vector32-concat` and `vector64-concat` may be related to the observation made in #288, where it was observed that moving a sequence from a register to a global had a significant slowdown (since the access through the global required an additional level of indirection). In this case, a sequence may be moving from a stack slot to a register, eliminating a level of indirection. ### Compile-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 0.9721 0.9760 0.9171 0.9393 boyer 0.9054 0.9675 0.9988 1.024 checksum 0.9636 0.9630 0.9753 1.016 count-graphs 0.9822 1.002 0.9862 0.9151 DLXSimulator 0.9205 0.9248 0.9340 0.9704 even-odd 0.9808 0.9600 0.9705 1.015 fft 0.9938 0.9644 0.9817 0.9656 fib 0.9670 0.9777 0.9750 0.9473 flat-array 0.9781 0.9758 1.013 0.9805 hamlet 0.9942 1.104 1.029 1.000 imp-for 0.9380 0.9764 0.9429 0.9979 knuth-bendix 0.9409 0.9584 0.8814 0.8520 lexgen 0.9317 0.8911 0.8933 0.9525 life 0.9598 0.9610 0.9812 0.9046 logic 0.9704 0.9620 0.9474 0.9856 mandelbrot 0.9370 0.9926 1.011 0.9694 matrix-multiply 0.9510 0.9425 0.9634 0.9497 md5 0.9568 0.9609 0.9473 0.9282 merge 0.9896 0.9517 0.9780 0.9987 mlyacc 1.124 1.076 1.161 1.150 model-elimination 0.9912 0.9493 1.093 0.9879 mpuz 0.9568 0.9690 1.006 0.9763 nucleic 0.9010 0.9169 0.9384 1.079 output1 0.9668 0.9896 0.9598 0.9086 peek 0.9713 0.9736 0.9840 0.9626 pidigits 0.9720 0.9686 0.9494 0.9379 psdes-random 0.9761 1.004 1.002 0.9435 ratio-regions 0.9844 0.9421 0.9679 0.9262 ray 0.9605 0.8728 0.8533 0.8406 raytrace 0.9130 0.9501 0.8719 1.029 simple 0.8696 0.9130 0.9537 0.9602 smith-normal-form 0.9000 0.9400 0.9632 1.167 string-concat 0.9667 0.9934 0.9466 1.000 tailfib 0.9807 0.9598 0.9779 0.9713 tailmerge 0.9848 0.9871 0.9878 0.9870 tak 0.9720 1.0000 0.9728 0.9968 tensor 0.9739 0.9348 0.9057 0.8996 tsp 0.9955 1.001 0.9283 0.9518 tyan 0.9734 0.9536 0.9052 0.9488 vector32-concat 0.9527 0.9997 0.9673 0.9471 vector64-concat 0.9617 0.9722 0.9921 0.9885 vector-rev 0.9809 0.9846 0.9625 0.9955 vliw 1.020 1.004 1.111 1.086 wc-input1 0.9345 1.002 0.9502 0.9549 wc-scanStream 0.9890 0.9830 1.027 1.028 zebra 0.9712 0.9204 1.101 0.9702 zern 0.9416 1.013 1.098 0.8998 MIN 0.8696 0.8728 0.8533 0.8406 GMEAN 0.9635 0.9691 0.9740 0.9727 MAX 1.124 1.104 1.161 1.167 ``` Notes: * One concern with assigning more RSSA variables to registers is that increases the burden on the codegen (hardware) register allocator. * With the exception of `mlyacc`, there is a general improvement in compile time. Note that configs C04, C05, C06, and C07 are using a MLton compiled with `BounceVars` to compile programs with `BounceVars`. It isn't clear how much of the benefit is due to the effect of `BounceVars` on the compiled MLton and how much is due to the effect of `BounceVars` on the IR of the compiled programs. In any case, it seems that `BounceVars` "pays for itself", although the greatest improvements are not on the largest benchmark programs (`hamlet`, `mlyacc`, `model-elimination`, `nucleic`, `vliw`). ### Executable-Size Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.005 1.005 1.009 0.9982 boyer 1.002 0.9940 0.9967 0.9972 checksum 1.005 1.001 1.001 1.000 count-graphs 1.007 1.004 1.005 1.001 DLXSimulator 1.015 1.005 1.006 1.007 even-odd 1.005 1.001 0.9999 1.0000 fft 1.009 1.006 1.004 1.003 fib 1.005 1.001 0.9998 1.000 flat-array 1.005 1.001 1.001 0.9993 hamlet 1.002 1.003 1.004 0.9999 imp-for 1.005 1.001 1.001 0.9991 knuth-bendix 1.004 0.9990 1.003 1.001 lexgen 1.025 1.001 1.002 1.004 life 1.003 0.9999 1.000 0.9999 logic 1.004 1.003 1.001 1.002 mandelbrot 1.005 1.001 1.001 0.9994 matrix-multiply 1.005 1.000 1.002 0.9993 md5 1.005 1.002 1.002 1.007 merge 1.004 1.000 0.9994 0.9995 mlyacc 1.176 1.154 1.183 1.180 model-elimination 1.006 1.009 1.011 1.003 mpuz 1.006 0.9998 1.001 1.001 nucleic 1.000 0.9996 0.9998 0.9985 output1 1.006 1.002 1.002 1.006 peek 1.005 1.001 1.001 0.9996 pidigits 1.003 0.9998 1.001 1.005 psdes-random 1.005 1.001 1.002 0.9976 ratio-regions 1.020 1.008 1.007 1.009 ray 1.023 1.021 1.018 1.013 raytrace 1.008 1.003 1.003 1.011 simple 1.009 1.033 1.034 1.055 smith-normal-form 1.005 1.005 1.006 1.003 string-concat 1.004 0.9997 1.002 0.9980 tailfib 1.005 1.000 1.000 0.9998 tak 1.005 1.001 0.9997 0.9999 tensor 1.019 1.019 1.018 1.020 tsp 1.005 1.002 1.004 1.007 tyan 1.034 1.026 1.025 1.030 vector32-concat 1.005 1.000 1.001 0.9984 vector64-concat 1.005 1.000 1.001 0.9982 vector-rev 1.005 1.001 1.001 0.9988 vliw 1.033 1.010 1.012 1.027 wc-input1 1.012 1.009 1.008 1.010 wc-scanStream 1.011 1.006 1.007 1.007 zebra 1.004 1.001 1.002 1.005 zern 1.006 1.003 1.001 1.002 MIN 1.000 0.9940 0.9967 0.9972 GMEAN 1.012 1.007 1.008 1.008 MAX 1.176 1.154 1.183 1.180 ``` Notes: * Consistent with the increase in compile-time for `mlyacc`, there is an increase in code size for `mlyacc`. * Otherwise, there is little change in code size. ## Benchmark results (sulfur) Specs: * 2 x Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (8 physical cores; 16 logical cores) * Ubuntu 16.04.6 LTS * gcc: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) * gcc-8: gcc version 8.3.0 (Homebrew GCC 8.3.0) * llvm: LLVM version 8.0.0 ``` text config command C00 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen amd64 C01 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c C02 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c -cc gcc-8 C03 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen llvm C04 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen amd64 C05 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c C06 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c -cc gcc-8 C07 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen llvm ``` ### Run-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.066 1.004 0.9963 0.8840 boyer 0.9890 0.9917 0.9902 0.9567 checksum 1.065 1.001 0.9935 0.9681 count-graphs 0.9580 0.9805 0.9665 0.9672 DLXSimulator 0.7266 0.7235 0.7374 0.7496 even-odd 1.009 0.9903 0.9873 0.9845 fft 0.9930 0.9983 0.9710 0.9682 fib 1.108 1.023 1.0000 0.9724 flat-array 0.9638 0.9581 0.9604 0.9620 hamlet 0.8991 0.9415 0.9295 0.9544 imp-for 0.9688 0.9359 0.9240 1.123 knuth-bendix 1.005 0.9915 1.025 1.005 lexgen 1.001 0.9954 0.9583 0.9974 life 0.9659 1.003 0.9652 0.9923 logic 0.9887 0.9938 0.9702 0.9919 mandelbrot 0.9989 1.015 1.024 1.008 matrix-multiply 1.017 0.9927 1.014 0.9881 md5 0.9846 1.006 0.9868 1.026 merge 1.071 1.077 1.039 1.074 mlyacc 0.9994 0.9986 1.012 1.011 model-elimination 0.9850 0.9656 0.9885 0.9879 mpuz 0.7695 0.8649 0.9525 1.018 nucleic 0.9267 0.9614 0.9359 0.9804 output1 0.9453 0.8869 0.8673 0.8782 peek 0.9988 1.000 1.011 0.9938 pidigits 0.9734 0.9961 1.003 0.9986 psdes-random 0.9902 0.9998 1.016 0.9956 ratio-regions 1.107 0.9887 0.9315 0.9852 ray 0.9905 0.9674 0.9930 0.9931 raytrace 1.008 1.032 1.005 1.057 simple 0.9659 0.9692 0.9842 0.9648 smith-normal-form 0.9858 0.9876 0.9712 0.9615 string-concat 0.9901 1.001 1.007 0.9731 tailfib 0.9968 1.015 1.000 0.9936 tailmerge 0.9893 0.9932 0.9810 0.8081 tak 0.9724 1.058 0.9762 0.9693 tensor 1.005 0.9982 0.9914 1.006 tsp 1.030 1.049 1.037 1.032 tyan 1.007 0.9887 0.9850 1.002 vector32-concat 0.3328 0.3212 0.6566 1.029 vector64-concat 0.3983 0.3806 0.7111 0.9922 vector-rev 0.9282 1.032 0.9473 0.9013 vliw 0.9068 0.8996 0.8662 0.9419 wc-input1 1.396 0.9463 1.120 1.348 wc-scanStream 0.9928 1.015 1.117 0.9642 zebra 0.9957 0.9966 0.9974 0.9967 zern 0.9958 1.013 1.000 0.9743 MIN 0.3328 0.3212 0.6566 0.7496 GMEAN 0.9468 0.9393 0.9640 0.9827 MAX 1.396 1.077 1.120 1.348 ``` Notes: * Again, there are some "big" wins and no "big" losses. * Interestingly, compared to `cadmium`, `vector32-concat` and `vector64-concat` see greater speedups, though `mpuz` sees a smaller speedup. ### Compile-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 0.9401 0.9632 1.013 1.006 boyer 0.9659 1.000 0.9725 0.9629 checksum 1.022 0.9911 0.9966 0.9967 count-graphs 1.005 1.042 1.004 1.011 DLXSimulator 0.9910 0.9526 0.9833 0.9914 even-odd 0.9654 0.9591 1.009 0.9795 fft 0.9841 0.9713 0.9885 0.9923 fib 0.9417 0.9946 1.000 0.9976 flat-array 1.002 1.049 0.9716 1.012 hamlet 1.103 1.018 0.9917 0.9781 imp-for 0.9586 1.041 0.9824 1.032 knuth-bendix 0.9690 1.007 1.034 0.9347 lexgen 1.049 0.9718 0.9930 1.029 life 0.9490 0.9798 1.023 1.018 logic 0.9936 0.9303 0.9768 0.9715 mandelbrot 0.9959 0.9518 0.9339 0.9740 matrix-multiply 1.117 1.060 1.013 0.9176 md5 1.007 0.9454 0.9904 0.9973 merge 0.9922 0.9493 0.9793 1.015 mlyacc 1.113 1.125 1.136 1.143 model-elimination 0.9702 1.030 1.025 1.017 mpuz 1.015 0.9249 1.053 0.9401 nucleic 0.9813 1.007 0.9919 1.043 output1 0.9285 0.9746 0.9603 1.049 peek 0.9547 0.9915 0.9410 0.9543 pidigits 0.9569 0.9757 0.9745 0.9973 psdes-random 0.9687 0.9835 1.006 0.9596 ratio-regions 0.9640 1.021 1.019 1.036 ray 1.079 0.9156 0.9577 0.9714 raytrace 1.008 0.9028 1.010 1.052 simple 1.025 1.154 0.9570 1.068 smith-normal-form 0.9611 0.9399 0.9215 0.9196 string-concat 1.023 0.9968 0.9789 1.072 tailfib 0.9869 0.9748 0.9711 1.001 tailmerge 0.9812 0.9766 0.9731 0.9051 tak 0.9694 0.9863 0.9719 0.9681 tensor 0.9759 1.061 0.9847 1.030 tsp 0.9818 1.094 0.9497 1.066 tyan 1.018 1.004 0.9927 1.011 vector32-concat 0.9783 0.9755 0.9766 0.9792 vector64-concat 0.9820 0.9876 0.9830 0.9607 vector-rev 0.9648 0.9215 1.066 1.002 vliw 1.075 1.045 1.035 1.052 wc-input1 1.012 1.025 1.005 0.9931 wc-scanStream 0.9747 0.9883 0.9940 0.9761 zebra 0.9838 0.9624 0.9758 1.003 zern 0.8921 0.9604 0.9794 0.9880 MIN 0.8921 0.9028 0.9215 0.9051 GMEAN 0.9921 0.9920 0.9919 0.9985 MAX 1.117 1.154 1.136 1.143 ``` Notes: * Compared to `cadmium`, there is less general improvement in compile time, though (with the exception of `mlyacc`), no "across the board" slowdowns. ### Executable-Size Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.005 1.005 1.009 0.9983 boyer 1.002 0.9940 0.9967 0.9972 checksum 1.005 1.001 1.001 1.000 count-graphs 1.007 1.004 1.005 1.001 DLXSimulator 1.015 1.005 1.006 1.008 even-odd 1.005 1.001 0.9999 0.9999 fft 1.009 1.006 1.004 1.003 fib 1.005 1.001 0.9998 1.000 flat-array 1.005 1.001 1.001 0.9995 hamlet 1.002 1.003 1.004 0.9999 imp-for 1.005 1.001 1.001 0.9991 knuth-bendix 1.004 0.9989 1.003 1.000 lexgen 1.025 1.001 1.002 1.004 life 1.003 0.9998 1.000 0.9997 logic 1.004 1.003 1.001 0.9997 mandelbrot 1.005 1.001 1.001 0.9994 matrix-multiply 1.005 1.000 1.002 0.9996 md5 1.005 1.002 1.002 1.007 merge 1.004 1.000 0.9993 0.9994 mlyacc 1.176 1.154 1.183 1.179 model-elimination 1.006 1.009 1.011 1.003 mpuz 1.006 0.9998 1.001 1.001 nucleic 1.000 0.9996 0.9998 0.9985 output1 1.006 1.002 1.002 1.006 peek 1.005 1.001 1.001 0.9994 pidigits 1.003 0.9998 1.001 1.005 psdes-random 1.005 1.001 1.002 0.9978 ratio-regions 1.020 1.008 1.007 1.009 ray 1.023 1.021 1.018 1.013 raytrace 1.008 1.003 1.003 1.013 simple 1.009 1.033 1.034 1.055 smith-normal-form 1.005 1.005 1.006 1.003 string-concat 1.004 0.9995 1.002 0.9980 tailfib 1.005 1.000 1.000 0.9998 tailmerge 1.004 0.9995 1.001 0.9998 tak 1.005 1.001 0.9997 1.000 tensor 1.019 1.019 1.018 1.019 tsp 1.005 1.002 1.004 1.007 tyan 1.034 1.026 1.025 1.031 vector32-concat 1.005 1.000 1.001 0.9984 vector64-concat 1.005 1.000 1.001 0.9982 vector-rev 1.005 1.001 1.001 0.9987 vliw 1.033 1.010 1.012 1.026 wc-input1 1.012 1.009 1.008 1.010 wc-scanStream 1.011 1.006 1.007 1.007 zebra 1.004 1.001 1.002 1.005 zern 1.006 1.003 1.001 1.002 MIN 1.000 0.9940 0.9967 0.9972 GMEAN 1.011 1.007 1.008 1.008 MAX 1.176 1.154 1.183 1.179 ``` Notes: * As they should be (same versions of MLton, gcc, llvm), executable sizes are identical with `cadmium`.

Static allocation/initialization of objects in backend The main benefits are that code size and compile time are improved across the board, particularly for larger programs. Runtime sometimes improves, but should only affect programs which had hot code accessing globals, as it removes one level of indirection. Garbage collections might be marginally faster, as globals are now mostly skipped. Statically allocated and initialized objects are created in the main `.c` file, where they will be placed in the data segment of the executable: const struct {Word64 meta_0; Word64 meta_1; Word64 meta_2; Word8 data[9];} static_20 = {(Word64)(0x0ull), (Word64)(0x9ull), (Word64)(0x7ull), "addrinuse"}; const struct {Word64 meta_0; Word32 data_0; Word32 data_1; Pointer data_2; } static_21 = {(Word64)(0x29ull), (Word32)(0x62ull), (Word32)(0x0ull), ((Pointer)(&static_20) + 24)}; Note that these are proper ML objects, with metadata and data. References to statically allocated objects are via pointers to the first data field (e.g., `&static20 + 24`). Note also that `WordXVector` (e.g., strings) are a special case of statically allocated and initialized objects. Statically allocated and initialized objects can be both immutable and mutable, although the latter should be restricted to objects with non-`Objptr` mutable fields. A special case of statically allocated objects are arrays, whose contents will be dynamically initialized by the mutator. These are also created in the main `.c` file, but are placed in the bss segment of the executable (decreasing the size of the executable) and proper metadata is written by initialization code: struct {Word64 meta_0; Word64 meta_1; Word64 meta_2; Word8 data[800000];} static_26; struct {Word64 meta_0; Word64 meta_1; Word64 meta_2; Word8 data[0];} static_31; static void static_Init() { memcpy (&static_26, &((struct {Word64 meta_0; Word64 meta_1; Word64 meta_2}){(Word64)(0x0ull), (Word64)(0x186A0ull), (Word64)(0x11ull)}), 24); memcpy (&static_31, &((struct {Word64 meta_0; Word64 meta_1; Word64 meta_2}){(Word64)(0x0ull), (Word64)(0x0ull), (Word64)(0x13ull)}), 24); }; Finally, dynamically allocated but statically initialized objects have their initialization data in the main `.c` file along with information to copy that data to the initial dynamic heap during `initWorld`: const static struct {Word64 meta_0; Word64 data_0; } static_9819 = {(Word64)(0x79Dull), (Word64)(0x1ull)}; const static struct {Word64 meta_0; Word64 data_0; } static_9820 = {(Word64)(0x79Dull), (Word64)(0x1ull)}; static struct GC_objectInit objectInits[] = { { 11, 8, 16, ((Pointer) &static_9819) }, { 12, 8, 16, ((Pointer) &static_9820) }, ... } By default (with `-static-init-objects staticAllocOnly`), no such objects are created. With `-static-init-objects all`, global objects with `Objptr` mutable fields would be dynamically allocated but statically initialized. But, such global objects are rare. For example, a `(int * int) ref` could be space-safely globalized and would be an object with a mutable `Objptr` field. However, it is also likely that such a tuple would be `RefFlatten`ed. With `-globalize-small-type 4` (see #288 and 752467c), an `(int * int, int * int) either ref` could be globalized and represented as an object with a mutable `Objptr` field. Similarly, an `IntInf.int ref` can also be globalized and would be represented as an object with a mutable `Objptr` field. With `-static-alloc-objects false -static-init-objects all`, all global objects will be dynamically allocated but statically initialized (and no global objects will be statically allocated). Similarly, with `-static-alloc-wordvector-consts false`, string constants will be dynamically allocated but statically initialized; this corresponds to the previous MLton behavior with respect to string constants. A number of controls have been added to control static allocation/initialization: * `-static-alloc-internal-ptrs {static|all|none}` Controls which kinds of objects can be statically allocated: * `static`: only objects with all fields either immutable or non-`Objptr` * `none`: only objects with no fields * `all`: all objects The `all` setting is incompatible with the current GC for two reasons. First, statically allocated objects are not traced by the GC; a statically-allocated object that is updated with an `Objptr` to an object in the heap should be considered a root. Second, a statically-allocated object that is updated with an `Objptr` would trigger a card marking, but the address of a statically-allocated object would not map to a valid card slot. * `-static-alloc-wordvector-consts {true|false}` Controls whether or not `WordXVector` constants are converted to statics (with `ImmStatic` location) at `Ssa2ToRssa`. * `-static-init-arrays {true|false}` Controls whether or not `Array_alloc` primitives are converted to statics (with `MutStatic` or `Heap` location) at `Ssa2ToRssa`. * `-static-alloc-arrays {true|false}` Controls whether or not `Array_alloc` primitives that can be statically initialized are forced to `Heap` location. * `-static-init-objects {none|staticAllocOnly|all}` Controls whether or not `Object` expressions are converted to statics at `Ssa2ToRssa`. If `staticAllocOnly`, then an object that would be converted to a static with `Heap` location is not converted to a static. * `-static-alloc-objects {true|false}` Controls whether or not `Object` expressions that can be statically initialized are forced to `Heap` location.

MatthewFluet added 19 commits January 11, 2019 09:29

Add -globalize-arrays {false|true} compile-time option

3b72967

Simplify ConstantPropagation.Value.global{,s} functions

a48f215

Use standard Trace.trace and prep recursive calls.

Parameterize ConstantPropagation.Value.global by isSmallType

ed1fc83

Add -globalize-small-type <n> compile-time option

a820082

Add -globalize-refs {true|false} compile-time option

c7ab39d

Add "New Globals (<n>): ..." to ConstantPropagation diagnostics

9bdcc93

Prepare for implementing isSmallType variants via properties

92a0ded

Implement simple isSmallType variant via a property

04ae623

Refactor to mkIsSmallType generator function

fa76804

Add -globalize-small-type 2 variant

72fa80c

`-globalize-small-type 2` treats datatypes with all nullary constructors as small.

Rename destroySmallType to destroyIsSmallType

fdfce30

Add -globalize-small-type 3 variant

cc7e84d

`-globalize-small-type 3` treats datatypes that have all constructor arguments satisfying `#isSmallType o (mkIsSmallType 1)` as small.

Code formatting in ConstantPropagation

dd13c6b

Add -globalize-small-int-inf {true|false} compile-time option

ff29610

Allow previous behavior of globalizing `IntInf.int ref` values.

MatthewFluet merged commit 752467c into MLton:master Jan 16, 2019

MatthewFluet deleted the globalize-opt branch January 16, 2019 20:58

MatthewFluet mentioned this pull request May 31, 2019

Bounce variables around forced stack allocations in RSSA loops #299

Merged

This was referenced Aug 12, 2019

Add static allocation support in Rssa through codegen #328

Merged

Eliminate tuples from IR and use datatypes instead #329

Open

MatthewFluet mentioned this pull request Jan 9, 2023

performance improvement: delayed newTab and newTabWithStyle to avoid unnecessary work shwestrick/smlfmt#71

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternate strategies for globalization in ConstantPropagation #288

Alternate strategies for globalization in ConstantPropagation #288

MatthewFluet commented Jan 16, 2019 •

edited

Loading

Alternate strategies for globalization in ConstantPropagation #288

Alternate strategies for globalization in ConstantPropagation #288

Conversation

MatthewFluet commented Jan 16, 2019 • edited Loading

MatthewFluet commented Jan 16, 2019 •

edited

Loading