Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate strategies for globalization in ConstantPropagation #288

Merged
merged 19 commits into from
Jan 16, 2019

Conversation

MatthewFluet
Copy link
Member

@MatthewFluet MatthewFluet commented Jan 16, 2019

Extend globalization aspect of ConstantPropagation to support
globalization of arrays and to support different "small type"
strategies.

Closes #206.

Space-safety prohibits ConstantPropagation from globalizing all arrays
and refs that are allocated at most once by a program. In particular,
because globals are live for the duration of the program, globalizing
an int list ref (for example) would not be safe-for-space: an
arbitrarily large list may be written to the reference and never be
garbage collected (whereas, when the int list ref is not globalized,
it will be garbage collected when it is no longer live). On the other
hand, globalizing an int ref is safe-for-space.

However, MLton previously used only a very conservative estimation for
space safety. Only "small" types may be globalized, where smallness is
defined as:

    fun isSmall t =
        case dest t of
           Array _ => false
         | Datatype _ => false
         | Ref t => isSmall t
         | Tuple ts => Vector.forall (ts, isSmall)
         | Vector _ => false
         | _ => true

Note that no Datatype is small; this is conservative (since a
recursive datatype could represent unbounded data), but prevents
globalizing bool ref. Also, no Array is small; this is correct
(because an int array ref should not be globalized), but the
globalization of a val a: t array = Array_alloc[t] (l) was
conditioned on the smallness of t array, not the smallness of
t. It is correct to globalize an array if t were small; note that
to globalize val a: t array = Array_alloc[t] (l), l (the length)
must be globalized and must, therefore, be a constant and the array is
of constant size. (This is Stephen Weeks's relaxed notion of
safe-for-space, where the constant factor blowup can be chosen per
program.)

This pull request adds support for alternate globalization strategies:

  • -globalize-arrays {false|true}: globalize arrays
  • -globalize-refs {true|false}: globalize refs
  • -globalize-small-int-inf {true|false}:globalize IntInf as a
    small type
  • -globalize-small-type {1|0|2|3|4|9}: strategies for classifying a
    type as "small":
    • 0: constant false function (no types considered small)
    • 1: no Datatype is considered small (original strategy)
    • 2: Datatypes with all nullary constructors are considered
      small
    • 3: Datatypes with all constructor arguments considered small
      according to strategy 2 are considered small
    • 4: Fixed-point analysis of Datatypes to determine smallness
    • 9: constant true function (all types considered small; not
      safe-for-space)

The defaults correspond to the previous behavior.

Unfortunately, additional globalization has little to no (positive)
effect on benchmarks:

MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton
MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1
MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2
MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3
MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4
MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1
MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2
MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3
MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4
run time ratio
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator        1.00   1.00   1.03   1.00   1.00   0.99   1.00   1.01   1.01
checksum            1.00   1.00   1.13   1.14   1.13   1.00   1.12   1.13   1.14
flat-array          1.00   1.00   1.00   1.00   1.01   1.19   1.19   1.19   1.19
hamlet              1.00   1.00   1.02   1.01   1.01   1.00   0.96   0.97   0.97
imp-for             1.00   1.00   1.06   1.05   1.06   1.00   1.05   1.06   1.06
knuth-bendix        1.00   1.00   1.03   1.03   1.03   1.00   1.03   1.03   1.03
lexgen              1.00   0.99   1.05   0.99   1.03   1.04   1.03   1.02   1.07
model-elimination   1.00   1.01   1.02   1.02   1.02   1.00   1.03   1.02   1.03
peek                1.00   1.00   1.03   1.03   1.04   1.00   1.04   1.04   1.04
ray                 1.00   1.03   0.99   0.99   1.01   0.99   0.99   0.99   0.98
raytrace            1.00   1.03   1.00   1.02   1.03   0.99   1.00   1.01   1.01
simple              1.00   0.98   0.99   1.00   0.99   0.97   0.97   0.96   0.97
tak                 1.00   1.00   1.04   1.03   1.10   1.01   1.02   1.08   1.00
wc-scanStream       1.00   1.00   1.06   1.06   1.06   1.00   1.04   1.06   1.07

Note that MLton0 and MLton1 generate identical code (modulo the
random magic number), so the slowdowns in ray and raytrace are
noise, which also suggests that slowdowns/speedups of <= 3% are also
likely noise.

The slowdown in flat-array with -globalize-array true is explained
as follows. The flat-array benchmark uses Vector.tabulate to
allocate a vector that is used for all iterations of the benchmark.
With -globalize-array false, the array is not globalized, and in
SSA/SSA2, we have:

    x_1212: ((word32, word32) tuple) array =
      prim Array_alloc((word32, word32) tuple) (global_138 (*0xF4240*))
...
    x_757: ((word32, word32) tuple) vector =
      prim Array_toVector((word32, word32) tuple) (x_1212)
...
    x_1287: (word32, word32) tuple =
      prim Vector_sub((word32, word32) tuple) (x_757, x_1283)

but with -globalize-array true, the array is globalized, and in
SSA/SSA2, we have:

global_490: ((word32, word32) tuple) array =
  prim Array_alloc((word32, word32) tuple) (global_138 (*0xF4240*))
...
    x_757: ((word32, word32) tuple) vector =
      prim Array_toVector((word32, word32) tuple) (global_490)
...
    x_1286: (word32, word32) tuple =
      prim Vector_sub((word32, word32) tuple) (x_757, x_1282)

At RSSA, the Array_toVector becomes a header update and the array
variable is cast/copy-propagated for the vector variable;
with -globalize-arrays false, we have

  L_531 (x_1212: Objptr (opt_11)) CReturn {func = {..., target = GC_sequenceAllocate}} =
...
    OW64 (x_1212, ~8): Word64 := opt_12
...
    x_1354: Word32 = XW32 (Cast (x_1212, Objptr (opt_12)), x_1283, 8, 0)
    x_1353: Word32 = XW32 (Cast (x_1212, Objptr (opt_12)), x_1283, 8, 4)

but with -globalize-arrays true, we have

  L_488 (global_490: Objptr (opt_7)) CReturn {func = {..., target = GC_sequenceAllocate}} =
...
    OW64 (global_490, ~8): Word64 := opt_12
...
    x_1353: Word32 = XW32 (Cast (global_490, Objptr (opt_12)), x_1282, 8, 0)
    x_1352: Word32 = XW32 (Cast (global_490, Objptr (opt_12)), x_1282, 8, 4)

Finally, with -globalize-arrays false, x_1212 becomes a local
(because the loops to initialize and use the vector are
non-allocating):

    RW32(2): Word32
     = XW32 (Cast (RP(0): Objptr (opt_11), Objptr (opt_12)),
             RW64(0): Word64, 8, 0): Word32
    RW32(3): Word32
     = XW32 (Cast (RP(0): Objptr (opt_11), Objptr (opt_12)),
             RW64(0): Word64, 8, 4): Word32

but with -globalize-arrays true:

    RW32(2): Word32
     = XW32 (Cast (glob {index = 1, isRoot = true, ty = Objptr (opt_7)},
                   Objptr (opt_12)),
             RW64(0): Word64, 8, 0): Word32
    RW32(3): Word32
     = XW32 (Cast (glob {index = 1, isRoot = true, ty = Objptr (opt_7)},
                   Objptr (opt_12)),
             RW64(0): Word64, 8, 4): Word32

The innermost loop of the benchmark goes from indexing a sequence
stored in a local (RP(0)) to indexing a sequence stored in a global
(GP(1)). All of the codegens should implement the former by using a
hardware register for RP(0), but will implement the latter with a
memory read.

In light of the above, and related to #218, it may be
beneficial to "deglobalize" object pointer globals; that is, in RSSA
functions that have multiple accesses through the same object pointer
global (particularly within loops) could be translated to copy the
global to a local.

The slowdown in checksum is less easily explained. The only new
objects globalized with -globalize-small-type 2 as compared to
-globalize-small-type 1 are two bool ref objects, corresponding to
the exiting flag of basis-library/mlton/exit.sml and the
staticIsInUse flag of basis-library/util/one.sml used by
Int.fmt. That small change seems to lead to code layout and cache
effects that result in the slowdown, because the assembly code is not
substantial different. With -enable-pass machineSuffle and
-seed-rand <w>, one can perturb the code layout and observe that the
slowdowns are not universal:

MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton
MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1
MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand 42424242
MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand deadbeef
MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2
MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand 42424242
MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand deadbeef
MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3
MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand 42424242
MLton9 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand deadbeef
MLton10 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4
MLton11 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand 42424242
MLton12 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand deadbeef
MLton13 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1
MLton14 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand 42424242
MLton15 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand deadbeef
MLton16 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2
MLton17 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand 42424242
MLton18 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand deadbeef
MLton19 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3
MLton20 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand 42424242
MLton21 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand deadbeef
MLton22 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4
MLton23 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand 42424242
MLton24 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand deadbeef
run time ratio
benchmark     MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11 MLton12 MLton13 MLton14 MLton15 MLton16 MLton17 MLton18 MLton19 MLton20 MLton21 MLton22 MLton23 MLton24
checksum        1.00   1.00   1.00   1.01   1.14   1.01   1.00   1.14   1.01   1.01    1.14    1.00    1.04    1.00    1.00    1.00    1.14    1.00    1.01    1.16    1.00    1.01    1.15    1.00    1.01
flat-array      1.00   1.01   1.02   1.00   1.00   1.01   1.01   1.03   1.01   1.01    1.01    1.01    1.01    1.23    1.19    1.19    1.19    1.20    1.19    1.20    1.20    1.19    1.20    1.19    1.20
hamlet          1.00   0.99   1.00   0.99   1.01   1.01   1.00   1.01   1.03   1.01    1.01    1.02    1.01    1.00    1.00    0.99    0.95    0.95    0.94    0.97    0.96    0.96    0.98    0.98    0.99
imp-for         1.00   1.00   1.05   1.05   1.05   1.02   1.00   1.05   0.99   1.00    1.05    1.00    1.01    1.00    1.05    1.05    1.05    0.99    1.00    1.05    1.00    1.00    1.05    1.00    0.99
lexgen          1.00   0.97   1.00   0.97   1.03   1.03   1.01   1.04   0.99   0.95    0.99    0.99    0.95    0.97    0.98    0.95    0.96    1.01    0.98    1.00    1.04    0.95    0.96    1.00    0.95
peek            1.00   1.00   1.00   1.01   1.03   1.01   1.04   1.03   1.01   1.04    1.03    1.01    1.03    1.00    1.00    1.01    1.04    1.01    1.05    1.03    1.01    1.04    1.04    1.00    1.04
simple          1.00   1.01   1.01   1.00   1.00   0.99   1.02   1.00   0.99   1.00    1.00    1.00    1.00    0.98    0.97    0.99    0.98    0.97    0.99    0.97    0.99    0.98    0.98    0.98    0.99
tak             1.00   0.99   0.90   1.00   1.05   0.90   0.99   1.02   0.89   0.99    1.04    0.90    1.00    0.99    0.89    0.99    0.99    0.90    0.99    1.01    0.90    1.00    1.00    0.90    1.00
wc-scanStream   1.00   1.01   1.01   1.03   1.06   1.02   1.03   1.05   1.02   1.02    1.04    1.01    1.00    1.01    1.01    1.00    1.06    1.01    1.00    1.07    1.01    1.00    1.05    1.03    1.02

Note that while checksum with MLton4 has a slowdown, checksum with
MLton5 and MLton6 (which are identical up to shuffling of the
functions and basic blocks at the MachineIR) do not have a slowdown.
Similarly tak with MLton0 and MLton1 have similar running time, but
tak with MLton3 has a speedup. On the other hand, flat-array's
slowdowns with -globalize-arrays true are not due to code layout
effects.

hamlet may have a slight speedup with -globalize-arrays true, but
that is significantly outweighed by the slowdown in flat-array.

The conclusion is to leave the defaults corresponding to the original
behavior.

Full benchmark results:

MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton
MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1
MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2
MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3
MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4
MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1
MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2
MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3
MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4
run time ratio
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator        1.00   1.00   1.03   1.00   1.00   0.99   1.00   1.01   1.01
barnes-hut          1.00   1.01   1.01   1.02   1.01   1.01   1.01   1.02   1.01
boyer               1.00   1.00   1.01   1.01   1.02   1.00   1.01   1.01   1.01
checksum            1.00   1.00   1.13   1.14   1.13   1.00   1.12   1.13   1.14
count-graphs        1.00   1.00   0.99   1.01   0.99   1.00   1.00   1.00   0.99
even-odd            1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
fft                 1.00   1.00   1.01   1.02   1.02   1.01   1.01   1.00   1.01
fib                 1.00   1.00   1.00   1.00   1.01   1.00   1.00   1.00   1.00
flat-array          1.00   1.00   1.00   1.00   1.01   1.19   1.19   1.19   1.19
hamlet              1.00   1.00   1.02   1.01   1.01   1.00   0.96   0.97   0.97
imp-for             1.00   1.00   1.06   1.05   1.06   1.00   1.05   1.06   1.06
knuth-bendix        1.00   1.00   1.03   1.03   1.03   1.00   1.03   1.03   1.03
lexgen              1.00   0.99   1.05   0.99   1.03   1.04   1.03   1.02   1.07
life                1.00   1.00   1.00   1.00   1.00   1.01   1.01   1.00   1.00
logic               1.00   1.00   0.98   0.99   0.99   1.00   0.98   0.99   1.00
mandelbrot          1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
matrix-multiply     1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.01   1.00
md5                 1.00   1.00   0.99   0.99   0.99   1.00   0.99   0.99   0.99
merge               1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
mlyacc              1.00   1.01   1.01   1.01   1.01   1.00   0.99   0.99   1.01
model-elimination   1.00   1.01   1.02   1.02   1.02   1.00   1.03   1.02   1.03
mpuz                1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
nucleic             1.00   1.01   1.00   1.00   0.99   1.00   0.99   0.99   1.00
output1             1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
peek                1.00   1.00   1.03   1.03   1.04   1.00   1.04   1.04   1.04
psdes-random        1.00   1.00   1.00   1.00   1.00   1.00   1.01   1.00   1.00
ratio-regions       1.00   0.99   0.98   0.98   1.01   1.01   0.99   1.00   1.01
ray                 1.00   1.03   0.99   0.99   1.01   0.99   0.99   0.99   0.98
raytrace            1.00   1.03   1.00   1.02   1.03   0.99   1.00   1.01   1.01
simple              1.00   0.98   0.99   1.00   0.99   0.97   0.97   0.96   0.97
smith-normal-form   1.00   1.00   1.01   0.99   1.00   1.00   1.00   1.00   1.00
string-concat       1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
tailfib             1.00   1.00   1.00   0.99   1.00   1.00   1.00   1.00   1.00
tak                 1.00   1.00   1.04   1.03   1.10   1.01   1.02   1.08   1.00
tensor              1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
tsp                 1.00   1.00   1.00   1.00   1.00   0.99   1.00   1.00   1.00
tyan                1.00   1.00   1.00   1.01   1.01   1.01   1.01   1.01   1.01
vector-rev          1.00   0.99   0.99   0.98   0.98   0.98   0.99   0.98   0.98
vector32-concat     1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
vector64-concat     1.00   1.00   1.00   1.00   1.00   0.99   1.00   1.00   0.99
vliw                1.00   0.98   1.00   1.01   1.00   1.00   0.98   0.98   0.98
wc-input1           1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
wc-scanStream       1.00   1.00   1.06   1.06   1.06   1.00   1.04   1.06   1.07
zebra               1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
zern                1.00   1.02   1.02   0.99   1.01   0.99   0.99   1.01   0.99
size
benchmark            MLton0    MLton1    MLton2    MLton3    MLton4    MLton5    MLton6    MLton7    MLton8
DLXSimulator        209,076   209,076   209,140   209,140   209,140   209,076   208,836   208,340   208,340
barnes-hut          176,199   176,199   176,071   176,071   176,071   176,199   176,071   176,071   176,071
boyer               243,369   243,369   243,289   243,289   243,289   243,369   243,289   243,289   243,289
checksum            117,561   117,561   117,433   117,433   117,433   117,561   117,433   117,433   117,433
count-graphs        145,065   145,065   145,017   145,017   145,017   145,065   144,937   144,937   144,937
even-odd            117,529   117,529   117,433   117,433   117,433   117,529   117,433   117,433   117,433
fft                 142,307   142,307   141,315   141,315   141,315   142,307   141,315   141,315   141,315
fib                 117,449   117,449   117,321   117,321   117,321   117,449   117,321   117,321   117,321
flat-array          117,177   117,177   117,049   117,049   117,049   117,193   117,081   117,081   117,081
hamlet            1,434,228 1,434,228 1,433,220 1,433,220 1,433,220 1,434,228 1,432,564 1,427,956 1,427,396
imp-for             117,241   117,241   117,145   117,145   117,145   117,241   117,145   117,145   117,145
knuth-bendix        186,116   186,116   186,212   186,212   186,212   186,116   186,212   186,212   186,212
lexgen              290,931   290,931   290,819   290,819   290,819   290,931   290,819   290,819   290,819
life                141,113   141,113   141,065   141,065   141,065   141,113   141,065   141,065   141,065
logic               197,417   197,417   197,273   197,273   197,273   197,417   197,273   197,273   197,273
mandelbrot          117,273   117,273   117,177   117,177   117,177   117,273   117,177   117,177   117,177
matrix-multiply     119,577   119,577   119,417   119,417   119,417   119,577   119,417   119,417   119,417
md5                 144,676   144,676   144,500   144,500   144,500   144,676   144,500   144,500   144,500
merge               118,953   118,953   118,857   118,857   118,857   118,953   118,857   118,857   118,857
mlyacc              643,555   643,555   643,651   643,651   643,651   643,555   643,651   643,651   643,475
model-elimination   796,054   796,054   793,958   793,798   793,798   796,054   794,166   792,246   792,246
mpuz                123,545   123,545   123,481   123,481   123,481   123,545   123,481   123,481   123,481
nucleic             297,249   297,249   297,233   297,233   297,233   297,249   297,233   297,233   297,233
output1             151,768   151,768   149,848   149,848   149,848   151,768   149,848   149,848   149,848
peek                150,164   150,164   150,132   150,132   150,132   150,164   150,132   150,132   150,132
psdes-random        121,545   121,545   121,401   121,401   121,401   121,545   121,401   121,401   121,401
ratio-regions       144,137   144,137   144,169   144,169   144,169   144,137   144,169   144,169   144,169
ray                 250,058   250,058   250,218   250,218   250,218   250,058   249,818   249,066   249,066
raytrace            368,988   368,988   368,108   368,108   368,108   368,956   367,868   367,468   367,468
simple              345,205   345,205   345,381   345,381   345,381   329,557   329,557   329,317   329,317
smith-normal-form   279,837   279,837   279,645   279,645   279,645   279,837   279,341   279,341   279,341
string-concat       119,129   119,129   119,033   119,033   119,033   119,209   119,033   119,033   119,033
tailfib             117,273   117,273   117,177   117,177   117,177   117,273   117,177   117,177   117,177
tak                 117,449   117,449   117,321   117,321   117,321   117,449   117,321   117,321   117,321
tensor              179,292   179,292   176,908   176,908   176,908   179,292   176,908   176,908   176,908
tsp                 158,860   158,860   158,668   158,668   158,668   158,860   158,668   158,668   158,668
tyan                223,588   223,588   223,044   223,044   223,044   223,588   223,044   223,044   223,044
vector-rev          118,105   118,105   118,009   118,009   118,009   118,153   117,977   117,977   117,977
vector32-concat     118,297   118,297   118,201   118,201   118,201   118,329   118,185   118,185   118,185
vector64-concat     118,329   118,329   118,169   118,169   118,169   118,329   118,217   118,217   118,217
vliw                505,509   505,509   503,013   503,013   503,013   505,637   500,917   497,957   497,957
wc-input1           179,051   179,051   178,923   178,923   178,923   179,051   178,923   178,923   178,923
wc-scanStream       188,155   188,155   188,027   188,027   188,027   188,155   188,027   188,027   188,027
zebra               225,364   225,364   225,220   225,220   225,220   225,364   225,220   225,220   225,220
zern                153,241   153,241   152,521   152,521   152,521   153,241   152,585   152,585   152,585
compile time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator        3.19   3.14   3.45   3.51   3.02   3.27   3.18   3.45   3.24
barnes-hut          2.93   2.94   2.96   2.97   3.06   2.92   2.96   2.93   2.96
boyer               3.36   3.49   3.48   3.32   3.42   3.40   3.52   3.53   3.49
checksum            2.53   2.56   2.47   2.56   2.57   2.46   2.56   2.54   2.62
count-graphs        2.68   2.69   2.71   2.80   2.80   2.79   2.67   2.69   2.76
even-odd            2.45   2.56   2.46   2.56   2.48   2.50   2.48   2.55   2.57
fft                 2.64   2.57   2.62   2.66   2.35   2.68   2.60   2.54   2.62
fib                 2.46   2.46   2.58   2.48   2.55   2.45   2.55   2.54   2.46
flat-array          2.52   2.56   2.55   2.56   2.55   2.54   2.53   2.56   2.55
hamlet             15.27  15.74  14.48  14.35  14.48  15.63  15.21  14.88  15.04
imp-for             2.48   2.56   2.52   2.55   2.55   2.34   2.42   2.48   2.54
knuth-bendix        2.89   2.90   2.90   3.01   3.02   2.93   3.01   3.13   2.99
lexgen              3.40   3.88   3.76   3.82   3.74   3.62   3.50   3.81   3.71
life                2.66   2.71   2.70   2.74   2.72   2.64   2.74   2.68   2.60
logic               3.06   3.05   3.11   3.12   3.00   3.13   3.14   2.92   2.83
mandelbrot          2.51   2.54   2.53   2.58   2.57   2.53   2.56   2.55   2.45
matrix-multiply     2.48   2.50   2.52   2.49   2.57   2.60   2.60   2.57   2.53
md5                 2.65   2.68   2.78   2.58   2.78   2.80   2.58   2.80   2.69
merge               2.47   2.49   2.58   2.57   2.55   2.49   2.57   2.50   2.52
mlyacc              7.85   7.94   8.00   7.98   7.90   7.69   8.05   8.09   7.54
model-elimination   7.08   7.98   7.62   7.11   7.40   8.24   7.84   7.96   8.15
mpuz                2.34   2.61   2.60   2.56   2.53   2.62   2.42   2.53   2.52
nucleic             4.06   4.17   4.07   4.05   4.08   4.06   4.28   4.33   4.14
output1             2.69   2.68   2.52   2.58   2.77   2.79   2.78   2.57   2.77
peek                2.74   2.78   2.58   2.80   2.80   2.73   2.70   2.68   2.72
psdes-random        2.56   2.53   2.49   2.48   2.50   2.48   2.53   2.64   2.50
ratio-regions       2.80   2.82   2.62   2.88   2.86   2.89   2.81   2.80   2.79
ray                 3.34   3.62   3.69   3.47   3.45   3.60   3.48   3.37   3.40
raytrace            4.55   4.88   4.47   4.32   4.64   4.30   4.48   4.43   4.50
simple              4.01   4.07   4.04   4.00   3.91   3.83   3.77   3.81   3.74
smith-normal-form   3.75   3.60   3.82   3.58   3.40   3.79   3.60   3.58   3.58
string-concat       2.46   2.66   2.54   2.45   2.57   2.56   2.46   2.51   2.49
tailfib             2.44   2.54   2.53   2.57   2.50   2.57   2.57   2.36   2.54
tak                 2.45   2.57   2.56   2.63   2.47   2.44   2.43   2.44   2.52
tensor              3.05   3.16   3.07   3.15   3.10   3.32   3.18   3.16   3.13
tsp                 2.81   2.79   2.57   2.74   2.76   2.75   2.84   2.84   2.82
tyan                3.27   3.06   3.38   3.26   3.35   3.23   3.22   3.22   3.24
vector-rev          2.49   2.57   2.56   2.33   2.61   2.40   2.52   2.49   2.54
vector32-concat     2.53   2.49   2.52   2.47   2.55   2.63   2.49   2.49   2.51
vector64-concat     2.48   2.46   2.50   2.52   2.48   2.54   2.46   2.47   2.48
vliw                5.63   5.63   5.64   6.14   5.43   6.26   5.78   6.17   6.16
wc-input1           3.06   2.96   2.95   2.98   2.91   2.87   2.90   2.97   2.96
wc-scanStream       3.01   2.97   3.02   2.95   3.06   3.04   3.03   3.01   3.09
zebra               3.30   3.37   3.36   3.26   3.34   3.51   3.31   3.26   3.34
zern                2.69   2.74   2.62   2.64   2.72   2.74   2.73   2.43   2.70
run time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator       32.67  32.69  33.65  32.71  32.53  32.50  32.63  32.84  32.92
barnes-hut         28.33  28.63  28.64  28.85  28.59  28.50  28.71  28.78  28.61
boyer              55.98  56.01  56.81  56.75  56.82  56.08  56.29  56.72  56.76
checksum           25.35  25.39  28.74  28.79  28.64  25.39  28.42  28.68  28.87
count-graphs       39.96  39.98  39.51  40.23  39.38  40.13  40.16  39.92  39.75
even-odd           39.09  39.07  39.09  39.06  39.08  39.14  39.10  39.09  39.09
fft                30.56  30.69  30.75  31.05  31.20  30.98  30.86  30.62  30.77
fib                17.70  17.75  17.77  17.77  17.83  17.77  17.74  17.67  17.77
flat-array         23.60  23.54  23.65  23.70  23.82  28.18  28.09  28.02  28.01
hamlet             39.69  39.77  40.32  40.20  40.20  39.72  38.00  38.70  38.68
imp-for            24.40  24.44  25.78  25.74  25.79  24.51  25.73  25.98  25.84
knuth-bendix       34.10  34.09  35.05  34.97  35.03  34.16  35.02  35.11  35.07
lexgen             34.22  33.77  35.88  33.88  35.27  35.64  35.09  34.93  36.52
life               38.70  38.80  38.76  38.76  38.89  38.90  38.91  38.74  38.84
logic              35.01  34.93  34.44  34.82  34.82  35.06  34.47  34.77  34.93
mandelbrot         35.78  35.77  35.79  35.80  35.81  35.82  35.79  35.78  35.81
matrix-multiply    29.74  29.80  29.82  29.89  29.79  29.72  29.84  29.90  29.86
md5                28.38  28.39  28.04  28.05  28.01  28.42  28.12  28.14  27.98
merge              32.55  32.45  32.41  32.44  32.59  32.48  32.41  32.56  32.40
mlyacc             32.76  33.01  32.94  33.14  33.23  32.83  32.52  32.49  33.11
model-elimination  38.05  38.26  38.89  38.72  38.80  38.09  39.11  38.97  39.36
mpuz               29.94  29.86  29.88  29.91  29.89  29.91  29.89  29.86  29.90
nucleic            33.73  33.91  33.60  33.68  33.35  33.74  33.40  33.47  33.58
output1            30.01  30.01  29.99  30.01  30.02  30.06  29.99  30.10  29.91
peek               33.58  33.60  34.75  34.63  34.78  33.61  34.79  34.78  34.86
psdes-random       33.84  33.91  33.83  33.86  33.91  33.88  34.16  33.91  33.91
ratio-regions      49.08  48.63  48.33  48.33  49.40  49.45  48.70  48.95  49.45
ray                37.55  38.58  37.18  37.06  37.80  37.04  37.19  37.08  36.65
raytrace           34.20  35.34  34.35  34.94  35.23  34.00  34.08  34.61  34.52
simple             29.73  29.25  29.47  29.64  29.51  28.96  28.84  28.57  28.86
smith-normal-form  39.88  39.79  40.38  39.56  39.99  39.92  39.93  39.85  39.88
string-concat      91.43  91.56  91.68  91.40  91.65  91.58  91.75  91.50  91.36
tailfib            38.06  38.06  37.93  37.85  37.90  38.12  38.00  37.98  37.87
tak                30.73  30.77  31.96  31.60  33.85  30.94  31.41  33.22  30.87
tensor             39.62  39.59  39.67  39.70  39.70  39.54  39.71  39.76  39.67
tsp                37.88  37.80  37.84  37.79  37.75  37.62  37.86  37.99  37.82
tyan               30.48  30.45  30.55  30.86  30.79  30.78  30.74  30.85  30.79
vector-rev         27.00  26.60  26.61  26.44  26.55  26.37  26.68  26.53  26.39
vector32-concat    82.46  82.52  82.56  82.72  82.43  82.56  82.37  82.30  82.22
vector64-concat    91.66  91.86  91.57  91.77  91.63  91.13  91.54  91.42  91.13
vliw               28.91  28.43  28.94  29.05  28.90  28.82  28.29  28.27  28.44
wc-input1          43.95  43.81  43.87  43.95  43.86  43.93  43.95  43.89  43.79
wc-scanStream      21.63  21.64  22.96  23.03  23.02  21.65  22.51  22.89  23.17
zebra              30.39  30.46  30.32  30.29  30.28  30.37  30.37  30.36  30.39
zern               32.37  33.09  33.05  32.00  32.71  31.99  32.04  32.83  31.96

Space-safety prohibits ConstantPropagation from globalizing all arrays
that are allocated at most once by a program.  In particular, because
globals are live for the duration of the program, globalizing an `int
list array` (for example) would not be safe-for-space: arbitrarily
large lists may be stored in the array and never garbage
collected (whereas, when the `int list array` is not globalized, it
will be garbage collected when it is no longer live).

On the other hand, globalizing an `int array` (with a constant length)
is safe-for-space.

However, previously, the globalization of a `val a: t array =
Array_alloc[t] (l)` was conditioned on the smallness of `t array`.
Furthermore, `Type.isSmall` returns `false` for arrays, so no array
would be globalized.

It is correct to globalize an array if `t` is small; note that to
globalize `val a: t array = Array_alloc[t] (l)`, `l` (the length) must
be globalized and must, therefore, be a constant and the array is of
constant size. (This is Stephen Weeks's relaxed notion of
safe-for-space, where the constant factor blowup can be chosen per
program.)  In practice, it may be better to limit globalization of
arrays to ones with "small" length in addition to small element type.

This commit allows a `val a: t array = Array_alloc[t] (l)` to be
globalized if `-globalize-arrays true`, `t` is small, and `l` is
globalized.
Use standard Trace.trace and prep recursive calls.
`-globalize-small-type 0` is the constant `false` function, which
globalizes no references or arrays.

`-globalize-small-type 1` is the previous `Type.isSmall` function,
which returns false for all datatypes.

`-globalize-small-type 9` is the constant `true` function, which is
not safe-for-space.
`-globalize-small-type 2` treats datatypes with all nullary
constructors as small.
`-globalize-small-type 3` treats datatypes that have all constructor
arguments satisfying `#isSmallType o (mkIsSmallType 1)` as small.
Fixes a bug (triggered by the `simple` benchmark with
`-globalize-arrays true`) where the rawness of an array value was not
properly propagated.

Previously, the rawness of an array value was implemented as `bool
option ref`, but coercing `ref NONE` to `ref NONE` would not remember
the coercion when the first raw became `ref (SOME true)`.

Now, a proper "flat lattice"-like `structure Raw` is used.
Technically, both IntInf and Thread should not be considered small
types, as they can be of unbounded size (an IntInf can be represented
by a sequence, while a Thread implicitly contains a stack).

Note, an IntInf value can be globalized (as it is a constant), but an
`IntInf.int ref` cannot.
Allow previous behavior of globalizing `IntInf.int ref` values.
`-globalize-small-type 4` uses RefFlatten's more precise notion of
small/large types: build a graph of the dependencies between
datatypes, force any (mutually) recursive datatypes to be large types,
solve a fixed-point analysis of `Size.<=` constraints.
@MatthewFluet MatthewFluet merged commit 752467c into MLton:master Jan 16, 2019
@MatthewFluet MatthewFluet deleted the globalize-opt branch January 16, 2019 20:58
MatthewFluet added a commit that referenced this pull request May 31, 2019
Bounce variables around forced stack allocations in RSSA loops

See #218.

In the translation from RSSA to Machine, each RSSA variable is
assigned a location: either a stack slot or a register (i.e., local).
An RSSA variable is assigned to a stack slot if it is live across a
non-tail call or a `mayGC` ccall.  Assigning an RSSA variable to a
stack slot has a potential cost, because stack-slot accesses appear as
memory reads and writes to the codegens and are not easily realized by
hardware registers.  This cost is multiplied when the stack-slot
accesses are within a loop.  For example, consider the tail recursive
`revapp`:

``` standard-ml
fun revapp (l, acc) =
  case l of
     [] => acc
   | h::t => revapp (t, h::acc)
```

This will be realized as an intraprocedural loop; however, due to the
`h::acc` allocation in the loop, there will be a potential GC that
forces `l` and `acc` to be assigned to stack slots (assuming the GC
check occurs at the loop header).  It would seem to be more efficient
to assign `l` and `acc` to registers, moving them to and from stack
slots at the GC, because a GC should occur on only a small fraction of
the loop iterations.

A new `BounceVars` RSSA pass attempts to split the live ranges of RSSA
variables that are used within loops so that the within-loop instances
of the variables are assigned registers (possibly being moved to and
from stack slots around a GC).

Unfortunately, it has not been easy to find good heuristics.  The
current defaults are biased towards not degrading performance.

## Benchmark results (cadmium)

Specs:
* 4 x AMD Opteron(tm) Processor 6172 (2.1GHz; 48 physical cores)
* Ubuntu 16.04.6 LTS
* gcc: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
* gcc-8: gcc version 8.3.0 (Homebrew GCC 8.3.0)
* llvm: LLVM version 8.0.0

``` text
config command                                                               
C00    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen amd64      
C01    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c          
C02    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c -cc gcc-8
C03    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen llvm       
C04    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen amd64      
C05    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c          
C06    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c -cc gcc-8
C07    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen llvm       
```

### Run-Time Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut           1.050     1.004     0.9939    0.9724
boyer                1.020     0.9547    0.9826    0.9940
checksum             1.007     1.034     0.9795    1.003 
count-graphs         0.9735    1.008     0.9826    0.9722
DLXSimulator         0.9397    0.9566    0.9553    0.8925
even-odd             0.9185    1.019     1.000     0.9754
fft                  1.061     1.041     1.039     1.021 
fib                  1.013     1.046     1.040     0.9884
flat-array           1.064     0.9138    0.9901    0.9247
hamlet               0.9442    0.9713    0.9664    0.9955
imp-for              1.040     0.9223    0.8787    1.005 
knuth-bendix         0.9967    1.007     0.9924    0.9744
lexgen               1.005     1.025     0.9892    0.9574
life                 0.9805    0.9813    0.9779    1.012 
logic                1.062     0.9721    1.002     0.9641
mandelbrot           0.9995    1.003     1.005     1.000 
matrix-multiply      0.9424    1.028     1.040     0.9958
md5                  0.9946    0.9913    0.9929    1.053 
merge                0.9942    0.9397    0.9797    0.9558
mlyacc               0.9956    0.9558    1.021     1.026 
model-elimination    0.9816    0.9842    1.000     0.9769
mpuz                 0.6748    0.8899    0.9354    1.084 
nucleic              0.9955    0.9901    1.008     0.9746
output1              0.9880    0.8836    0.8840    0.9090
peek                 1.010     1.075     0.9669    1.000 
pidigits             0.9778    1.055     0.9936    1.018 
psdes-random         0.9905    0.9943    1.054     1.000 
ratio-regions        1.030     1.071     0.9943    0.9448
ray                  1.016     0.9944    0.9993    1.030 
raytrace             1.024     1.009     0.9994    1.017 
simple               0.9900    1.017     1.019     1.032 
smith-normal-form    1.015     1.017     1.080     1.061 
string-concat        0.9619    1.004     0.9863    1.173 
tailfib              0.9867    1.056     1.001     0.9989
tailmerge            1.005     1.002     0.9844    0.9697
tak                  0.9886    1.025     0.9999    1.013 
tensor               1.015     0.9728    1.006     0.9927
tsp                  1.028     0.9878    0.9956    0.9745
tyan                 1.029     1.011     1.012     0.9404
vector32-concat      0.7653    0.6395    0.8281    1.003 
vector64-concat      0.8517    0.7782    0.8571    0.9619
vector-rev           0.9514    0.9847    0.8541    0.9161
vliw                 1.028     0.9368    1.017     1.011 
wc-input1            0.9170    0.9157    0.9084    0.9240
wc-scanStream        0.9929    1.007     1.146     1.038 
zebra                0.9984    1.008     1.008     0.9941
zern                 1.026     0.9929    1.005     0.9942
MIN                  0.6748    0.6395    0.8281    0.8925
GMEAN                0.9811    0.9771    0.9846    0.9912
MAX                  1.064     1.075     1.146     1.173 
```

Notes:
* Overall, there is less run-time improvements than were hoped,
  although there are some "big" wins and no "big" losses.
* The LLVM codegen seems to benefit the least; it may be that LLVM is
  able to realizes some stack-slot accesses as hardware registers
  (although it is unclear how it justifies doing so --- LLVM should
  not be able to "see" that the access of stack slots and heap objects
  do not alias).
* The speedups with `vector32-concat` and `vector64-concat` may be
  related to the observation made in #288, where it was
  observed that moving a sequence from a register to a global had a
  significant slowdown (since the access through the global required
  an additional level of indirection).  In this case, a sequence may
  be moving from a stack slot to a register, eliminating a level of
  indirection.

### Compile-Time Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut           0.9721    0.9760    0.9171    0.9393
boyer                0.9054    0.9675    0.9988    1.024 
checksum             0.9636    0.9630    0.9753    1.016 
count-graphs         0.9822    1.002     0.9862    0.9151
DLXSimulator         0.9205    0.9248    0.9340    0.9704
even-odd             0.9808    0.9600    0.9705    1.015 
fft                  0.9938    0.9644    0.9817    0.9656
fib                  0.9670    0.9777    0.9750    0.9473
flat-array           0.9781    0.9758    1.013     0.9805
hamlet               0.9942    1.104     1.029     1.000 
imp-for              0.9380    0.9764    0.9429    0.9979
knuth-bendix         0.9409    0.9584    0.8814    0.8520
lexgen               0.9317    0.8911    0.8933    0.9525
life                 0.9598    0.9610    0.9812    0.9046
logic                0.9704    0.9620    0.9474    0.9856
mandelbrot           0.9370    0.9926    1.011     0.9694
matrix-multiply      0.9510    0.9425    0.9634    0.9497
md5                  0.9568    0.9609    0.9473    0.9282
merge                0.9896    0.9517    0.9780    0.9987
mlyacc               1.124     1.076     1.161     1.150 
model-elimination    0.9912    0.9493    1.093     0.9879
mpuz                 0.9568    0.9690    1.006     0.9763
nucleic              0.9010    0.9169    0.9384    1.079 
output1              0.9668    0.9896    0.9598    0.9086
peek                 0.9713    0.9736    0.9840    0.9626
pidigits             0.9720    0.9686    0.9494    0.9379
psdes-random         0.9761    1.004     1.002     0.9435
ratio-regions        0.9844    0.9421    0.9679    0.9262
ray                  0.9605    0.8728    0.8533    0.8406
raytrace             0.9130    0.9501    0.8719    1.029 
simple               0.8696    0.9130    0.9537    0.9602
smith-normal-form    0.9000    0.9400    0.9632    1.167 
string-concat        0.9667    0.9934    0.9466    1.000 
tailfib              0.9807    0.9598    0.9779    0.9713
tailmerge            0.9848    0.9871    0.9878    0.9870
tak                  0.9720    1.0000    0.9728    0.9968
tensor               0.9739    0.9348    0.9057    0.8996
tsp                  0.9955    1.001     0.9283    0.9518
tyan                 0.9734    0.9536    0.9052    0.9488
vector32-concat      0.9527    0.9997    0.9673    0.9471
vector64-concat      0.9617    0.9722    0.9921    0.9885
vector-rev           0.9809    0.9846    0.9625    0.9955
vliw                 1.020     1.004     1.111     1.086 
wc-input1            0.9345    1.002     0.9502    0.9549
wc-scanStream        0.9890    0.9830    1.027     1.028 
zebra                0.9712    0.9204    1.101     0.9702
zern                 0.9416    1.013     1.098     0.8998
MIN                  0.8696    0.8728    0.8533    0.8406
GMEAN                0.9635    0.9691    0.9740    0.9727
MAX                  1.124     1.104     1.161     1.167 
```

Notes:
* One concern with assigning more RSSA variables to registers is that
  increases the burden on the codegen (hardware) register allocator.
* With the exception of `mlyacc`, there is a general improvement in
  compile time.  Note that configs C04, C05, C06, and C07 are using a
  MLton compiled with `BounceVars` to compile programs with
  `BounceVars`.  It isn't clear how much of the benefit is due to the
  effect of `BounceVars` on the compiled MLton and how much is due to
  the effect of `BounceVars` on the IR of the compiled programs.  In
  any case, it seems that `BounceVars` "pays for itself", although the
  greatest improvements are not on the largest benchmark programs
  (`hamlet`, `mlyacc`, `model-elimination`, `nucleic`, `vliw`).

### Executable-Size Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut            1.005    1.005     1.009     0.9982
boyer                 1.002    0.9940    0.9967    0.9972
checksum              1.005    1.001     1.001     1.000 
count-graphs          1.007    1.004     1.005     1.001 
DLXSimulator          1.015    1.005     1.006     1.007 
even-odd              1.005    1.001     0.9999    1.0000
fft                   1.009    1.006     1.004     1.003 
fib                   1.005    1.001     0.9998    1.000 
flat-array            1.005    1.001     1.001     0.9993
hamlet                1.002    1.003     1.004     0.9999
imp-for               1.005    1.001     1.001     0.9991
knuth-bendix          1.004    0.9990    1.003     1.001 
lexgen                1.025    1.001     1.002     1.004 
life                  1.003    0.9999    1.000     0.9999
logic                 1.004    1.003     1.001     1.002 
mandelbrot            1.005    1.001     1.001     0.9994
matrix-multiply       1.005    1.000     1.002     0.9993
md5                   1.005    1.002     1.002     1.007 
merge                 1.004    1.000     0.9994    0.9995
mlyacc                1.176    1.154     1.183     1.180 
model-elimination     1.006    1.009     1.011     1.003 
mpuz                  1.006    0.9998    1.001     1.001 
nucleic               1.000    0.9996    0.9998    0.9985
output1               1.006    1.002     1.002     1.006 
peek                  1.005    1.001     1.001     0.9996
pidigits              1.003    0.9998    1.001     1.005 
psdes-random          1.005    1.001     1.002     0.9976
ratio-regions         1.020    1.008     1.007     1.009 
ray                   1.023    1.021     1.018     1.013 
raytrace              1.008    1.003     1.003     1.011 
simple                1.009    1.033     1.034     1.055 
smith-normal-form     1.005    1.005     1.006     1.003 
string-concat         1.004    0.9997    1.002     0.9980
tailfib               1.005    1.000     1.000     0.9998
tak                   1.005    1.001     0.9997    0.9999
tensor                1.019    1.019     1.018     1.020 
tsp                   1.005    1.002     1.004     1.007 
tyan                  1.034    1.026     1.025     1.030 
vector32-concat       1.005    1.000     1.001     0.9984
vector64-concat       1.005    1.000     1.001     0.9982
vector-rev            1.005    1.001     1.001     0.9988
vliw                  1.033    1.010     1.012     1.027 
wc-input1             1.012    1.009     1.008     1.010 
wc-scanStream         1.011    1.006     1.007     1.007 
zebra                 1.004    1.001     1.002     1.005 
zern                  1.006    1.003     1.001     1.002 
MIN                   1.000    0.9940    0.9967    0.9972
GMEAN                 1.012    1.007     1.008     1.008 
MAX                   1.176    1.154     1.183     1.180 
```

Notes:
* Consistent with the increase in compile-time for `mlyacc`, there is
  an increase in code size for `mlyacc`.
* Otherwise, there is little change in code size.


## Benchmark results (sulfur)

Specs:
* 2 x Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (8 physical cores; 16 logical cores)
* Ubuntu 16.04.6 LTS
* gcc: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
* gcc-8: gcc version 8.3.0 (Homebrew GCC 8.3.0)
* llvm: LLVM version 8.0.0

``` text
config command                                                               
C00    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen amd64      
C01    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c          
C02    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c -cc gcc-8
C03    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen llvm       
C04    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen amd64      
C05    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c          
C06    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c -cc gcc-8
C07    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen llvm       
```

### Run-Time Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut           1.066     1.004     0.9963    0.8840
boyer                0.9890    0.9917    0.9902    0.9567
checksum             1.065     1.001     0.9935    0.9681
count-graphs         0.9580    0.9805    0.9665    0.9672
DLXSimulator         0.7266    0.7235    0.7374    0.7496
even-odd             1.009     0.9903    0.9873    0.9845
fft                  0.9930    0.9983    0.9710    0.9682
fib                  1.108     1.023     1.0000    0.9724
flat-array           0.9638    0.9581    0.9604    0.9620
hamlet               0.8991    0.9415    0.9295    0.9544
imp-for              0.9688    0.9359    0.9240    1.123 
knuth-bendix         1.005     0.9915    1.025     1.005 
lexgen               1.001     0.9954    0.9583    0.9974
life                 0.9659    1.003     0.9652    0.9923
logic                0.9887    0.9938    0.9702    0.9919
mandelbrot           0.9989    1.015     1.024     1.008 
matrix-multiply      1.017     0.9927    1.014     0.9881
md5                  0.9846    1.006     0.9868    1.026 
merge                1.071     1.077     1.039     1.074 
mlyacc               0.9994    0.9986    1.012     1.011 
model-elimination    0.9850    0.9656    0.9885    0.9879
mpuz                 0.7695    0.8649    0.9525    1.018 
nucleic              0.9267    0.9614    0.9359    0.9804
output1              0.9453    0.8869    0.8673    0.8782
peek                 0.9988    1.000     1.011     0.9938
pidigits             0.9734    0.9961    1.003     0.9986
psdes-random         0.9902    0.9998    1.016     0.9956
ratio-regions        1.107     0.9887    0.9315    0.9852
ray                  0.9905    0.9674    0.9930    0.9931
raytrace             1.008     1.032     1.005     1.057 
simple               0.9659    0.9692    0.9842    0.9648
smith-normal-form    0.9858    0.9876    0.9712    0.9615
string-concat        0.9901    1.001     1.007     0.9731
tailfib              0.9968    1.015     1.000     0.9936
tailmerge            0.9893    0.9932    0.9810    0.8081
tak                  0.9724    1.058     0.9762    0.9693
tensor               1.005     0.9982    0.9914    1.006 
tsp                  1.030     1.049     1.037     1.032 
tyan                 1.007     0.9887    0.9850    1.002 
vector32-concat      0.3328    0.3212    0.6566    1.029 
vector64-concat      0.3983    0.3806    0.7111    0.9922
vector-rev           0.9282    1.032     0.9473    0.9013
vliw                 0.9068    0.8996    0.8662    0.9419
wc-input1            1.396     0.9463    1.120     1.348 
wc-scanStream        0.9928    1.015     1.117     0.9642
zebra                0.9957    0.9966    0.9974    0.9967
zern                 0.9958    1.013     1.000     0.9743
MIN                  0.3328    0.3212    0.6566    0.7496
GMEAN                0.9468    0.9393    0.9640    0.9827
MAX                  1.396     1.077     1.120     1.348 
```

Notes:
* Again, there are some "big" wins and no "big" losses.
* Interestingly, compared to `cadmium`, `vector32-concat` and
  `vector64-concat` see greater speedups, though `mpuz` sees a smaller
  speedup.

### Compile-Time Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut           0.9401    0.9632    1.013     1.006 
boyer                0.9659    1.000     0.9725    0.9629
checksum             1.022     0.9911    0.9966    0.9967
count-graphs         1.005     1.042     1.004     1.011 
DLXSimulator         0.9910    0.9526    0.9833    0.9914
even-odd             0.9654    0.9591    1.009     0.9795
fft                  0.9841    0.9713    0.9885    0.9923
fib                  0.9417    0.9946    1.000     0.9976
flat-array           1.002     1.049     0.9716    1.012 
hamlet               1.103     1.018     0.9917    0.9781
imp-for              0.9586    1.041     0.9824    1.032 
knuth-bendix         0.9690    1.007     1.034     0.9347
lexgen               1.049     0.9718    0.9930    1.029 
life                 0.9490    0.9798    1.023     1.018 
logic                0.9936    0.9303    0.9768    0.9715
mandelbrot           0.9959    0.9518    0.9339    0.9740
matrix-multiply      1.117     1.060     1.013     0.9176
md5                  1.007     0.9454    0.9904    0.9973
merge                0.9922    0.9493    0.9793    1.015 
mlyacc               1.113     1.125     1.136     1.143 
model-elimination    0.9702    1.030     1.025     1.017 
mpuz                 1.015     0.9249    1.053     0.9401
nucleic              0.9813    1.007     0.9919    1.043 
output1              0.9285    0.9746    0.9603    1.049 
peek                 0.9547    0.9915    0.9410    0.9543
pidigits             0.9569    0.9757    0.9745    0.9973
psdes-random         0.9687    0.9835    1.006     0.9596
ratio-regions        0.9640    1.021     1.019     1.036 
ray                  1.079     0.9156    0.9577    0.9714
raytrace             1.008     0.9028    1.010     1.052 
simple               1.025     1.154     0.9570    1.068 
smith-normal-form    0.9611    0.9399    0.9215    0.9196
string-concat        1.023     0.9968    0.9789    1.072 
tailfib              0.9869    0.9748    0.9711    1.001 
tailmerge            0.9812    0.9766    0.9731    0.9051
tak                  0.9694    0.9863    0.9719    0.9681
tensor               0.9759    1.061     0.9847    1.030 
tsp                  0.9818    1.094     0.9497    1.066 
tyan                 1.018     1.004     0.9927    1.011 
vector32-concat      0.9783    0.9755    0.9766    0.9792
vector64-concat      0.9820    0.9876    0.9830    0.9607
vector-rev           0.9648    0.9215    1.066     1.002 
vliw                 1.075     1.045     1.035     1.052 
wc-input1            1.012     1.025     1.005     0.9931
wc-scanStream        0.9747    0.9883    0.9940    0.9761
zebra                0.9838    0.9624    0.9758    1.003 
zern                 0.8921    0.9604    0.9794    0.9880
MIN                  0.8921    0.9028    0.9215    0.9051
GMEAN                0.9921    0.9920    0.9919    0.9985
MAX                  1.117     1.154     1.136     1.143 
```

Notes:
* Compared to `cadmium`, there is less general improvement in compile
  time, though (with the exception of `mlyacc`), no "across the board"
  slowdowns.

### Executable-Size Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut            1.005    1.005     1.009     0.9983
boyer                 1.002    0.9940    0.9967    0.9972
checksum              1.005    1.001     1.001     1.000 
count-graphs          1.007    1.004     1.005     1.001 
DLXSimulator          1.015    1.005     1.006     1.008 
even-odd              1.005    1.001     0.9999    0.9999
fft                   1.009    1.006     1.004     1.003 
fib                   1.005    1.001     0.9998    1.000 
flat-array            1.005    1.001     1.001     0.9995
hamlet                1.002    1.003     1.004     0.9999
imp-for               1.005    1.001     1.001     0.9991
knuth-bendix          1.004    0.9989    1.003     1.000 
lexgen                1.025    1.001     1.002     1.004 
life                  1.003    0.9998    1.000     0.9997
logic                 1.004    1.003     1.001     0.9997
mandelbrot            1.005    1.001     1.001     0.9994
matrix-multiply       1.005    1.000     1.002     0.9996
md5                   1.005    1.002     1.002     1.007 
merge                 1.004    1.000     0.9993    0.9994
mlyacc                1.176    1.154     1.183     1.179 
model-elimination     1.006    1.009     1.011     1.003 
mpuz                  1.006    0.9998    1.001     1.001 
nucleic               1.000    0.9996    0.9998    0.9985
output1               1.006    1.002     1.002     1.006 
peek                  1.005    1.001     1.001     0.9994
pidigits              1.003    0.9998    1.001     1.005 
psdes-random          1.005    1.001     1.002     0.9978
ratio-regions         1.020    1.008     1.007     1.009 
ray                   1.023    1.021     1.018     1.013 
raytrace              1.008    1.003     1.003     1.013 
simple                1.009    1.033     1.034     1.055 
smith-normal-form     1.005    1.005     1.006     1.003 
string-concat         1.004    0.9995    1.002     0.9980
tailfib               1.005    1.000     1.000     0.9998
tailmerge             1.004    0.9995    1.001     0.9998
tak                   1.005    1.001     0.9997    1.000 
tensor                1.019    1.019     1.018     1.019 
tsp                   1.005    1.002     1.004     1.007 
tyan                  1.034    1.026     1.025     1.031 
vector32-concat       1.005    1.000     1.001     0.9984
vector64-concat       1.005    1.000     1.001     0.9982
vector-rev            1.005    1.001     1.001     0.9987
vliw                  1.033    1.010     1.012     1.026 
wc-input1             1.012    1.009     1.008     1.010 
wc-scanStream         1.011    1.006     1.007     1.007 
zebra                 1.004    1.001     1.002     1.005 
zern                  1.006    1.003     1.001     1.002 
MIN                   1.000    0.9940    0.9967    0.9972
GMEAN                 1.011    1.007     1.008     1.008 
MAX                   1.176    1.154     1.183     1.179 
```

Notes:
* As they should be (same versions of MLton, gcc, llvm), executable
  sizes are identical with `cadmium`.
MatthewFluet added a commit that referenced this pull request Sep 20, 2019
Static allocation/initialization of objects in backend

The main benefits are that code size and compile time are improved
across the board, particularly for larger programs.  Runtime sometimes
improves, but should only affect programs which had hot code accessing
globals, as it removes one level of indirection.  Garbage collections
might be marginally faster, as globals are now mostly skipped.

Statically allocated and initialized objects are created in the main
`.c` file, where they will be placed in the data segment of the
executable:

    const struct {Word64 meta_0; Word64 meta_1; Word64 meta_2; Word8 data[9];}
    static_20 = {(Word64)(0x0ull), (Word64)(0x9ull), (Word64)(0x7ull), "addrinuse"};
    const struct {Word64 meta_0; Word32 data_0; Word32 data_1; Pointer data_2; }
    static_21 = {(Word64)(0x29ull), (Word32)(0x62ull), (Word32)(0x0ull), ((Pointer)(&static_20) + 24)};

Note that these are proper ML objects, with metadata and data.
References to statically allocated objects are via pointers to the
first data field (e.g., `&static20 + 24`).  Note also that
`WordXVector` (e.g., strings) are a special case of statically
allocated and initialized objects.  Statically allocated and
initialized objects can be both immutable and mutable, although the
latter should be restricted to objects with non-`Objptr` mutable
fields.

A special case of statically allocated objects are arrays, whose
contents will be dynamically initialized by the mutator.  These are
also created in the main `.c` file, but are placed in the bss segment
of the executable (decreasing the size of the executable) and proper
metadata is written by initialization code:

    struct {Word64 meta_0; Word64 meta_1; Word64 meta_2; Word8 data[800000];}
    static_26;
    struct {Word64 meta_0; Word64 meta_1; Word64 meta_2; Word8 data[0];}
    static_31;

    static void static_Init() {
        memcpy (&static_26, &((struct {Word64 meta_0; Word64 meta_1; Word64 meta_2}){(Word64)(0x0ull), (Word64)(0x186A0ull), (Word64)(0x11ull)}), 24);
        memcpy (&static_31, &((struct {Word64 meta_0; Word64 meta_1; Word64 meta_2}){(Word64)(0x0ull), (Word64)(0x0ull), (Word64)(0x13ull)}), 24);
    };

Finally, dynamically allocated but statically initialized objects have
their initialization data in the main `.c` file along with information
to copy that data to the initial dynamic heap during `initWorld`:

    const static struct {Word64 meta_0; Word64 data_0; }
    static_9819 = {(Word64)(0x79Dull), (Word64)(0x1ull)};
    const static struct {Word64 meta_0; Word64 data_0; }
    static_9820 = {(Word64)(0x79Dull), (Word64)(0x1ull)};

    static struct GC_objectInit objectInits[] = {
        { 11, 8, 16, ((Pointer) &static_9819) },
        { 12, 8, 16, ((Pointer) &static_9820) },
        ...
    }

By default (with `-static-init-objects staticAllocOnly`), no such
objects are created.

With `-static-init-objects all`, global objects with `Objptr` mutable
fields would be dynamically allocated but statically initialized.
But, such global objects are rare.  For example, a `(int * int) ref`
could be space-safely globalized and would be an object with a mutable
`Objptr` field.  However, it is also likely that such a tuple would be
`RefFlatten`ed.  With `-globalize-small-type 4` (see #288
and 752467c), an `(int * int, int * int) either ref` could be
globalized and represented as an object with a mutable `Objptr` field.
Similarly, an `IntInf.int ref` can also be globalized and would be
represented as an object with a mutable `Objptr` field.

With `-static-alloc-objects false -static-init-objects all`, all
global objects will be dynamically allocated but statically
initialized (and no global objects will be statically allocated).
Similarly, with `-static-alloc-wordvector-consts false`, string
constants will be dynamically allocated but statically initialized;
this corresponds to the previous MLton behavior with respect to string
constants.


A number of controls have been added to control static
allocation/initialization:

 * `-static-alloc-internal-ptrs {static|all|none}`

   Controls which kinds of objects can be statically allocated:
    * `static`: only objects with all fields either immutable or non-`Objptr`
    * `none`: only objects with no fields
    * `all`: all objects

   The `all` setting is incompatible with the current GC for two
   reasons.  First, statically allocated objects are not traced by the
   GC; a statically-allocated object that is updated with an `Objptr`
   to an object in the heap should be considered a root.  Second, a
   statically-allocated object that is updated with an `Objptr` would
   trigger a card marking, but the address of a statically-allocated
   object would not map to a valid card slot.

 * `-static-alloc-wordvector-consts {true|false}`

   Controls whether or not `WordXVector` constants are converted to
   statics (with `ImmStatic` location) at `Ssa2ToRssa`.

 * `-static-init-arrays {true|false}`

   Controls whether or not `Array_alloc` primitives are converted to
   statics (with `MutStatic` or `Heap` location) at `Ssa2ToRssa`.

 * `-static-alloc-arrays {true|false}`

   Controls whether or not `Array_alloc` primitives that can be
   statically initialized are forced to `Heap` location.

 * `-static-init-objects {none|staticAllocOnly|all}`

   Controls whether or not `Object` expressions are converted to
   statics at `Ssa2ToRssa`.  If `staticAllocOnly`, then an object that
   would be converted to a static with `Heap` location is not
   converted to a static.

 * `-static-alloc-objects {true|false}`

   Controls whether or not `Object` expressions that can be statically
   initialized are forced to `Heap` location.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant