-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate strategies for globalization in ConstantPropagation #288
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Space-safety prohibits ConstantPropagation from globalizing all arrays that are allocated at most once by a program. In particular, because globals are live for the duration of the program, globalizing an `int list array` (for example) would not be safe-for-space: arbitrarily large lists may be stored in the array and never garbage collected (whereas, when the `int list array` is not globalized, it will be garbage collected when it is no longer live). On the other hand, globalizing an `int array` (with a constant length) is safe-for-space. However, previously, the globalization of a `val a: t array = Array_alloc[t] (l)` was conditioned on the smallness of `t array`. Furthermore, `Type.isSmall` returns `false` for arrays, so no array would be globalized. It is correct to globalize an array if `t` is small; note that to globalize `val a: t array = Array_alloc[t] (l)`, `l` (the length) must be globalized and must, therefore, be a constant and the array is of constant size. (This is Stephen Weeks's relaxed notion of safe-for-space, where the constant factor blowup can be chosen per program.) In practice, it may be better to limit globalization of arrays to ones with "small" length in addition to small element type. This commit allows a `val a: t array = Array_alloc[t] (l)` to be globalized if `-globalize-arrays true`, `t` is small, and `l` is globalized.
Use standard Trace.trace and prep recursive calls.
`-globalize-small-type 0` is the constant `false` function, which globalizes no references or arrays. `-globalize-small-type 1` is the previous `Type.isSmall` function, which returns false for all datatypes. `-globalize-small-type 9` is the constant `true` function, which is not safe-for-space.
`-globalize-small-type 2` treats datatypes with all nullary constructors as small.
`-globalize-small-type 3` treats datatypes that have all constructor arguments satisfying `#isSmallType o (mkIsSmallType 1)` as small.
Fixes a bug (triggered by the `simple` benchmark with `-globalize-arrays true`) where the rawness of an array value was not properly propagated. Previously, the rawness of an array value was implemented as `bool option ref`, but coercing `ref NONE` to `ref NONE` would not remember the coercion when the first raw became `ref (SOME true)`. Now, a proper "flat lattice"-like `structure Raw` is used.
Technically, both IntInf and Thread should not be considered small types, as they can be of unbounded size (an IntInf can be represented by a sequence, while a Thread implicitly contains a stack). Note, an IntInf value can be globalized (as it is a constant), but an `IntInf.int ref` cannot.
Allow previous behavior of globalizing `IntInf.int ref` values.
`-globalize-small-type 4` uses RefFlatten's more precise notion of small/large types: build a graph of the dependencies between datatypes, force any (mutually) recursive datatypes to be large types, solve a fixed-point analysis of `Size.<=` constraints.
MatthewFluet
added a commit
that referenced
this pull request
May 31, 2019
Bounce variables around forced stack allocations in RSSA loops See #218. In the translation from RSSA to Machine, each RSSA variable is assigned a location: either a stack slot or a register (i.e., local). An RSSA variable is assigned to a stack slot if it is live across a non-tail call or a `mayGC` ccall. Assigning an RSSA variable to a stack slot has a potential cost, because stack-slot accesses appear as memory reads and writes to the codegens and are not easily realized by hardware registers. This cost is multiplied when the stack-slot accesses are within a loop. For example, consider the tail recursive `revapp`: ``` standard-ml fun revapp (l, acc) = case l of [] => acc | h::t => revapp (t, h::acc) ``` This will be realized as an intraprocedural loop; however, due to the `h::acc` allocation in the loop, there will be a potential GC that forces `l` and `acc` to be assigned to stack slots (assuming the GC check occurs at the loop header). It would seem to be more efficient to assign `l` and `acc` to registers, moving them to and from stack slots at the GC, because a GC should occur on only a small fraction of the loop iterations. A new `BounceVars` RSSA pass attempts to split the live ranges of RSSA variables that are used within loops so that the within-loop instances of the variables are assigned registers (possibly being moved to and from stack slots around a GC). Unfortunately, it has not been easy to find good heuristics. The current defaults are biased towards not degrading performance. ## Benchmark results (cadmium) Specs: * 4 x AMD Opteron(tm) Processor 6172 (2.1GHz; 48 physical cores) * Ubuntu 16.04.6 LTS * gcc: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) * gcc-8: gcc version 8.3.0 (Homebrew GCC 8.3.0) * llvm: LLVM version 8.0.0 ``` text config command C00 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen amd64 C01 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c C02 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c -cc gcc-8 C03 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen llvm C04 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen amd64 C05 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c C06 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c -cc gcc-8 C07 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen llvm ``` ### Run-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.050 1.004 0.9939 0.9724 boyer 1.020 0.9547 0.9826 0.9940 checksum 1.007 1.034 0.9795 1.003 count-graphs 0.9735 1.008 0.9826 0.9722 DLXSimulator 0.9397 0.9566 0.9553 0.8925 even-odd 0.9185 1.019 1.000 0.9754 fft 1.061 1.041 1.039 1.021 fib 1.013 1.046 1.040 0.9884 flat-array 1.064 0.9138 0.9901 0.9247 hamlet 0.9442 0.9713 0.9664 0.9955 imp-for 1.040 0.9223 0.8787 1.005 knuth-bendix 0.9967 1.007 0.9924 0.9744 lexgen 1.005 1.025 0.9892 0.9574 life 0.9805 0.9813 0.9779 1.012 logic 1.062 0.9721 1.002 0.9641 mandelbrot 0.9995 1.003 1.005 1.000 matrix-multiply 0.9424 1.028 1.040 0.9958 md5 0.9946 0.9913 0.9929 1.053 merge 0.9942 0.9397 0.9797 0.9558 mlyacc 0.9956 0.9558 1.021 1.026 model-elimination 0.9816 0.9842 1.000 0.9769 mpuz 0.6748 0.8899 0.9354 1.084 nucleic 0.9955 0.9901 1.008 0.9746 output1 0.9880 0.8836 0.8840 0.9090 peek 1.010 1.075 0.9669 1.000 pidigits 0.9778 1.055 0.9936 1.018 psdes-random 0.9905 0.9943 1.054 1.000 ratio-regions 1.030 1.071 0.9943 0.9448 ray 1.016 0.9944 0.9993 1.030 raytrace 1.024 1.009 0.9994 1.017 simple 0.9900 1.017 1.019 1.032 smith-normal-form 1.015 1.017 1.080 1.061 string-concat 0.9619 1.004 0.9863 1.173 tailfib 0.9867 1.056 1.001 0.9989 tailmerge 1.005 1.002 0.9844 0.9697 tak 0.9886 1.025 0.9999 1.013 tensor 1.015 0.9728 1.006 0.9927 tsp 1.028 0.9878 0.9956 0.9745 tyan 1.029 1.011 1.012 0.9404 vector32-concat 0.7653 0.6395 0.8281 1.003 vector64-concat 0.8517 0.7782 0.8571 0.9619 vector-rev 0.9514 0.9847 0.8541 0.9161 vliw 1.028 0.9368 1.017 1.011 wc-input1 0.9170 0.9157 0.9084 0.9240 wc-scanStream 0.9929 1.007 1.146 1.038 zebra 0.9984 1.008 1.008 0.9941 zern 1.026 0.9929 1.005 0.9942 MIN 0.6748 0.6395 0.8281 0.8925 GMEAN 0.9811 0.9771 0.9846 0.9912 MAX 1.064 1.075 1.146 1.173 ``` Notes: * Overall, there is less run-time improvements than were hoped, although there are some "big" wins and no "big" losses. * The LLVM codegen seems to benefit the least; it may be that LLVM is able to realizes some stack-slot accesses as hardware registers (although it is unclear how it justifies doing so --- LLVM should not be able to "see" that the access of stack slots and heap objects do not alias). * The speedups with `vector32-concat` and `vector64-concat` may be related to the observation made in #288, where it was observed that moving a sequence from a register to a global had a significant slowdown (since the access through the global required an additional level of indirection). In this case, a sequence may be moving from a stack slot to a register, eliminating a level of indirection. ### Compile-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 0.9721 0.9760 0.9171 0.9393 boyer 0.9054 0.9675 0.9988 1.024 checksum 0.9636 0.9630 0.9753 1.016 count-graphs 0.9822 1.002 0.9862 0.9151 DLXSimulator 0.9205 0.9248 0.9340 0.9704 even-odd 0.9808 0.9600 0.9705 1.015 fft 0.9938 0.9644 0.9817 0.9656 fib 0.9670 0.9777 0.9750 0.9473 flat-array 0.9781 0.9758 1.013 0.9805 hamlet 0.9942 1.104 1.029 1.000 imp-for 0.9380 0.9764 0.9429 0.9979 knuth-bendix 0.9409 0.9584 0.8814 0.8520 lexgen 0.9317 0.8911 0.8933 0.9525 life 0.9598 0.9610 0.9812 0.9046 logic 0.9704 0.9620 0.9474 0.9856 mandelbrot 0.9370 0.9926 1.011 0.9694 matrix-multiply 0.9510 0.9425 0.9634 0.9497 md5 0.9568 0.9609 0.9473 0.9282 merge 0.9896 0.9517 0.9780 0.9987 mlyacc 1.124 1.076 1.161 1.150 model-elimination 0.9912 0.9493 1.093 0.9879 mpuz 0.9568 0.9690 1.006 0.9763 nucleic 0.9010 0.9169 0.9384 1.079 output1 0.9668 0.9896 0.9598 0.9086 peek 0.9713 0.9736 0.9840 0.9626 pidigits 0.9720 0.9686 0.9494 0.9379 psdes-random 0.9761 1.004 1.002 0.9435 ratio-regions 0.9844 0.9421 0.9679 0.9262 ray 0.9605 0.8728 0.8533 0.8406 raytrace 0.9130 0.9501 0.8719 1.029 simple 0.8696 0.9130 0.9537 0.9602 smith-normal-form 0.9000 0.9400 0.9632 1.167 string-concat 0.9667 0.9934 0.9466 1.000 tailfib 0.9807 0.9598 0.9779 0.9713 tailmerge 0.9848 0.9871 0.9878 0.9870 tak 0.9720 1.0000 0.9728 0.9968 tensor 0.9739 0.9348 0.9057 0.8996 tsp 0.9955 1.001 0.9283 0.9518 tyan 0.9734 0.9536 0.9052 0.9488 vector32-concat 0.9527 0.9997 0.9673 0.9471 vector64-concat 0.9617 0.9722 0.9921 0.9885 vector-rev 0.9809 0.9846 0.9625 0.9955 vliw 1.020 1.004 1.111 1.086 wc-input1 0.9345 1.002 0.9502 0.9549 wc-scanStream 0.9890 0.9830 1.027 1.028 zebra 0.9712 0.9204 1.101 0.9702 zern 0.9416 1.013 1.098 0.8998 MIN 0.8696 0.8728 0.8533 0.8406 GMEAN 0.9635 0.9691 0.9740 0.9727 MAX 1.124 1.104 1.161 1.167 ``` Notes: * One concern with assigning more RSSA variables to registers is that increases the burden on the codegen (hardware) register allocator. * With the exception of `mlyacc`, there is a general improvement in compile time. Note that configs C04, C05, C06, and C07 are using a MLton compiled with `BounceVars` to compile programs with `BounceVars`. It isn't clear how much of the benefit is due to the effect of `BounceVars` on the compiled MLton and how much is due to the effect of `BounceVars` on the IR of the compiled programs. In any case, it seems that `BounceVars` "pays for itself", although the greatest improvements are not on the largest benchmark programs (`hamlet`, `mlyacc`, `model-elimination`, `nucleic`, `vliw`). ### Executable-Size Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.005 1.005 1.009 0.9982 boyer 1.002 0.9940 0.9967 0.9972 checksum 1.005 1.001 1.001 1.000 count-graphs 1.007 1.004 1.005 1.001 DLXSimulator 1.015 1.005 1.006 1.007 even-odd 1.005 1.001 0.9999 1.0000 fft 1.009 1.006 1.004 1.003 fib 1.005 1.001 0.9998 1.000 flat-array 1.005 1.001 1.001 0.9993 hamlet 1.002 1.003 1.004 0.9999 imp-for 1.005 1.001 1.001 0.9991 knuth-bendix 1.004 0.9990 1.003 1.001 lexgen 1.025 1.001 1.002 1.004 life 1.003 0.9999 1.000 0.9999 logic 1.004 1.003 1.001 1.002 mandelbrot 1.005 1.001 1.001 0.9994 matrix-multiply 1.005 1.000 1.002 0.9993 md5 1.005 1.002 1.002 1.007 merge 1.004 1.000 0.9994 0.9995 mlyacc 1.176 1.154 1.183 1.180 model-elimination 1.006 1.009 1.011 1.003 mpuz 1.006 0.9998 1.001 1.001 nucleic 1.000 0.9996 0.9998 0.9985 output1 1.006 1.002 1.002 1.006 peek 1.005 1.001 1.001 0.9996 pidigits 1.003 0.9998 1.001 1.005 psdes-random 1.005 1.001 1.002 0.9976 ratio-regions 1.020 1.008 1.007 1.009 ray 1.023 1.021 1.018 1.013 raytrace 1.008 1.003 1.003 1.011 simple 1.009 1.033 1.034 1.055 smith-normal-form 1.005 1.005 1.006 1.003 string-concat 1.004 0.9997 1.002 0.9980 tailfib 1.005 1.000 1.000 0.9998 tak 1.005 1.001 0.9997 0.9999 tensor 1.019 1.019 1.018 1.020 tsp 1.005 1.002 1.004 1.007 tyan 1.034 1.026 1.025 1.030 vector32-concat 1.005 1.000 1.001 0.9984 vector64-concat 1.005 1.000 1.001 0.9982 vector-rev 1.005 1.001 1.001 0.9988 vliw 1.033 1.010 1.012 1.027 wc-input1 1.012 1.009 1.008 1.010 wc-scanStream 1.011 1.006 1.007 1.007 zebra 1.004 1.001 1.002 1.005 zern 1.006 1.003 1.001 1.002 MIN 1.000 0.9940 0.9967 0.9972 GMEAN 1.012 1.007 1.008 1.008 MAX 1.176 1.154 1.183 1.180 ``` Notes: * Consistent with the increase in compile-time for `mlyacc`, there is an increase in code size for `mlyacc`. * Otherwise, there is little change in code size. ## Benchmark results (sulfur) Specs: * 2 x Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (8 physical cores; 16 logical cores) * Ubuntu 16.04.6 LTS * gcc: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) * gcc-8: gcc version 8.3.0 (Homebrew GCC 8.3.0) * llvm: LLVM version 8.0.0 ``` text config command C00 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen amd64 C01 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c C02 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c -cc gcc-8 C03 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen llvm C04 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen amd64 C05 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c C06 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c -cc gcc-8 C07 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen llvm ``` ### Run-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.066 1.004 0.9963 0.8840 boyer 0.9890 0.9917 0.9902 0.9567 checksum 1.065 1.001 0.9935 0.9681 count-graphs 0.9580 0.9805 0.9665 0.9672 DLXSimulator 0.7266 0.7235 0.7374 0.7496 even-odd 1.009 0.9903 0.9873 0.9845 fft 0.9930 0.9983 0.9710 0.9682 fib 1.108 1.023 1.0000 0.9724 flat-array 0.9638 0.9581 0.9604 0.9620 hamlet 0.8991 0.9415 0.9295 0.9544 imp-for 0.9688 0.9359 0.9240 1.123 knuth-bendix 1.005 0.9915 1.025 1.005 lexgen 1.001 0.9954 0.9583 0.9974 life 0.9659 1.003 0.9652 0.9923 logic 0.9887 0.9938 0.9702 0.9919 mandelbrot 0.9989 1.015 1.024 1.008 matrix-multiply 1.017 0.9927 1.014 0.9881 md5 0.9846 1.006 0.9868 1.026 merge 1.071 1.077 1.039 1.074 mlyacc 0.9994 0.9986 1.012 1.011 model-elimination 0.9850 0.9656 0.9885 0.9879 mpuz 0.7695 0.8649 0.9525 1.018 nucleic 0.9267 0.9614 0.9359 0.9804 output1 0.9453 0.8869 0.8673 0.8782 peek 0.9988 1.000 1.011 0.9938 pidigits 0.9734 0.9961 1.003 0.9986 psdes-random 0.9902 0.9998 1.016 0.9956 ratio-regions 1.107 0.9887 0.9315 0.9852 ray 0.9905 0.9674 0.9930 0.9931 raytrace 1.008 1.032 1.005 1.057 simple 0.9659 0.9692 0.9842 0.9648 smith-normal-form 0.9858 0.9876 0.9712 0.9615 string-concat 0.9901 1.001 1.007 0.9731 tailfib 0.9968 1.015 1.000 0.9936 tailmerge 0.9893 0.9932 0.9810 0.8081 tak 0.9724 1.058 0.9762 0.9693 tensor 1.005 0.9982 0.9914 1.006 tsp 1.030 1.049 1.037 1.032 tyan 1.007 0.9887 0.9850 1.002 vector32-concat 0.3328 0.3212 0.6566 1.029 vector64-concat 0.3983 0.3806 0.7111 0.9922 vector-rev 0.9282 1.032 0.9473 0.9013 vliw 0.9068 0.8996 0.8662 0.9419 wc-input1 1.396 0.9463 1.120 1.348 wc-scanStream 0.9928 1.015 1.117 0.9642 zebra 0.9957 0.9966 0.9974 0.9967 zern 0.9958 1.013 1.000 0.9743 MIN 0.3328 0.3212 0.6566 0.7496 GMEAN 0.9468 0.9393 0.9640 0.9827 MAX 1.396 1.077 1.120 1.348 ``` Notes: * Again, there are some "big" wins and no "big" losses. * Interestingly, compared to `cadmium`, `vector32-concat` and `vector64-concat` see greater speedups, though `mpuz` sees a smaller speedup. ### Compile-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 0.9401 0.9632 1.013 1.006 boyer 0.9659 1.000 0.9725 0.9629 checksum 1.022 0.9911 0.9966 0.9967 count-graphs 1.005 1.042 1.004 1.011 DLXSimulator 0.9910 0.9526 0.9833 0.9914 even-odd 0.9654 0.9591 1.009 0.9795 fft 0.9841 0.9713 0.9885 0.9923 fib 0.9417 0.9946 1.000 0.9976 flat-array 1.002 1.049 0.9716 1.012 hamlet 1.103 1.018 0.9917 0.9781 imp-for 0.9586 1.041 0.9824 1.032 knuth-bendix 0.9690 1.007 1.034 0.9347 lexgen 1.049 0.9718 0.9930 1.029 life 0.9490 0.9798 1.023 1.018 logic 0.9936 0.9303 0.9768 0.9715 mandelbrot 0.9959 0.9518 0.9339 0.9740 matrix-multiply 1.117 1.060 1.013 0.9176 md5 1.007 0.9454 0.9904 0.9973 merge 0.9922 0.9493 0.9793 1.015 mlyacc 1.113 1.125 1.136 1.143 model-elimination 0.9702 1.030 1.025 1.017 mpuz 1.015 0.9249 1.053 0.9401 nucleic 0.9813 1.007 0.9919 1.043 output1 0.9285 0.9746 0.9603 1.049 peek 0.9547 0.9915 0.9410 0.9543 pidigits 0.9569 0.9757 0.9745 0.9973 psdes-random 0.9687 0.9835 1.006 0.9596 ratio-regions 0.9640 1.021 1.019 1.036 ray 1.079 0.9156 0.9577 0.9714 raytrace 1.008 0.9028 1.010 1.052 simple 1.025 1.154 0.9570 1.068 smith-normal-form 0.9611 0.9399 0.9215 0.9196 string-concat 1.023 0.9968 0.9789 1.072 tailfib 0.9869 0.9748 0.9711 1.001 tailmerge 0.9812 0.9766 0.9731 0.9051 tak 0.9694 0.9863 0.9719 0.9681 tensor 0.9759 1.061 0.9847 1.030 tsp 0.9818 1.094 0.9497 1.066 tyan 1.018 1.004 0.9927 1.011 vector32-concat 0.9783 0.9755 0.9766 0.9792 vector64-concat 0.9820 0.9876 0.9830 0.9607 vector-rev 0.9648 0.9215 1.066 1.002 vliw 1.075 1.045 1.035 1.052 wc-input1 1.012 1.025 1.005 0.9931 wc-scanStream 0.9747 0.9883 0.9940 0.9761 zebra 0.9838 0.9624 0.9758 1.003 zern 0.8921 0.9604 0.9794 0.9880 MIN 0.8921 0.9028 0.9215 0.9051 GMEAN 0.9921 0.9920 0.9919 0.9985 MAX 1.117 1.154 1.136 1.143 ``` Notes: * Compared to `cadmium`, there is less general improvement in compile time, though (with the exception of `mlyacc`), no "across the board" slowdowns. ### Executable-Size Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.005 1.005 1.009 0.9983 boyer 1.002 0.9940 0.9967 0.9972 checksum 1.005 1.001 1.001 1.000 count-graphs 1.007 1.004 1.005 1.001 DLXSimulator 1.015 1.005 1.006 1.008 even-odd 1.005 1.001 0.9999 0.9999 fft 1.009 1.006 1.004 1.003 fib 1.005 1.001 0.9998 1.000 flat-array 1.005 1.001 1.001 0.9995 hamlet 1.002 1.003 1.004 0.9999 imp-for 1.005 1.001 1.001 0.9991 knuth-bendix 1.004 0.9989 1.003 1.000 lexgen 1.025 1.001 1.002 1.004 life 1.003 0.9998 1.000 0.9997 logic 1.004 1.003 1.001 0.9997 mandelbrot 1.005 1.001 1.001 0.9994 matrix-multiply 1.005 1.000 1.002 0.9996 md5 1.005 1.002 1.002 1.007 merge 1.004 1.000 0.9993 0.9994 mlyacc 1.176 1.154 1.183 1.179 model-elimination 1.006 1.009 1.011 1.003 mpuz 1.006 0.9998 1.001 1.001 nucleic 1.000 0.9996 0.9998 0.9985 output1 1.006 1.002 1.002 1.006 peek 1.005 1.001 1.001 0.9994 pidigits 1.003 0.9998 1.001 1.005 psdes-random 1.005 1.001 1.002 0.9978 ratio-regions 1.020 1.008 1.007 1.009 ray 1.023 1.021 1.018 1.013 raytrace 1.008 1.003 1.003 1.013 simple 1.009 1.033 1.034 1.055 smith-normal-form 1.005 1.005 1.006 1.003 string-concat 1.004 0.9995 1.002 0.9980 tailfib 1.005 1.000 1.000 0.9998 tailmerge 1.004 0.9995 1.001 0.9998 tak 1.005 1.001 0.9997 1.000 tensor 1.019 1.019 1.018 1.019 tsp 1.005 1.002 1.004 1.007 tyan 1.034 1.026 1.025 1.031 vector32-concat 1.005 1.000 1.001 0.9984 vector64-concat 1.005 1.000 1.001 0.9982 vector-rev 1.005 1.001 1.001 0.9987 vliw 1.033 1.010 1.012 1.026 wc-input1 1.012 1.009 1.008 1.010 wc-scanStream 1.011 1.006 1.007 1.007 zebra 1.004 1.001 1.002 1.005 zern 1.006 1.003 1.001 1.002 MIN 1.000 0.9940 0.9967 0.9972 GMEAN 1.011 1.007 1.008 1.008 MAX 1.176 1.154 1.183 1.179 ``` Notes: * As they should be (same versions of MLton, gcc, llvm), executable sizes are identical with `cadmium`.
This was referenced Aug 12, 2019
MatthewFluet
added a commit
that referenced
this pull request
Sep 20, 2019
Static allocation/initialization of objects in backend The main benefits are that code size and compile time are improved across the board, particularly for larger programs. Runtime sometimes improves, but should only affect programs which had hot code accessing globals, as it removes one level of indirection. Garbage collections might be marginally faster, as globals are now mostly skipped. Statically allocated and initialized objects are created in the main `.c` file, where they will be placed in the data segment of the executable: const struct {Word64 meta_0; Word64 meta_1; Word64 meta_2; Word8 data[9];} static_20 = {(Word64)(0x0ull), (Word64)(0x9ull), (Word64)(0x7ull), "addrinuse"}; const struct {Word64 meta_0; Word32 data_0; Word32 data_1; Pointer data_2; } static_21 = {(Word64)(0x29ull), (Word32)(0x62ull), (Word32)(0x0ull), ((Pointer)(&static_20) + 24)}; Note that these are proper ML objects, with metadata and data. References to statically allocated objects are via pointers to the first data field (e.g., `&static20 + 24`). Note also that `WordXVector` (e.g., strings) are a special case of statically allocated and initialized objects. Statically allocated and initialized objects can be both immutable and mutable, although the latter should be restricted to objects with non-`Objptr` mutable fields. A special case of statically allocated objects are arrays, whose contents will be dynamically initialized by the mutator. These are also created in the main `.c` file, but are placed in the bss segment of the executable (decreasing the size of the executable) and proper metadata is written by initialization code: struct {Word64 meta_0; Word64 meta_1; Word64 meta_2; Word8 data[800000];} static_26; struct {Word64 meta_0; Word64 meta_1; Word64 meta_2; Word8 data[0];} static_31; static void static_Init() { memcpy (&static_26, &((struct {Word64 meta_0; Word64 meta_1; Word64 meta_2}){(Word64)(0x0ull), (Word64)(0x186A0ull), (Word64)(0x11ull)}), 24); memcpy (&static_31, &((struct {Word64 meta_0; Word64 meta_1; Word64 meta_2}){(Word64)(0x0ull), (Word64)(0x0ull), (Word64)(0x13ull)}), 24); }; Finally, dynamically allocated but statically initialized objects have their initialization data in the main `.c` file along with information to copy that data to the initial dynamic heap during `initWorld`: const static struct {Word64 meta_0; Word64 data_0; } static_9819 = {(Word64)(0x79Dull), (Word64)(0x1ull)}; const static struct {Word64 meta_0; Word64 data_0; } static_9820 = {(Word64)(0x79Dull), (Word64)(0x1ull)}; static struct GC_objectInit objectInits[] = { { 11, 8, 16, ((Pointer) &static_9819) }, { 12, 8, 16, ((Pointer) &static_9820) }, ... } By default (with `-static-init-objects staticAllocOnly`), no such objects are created. With `-static-init-objects all`, global objects with `Objptr` mutable fields would be dynamically allocated but statically initialized. But, such global objects are rare. For example, a `(int * int) ref` could be space-safely globalized and would be an object with a mutable `Objptr` field. However, it is also likely that such a tuple would be `RefFlatten`ed. With `-globalize-small-type 4` (see #288 and 752467c), an `(int * int, int * int) either ref` could be globalized and represented as an object with a mutable `Objptr` field. Similarly, an `IntInf.int ref` can also be globalized and would be represented as an object with a mutable `Objptr` field. With `-static-alloc-objects false -static-init-objects all`, all global objects will be dynamically allocated but statically initialized (and no global objects will be statically allocated). Similarly, with `-static-alloc-wordvector-consts false`, string constants will be dynamically allocated but statically initialized; this corresponds to the previous MLton behavior with respect to string constants. A number of controls have been added to control static allocation/initialization: * `-static-alloc-internal-ptrs {static|all|none}` Controls which kinds of objects can be statically allocated: * `static`: only objects with all fields either immutable or non-`Objptr` * `none`: only objects with no fields * `all`: all objects The `all` setting is incompatible with the current GC for two reasons. First, statically allocated objects are not traced by the GC; a statically-allocated object that is updated with an `Objptr` to an object in the heap should be considered a root. Second, a statically-allocated object that is updated with an `Objptr` would trigger a card marking, but the address of a statically-allocated object would not map to a valid card slot. * `-static-alloc-wordvector-consts {true|false}` Controls whether or not `WordXVector` constants are converted to statics (with `ImmStatic` location) at `Ssa2ToRssa`. * `-static-init-arrays {true|false}` Controls whether or not `Array_alloc` primitives are converted to statics (with `MutStatic` or `Heap` location) at `Ssa2ToRssa`. * `-static-alloc-arrays {true|false}` Controls whether or not `Array_alloc` primitives that can be statically initialized are forced to `Heap` location. * `-static-init-objects {none|staticAllocOnly|all}` Controls whether or not `Object` expressions are converted to statics at `Ssa2ToRssa`. If `staticAllocOnly`, then an object that would be converted to a static with `Heap` location is not converted to a static. * `-static-alloc-objects {true|false}` Controls whether or not `Object` expressions that can be statically initialized are forced to `Heap` location.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Extend globalization aspect of ConstantPropagation to support
globalization of arrays and to support different "small type"
strategies.
Closes #206.
Space-safety prohibits ConstantPropagation from globalizing all arrays
and refs that are allocated at most once by a program. In particular,
because globals are live for the duration of the program, globalizing
an
int list ref
(for example) would not be safe-for-space: anarbitrarily large list may be written to the reference and never be
garbage collected (whereas, when the
int list ref
is not globalized,it will be garbage collected when it is no longer live). On the other
hand, globalizing an
int ref
is safe-for-space.However, MLton previously used only a very conservative estimation for
space safety. Only "small" types may be globalized, where smallness is
defined as:
Note that no
Datatype
is small; this is conservative (since arecursive datatype could represent unbounded data), but prevents
globalizing
bool ref
. Also, noArray
is small; this is correct(because an
int array ref
should not be globalized), but theglobalization of a
val a: t array = Array_alloc[t] (l)
wasconditioned on the smallness of
t array
, not the smallness oft
. It is correct to globalize an array ift
were small; note thatto globalize
val a: t array = Array_alloc[t] (l)
,l
(the length)must be globalized and must, therefore, be a constant and the array is
of constant size. (This is Stephen Weeks's relaxed notion of
safe-for-space, where the constant factor blowup can be chosen per
program.)
This pull request adds support for alternate globalization strategies:
-globalize-arrays {false|true}
: globalize arrays-globalize-refs {true|false}
: globalize refs-globalize-small-int-inf {true|false}
:globalizeIntInf
as asmall type
-globalize-small-type {1|0|2|3|4|9}
: strategies for classifying atype as "small":
0
: constantfalse
function (no types considered small)1
: noDatatype
is considered small (original strategy)2
:Datatype
s with all nullary constructors are consideredsmall
3
:Datatype
s with all constructor arguments considered smallaccording to strategy
2
are considered small4
: Fixed-point analysis ofDatatype
s to determine smallness9
: constanttrue
function (all types considered small; notsafe-for-space)
The defaults correspond to the previous behavior.
Unfortunately, additional globalization has little to no (positive)
effect on benchmarks:
Note that
MLton0
andMLton1
generate identical code (modulo therandom magic number), so the slowdowns in
ray
andraytrace
arenoise, which also suggests that slowdowns/speedups of <= 3% are also
likely noise.
The slowdown in
flat-array
with-globalize-array true
is explainedas follows. The
flat-array
benchmark usesVector.tabulate
toallocate a vector that is used for all iterations of the benchmark.
With
-globalize-array false
, the array is not globalized, and inSSA/SSA2, we have:
but with
-globalize-array true
, the array is globalized, and inSSA/SSA2, we have:
At RSSA, the
Array_toVector
becomes a header update and the arrayvariable is cast/copy-propagated for the vector variable;
with
-globalize-arrays false
, we havebut with
-globalize-arrays true
, we haveFinally, with
-globalize-arrays false
,x_1212
becomes a local(because the loops to initialize and use the vector are
non-allocating):
but with
-globalize-arrays true
:The innermost loop of the benchmark goes from indexing a sequence
stored in a local (
RP(0)
) to indexing a sequence stored in a global(
GP(1)
). All of the codegens should implement the former by using ahardware register for
RP(0)
, but will implement the latter with amemory read.
In light of the above, and related to #218, it may be
beneficial to "deglobalize" object pointer globals; that is, in RSSA
functions that have multiple accesses through the same object pointer
global (particularly within loops) could be translated to copy the
global to a local.
The slowdown in
checksum
is less easily explained. The only newobjects globalized with
-globalize-small-type 2
as compared to-globalize-small-type 1
are twobool ref
objects, corresponding tothe
exiting
flag ofbasis-library/mlton/exit.sml
and thestaticIsInUse
flag ofbasis-library/util/one.sml
used byInt.fmt
. That small change seems to lead to code layout and cacheeffects that result in the slowdown, because the assembly code is not
substantial different. With
-enable-pass machineSuffle
and-seed-rand <w>
, one can perturb the code layout and observe that theslowdowns are not universal:
Note that while
checksum
with MLton4 has a slowdown,checksum
withMLton5 and MLton6 (which are identical up to shuffling of the
functions and basic blocks at the MachineIR) do not have a slowdown.
Similarly
tak
with MLton0 and MLton1 have similar running time, buttak
with MLton3 has a speedup. On the other hand,flat-array
'sslowdowns with
-globalize-arrays true
are not due to code layouteffects.
hamlet
may have a slight speedup with-globalize-arrays true
, butthat is significantly outweighed by the slowdown in
flat-array
.The conclusion is to leave the defaults corresponding to the original
behavior.
Full benchmark results: