PATH WALK IV: Add 'git survey' command #1821

derrickstolee · 2024-10-29T02:38:05Z

WIP

Thanks, -Stolee

When storing output in test-results/, we usually give each numbered run in a --stress set its own output file. But we don't do that for storing LSan logs, so something like: ./t0003-attributes.sh --stress will have many scripts simultaneously creating, writing to, and deleting the test-results/t0003-attributes.leak directory. This can cause logs from one run to be attributed to another, spurious failures when creation and deletion race, and so on. This has always been broken, but nobody noticed because it's rare to do a --stress run with LSan (since the point is for the code to run quickly many times in order to hit races). But if you're trying to find a race in the leak sanitizing code, it makes sense to use these together. We can fix it by using $TEST_RESULTS_BASE, which already incorporates the stress job suffix. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

This reverts commit 993d38a. That commit was trying to solve a race between LSan setting up the threads stack and another thread calling exit(), by making sure that all pthread_create() calls have finished before doing any work that might trigger the exit(). But that isn't sufficient. The setup code actually runs in the individual threads themselves, not in the spawning thread's call to pthread_create(). So while it may have improved the race a bit, you can still trigger it pretty quickly with: make SANITIZE=leak cd t ./t5309-pack-delta-cycles.sh --stress Let's back out that failed attempt so we can try again. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

One thread primitive we don't yet support is a barrier: it waits for all threads to reach a synchronization point before letting any of them continue. This would be useful for avoiding the LSan race we see in index-pack (and other places) by having all threads complete their initialization before any of them start to do real work. POSIX introduced a pthread_barrier_t in 2004, which does what we want. But if we want to rely on it: 1. Our Windows pthread emulation would need a new set of wrapper functions. There's a Synchronization Barrier primitive there, which was introduced in Windows 8 (which is old enough for us to depend on). 2. macOS (and possibly other systems) has pthreads but not pthread_barrier_t. So there we'd have to implement our own barrier based on the mutex and cond primitives. Those are do-able, but since we only care about avoiding races in our LSan builds, there's an easier way: make it a noop on systems without a native pthread barrier. This patch introduces a "maybe_thread_barrier" API. The clunky name (rather than just using pthread_barrier directly) should hopefully clue people in that on some systems it will do nothing. It's wired to a Makefile knob which has to be triggered manually, and we enable it for the linux-leaks CI jobs (since we know we'll have it there). There are some other possible options: - we could turn it on all the time for Linux systems based on uname. But we really only care about it for LSan builds, and there is no need to add extra code to regular builds. - we could turn it on only for LSan builds. But that would break builds on non-Linux platforms (like macOS) that otherwise should support sanitizers. - we could trigger only on the combination of Linux and LSan together. This isn't too hard to do, but the uname check isn't completely accurate. It is really about what your libc supports, and non-glibc systems might not have it (though at least musl seems to). So we'd risk breaking builds on those systems, which would need to add a new knob. Though the upside would be that running local "make SANITIZE=leak test" would be protected automatically. And of course none of this protects LSan runs from races on systems without pthread barriers. It's probably OK in practice to protect only our CI jobs, though. The race is rare-ish and most leak-checking happens through CI. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

We sometimes get false positives from our linux-leaks CI job because of a race in LSan itself. The problem is that one thread is still initializing its stack in LSan's code (and allocating memory to do so) while anothe thread calls die(), taking down the whole process and triggering a leak check. The problem is described in more detail in 993d38a (index-pack: spawn threads atomically, 2024-01-05), which tried to fix it by pausing worker threads until all calls to pthread_create() had completed. But that's not enough to fix the problem, because the LSan setup code runs in the threads themselves. So even though pthread_create() has returned, we have no idea if all threads actually finished their setup before letting any of them do real work. We can fix that by using a barrier inside the threads themselves, waiting for all of them to hit the start of their main function before any of them proceed. You can test for the race by running: make SANITIZE=leak THREAD_BARRIER_PTHREAD=YesOnLinux cd t ./t5309-pack-delta-cycles.sh --stress which fails quickly before this patch, and should run indefinitely without it. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

There's a race with LSan when spawning threads and one of the threads calls die(). We worked around one such problem with index-pack in the previous commit, but it exists in git-grep, too. You can see it with: make SANITIZE=leak THREAD_BARRIER_PTHREAD=YesOnLinux cd t ./t0003-attributes.sh --stress which fails pretty quickly with: ==git==4096424==ERROR: LeakSanitizer: detected memory leaks Direct leak of 32 byte(s) in 1 object(s) allocated from: #0 0x7f906de14556 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98 #1 0x7f906dc9d2c1 in __pthread_getattr_np nptl/pthread_getattr_np.c:180 #2 0x7f906de2500d in __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:150 #3 0x7f906de25187 in __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:614 #4 0x7f906de17d18 in __lsan::ThreadStart(unsigned int, unsigned long long, __sanitizer::ThreadType) ../../../../src/libsanitizer/lsan/lsan_posix.cpp:53 #5 0x7f906de143a9 in ThreadStartFunc<false> ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:431 #6 0x7f906dc9bf51 in start_thread nptl/pthread_create.c:447 #7 0x7f906dd1a677 in __clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 As with the previous commit, we can fix this by inserting a barrier that makes sure all threads have finished their setup before continuing. But there's one twist in this case: the thread which calls die() is not one of the worker threads, but the main thread itself! So we need the main thread to wait in the barrier, too, until all threads have gotten to it. And thus we initialize the barrier for num_threads+1, to account for all of the worker threads plus the main one. If we then test as above, t0003 should run indefinitely. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

One of the tests in t5616 asserts that git-fetch(1) with `--refetch` triggers repository maintenance with the correct set of arguments. This test is flaky and causes us to fail sometimes: ++ git -c protocol.version=0 -c gc.autoPackLimit=0 -c maintenance.incremental-repack.auto=1234 -C pc1 fetch --refetch origin error: unable to open .git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack: No such file or directory fatal: unable to rename temporary file to '.git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack' fatal: could not finish pack-objects to repack local links fatal: index-pack failed error: last command exited with $?=128 The error message is quite confusing as it talks about trying to rename a temporary packfile. A first hunch would thus be that this packfile gets written by git-fetch(1), but removed by git-maintenance(1) while it hasn't yet been finalized, which shouldn't ever happen. And indeed, when looking closer one notices that the file that is supposedly of temporary nature does not have the typical `tmp_pack_` prefix. As it turns out, the "unable to rename temporary file" fatal error is a red herring and the real error is "unable to open". That error is raised by `check_collision()`, which is called by `finalize_object_file()` when moving the new packfile into place. Because t5616 re-fetches objects, we end up with the exact same pack as we already have in the repository. So when the concurrent git-maintenance(1) process rewrites the preexisting pack and unlinks it exactly at the point in time where git-fetch(1) wants to check the old and new packfiles for equality we will see ENOENT and thus `check_collision()` returns an error, which gets bubbled up by `finalize_object_file()` and is then handled by `rename_tmp_packfile()`. That function does not know about the exact root cause of the error and instead just claims that the rename has failed. This race is thus caused by b1b8dfd (finalize_object_file(): implement collision check, 2024-09-26), where we have newly introduced the collision check. By definition, two files cannot collide with each other when one of them has been removed. We can thus trivially fix the issue by ignoring ENOENT when opening either of the files we're about to check for collision. Signed-off-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

In 1b9e9be (csum-file.c: use unsafe SHA-1 implementation when available, 2024-09-26) we have converted our `struct hashfile` to use the unsafe SHA1 backend, which results in a significant speedup. One needs to be careful with how to use that structure now though because callers need to consistently use either the safe or unsafe variants of SHA1, as otherwise one can easily trigger corruption. As it turns out, we have one inconsistent usage in our tree because we directly initialize `struct hashfile_checkpoint::ctx` with the safe variant of SHA1, but end up writing to that context with the unsafe ones. This went unnoticed so far because our CI systems do not exercise different hash functions for these two backends, and consequently safe and unsafe variants are equivalent. But when using SHA1DC as safe and OpenSSL as unsafe backend this leads to a crash an t1050: ++ git -c core.compression=0 add large1 AddressSanitizer:DEADLYSIGNAL ================================================================= ==1367==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000040 (pc 0x7ffff7a01a99 bp 0x507000000db0 sp 0x7fffffff5690 T0) ==1367==The signal is caused by a READ memory access. ==1367==Hint: address points to the zero page. #0 0x7ffff7a01a99 in EVP_MD_CTX_copy_ex (/nix/store/h1ydpxkw9qhjdxjpic1pdc2nirggyy6f-openssl-3.3.2/lib/libcrypto.so.3+0x201a99) (BuildId: 41746a580d39075fc85e8c8065b6c07fb34e97d4) #1 0x555555ddde56 in openssl_SHA1_Clone ../sha1/openssl.h:40:2 #2 0x555555dce2fc in git_hash_sha1_clone_unsafe ../object-file.c:123:2 #3 0x555555c2d5f8 in hashfile_checkpoint ../csum-file.c:211:2 #4 0x555555b9905d in deflate_blob_to_pack ../bulk-checkin.c:286:4 #5 0x555555b98ae9 in index_blob_bulk_checkin ../bulk-checkin.c:362:15 #6 0x555555ddab62 in index_blob_stream ../object-file.c:2756:9 #7 0x555555dda420 in index_fd ../object-file.c:2778:9 #8 0x555555ddad76 in index_path ../object-file.c:2796:7 #9 0x555555e947f3 in add_to_index ../read-cache.c:771:7 #10 0x555555e954a4 in add_file_to_index ../read-cache.c:804:9 #11 0x5555558b5c39 in add_files ../builtin/add.c:355:7 #12 0x5555558b412e in cmd_add ../builtin/add.c:578:18 #13 0x555555b1f493 in run_builtin ../git.c:480:11 #14 0x555555b1bfef in handle_builtin ../git.c:740:9 #15 0x555555b1e6f4 in run_argv ../git.c:807:4 #16 0x555555b1b87a in cmd_main ../git.c:947:19 #17 0x5555561649e6 in main ../common-main.c:64:11 #18 0x7ffff742a1fb in __libc_start_call_main (/nix/store/65h17wjrrlsj2rj540igylrx7fqcd6vq-glibc-2.40-36/lib/libc.so.6+0x2a1fb) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4) #19 0x7ffff742a2b8 in __libc_start_main@GLIBC_2.2.5 (/nix/store/65h17wjrrlsj2rj540igylrx7fqcd6vq-glibc-2.40-36/lib/libc.so.6+0x2a2b8) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4) #20 0x555555772c84 in _start (git+0x21ec84) ==1367==Register values: rax = 0x0000511000001080 rbx = 0x0000000000000000 rcx = 0x000000000000000c rdx = 0x0000000000000000 rdi = 0x0000000000000000 rsi = 0x0000507000000db0 rbp = 0x0000507000000db0 rsp = 0x00007fffffff5690 r8 = 0x0000000000000000 r9 = 0x0000000000000000 r10 = 0x0000000000000000 r11 = 0x00007ffff7a01a30 r12 = 0x0000000000000000 r13 = 0x00007fffffff6b38 r14 = 0x00007ffff7ffd000 r15 = 0x00005555563b9910 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV (/nix/store/h1ydpxkw9qhjdxjpic1pdc2nirggyy6f-openssl-3.3.2/lib/libcrypto.so.3+0x201a99) (BuildId: 41746a580d39075fc85e8c8065b6c07fb34e97d4) in EVP_MD_CTX_copy_ex ==1367==ABORTING ./test-lib.sh: line 1023: 1367 Aborted git $config add large1 error: last command exited with $?=134 not ok 4 - add with -c core.compression=0 Fix the issue by using the unsafe variant instead. Signed-off-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Same as with the preceding commit, git-fast-import(1) is using the safe variant to initialize a hashfile checkpoint. This leads to a segfault when passing the checkpoint into the hashfile subsystem because it would use the unsafe variants instead: ++ git --git-dir=R/.git fast-import --big-file-threshold=1 AddressSanitizer:DEADLYSIGNAL ================================================================= ==577126==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000040 (pc 0x7ffff7a01a99 bp 0x5070000009c0 sp 0x7fffffff5b30 T0) ==577126==The signal is caused by a READ memory access. ==577126==Hint: address points to the zero page. #0 0x7ffff7a01a99 in EVP_MD_CTX_copy_ex (/nix/store/h1ydpxkw9qhjdxjpic1pdc2nirggyy6f-openssl-3.3.2/lib/libcrypto.so.3+0x201a99) (BuildId: 41746a580d39075fc85e8c8065b6c07fb34e97d4) #1 0x555555ddde56 in openssl_SHA1_Clone ../sha1/openssl.h:40:2 #2 0x555555dce2fc in git_hash_sha1_clone_unsafe ../object-file.c:123:2 #3 0x555555c2d5f8 in hashfile_checkpoint ../csum-file.c:211:2 #4 0x5555559647d1 in stream_blob ../builtin/fast-import.c:1110:2 #5 0x55555596247b in parse_and_store_blob ../builtin/fast-import.c:2031:3 #6 0x555555967f91 in file_change_m ../builtin/fast-import.c:2408:5 #7 0x55555595d8a2 in parse_new_commit ../builtin/fast-import.c:2768:4 #8 0x55555595bb7a in cmd_fast_import ../builtin/fast-import.c:3614:4 #9 0x555555b1f493 in run_builtin ../git.c:480:11 #10 0x555555b1bfef in handle_builtin ../git.c:740:9 #11 0x555555b1e6f4 in run_argv ../git.c:807:4 #12 0x555555b1b87a in cmd_main ../git.c:947:19 #13 0x5555561649e6 in main ../common-main.c:64:11 #14 0x7ffff742a1fb in __libc_start_call_main (/nix/store/65h17wjrrlsj2rj540igylrx7fqcd6vq-glibc-2.40-36/lib/libc.so.6+0x2a1fb) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4) #15 0x7ffff742a2b8 in __libc_start_main@GLIBC_2.2.5 (/nix/store/65h17wjrrlsj2rj540igylrx7fqcd6vq-glibc-2.40-36/lib/libc.so.6+0x2a2b8) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4) #16 0x555555772c84 in _start (git+0x21ec84) ==577126==Register values: rax = 0x0000511000000cc0 rbx = 0x0000000000000000 rcx = 0x000000000000000c rdx = 0x0000000000000000 rdi = 0x0000000000000000 rsi = 0x00005070000009c0 rbp = 0x00005070000009c0 rsp = 0x00007fffffff5b30 r8 = 0x0000000000000000 r9 = 0x0000000000000000 r10 = 0x0000000000000000 r11 = 0x00007ffff7a01a30 r12 = 0x0000000000000000 r13 = 0x00007fffffff6b60 r14 = 0x00007ffff7ffd000 r15 = 0x00005555563b9910 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV (/nix/store/h1ydpxkw9qhjdxjpic1pdc2nirggyy6f-openssl-3.3.2/lib/libcrypto.so.3+0x201a99) (BuildId: 41746a580d39075fc85e8c8065b6c07fb34e97d4) in EVP_MD_CTX_copy_ex ==577126==ABORTING ./test-lib.sh: line 1039: 577126 Aborted git --git-dir=R/.git fast-import --big-file-threshold=1 < input error: last command exited with $?=134 not ok 167 - R: blob bigger than threshold The segfault is only exposed in case the unsafe and safe backends are different from one another. Fix the issue by initializing the context with the unsafe SHA1 variant. Signed-off-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

In the preceding commit we have fixed a segfault when using an unsafe SHA1 backend that is different from the safe one. This segfault only went by unnoticed because we never set up an unsafe backend in our CI systems. Fix this ommission by setting `OPENSSL_SHA1_UNSAFE` in our TEST-vars job. Signed-off-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

…build * ps/weak-sha1-for-tail-sum-fix: ci: exercise unsafe OpenSSL backend builtin/fast-import: fix segfault with unsafe SHA1 backend bulk-checkin: fix segfault with unsafe SHA1 backend

The 'CommonCrypto' backend can be specified as HTTPS and SHA1 backends, but the value that one needs to use is inconsistent across those two build options. Unify it to 'CommonCrypto'. Signed-off-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

We've got a couple of repeated calls to `get_option()` for the SHA1 and SHA256 backend options. While not an issue, it makes the code needlessly verbose. Fix this by consistently using a local variable. Signed-off-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The Security framework is required when we use CommonCrypto either as HTTPS or SHA1 backend, but we only require it in case it is set up as HTTPS backend. Fix this. Signed-off-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The conditions used to figure out whteher the Security framework or OpenSSL library is required are a bit convoluted because they can be pulled in via the HTTPS, SHA1 or SHA256 backends. Refactor them to be easier to read. Signed-off-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Most of our Meson build options end with a trailing dot, but those for our SHA1 and SHA256 backends don't. Add it. Signed-off-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Introduce a new API to visit objects in batches based on a common path, or by type. * ds/path-walk-1: path-walk: drop redundant parse_tree() call path-walk: reorder object visits path-walk: mark trees and blobs as UNINTERESTING path-walk: visit tags and cached objects path-walk: allow consumer to specify object types t6601: add helper for testing path-walk API test-lib-functions: add test_cmp_sorted path-walk: introduce an object walk by path

Doc updates. * ja/doc-commit-markup-updates: doc: migrate git-commit manpage secondary files to new format doc: convert git commit config to new format doc: make more direct explanations in git commit options doc: the mode param of -u of git commit is optional doc: apply new documentation guidelines to git commit

"git branch --sort=..." and "git for-each-ref --format=... --sort=..." did not work as expected with some atoms, which has been corrected. * rs/ref-fitler-used-atoms-value-fix: ref-filter: remove ref_format_clear() ref-filter: move is-base tip to used_atom ref-filter: move ahead-behind bases into used_atom

reflog entries for symbolic ref updates were broken, which has been corrected. * kn/reflog-symref-fix: refs: fix creation of reflog entries for symrefs

The trace2 code was not prepared to show a configuration variable that is set to true using the valueless true syntax, which has been corrected. * am/trace2-with-valueless-true: trace2: prevent segfault on config collection with valueless true

The "git refs migrate" command did not migrate the reflog for refs/stash, which is the contents of the stashes, which has been corrected. * ps/reflog-migration-with-logall-fix: refs: fix migration of reflogs respecting "core.logAllRefUpdates"

Signed-off-by: Junio C Hamano <[email protected]>

Doc mark-up updates. * ja/doc-restore-markup-update: doc: convert git-restore to new style format

Code clean-up. * sk/strlen-returns-size_t: date.c: Fix type missmatch warings from msvc

Doc mark-up updates. * ja/doc-notes-markup-updates: doc: convert git-notes to new documentation format

Doc and short-help text for "show-index" has been clarified to stress that the command reads its data from the standard input. * jc/show-index-h-update: show-index: the short help should say the command reads from its input

Signed-off-by: Junio C Hamano <[email protected]>

The API around choosing to use unsafe variant of SHA-1 implementation has been updated in an attempt to make it harder to abuse. * tb/unsafe-hash-cleanup: hash.h: drop unsafe_ function variants csum-file: introduce hashfile_checkpoint_init() t/helper/test-hash.c: use unsafe_hash_algo() csum-file.c: use unsafe_hash_algo() hash.h: introduce `unsafe_hash_algo()` csum-file.c: extract algop from hashfile_checksum_valid() csum-file: store the hash algorithm as a struct field t/helper/test-tool: implement sha1-unsafe helper

Code clean-up for code paths around combined diff. * jk/combine-diff-cleanup: tree-diff: make list tail-passing more explicit tree-diff: simplify emit_path() list management tree-diff: use the name "tail" to refer to list tail tree-diff: drop list-tail argument to diff_tree_paths() combine-diff: drop public declaration of combine_diff_path_size() tree-diff: inline path_appendnew() tree-diff: pass whole path string to path_appendnew() tree-diff: drop path_appendnew() alloc optimization run_diff_files(): de-mystify the size of combine_diff_path struct diff: add a comment about combine_diff_path.parent.path combine-diff: use pointer for parent paths tree-diff: clear parent array in path_appendnew() combine-diff: add combine_diff_path_new() run_diff_files(): delay allocation of combine_diff_path

Following the procedure we established to introduce breaking changes for Git 3.0, allow an early opt-in for removing support of $GIT_DIR/branches/ and $GIT_DIR/remotes/ directories to configure remotes. * ps/3.0-remote-deprecation: remote: announce removal of "branches/" and "remotes/" builtin/pack-redundant: remove subcommand with breaking changes ci: repurpose "linux-gcc" job for deprecations ci: merge linux-gcc-default into linux-gcc Makefile: wire up build option for deprecated features

More build fixes and enhancements on meson based build procedure. * ps/build-meson-fixes: ci: wire up Visual Studio build with Meson ci: raise error when Meson generates warnings meson: fix compilation with Visual Studio meson: make the CSPRNG backend configurable meson: wire up fuzzers meson: wire up generation of distribution archive meson: wire up development environments meson: fix dependencies for generated headers meson: populate project version via GIT-VERSION-GEN GIT-VERSION-GEN: allow running without input and output files GIT-VERSION-GEN: simplify computing the dirty marker

Code clean-up. * kn/pack-write-with-reduced-globals: pack-write: pass hash_algo to internal functions pack-write: pass hash_algo to `write_rev_file()` pack-write: pass hash_algo to `write_idx_file()` pack-write: pass repository to `index_pack_lockfile()` pack-write: pass hash_algo to `fixup_pack_header_footer()`

Fix bugs in an earlier attempt to fix "git refs migration". * kn/reflog-migration-fix-fix: refs/reftable: fix uninitialized memory access of `max_index` reftable: write correct max_update_index to header

Signed-off-by: Junio C Hamano <[email protected]>

Start work on a new 'git survey' command to scan the repository for monorepo performance and scaling problems. The goal is to measure the various known "dimensions of scale" and serve as a foundation for adding additional measurements as we learn more about Git monorepo scaling problems. The initial goal is to complement the scanning and analysis performed by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool. It is hoped that by creating a builtin command, we may be able to take advantage of internal Git data structures and code that is not accessible from GO to gain further insight into potential scaling problems. Co-authored-by: Derrick Stolee <[email protected]> Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>

By default we will scan all references in "refs/heads/", "refs/tags/" and "refs/remotes/". Add command line opts let the use ask for all refs or a subset of them and to include a detached HEAD. Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>

When 'git survey' provides information to the user, this will be presented in one of two formats: plaintext and JSON. The JSON implementation will be delayed until the functionality is complete for the plaintext format. The most important parts of the plaintext format are headers specifying the different sections of the report and tables providing concreted data. Create a custom table data structure that allows specifying a list of strings for the row values. When printing the table, check each column for the maximum width so we can create a table of the correct size from the start. The table structure is designed to be flexible to the different kinds of output that will be implemented in future changes. Signed-off-by: Derrick Stolee <[email protected]>

At the moment, nothing is obvious about the reason for the use of the path-walk API, but this will become more prevelant in future iterations. For now, use the path-walk API to sum up the counts of each kind of object. For example, this is the reachable object summary output for my local repo: REACHABLE OBJECT SUMMARY ======================== Object Type | Count ------------+------- Tags | 1343 Commits | 179344 Trees | 314350 Blobs | 184030 Signed-off-by: Derrick Stolee <[email protected]>

Now that we have explored objects by count, we can expand that a bit more to summarize the data for the on-disk and inflated size of those objects. This information is helpful for diagnosing both why disk space (and perhaps clone or fetch times) is growing but also why certain operations are slow because the inflated size of the abstract objects that must be processed is so large. Signed-off-by: Derrick Stolee <[email protected]>

Signed-off-by: Derrick Stolee <[email protected]>

In future changes, we will make use of these methods. The intention is to keep track of the top contributors according to some metric. We don't want to store all of the entries and do a sort at the end, so track a constant-size table and remove rows that get pushed out depending on the chosen sorting algorithm. Co-authored-by: Jeff Hostetler <[email protected]> Signed-off-by; Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>

Since we are already walking our reachable objects using the path-walk API, let's now collect lists of the paths that contribute most to different metrics. Specifically, we care about * Number of versions. * Total size on disk. * Total inflated size (no delta or zlib compression). This information can be critical to discovering which parts of the repository are causing the most growth, especially on-disk size. Different packing strategies might help compress data more efficiently, but the toal inflated size is a representation of the raw size of all snapshots of those paths. Even when stored efficiently on disk, that size represents how much information must be processed to complete a command such as 'git blame'. Since the on-disk size is likely to be fragile, stop testing the exact output of 'git survey' and check that the correct set of headers is output. Signed-off-by: Derrick Stolee <[email protected]>

The 'git survey' builtin provides several detail tables, such as "top files by on-disk size". The size of these tables defaults to 100, currently. Allow the user to specify this number via a new --top=<N> option or the new survey.top config key. Signed-off-by: Derrick Stolee <[email protected]>

While this command is definitely something we _want_, chances are that upstreaming this will require substantial changes. We still want to be able to experiment with this before that, to focus on what we need out of this command: To assist with diagnosing issues with large repositories, as well as to help monitoring the growth and the associated painpoints of such repositories. To that end, we are about to integrate this command into `microsoft/git`, to get the tool into the hands of users who need it most, with the idea to iterate in close collaboration between these users and the developers familar with Git's internals. However, we will definitely want to avoid letting anybody have the impression that this command, its exact inner workings, as well as its output format, are anywhere close to stable. To make that fact utterly clear (and thereby protect the freedom to iterate and innovate freely before upstreaming the command), let's mark its output as experimental in all-caps, as the first thing we do. Signed-off-by: Johannes Schindelin <[email protected]>

derrickstolee self-assigned this Oct 29, 2024

derrickstolee force-pushed the survey-upstream branch from 95e5c93 to 01f3080 Compare October 30, 2024 20:06

derrickstolee force-pushed the api-upstream branch from 97d669a to 5252076 Compare October 30, 2024 20:07

derrickstolee force-pushed the survey-upstream branch from 01f3080 to 61b1397 Compare October 30, 2024 22:20

derrickstolee force-pushed the api-upstream branch from 5252076 to 0bb607e Compare October 30, 2024 22:20

derrickstolee mentioned this pull request Oct 31, 2024

PATH WALK I: The path-walk API #1818

Closed

derrickstolee force-pushed the survey-upstream branch from 61b1397 to 38e1168 Compare November 8, 2024 16:01

derrickstolee force-pushed the survey-upstream branch from 38e1168 to b5c4a74 Compare December 6, 2024 19:41

derrickstolee force-pushed the api-upstream branch from 0bb607e to e716672 Compare December 6, 2024 19:42

derrickstolee force-pushed the survey-upstream branch from b5c4a74 to ffec88a Compare December 18, 2024 15:14

derrickstolee force-pushed the api-upstream branch 2 times, most recently from 8a7b7e6 to 781b2ea Compare December 18, 2024 15:19

derrickstolee force-pushed the survey-upstream branch 2 times, most recently from 358e4e6 to 4d496d5 Compare December 18, 2024 16:13

derrickstolee force-pushed the api-upstream branch from 781b2ea to ef54342 Compare December 18, 2024 16:13

peff and others added 15 commits December 30, 2024 06:18

Merge branch 'ps/weak-sha1-for-tail-sum-fix' into ps/meson-weak-sha1-…

cade724

…build * ps/weak-sha1-for-tail-sum-fix: ci: exercise unsafe OpenSSL backend builtin/fast-import: fix segfault with unsafe SHA1 backend bulk-checkin: fix segfault with unsafe SHA1 backend

meson: add missing dots for build options

12068bd

Most of our Meson build options end with a trailing dot, but those for our SHA1 and SHA256 backends don't. Add it. Signed-off-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

gitster and others added 29 commits January 29, 2025 14:05

Merge branch 'kn/reflog-symref-fix'

d205f06

reflog entries for symbolic ref updates were broken, which has been corrected. * kn/reflog-symref-fix: refs: fix creation of reflog entries for symrefs

The fifth batch

3b0d05c

Signed-off-by: Junio C Hamano <[email protected]>

Merge branch 'ja/doc-restore-markup-update'

dccd9c5

Doc mark-up updates. * ja/doc-restore-markup-update: doc: convert git-restore to new style format

Merge branch 'sk/strlen-returns-size_t'

ecba2c1

Code clean-up. * sk/strlen-returns-size_t: date.c: Fix type missmatch warings from msvc

Merge branch 'ja/doc-notes-markup-updates'

bdd1988

Doc mark-up updates. * ja/doc-notes-markup-updates: doc: convert git-notes to new documentation format

Merge branch 'jc/show-index-h-update'

81309f4

Doc and short-help text for "show-index" has been clarified to stress that the command reads its data from the standard input. * jc/show-index-h-update: show-index: the short help should say the command reads from its input

The sixth batch

58b5801

Signed-off-by: Junio C Hamano <[email protected]>

Merge branch 'kn/reflog-migration-fix-fix'

1f124f3

Fix bugs in an earlier attempt to fix "git refs migration". * kn/reflog-migration-fix-fix: refs/reftable: fix uninitialized memory access of `max_index` reftable: write correct max_update_index to header

The seventh batch

bc204b7

Signed-off-by: Junio C Hamano <[email protected]>

survey: show progress during object walk

71ca390

Signed-off-by: Derrick Stolee <[email protected]>

derrickstolee force-pushed the survey-upstream branch from 4d496d5 to 2d7e24f Compare February 5, 2025 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PATH WALK IV: Add 'git survey' command #1821

PATH WALK IV: Add 'git survey' command #1821

derrickstolee commented Oct 29, 2024

PATH WALK IV: Add 'git survey' command #1821

Are you sure you want to change the base?

PATH WALK IV: Add 'git survey' command #1821

Conversation

derrickstolee commented Oct 29, 2024