aboutsummaryrefslogtreecommitdiffstats
path: root/packfile.c
AgeCommit message (Collapse)AuthorFilesLines
2024-03-28Merge branch 'eb/hash-transition'Junio C Hamano1-1/+2
Work to support a repository that work with both SHA-1 and SHA-256 hash algorithms has started. * eb/hash-transition: (30 commits) t1016-compatObjectFormat: add tests to verify the conversion between objects t1006: test oid compatibility with cat-file t1006: rename sha1 to oid test-lib: compute the compatibility hash so tests may use it builtin/ls-tree: let the oid determine the output algorithm object-file: handle compat objects in check_object_signature tree-walk: init_tree_desc take an oid to get the hash algorithm builtin/cat-file: let the oid determine the output algorithm rev-parse: add an --output-object-format parameter repository: implement extensions.compatObjectFormat object-file: update object_info_extended to reencode objects object-file-convert: convert commits that embed signed tags object-file-convert: convert commit objects when writing object-file-convert: don't leak when converting tag objects object-file-convert: convert tag objects when writing object-file-convert: add a function to convert trees between algorithms object: factor out parse_mode out of fast-import and tree-walk into in object.h cache: add a function to read an OID of a specific algorithm tag: sign both hashes commit: export add_header_signature to support handling signatures on tags ...
2023-12-26treewide: remove unnecessary includes in source filesElijah Newren1-1/+0
Each of these were checked with gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE} to ensure that removing the direct inclusion of the header actually resulted in that header no longer being included at all (i.e. that no other header pulled it in transitively). ...except for a few cases where we verified that although the header was brought in transitively, nothing from it was directly used in that source file. These cases were: * builtin/credential-cache.c * builtin/pull.c * builtin/send-pack.c Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02tree-walk: init_tree_desc take an oid to get the hash algorithmEric W. Biederman1-1/+2
To make it possible for git ls-tree to display the tree encoded in the hash algorithm of the oid specified to git ls-tree, update init_tree_desc to take as a parameter the oid of the tree object. Update all callers of init_tree_desc and init_tree_desc_gently to pass the oid of the tree object. Use the oid of the tree object to discover the hash algorithm of the oid and store that hash algorithm in struct tree_desc. Use the hash algorithm in decode_tree_entry and update_tree_entry_internal to handle reading a tree object encoded in a hash algorithm that differs from the repositories hash algorithm. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-04Merge branch 'jc/retire-get-sha1-hex'Junio C Hamano1-1/+1
The implementation of "get_sha1_hex()" that reads a hexadecimal string that spells a full object name has been extended to cope with any hash function used in the repository, but the "sha1" in its name survived. Rename it to get_hash_hex(), a name that is more consistent within its friends like get_hash_hex_algop(). * jc/retire-get-sha1-hex: hex: retire get_sha1_hex()
2023-07-25Merge branch 'tb/object-access-overflow-protection'Junio C Hamano1-7/+8
Various offset computation in the code that accesses the packfiles and other data in the object layer has been hardened against arithmetic overflow, especially on 32-bit systems. * tb/object-access-overflow-protection: commit-graph.c: prevent overflow in `verify_commit_graph()` commit-graph.c: prevent overflow in `write_commit_graph()` commit-graph.c: prevent overflow in `merge_commit_graph()` commit-graph.c: prevent overflow in `split_graph_merge_strategy()` commit-graph.c: prevent overflow in `load_tree_for_commit()` commit-graph.c: prevent overflow in `fill_commit_in_graph()` commit-graph.c: prevent overflow in `fill_commit_graph_info()` commit-graph.c: prevent overflow in `load_oid_from_graph()` commit-graph.c: prevent overflow in add_graph_to_chain() commit-graph.c: prevent overflow in `write_commit_graph_file()` pack-bitmap.c: ensure that eindex lookups don't overflow midx.c: prevent overflow in `fill_included_packs_batch()` midx.c: prevent overflow in `write_midx_internal()` midx.c: store `nr`, `alloc` variables as `size_t`'s midx.c: prevent overflow in `nth_midxed_offset()` midx.c: prevent overflow in `nth_midxed_object_oid()` midx.c: use `size_t`'s for fanout nr and alloc packfile.c: use checked arithmetic in `nth_packed_object_offset()` packfile.c: prevent overflow in `load_idx()` packfile.c: prevent overflow in `nth_packed_object_id()`
2023-07-24hex: retire get_sha1_hex()Junio C Hamano1-1/+1
The naming convention around get_sha1_hex() and its friends is awkward these days, after "struct object_id" was introduced. There are three public functions around this area: * get_sha1_hex() - use the implied the_hash_algo, fill uchar * * get_oid_hex() - use the implied the_hash_algo, fill oid * * get_oid_hex_algop() - use the passed algop, fill oid * Between the latter two, the "_algop" suffix signals whether the the_hash_algo is used as the implied algorithm or the caller should pass an algorithm explicitly. That is very much understandable and is a good convention. Between the former two, however, the "SHA1" vs "OID" in the names differentiate in what type of variable the result is stored. We could argue that it makes sense to use "SHA1" to mean "flat byte buffer" to honor the historical practice in the days before "struct object_id" was invented, but the natural fourth friend of the above group would take an algop and fill a flat byte buffer, and it would be strange to name it get_sha1_hex_algop(). Do we use the passed in algo, or are we limited to SHA-1 ;-)? In fact, such a function exists, albeit as a private helper function used by the implementation of these functions, and is named a lot more sensibly: get_hash_hex_algop(). Correct the misnomer of get_sha1_hex() and use "hash", instead of "sha1", as "flat byte buffer that stores binary (as opposed to hexadecimal) representation of the hash". The four (2x2) friends now become: * get_hash_hex() - use the implied the_hash_algo, fill uchar * * get_oid_hex() - use the implied the_hash_algo, fill oid * * get_hash_hex_algop() - use the passed algop, fill uchar * * get_oid_hex_algop() - use the passed algop, fill oid * As there are only two remaining calls to get_sha1_hex() in the codebase right now, the blast radious of this change is fairly small. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14packfile.c: use checked arithmetic in `nth_packed_object_offset()`Taylor Blau1-4/+5
In a similar spirit as the previous commits, ensure that we use `st_add()` or `st_mult()` when computing values that may overflow the 32-bit unsigned limit. Note that in each of these instances, we prevent 32-bit overflow already since we have explicit casts to `size_t`. So this code is OK as-is, but let's clarify it by using the `st_xyz()` helpers to make it obvious that we are performing the relevant computations using 64 bits. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14packfile.c: prevent overflow in `load_idx()`Taylor Blau1-1/+1
Prevent an overflow when locating a pack's CRC offset when the number of packed items is greater than 2^32-1/hashsz by guarding the computation with an `st_mult()`. Note that to avoid truncating the result, the `crc_offset` member must itself become a `size_t`. The only usage of this variable (besides the assignment in `load_idx()`) is in `read_v2_anomalous_offsets()` in the index-pack code. There we use the `crc_offset` as a pointer offset, so we are already equipped to handle the type change. Helped-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-12packfile.c: prevent overflow in `nth_packed_object_id()`Taylor Blau1-2/+2
In 37fec86a83 (packfile: abstract away hash constant values, 2018-05-02), `nth_packed_object_id()` started using the variable `the_hash_algo->rawsz` instead of a fixed constant when trying to compute an offset into the ".idx" file for some object position. This can lead to surprising truncation when looking for an object towards the end of a large enough pack, like the following: (gdb) p hashsz $1 = 20 (gdb) p n $2 = 215043814 (gdb) p hashsz * n $3 = 5908984 , which is a debugger session broken on a known-bad call to the `nth_packed_object_id()` function. This behavior predates 37fec86a83, and is original to the v2 index format, via: 74e34e1fca (sha1_file.c: learn about index version 2, 2007-04-09). This is due to §6.4.4.1 of the C99 standard, which states that an untyped integer constant will take the first type in which the value can be accurately represented, among `int`, `long int`, and `long long int`. Since 20 can be represented as an `int`, and `n` is a 32-bit unsigned integer, the resulting computation is defined by §6.3.1.8, and the (signed) integer value representing `n` is converted to an unsigned type, meaning that `20 * n` (for `n` having type `uint32_t`) is equivalent to a multiplication between two unsigned 32-bit integers. When multiplying a sufficiently large `n`, the resulting value can exceed 2^32-1, wrapping around and producing an invalid result. Let's follow the example in f86f769550e (compute pack .idx byte offsets using size_t, 2020-11-13) and replace this computation with `st_mult()`, which will ensure that the computation is done using 64-bits. While here, guard the corresponding computation for packs with v1 indexes, too. Though the likelihood of seeing a bug there is much smaller, since (a) v1 indexes are generated far less frequently than v2 indexes, and (b) they all correspond to packs no larger than 2 GiB, so having enough objects to trigger this overflow is unlikely if not impossible. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05git-compat-util: move alloc macros to git-compat-util.hCalvin Wan1-1/+0
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for dynamic array allocation. Moving these macros to git-compat-util.h with the other alloc macros focuses alloc.[ch] to allocation for Git objects and additionally allows us to remove inclusions to alloc.h from files that solely used the above macros. Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05treewide: remove unnecessary includes for wrapper.hCalvin Wan1-1/+0
Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-29Merge branch 'en/header-split-cache-h-part-3'Junio C Hamano1-1/+1
Header files cleanup. * en/header-split-cache-h-part-3: (28 commits) fsmonitor-ll.h: split this header out of fsmonitor.h hash-ll, hashmap: move oidhash() to hash-ll object-store-ll.h: split this header out of object-store.h khash: name the structs that khash declares merge-ll: rename from ll-merge git-compat-util.h: remove unneccessary include of wildmatch.h builtin.h: remove unneccessary includes list-objects-filter-options.h: remove unneccessary include diff.h: remove unnecessary include of oidset.h repository: remove unnecessary include of path.h log-tree: replace include of revision.h with simple forward declaration cache.h: remove this no-longer-used header read-cache*.h: move declarations for read-cache.c functions from cache.h repository.h: move declaration of the_index from cache.h merge.h: move declarations for merge.c from cache.h diff.h: move declaration for global in diff.c from cache.h preload-index.h: move declarations for preload-index.c from elsewhere sparse-index.h: move declarations for sparse-index.c from cache.h name-hash.h: move declarations for name-hash.c from cache.h run-command.h: move declarations for run-command.c from cache.h ...
2023-06-29Merge branch 'ds/remove-idx-before-pack'Junio C Hamano1-1/+1
We create .pack and then .idx, we consider only packfiles that have .idx usable (those with only .pack are not ready yet), so we should remove .idx before removing .pack for consistency. * ds/remove-idx-before-pack: packfile: delete .idx files before .pack files
2023-06-21object-store-ll.h: split this header out of object-store.hElijah Newren1-1/+1
The vast majority of files including object-store.h did not need dir.h nor khash.h. Split the header into two files, and let most just depend upon object-store-ll.h, while letting the two callers that need it depend on the full object-store.h. After this patch: $ git grep -h include..object-store | sort | uniq -c 2 #include "object-store.h" 129 #include "object-store-ll.h" Diff best viewed with `--color-moved`. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-20packfile: delete .idx files before .pack filesDerrick Stolee1-1/+1
When installing a packfile, we place the .pack file before the .idx file. The intention is that Git scans for .idx files in the pack directory and then loads the .pack files from that list. However, when we delete packfiles, we do not do this in the reverse order as we should. The unlink_pack_path() method deletes the .pack followed by the .idx. This creates a window where the process could be interrupted between the .pack deletion and the .idx deletion, leaving the repository in a state that looks strange, but isn't actually too problematic if we assume the pack was safe to delete. The .idx without a .pack will cause some overhead, but will not interrupt other Git processes. This ordering was introduced into the 'git repack' builtin by a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15), though we must be careful to track history through the code move in 8434e85d5f9 (repack: refactor pack deletion for future use, 2019-06-10) to see that. This became more important after 73320e49add (builtin/repack.c: only collect fully-formed packs, 2023-06-07) changed how 'git repack' scanned for packfiles for use in the cruft pack process. It previously looked for .pack files, but that was problematic due to the order that packs are installed: repacks between the creation of a .pack and the creation of its .idx would result in hard failures. There is an independent proposal about what to do in the case of a .idx without a .pack during this 'git repack' scenario, but this change is focused on deleting .pack files more safely. Modify the order to delete the .idx before the .pack. The rest of the modifiers on the .pack should still come after the .pack so we know all of the presumed properties of the packfile as long as it exists in the filesystem, in case we wish to reinstate it by re-indexing the .pack file. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-05-09Merge branch 'en/header-split-cache-h-part-2'Junio C Hamano1-1/+1
More header clean-up. * en/header-split-cache-h-part-2: (22 commits) reftable: ensure git-compat-util.h is the first (indirect) include diff.h: reduce unnecessary includes object-store.h: reduce unnecessary includes commit.h: reduce unnecessary includes fsmonitor: reduce includes of cache.h cache.h: remove unnecessary headers treewide: remove cache.h inclusion due to previous changes cache,tree: move basic name compare functions from read-cache to tree cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c hash-ll.h: split out of hash.h to remove dependency on repository.h tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h dir.h: move DTYPE defines from cache.h versioncmp.h: move declarations for versioncmp.c functions from cache.h ws.h: move declarations for ws.c functions from cache.h match-trees.h: move declarations for match-trees.c functions from cache.h pkt-line.h: move declarations for pkt-line.c functions from cache.h base85.h: move declarations for base85.c functions from cache.h copy.h: move declarations for copy.c functions from cache.h server-info.h: move declarations for server-info.c functions from cache.h packfile.h: move pack_window and pack_entry from cache.h ...
2023-04-27Merge branch 'tb/pack-revindex-on-disk'Junio C Hamano1-1/+1
The on-disk reverse index that allows mapping from the pack offset to the object name for the object stored at the offset has been enabled by default. * tb/pack-revindex-on-disk: t: invert `GIT_TEST_WRITE_REV_INDEX` config: enable `pack.writeReverseIndex` by default pack-revindex: introduce `pack.readReverseIndex` pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK pack-revindex: make `load_pack_revindex` take a repository t5325: mark as leak-free pack-write.c: plug a leak in stage_tmp_packfiles()
2023-04-24treewide: remove cache.h inclusion due to previous changesElijah Newren1-1/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-13pack-revindex: make `load_pack_revindex` take a repositoryTaylor Blau1-1/+1
In a future commit, we will introduce a `pack.readReverseIndex` configuration, which forces Git to generate the reverse index from scratch instead of loading it from disk. In order to avoid reading this configuration value more than once, we'll use the `repo_settings` struct to lazily load this value. In order to access the `struct repo_settings`, add a repository argument to `load_pack_revindex`, and update all callers to pass the correct instance (in all cases, `the_repository`). In certain instances, a new function-local variable is introduced to take the place of a `struct repository *` argument to the function itself to avoid propagating the new parameter even further throughout the tree. Co-authored-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11object-file.h: move declarations for object-file.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11treewide: be explicit about dependence on pack-revindex.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11treewide: be explicit about dependence on trace.h & trace2.hElijah Newren1-0/+1
Dozens of files made use of trace and trace2 functions, without explicitly including trace.h or trace2.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include trace.h or trace2.h if they are using them. Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-06Merge branch 'en/header-split-cleanup'Junio C Hamano1-1/+4
Split key function and data structure definitions out of cache.h to new header files and adjust the users. * en/header-split-cleanup: csum-file.h: remove unnecessary inclusion of cache.h write-or-die.h: move declarations for write-or-die.c functions from cache.h treewide: remove cache.h inclusion due to setup.h changes setup.h: move declarations for setup.c functions from cache.h treewide: remove cache.h inclusion due to environment.h changes environment.h: move declarations for environment.c functions from cache.h treewide: remove unnecessary includes of cache.h wrapper.h: move declarations for wrapper.c functions from cache.h path.h: move function declarations for path.c functions from cache.h cache.h: remove expand_user_path() abspath.h: move absolute path functions from cache.h environment: move comment_line_char from cache.h treewide: remove unnecessary cache.h inclusion from several sources treewide: remove unnecessary inclusion of gettext.h treewide: be explicit about dependence on gettext.h treewide: remove unnecessary cache.h inclusion from a few headers
2023-04-06Merge branch 'ab/remove-implicit-use-of-the-repository'Junio C Hamano1-1/+1
Code clean-up around the use of the_repository. * ab/remove-implicit-use-of-the-repository: libs: use "struct repository *" argument, not "the_repository" post-cocci: adjust comments for recent repo_* migration cocci: apply the "revision.h" part of "the_repository.pending" cocci: apply the "rerere.h" part of "the_repository.pending" cocci: apply the "refs.h" part of "the_repository.pending" cocci: apply the "promisor-remote.h" part of "the_repository.pending" cocci: apply the "packfile.h" part of "the_repository.pending" cocci: apply the "pretty.h" part of "the_repository.pending" cocci: apply the "object-store.h" part of "the_repository.pending" cocci: apply the "diff.h" part of "the_repository.pending" cocci: apply the "commit.h" part of "the_repository.pending" cocci: apply the "commit-reach.h" part of "the_repository.pending" cocci: apply the "cache.h" part of "the_repository.pending" cocci: add missing "the_repository" macros to "pending" cocci: sort "the_repository" rules by header cocci: fix incorrect & verbose "the_repository" rules cocci: remove dead rule from "the_repository.pending.cocci"
2023-04-04Merge branch 'ab/remove-implicit-use-of-the-repository' into ↵Junio C Hamano1-1/+1
en/header-split-cache-h * ab/remove-implicit-use-of-the-repository: libs: use "struct repository *" argument, not "the_repository" post-cocci: adjust comments for recent repo_* migration cocci: apply the "revision.h" part of "the_repository.pending" cocci: apply the "rerere.h" part of "the_repository.pending" cocci: apply the "refs.h" part of "the_repository.pending" cocci: apply the "promisor-remote.h" part of "the_repository.pending" cocci: apply the "packfile.h" part of "the_repository.pending" cocci: apply the "pretty.h" part of "the_repository.pending" cocci: apply the "object-store.h" part of "the_repository.pending" cocci: apply the "diff.h" part of "the_repository.pending" cocci: apply the "commit.h" part of "the_repository.pending" cocci: apply the "commit-reach.h" part of "the_repository.pending" cocci: apply the "cache.h" part of "the_repository.pending" cocci: add missing "the_repository" macros to "pending" cocci: sort "the_repository" rules by header cocci: fix incorrect & verbose "the_repository" rules cocci: remove dead rule from "the_repository.pending.cocci"
2023-03-28cocci: apply the "promisor-remote.h" part of "the_repository.pending"Ævar Arnfjörð Bjarmason1-1/+1
Apply the part of "the_repository.pending.cocci" pertaining to "promisor-remote.h". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21csum-file.h: remove unnecessary inclusion of cache.hElijah Newren1-1/+1
With the change in the last commit to move several functions to write-or-die.h, csum-file.h no longer needs to include cache.h. However, removing that include forces several other C files, which directly or indirectly dependend upon csum-file.h's inclusion of cache.h, to now be more explicit about their dependencies. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21environment.h: move declarations for environment.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21wrapper.h: move declarations for wrapper.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: be explicit about dependence on gettext.hElijah Newren1-0/+1
Dozens of files made use of gettext functions, without explicitly including gettext.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include gettext.h if they are using it. However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an include of gettext.h, it was left out to avoid conflicting with an in-flight topic. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-19Merge branch 'ds/reprepare-alternates-when-repreparing-packfiles'Junio C Hamano1-0/+10
Once we start running, we assumed that the list of alternate object databases would never change. Hook into the machinery used to update the list of packfiles during runtime to update this list as well. * ds/reprepare-alternates-when-repreparing-packfiles: object-file: reprepare alternates when necessary
2023-03-17Merge branch 'jk/unused-post-2.39-part2'Junio C Hamano1-2/+2
More work towards -Wunused. * jk/unused-post-2.39-part2: (21 commits) help: mark unused parameter in git_unknown_cmd_config() run_processes_parallel: mark unused callback parameters userformat_want_item(): mark unused parameter for_each_commit_graft(): mark unused callback parameter rewrite_parents(): mark unused callback parameter fetch-pack: mark unused parameter in callback function notes: mark unused callback parameters prio-queue: mark unused parameters in comparison functions for_each_object: mark unused callback parameters list-objects: mark unused callback parameters mark unused parameters in signal handlers run-command: mark error routine parameters as unused mark "pointless" data pointers in callbacks ref-filter: mark unused callback parameters http-backend: mark unused parameters in virtual functions http-backend: mark argc/argv unused object-name: mark unused parameters in disambiguate callbacks serve: mark unused parameters in virtual functions serve: use repository pointer to get config ls-refs: drop config caching ...
2023-03-09object-file: reprepare alternates when necessaryDerrick Stolee1-0/+10
When an object is not found in a repository's object store, we sometimes call reprepare_packed_git() to see if the object was temporarily moved into a new pack-file (and its old pack-file or loose object was deleted). This process does a scan of each pack directory within each odb, but does not reevaluate if the odb list needs updating. Extend reprepare_packed_git() to also reprepare the alternate odb list by setting loaded_alternates to zero and calling prepare_alt_odb(). This will add newly-discoverd odbs to the linked list, but will not duplicate existing ones nor will it remove existing ones that are no longer listed in the alternates file. Do this under the object read lock to avoid readers from interacting with a potentially incomplete odb being added to the odb list. If the alternates file was edited to _remove_ some alternates during the course of the Git process, Git will continue to see alternates that were ever valid for that repository. ODBs are not removed from the list, the same as the existing behavior before this change. Git already has protections against an alternate directory disappearing from the filesystem during the lifetime of a process, and those are still in effect. This change is specifically for concurrent changes to the repository, so it is difficult to create a test that guarantees this behavior is correct. I manually verified by introducing a reprepare_packed_git() call into get_revision() and stepped into that call in a debugger with a parent 'git log' process. Multiple runs of prepare_alt_odb() kept the_repository->objects->odb as a single-item chain until I added a .git/objects/info/alternates file in a different process. The next run added the new odb to the chain and subsequent runs did not add to the chain. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-24for_each_object: mark unused callback parametersJeff King1-2/+2
The for_each_{loose,packed}_object interface uses callback functions, but not every callback needs all of the parameters. Mark the unused ones to satisfy -Wunused-parameter. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23cache.h: remove dependence on hex.h; make other files include it explicitlyElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23alloc.h: move ALLOC_GROW() functions from cache.hElijah Newren1-1/+2
This allows us to replace includes of cache.h with includes of the much smaller alloc.h in many places. It does mean that we also need to add includes of alloc.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-08packfile: inline custom read_object()Jeff King1-17/+9
When the pack code was split into its own file[1], it got a copy of the static read_object() function. But there's only one caller here, so we could just inline it. And it's worth doing so, as the name read_object() invites comparisons to the public read_object_file(), but the two don't behave quite the same. [1] The move happened over several commits, but the relevant one here is f1d8130be0 (pack: move clear_delta_base_cache(), packed_object_info(), unpack_entry(), 2017-08-18). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-14Merge branch 'ab/unused-annotation'Junio C Hamano1-1/+1
Undoes 'jk/unused-annotation' topic and redoes it to work around Coccinelle rules misfiring false positives in unrelated codepaths. * ab/unused-annotation: git-compat-util.h: use "deprecated" for UNUSED variables git-compat-util.h: use "UNUSED", not "UNUSED(var)"
2022-09-14Merge branch 'jk/unused-annotation'Junio C Hamano1-1/+1
Annotate function parameters that are not used (but cannot be removed for structural reasons), to prepare us to later compile with -Wunused warning turned on. * jk/unused-annotation: is_path_owned_by_current_uid(): mark "report" parameter as unused run-command: mark unused async callback parameters mark unused read_tree_recursive() callback parameters hashmap: mark unused callback parameters config: mark unused callback parameters streaming: mark unused virtual method parameters transport: mark bundle transport_options as unused refs: mark unused virtual method parameters refs: mark unused reflog callback parameters refs: mark unused each_ref_fn parameters git-compat-util: add UNUSED macro
2022-09-01git-compat-util.h: use "UNUSED", not "UNUSED(var)"Ævar Arnfjörð Bjarmason1-1/+1
As reported in [1] the "UNUSED(var)" macro introduced in 2174b8c75de (Merge branch 'jk/unused-annotation' into next, 2022-08-24) breaks coccinelle's parsing of our sources in files where it occurs. Let's instead partially go with the approach suggested in [2] of making this not take an argument. As noted in [1] "coccinelle" will ignore such tokens in argument lists that it doesn't know about, and it's less of a surprise to syntax highlighters. This undoes the "help us notice when a parameter marked as unused is actually use" part of 9b240347543 (git-compat-util: add UNUSED macro, 2022-08-19), a subsequent commit will further tweak the macro to implement a replacement for that functionality. 1. https://lore.kernel.org/git/220825.86ilmg4mil.gmgdl@evledraar.gmail.com/ 2. https://lore.kernel.org/git/220819.868rnk54ju.gmgdl@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-25Merge branch 'jk/is-promisor-object-keep-tree-in-use'Junio C Hamano1-2/+13
An earlier optimization discarded a tree-object buffer that is still in use, which has been corrected. * jk/is-promisor-object-keep-tree-in-use: is_promisor_object(): fix use-after-free of tree buffer
2022-08-19hashmap: mark unused callback parametersJeff King1-1/+1
Hashmap comparison functions must conform to a particular callback interface, but many don't use all of their parameters. Especially the void cmp_data pointer, but some do not use keydata either (because they can easily form a full struct to pass when doing lookups). Let's mark these to make -Wunused-parameter happy. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-18Merge branch 'jk/fsck-tree-mode-bits-fix'Junio C Hamano1-1/+1
"git fsck" reads mode from tree objects but canonicalizes the mode before passing it to the logic to check object sanity, which has hid broken tree objects from the checking logic. This has been corrected, but to help exiting projects with broken tree objects that they cannot fix retroactively, the severity of anomalies this code detects has been demoted to "info" for now. * jk/fsck-tree-mode-bits-fix: fsck: downgrade tree badFilemode to "info" fsck: actually detect bad file modes in trees tree-walk: add a mechanism for getting non-canonicalized modes
2022-08-14is_promisor_object(): fix use-after-free of tree bufferJeff King1-2/+13
Since commit fcc07e980b (is_promisor_object(): free tree buffer after parsing, 2021-04-13), we'll always free the buffers attached to a "struct tree" after searching them for promisor links. But there's an important case where we don't want to do so: if somebody else is already using the tree! This can happen during a "rev-list --missing=allow-promisor" traversal in a partial clone that is missing one or more trees or blobs. The backtrace for the free looks like this: #1 free_tree_buffer tree.c:147 #2 add_promisor_object packfile.c:2250 #3 for_each_object_in_pack packfile.c:2190 #4 for_each_packed_object packfile.c:2215 #5 is_promisor_object packfile.c:2272 #6 finish_object__ma builtin/rev-list.c:245 #7 finish_object builtin/rev-list.c:261 #8 show_object builtin/rev-list.c:274 #9 process_blob list-objects.c:63 #10 process_tree_contents list-objects.c:145 #11 process_tree list-objects.c:201 #12 traverse_trees_and_blobs list-objects.c:344 [...] We're in the middle of walking through the entries of a tree object via process_tree_contents(). We see a blob (or it could even be another tree entry) that we don't have, so we call is_promisor_object() to check it. That function loops over all of the objects in the promisor packfile, including the tree we're currently walking. When we're done with it there, we free the tree buffer. But as we return to the walk in process_tree_contents(), it's still holding on to a pointer to that buffer, via its tree_desc iterator, and it accesses the freed memory. Even a trivial use of "--missing=allow-promisor" triggers this problem, as the included test demonstrates (it's just a vanilla --blob:none clone). We can detect this case by only freeing the tree buffer if it was allocated on our behalf. This is a little tricky since that happens inside parse_object(), and it doesn't tell us whether the object was already parsed, or whether it allocated the buffer itself. But by checking for an already-parsed tree beforehand, we can distinguish the two cases. That feels a little hacky, and does incur an extra lookup in the object-hash table. But that cost is fairly minimal compared to actually loading objects (and since we're iterating the whole pack here, we're likely to be loading most objects, rather than reusing cached results). It may also be a good direction for this function in general, as there are other possible optimizations that rely on doing some analysis before parsing: - we could detect blobs and avoid reading their contents; they can't link to other objects, but parse_object() doesn't know that we don't care about checking their hashes. - we could avoid allocating object structs entirely for most objects (since we really only need them in the oidset), which would save some memory. - promisor commits could use the commit-graph rather than loading the object from disk This commit doesn't do any of those optimizations, but I think it argues that this direction is reasonable, rather than relying on parse_object() and trying to teach it to give us more information about whether it parsed. The included test fails reliably under SANITIZE=address just when running "rev-list --missing=allow-promisor". Checking the output isn't strictly necessary to detect the bug, but it seems like a reasonable addition given the general lack of coverage for "allow-promisor" in the test suite. Reported-by: Andrew Olsen <andrew.olsen@koordinates.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-10tree-walk: add a mechanism for getting non-canonicalized modesJeff King1-1/+1
When using init_tree_desc() and tree_entry() to iterate over a tree, we always canonicalize the modes coming out of the tree. This is a good thing to prevent bugs or oddities in normal code paths, but it's counter-productive for tools like fsck that want to see the exact contents. We can address this by adding an option to avoid the extra canonicalization. A few notes on the implementation: - I've attached the new option to the tree_desc struct itself. The actual code change is in decode_tree_entry(), which is in turn called by the public update_tree_entry(), tree_entry(), and init_tree_desc() functions, plus their "gently" counterparts. By letting it ride along in the struct, we can avoid changing the signature of those functions, which are called many times. Plus it's conceptually simpler: you really want a particular iteration of a tree to be "raw" or not, rather than individual calls. - We still have to set the new option somewhere. The struct is initialized by init_tree_desc(). I added the new flags field only to the "gently" version. That avoids disturbing the much more numerous non-gentle callers, and it makes sense that anybody being careful about looking at raw modes would also be careful about bogus trees (i.e., the caller will be something like fsck in the first place). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-03Merge branch 'rs/mergesort'Junio C Hamano1-15/+3
Make our mergesort implementation type-safe. * rs/mergesort: mergesort: remove llist_mergesort() packfile: use DEFINE_LIST_SORT fetch-pack: use DEFINE_LIST_SORT commit: use DEFINE_LIST_SORT blame: use DEFINE_LIST_SORT test-mergesort: use DEFINE_LIST_SORT test-mergesort: use DEFINE_LIST_SORT_DEBUG mergesort: add macros for typed sort of linked lists mergesort: tighten merge loop mergesort: unify ranks loops
2022-07-17packfile: use DEFINE_LIST_SORTRené Scharfe1-15/+3
Build a typed sort function for packed_git lists using DEFINE_LIST_SORT instead of calling llist_mergesort(). This gets rid of the next pointer accessor functions and their calling overhead at the cost of slightly increased object text size. Before: __TEXT __DATA __OBJC others dec hex 20218 320 0 110936 131474 20192 packfile.o With this patch: __TEXT __DATA __OBJC others dec hex 20430 320 0 112619 133369 208f9 packfile.o Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-11Merge branch 'jk/optim-promisor-object-enumeration'Junio C Hamano1-1/+2
Collection of what is referenced by objects in promisor packs have been optimized to inspect these objects in the in-pack order. * jk/optim-promisor-object-enumeration: is_promisor_object(): walk promisor packs in pack-order
2022-06-16is_promisor_object(): walk promisor packs in pack-orderJeff King1-1/+2
When we generate the list of promisor objects, we walk every pack with a .promisor file and examine its objects for any links to other objects. By default, for_each_packed_object() will go in pack .idx order. This is the worst case with respect to our delta base cache. If we have a delta chain of A->B->C->D, then visiting A may require reconstructing both B and C, unless we also visited B recently, in which case we may have cached its value. Because .idx order is based on sha1, it's random with respect to the actual object contents and deltas, and thus we're unlikely to get many cache hits. If we instead traverse in pack order, then we get the optimal case: packs are written to keep delta families together, and to place bases before their children. Even on a modest repository like git.git, this has a noticeable speedup on p5600.4, which runs "fsck" on a partial clone with blob:none (so lots of trees which need to be walked, and which delta well): Test HEAD^ HEAD ------------------------------------------------------- 5600.4: 17.87(17.83+0.04) 15.42(15.35+0.06) -13.7% On a larger repository like linux.git, the speedup is even more pronounced: Test HEAD^ HEAD ----------------------------------------------------------- 5600.4: 322.47(322.01+0.42) 186.41(185.76+0.63) -42.2% Any other operations that call is_promisor_object(), like "rev-list --exclude-promisor-objects", would similarly benefit, but the invocations in p5600 don't actually trigger any such cases. Note that we may pay a small price to build a rev-index in-memory to do the pack-order traversal. But it's still a big net win, and even that small cost goes away if you are using pack.writeReverseIndex. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-03Merge branch 'tb/cruft-packs'Junio C Hamano1-2/+17
A mechanism to pack unreachable objects into a "cruft pack", instead of ejecting them into loose form to be reclaimed later, has been introduced. * tb/cruft-packs: sha1-file.c: don't freshen cruft packs builtin/gc.c: conditionally avoid pruning objects via loose builtin/repack.c: add cruft packs to MIDX during geometric repack builtin/repack.c: use named flags for existing_packs builtin/repack.c: allow configuring cruft pack generation builtin/repack.c: support generating a cruft pack builtin/pack-objects.c: --cruft with expiration reachable: report precise timestamps from objects in cruft packs reachable: add options to add_unseen_recent_objects_to_traversal builtin/pack-objects.c: --cruft without expiration builtin/pack-objects.c: return from create_object_entry() t/helper: add 'pack-mtimes' test-tool pack-mtimes: support writing pack .mtimes files chunk-format.h: extract oid_version() pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' pack-mtimes: support reading .mtimes files Documentation/technical: add cruft-packs.txt
2022-05-26pack-mtimes: support reading .mtimes filesTaylor Blau1-2/+17
To store the individual mtimes of objects in a cruft pack, introduce a new `.mtimes` format that can optionally accompany a single pack in the repository. The format is defined in Documentation/technical/pack-format.txt, and stores a 4-byte network order timestamp for each object in name (index) order. This patch prepares for cruft packs by defining the `.mtimes` format, and introducing a basic API that callers can use to read out individual mtimes. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-02Merge branch 'ep/maint-equals-null-cocci' for maint-2.35Junio C Hamano1-1/+1
* ep/maint-equals-null-cocci: tree-wide: apply equals-null.cocci contrib/coccinnelle: add equals-null.cocci
2022-05-02tree-wide: apply equals-null.cocciJunio C Hamano1-1/+1
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12Merge branch 'jt/pack-header-lshift-overflow'Junio C Hamano1-1/+1
* jt/pack-header-lshift-overflow: packfile: fix off-by-one error in decoding logic
2022-01-12packfile: fix off-by-one error in decoding logicJunio C Hamano1-1/+1
shift count being exactly at 7-bit smaller than the long is OK; on 32-bit architecture, shift count starts at 4 and goes through 11, 18 and 25, at which point the guard triggers one iteration too early. Reported-by: Marc Strapetz <marc.strapetz@syntevo.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-15Merge branch 'tb/pack-revindex-on-disk-cleanup'Junio C Hamano1-1/+2
Code clean-up. * tb/pack-revindex-on-disk-cleanup: packfile: make `close_pack_revindex()` static
2021-12-10Merge branch 'jt/pack-header-lshift-overflow'Junio C Hamano1-1/+1
The code to decode the length of packed object size has been corrected. * jt/pack-header-lshift-overflow: packfile: avoid overflowing shift during decode
2021-12-04packfile: make `close_pack_revindex()` staticTaylor Blau1-1/+2
Since its definition in 2f4ba2a867 (packfile: prepare for the existence of '*.rev' files, 2021-01-25), the only caller of `close_pack_revindex()` was within packfile.c. Thus there is no need for this to be exposed via packfile.h, and instead can remain static within packfile.c's compilation unit. While we're here, move the function's opening brace onto its own line. Noticed-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-29Merge branch 'mc/clean-smudge-with-llp64'Junio C Hamano1-3/+3
The clean/smudge conversion code path has been prepared to better work on platforms where ulong is narrower than size_t. * mc/clean-smudge-with-llp64: clean/smudge: allow clean filters to process extremely large files odb: guard against data loss checking out a huge file git-compat-util: introduce more size_t helpers odb: teach read_blob_entry to use size_t t1051: introduce a smudge filter test for extremely large files test-lib: add prerequisite for 64-bit platforms test-tool genzeros: generate large amounts of data more efficiently test-genzeros: allow more than 2G zeros in Windows
2021-11-11packfile: avoid overflowing shift during decodeJonathan Tan1-1/+1
unpack_object_header_buffer() attempts to protect against overflowing left shifts, but the limit of the shift amount should not be the size of the variable being shifted. It should be the size minus the size of its contents. Fix that accordingly. This was noticed at $DAYJOB by a fuzzer running internally. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-03odb: guard against data loss checking out a huge fileMatt Cooper1-3/+3
This introduces an additional guard for platforms where `unsigned long` and `size_t` are not of the same size. If the size of an object in the database would overflow `unsigned long`, instead we now exit with an error. A complete fix will have to update _many_ other functions throughout the codebase to use `size_t` instead of `unsigned long`. It will have to be implemented at some stage. This commit puts in a stop-gap for the time being. Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Matt Cooper <vtbassmatt@gmail.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-03Merge branch 'rs/close-pack-leakfix'Junio C Hamano1-0/+1
Leakfix. * rs/close-pack-leakfix: packfile: release bad_objects in close_pack()
2021-09-24packfile: release bad_objects in close_pack()René Scharfe1-0/+1
Unusable entries of a damaged pack file are recorded in the oidset bad_objects. Release it when we're done with the pack. This doesn't affect intact packs because an empty oidset requires no allocation. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-23Merge branch 'rs/packfile-bad-object-list-in-oidset'Junio C Hamano1-27/+11
Replace a handcrafted data structure used to keep track of bad objects in the packfile API by an oidset. * rs/packfile-bad-object-list-in-oidset: packfile: use oidset for bad objects packfile: convert has_packed_and_bad() to object_id packfile: convert mark_bad_packed_object() to object_id midx: inline nth_midxed_pack_entry() oidset: make oidset_size() an inline function
2021-09-20Merge branch 'tb/multi-pack-bitmaps'Junio C Hamano1-1/+1
The reachability bitmap file used to be generated only for a single pack, but now we've learned to generate bitmaps for history that span across multiple packfiles. * tb/multi-pack-bitmaps: (29 commits) pack-bitmap: drop bitmap_index argument from try_partial_reuse() pack-bitmap: drop repository argument from prepare_midx_bitmap_git() p5326: perf tests for MIDX bitmaps p5310: extract full and partial bitmap tests midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' t7700: update to work with MIDX bitmap test knob t5319: don't write MIDX bitmaps in t5319 t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP t0410: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP t5326: test multi-pack bitmap behavior t/helper/test-read-midx.c: add --checksum mode t5310: move some tests to lib-bitmap.sh pack-bitmap: write multi-pack bitmaps pack-bitmap: read multi-pack bitmaps pack-bitmap.c: avoid redundant calls to try_partial_reuse pack-bitmap.c: introduce 'bitmap_is_preferred_refname()' pack-bitmap.c: introduce 'nth_bitmap_object_oid()' pack-bitmap.c: introduce 'bitmap_num_objects()' midx: avoid opening multiple MIDXs when writing midx: close linked MIDXs, avoid leaking memory ...
2021-09-12packfile: use oidset for bad objectsRené Scharfe1-22/+6
Store the object ID of broken pack entries in an oidset instead of keeping only their hashes in an unsorted array. The resulting code is shorter and easier to read. It also handles the (hopefully) very rare case of having a high number of bad objects better. Helped-by: Jeff King <peff@peff.net> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-12packfile: convert has_packed_and_bad() to object_idRené Scharfe1-2/+2
The single caller has a full object ID, so pass it on instead of just its hash member. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-12packfile: convert mark_bad_packed_object() to object_idRené Scharfe1-6/+6
All callers have full object IDs, so pass them on instead of just their hash member. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap: read multi-pack bitmapsTaylor Blau1-1/+1
This prepares the code in pack-bitmap to interpret the new multi-pack bitmaps described in Documentation/technical/bitmap-format.txt, which mostly involves converting bit positions to accommodate looking them up in a MIDX. Note that there are currently no writers who write multi-pack bitmaps, and that this will be implemented in the subsequent commit. Note also that get_midx_checksum() and get_midx_filename() are made non-static so they can be called from pack-bitmap.c. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-29object-store.h: teach for_each_packed_object to ignore kept packsTaylor Blau1-0/+6
The next patch will reimplement a function that wants to iterate over packed objects while ignoring packs which are marked as kept (either in-core or on-disk). Teach for_each_packed_object() to ignore all objects from those packs by adding a new flag for each of the "kept" states that a pack can be in. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-29xmmap: inform Linux users of tuning knobs on ENOMEMEric Wong1-2/+2
Linux users may benefit from additional information on how to avoid ENOMEM from mmap despite the system having enough RAM to accomodate them. We can't reliably unmap pack windows to work around the issue since malloc and other library routines may mmap without our knowledge. Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20Merge branch 'en/dir-traversal'Junio C Hamano1-4/+1
"git clean" and "git ls-files -i" had confusion around working on or showing ignored paths inside an ignored directory, which has been corrected. * en/dir-traversal: dir: introduce readdir_skip_dot_and_dotdot() helper dir: update stale description of treat_directory() dir: traverse into untracked directories if they may have ignored subfiles dir: avoid unnecessary traversal into ignored directory t3001, t7300: add testcase showcasing missed directory traversal t7300: add testcase showing unnecessary traversal into ignored directory ls-files: error out on -i unless -o or -c are specified dir: report number of visited directories and paths with trace2 dir: convert trace calls to trace2 equivalents
2021-05-13dir: introduce readdir_skip_dot_and_dotdot() helperElijah Newren1-4/+1
Many places in the code were doing while ((d = readdir(dir)) != NULL) { if (is_dot_or_dotdot(d->d_name)) continue; ...process d... } Introduce a readdir_skip_dot_and_dotdot() helper to make that a one-liner: while ((d = readdir_skip_dot_and_dotdot(dir)) != NULL) { ...process d... } This helper particularly simplifies checks for empty directories. Also use this helper in read_cached_dir() so that our statistics are consistent across platforms. (In other words, read_cached_dir() should have been using is_dot_or_dotdot() and skipping such entries, but did not and left it to treat_path() to detect and mark such entries as path_none.) Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-13is_promisor_object(): free tree buffer after parsingJeff King1-0/+1
To get the list of all promisor objects, we not only include all objects in promisor packs, but also parse each of those objects to see which objects they reference. After parsing a tree object, the tree->buffer field will remain populated until we explicitly free it. So in a partial clone of blob:none, for example, we are essentially reading every tree in the repository (since they're all in the initial promisor pack), and keeping all of their uncompressed contents in memory at once. This patch frees the tree buffers after we've finished marking all of their reachable objects. We shouldn't need to do this for any other object type. While we are using some extra memory to store the structs, no other object type stores the whole contents in its parsed form (we do sometimes hold on to commit buffers, but less so these days due to commit graphs, plus most commands which care about promisor objects turn off the save_commit_buffer global). Even for a moderate-sized repository like git.git, this patch drops the peak heap (as measured by massif) for git-fsck from ~1.7GB to ~138MB. Fsck is a good candidate for measuring here because it doesn't interact with the promisor code except to call is_promisor_object(), so we can isolate just this problem. The added perf test shows only a tiny improvement on my machine for git.git, since 1.7GB isn't enough to cause any real memory pressure: Test HEAD^ HEAD -------------------------------------------------------------------------------- 5600.4: fsck 21.26(20.90+0.35) 20.84(20.79+0.04) -2.0% With linux.git the absolute change is a bit bigger, though still a small percentage: Test HEAD^ HEAD ----------------------------------------------------------------------------- 5600.4: fsck 262.26(259.13+3.12) 254.92(254.62+0.29) -2.8% I didn't have the patience to run it under massif with linux.git, but it's probably on the order of about 14GB improvement, since that's the sum of the sizes of all of the uncompressed trees (but still isn't enough to create memory pressure on this particular machine, which has 64GB of RAM). Smaller machines would probably see a bigger effect on runtime (and sadly our perf suite does not measure peak heap). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-08Merge branch 'tb/reverse-midx'Junio C Hamano1-0/+3
An on-disk reverse-index to map the in-pack location of an object back to its object name across multiple packfiles is introduced. * tb/reverse-midx: midx.c: improve cache locality in midx_pack_order_cmp() pack-revindex: write multi-pack reverse indexes pack-write.c: extract 'write_rev_file_order' pack-revindex: read multi-pack reverse indexes Documentation/technical: describe multi-pack reverse indexes midx: make some functions non-static midx: keep track of the checksum midx: don't free midx_name early midx: allow marking a pack as preferred t/helper/test-read-midx.c: add '--show-objects' builtin/multi-pack-index.c: display usage on unrecognized command builtin/multi-pack-index.c: don't enter bogus cmd_mode builtin/multi-pack-index.c: split sub-commands builtin/multi-pack-index.c: define common usage with a macro builtin/multi-pack-index.c: don't handle 'progress' separately builtin/multi-pack-index.c: inline 'flags' with options
2021-04-01pack-revindex: read multi-pack reverse indexesTaylor Blau1-0/+3
Implement reading for multi-pack reverse indexes, as described in the previous patch. Note that these functions don't yet have any callers, and won't until multi-pack reachability bitmaps are introduced in a later patch series. In the meantime, this patch implements some of the infrastructure necessary to support multi-pack bitmaps. There are three new functions exposed by the revindex API: - load_midx_revindex(): loads the reverse index corresponding to the given multi-pack index. - midx_to_pack_pos() and pack_pos_to_midx(): these convert between the multi-pack index and pseudo-pack order. load_midx_revindex() and pack_pos_to_midx() are both relatively straightforward. load_midx_revindex() needs a few functions to be exposed from the midx API. One to get the checksum of a midx, and another to get the .rev's filename. Similar to recent changes in the packed_git struct, three new fields are added to the multi_pack_index struct: one to keep track of the size, one to keep track of the mmap'd pointer, and another to point past the header and at the reverse index's data. pack_pos_to_midx() simply reads the corresponding entry out of the table. midx_to_pack_pos() is the trickiest, since it needs to find an object's position in the psuedo-pack order, but that order can only be recovered in the .rev file itself. This mapping can be implemented with a binary search, but note that the thing we're binary searching over isn't an array of values, but rather a permuted order of those values. So, when comparing two items, it's helpful to keep in mind the difference. Instead of a traditional binary search, where you are comparing two things directly, here we're comparing a (pack, offset) tuple with an index into the multi-pack index. That index describes another (pack, offset) tuple, and it is _those_ two tuples that are compared. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-24Merge branch 'tb/geometric-repack'Junio C Hamano1-0/+67
"git repack" so far has been only capable of repacking everything under the sun into a single pack (or split by size). A cleverer strategy to reduce the cost of repacking a repository has been introduced. * tb/geometric-repack: builtin/pack-objects.c: ignore missing links with --stdin-packs builtin/repack.c: reword comment around pack-objects flags builtin/repack.c: be more conservative with unsigned overflows builtin/repack.c: assign pack split later t7703: test --geometric repack with loose objects builtin/repack.c: do not repack single packs with --geometric builtin/repack.c: add '--geometric' option packfile: add kept-pack cache for find_kept_pack_entry() builtin/pack-objects.c: rewrite honor-pack-keep logic p5303: measure time to repack with keep p5303: add missing &&-chains builtin/pack-objects.c: add '--stdin-packs' option revision: learn '--no-kept-objects' packfile: introduce 'find_kept_pack_entry()'
2021-03-13use CALLOC_ARRAYRené Scharfe1-1/+1
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-22packfile: add kept-pack cache for find_kept_pack_entry()Jeff King1-43/+56
In a recent patch we added a function 'find_kept_pack_entry()' to look for an object only among kept packs. While this function avoids doing any lookup work in non-kept packs, it is still linear in the number of packs, since we have to traverse the linked list of packs once per object. Let's cache a reduced version of that list to save us time. Note that this cache will last the lifetime of the program. We could invalidate it on reprepare_packed_git(), but there's not much point in being rigorous here: - we might already fail to notice new .keep packs showing up after the program starts. We only reprepare_packed_git() when we fail to find an object. But adding a new pack won't cause that to happen. Somebody repacking could add a new pack and delete an old one, but most of the time we'd have a descriptor or mmap open to the old pack anyway, so we might not even notice. - in pack-objects we already cache the .keep state at startup, since 56dfeb6263 (pack-objects: compute local/ignore_pack_keep early, 2016-07-29). So this is just extending that concept further. - we don't have to worry about any packed_git being removed; we always keep the old structs around, even after reprepare_packed_git() We do defensively invalidate the cache in case the set of kept packs being asked for changes (e.g., only in-core kept packs were cached, but suddenly the caller also wants on-disk kept packs, too). In theory we could build all three caches and switch between them, but it's not necessary, since this patch (and series) never changes the set of kept packs that it wants to inspect from the cache. So that "optimization" is more about being defensive in the face of future changes than it is about asking for multiple kinds of kept packs in this patch. Here are p5303 results (as always, measured against the kernel): Test HEAD^ HEAD ----------------------------------------------------------------------------------------------- 5303.5: repack (1) 57.34(54.66+10.88) 56.98(54.36+10.98) -0.6% 5303.6: repack with kept (1) 57.38(54.83+10.49) 57.17(54.97+10.26) -0.4% 5303.11: repack (50) 71.70(88.99+4.74) 71.62(88.48+5.08) -0.1% 5303.12: repack with kept (50) 72.58(89.61+4.78) 71.56(88.80+4.59) -1.4% 5303.17: repack (1000) 217.19(491.72+14.25) 217.31(490.82+14.53) +0.1% 5303.18: repack with kept (1000) 246.12(520.07+14.93) 217.08(490.37+15.10) -11.8% and the --stdin-packs case, which scales a little bit better (although not by that much even at 1,000 packs): 5303.7: repack with --stdin-packs (1) 0.00(0.00+0.00) 0.00(0.00+0.00) = 5303.13: repack with --stdin-packs (50) 3.43(11.75+0.24) 3.43(11.69+0.30) +0.0% 5303.19: repack with --stdin-packs (1000) 130.50(307.15+7.66) 125.13(301.36+8.04) -4.1% Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-22packfile: introduce 'find_kept_pack_entry()'Taylor Blau1-5/+59
Future callers will want a function to fill a 'struct pack_entry' for a given object id but _only_ from its position in any kept pack(s). In particular, an new 'git repack' mode which ensures the resulting packs form a geometric progress by object count will mark packs that it does not want to repack as "kept in-core", and it will want to halt a reachability traversal as soon as it visits an object in any of the kept packs. But, it does not want to halt the traversal at non-kept, or .keep packs. The obvious alternative is 'find_pack_entry()', but this doesn't quite suffice since it only returns the first pack it finds, which may or may not be kept (and the mru cache makes it unpredictable which one you'll get if there are options). Short of that, you could walk over all packs looking for the object in each one, but it scales with the number of packs, which may be prohibitive. Introduce 'find_kept_pack_entry()', a function which is like 'find_pack_entry()', but only fills in objects in the kept packs. Handle packs which have .keep files, as well as in-core kept packs separately, since certain callers will want to distinguish one from the other. (Though on-disk and in-core kept packs share the adjective "kept", it is best to think of the two sets as independent.) There is a gotcha when looking up objects that are duplicated in kept and non-kept packs, particularly when the MIDX stores the non-kept version and the caller asked for kept objects only. This could be resolved by teaching the MIDX to resolve duplicates by always favoring the kept pack (if one exists), but this breaks an assumption in existing MIDXs, and so it would require a format change. The benefit to changing the MIDX in this way is marginal, so we instead have a more thorough check here which is explained with a comment. Callers will be added in subsequent patches. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-25packfile: prepare for the existence of '*.rev' filesTaylor Blau1-1/+12
Specify the format of the on-disk reverse index 'pack-*.rev' file, as well as prepare the code for the existence of such files. The reverse index maps from pack relative positions (i.e., an index into the array of object which is sorted by their offsets within the packfile) to their position within the 'pack-*.idx' file. Today, this is done by building up a list of (off_t, uint32_t) tuples for each object (the off_t corresponding to that object's offset, and the uint32_t corresponding to its position in the index). To convert between pack and index position quickly, this array of tuples is radix sorted based on its offset. This has two major drawbacks: First, the in-memory cost scales linearly with the number of objects in a pack. Each 'struct revindex_entry' is sizeof(off_t) + sizeof(uint32_t) + padding bytes for a total of 16. To observe this, force Git to load the reverse index by, for e.g., running 'git cat-file --batch-check="%(objectsize:disk)"'. When asking for a single object in a fresh clone of the kernel, Git needs to allocate 120+ MB of memory in order to hold the reverse index in memory. Second, the cost to sort also scales with the size of the pack. Luckily, this is a linear function since 'load_pack_revindex()' uses a radix sort, but this cost still must be paid once per pack per process. As an example, it takes ~60x longer to print the _size_ of an object as it does to print that entire object's _contents_: Benchmark #1: git.compile cat-file --batch <obj Time (mean ± σ): 3.4 ms ± 0.1 ms [User: 3.3 ms, System: 2.1 ms] Range (min … max): 3.2 ms … 3.7 ms 726 runs Benchmark #2: git.compile cat-file --batch-check="%(objectsize:disk)" <obj Time (mean ± σ): 210.3 ms ± 8.9 ms [User: 188.2 ms, System: 23.2 ms] Range (min … max): 193.7 ms … 224.4 ms 13 runs Instead, avoid computing and sorting the revindex once per process by writing it to a file when the pack itself is generated. The format is relatively straightforward. It contains an array of uint32_t's, the length of which is equal to the number of objects in the pack. The ith entry in this table contains the index position of the ith object in the pack, where "ith object in the pack" is determined by pack offset. One thing that the on-disk format does _not_ contain is the full (up to) eight-byte offset corresponding to each object. This is something that the in-memory revindex contains (it stores an off_t in 'struct revindex_entry' along with the same uint32_t that the on-disk format has). Omit it in the on-disk format, since knowing the index position for some object is sufficient to get a constant-time lookup in the pack-*.idx file to ask for an object's offset within the pack. This trades off between the on-disk size of the 'pack-*.rev' file for runtime to chase down the offset for some object. Even though the lookup is constant time, the constant is heavier, since it can potentially involve two pointer walks in v2 indexes (one to access the 4-byte offset table, and potentially a second to access the double wide offset table). Consider trying to map an object's pack offset to a relative position within that pack. In a cold-cache scenario, more page faults occur while switching between binary searching through the reverse index and searching through the *.idx file for an object's offset. Sure enough, with a cold cache (writing '3' into '/proc/sys/vm/drop_caches' after 'sync'ing), printing out the entire object's contents is still marginally faster than printing its size: Benchmark #1: git.compile cat-file --batch-check="%(objectsize:disk)" <obj >/dev/null Time (mean ± σ): 22.6 ms ± 0.5 ms [User: 2.4 ms, System: 7.9 ms] Range (min … max): 21.4 ms … 23.5 ms 41 runs Benchmark #2: git.compile cat-file --batch <obj >/dev/null Time (mean ± σ): 17.2 ms ± 0.7 ms [User: 2.8 ms, System: 5.5 ms] Range (min … max): 15.6 ms … 18.2 ms 45 runs (Numbers taken in the kernel after cheating and using the next patch to generate a reverse index). There are a couple of approaches to improve cold cache performance not pursued here: - We could include the object offsets in the reverse index format. Predictably, this does result in fewer page faults, but it triples the size of the file, while simultaneously duplicating a ton of data already available in the .idx file. (This was the original way I implemented the format, and it did show `--batch-check='%(objectsize:disk)'` winning out against `--batch`.) On the other hand, this increase in size also results in a large block-cache footprint, which could potentially hurt other workloads. - We could store the mapping from pack to index position in more cache-friendly way, like constructing a binary search tree from the table and writing the values in breadth-first order. This would result in much better locality, but the price you pay is trading O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you can no longer directly index the table). So, neither of these approaches are taken here. (Thankfully, the format is versioned, so we are free to pursue these in the future.) But, cold cache performance likely isn't interesting outside of one-off cases like asking for the size of an object directly. In real-world usage, Git is often performing many operations in the revindex (i.e., asking about many objects rather than a single one). The trade-off is worth it, since we will avoid the vast majority of the cost of generating the revindex that the extra pointer chase will look like noise in the following patch's benchmarks. This patch describes the format and prepares callers (like in pack-revindex.c) to be able to read *.rev files once they exist. An implementation of the writer will appear in the next patch, and callers will gradually begin to start using the writer in the patches that follow after that. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-25Merge branch 'tb/pack-revindex-api'Junio C Hamano1-24/+52
Abstract accesses to in-core revindex that allows enumerating objects stored in a packfile in the order they appear in the pack, in preparation for introducing an on-disk precomputed revindex. * tb/pack-revindex-api: (21 commits) for_each_object_in_pack(): clarify pack vs index ordering pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()' pack-revindex: hide the definition of 'revindex_entry' pack-revindex: remove unused 'find_revindex_position()' pack-revindex: remove unused 'find_pack_revindex()' builtin/gc.c: guess the size of the revindex for_each_object_in_pack(): convert to new revindex API unpack_entry(): convert to new revindex API packed_object_info(): convert to new revindex API retry_bad_packed_offset(): convert to new revindex API get_delta_base_oid(): convert to new revindex API rebuild_existing_bitmaps(): convert to new revindex API try_partial_reuse(): convert to new revindex API get_size_by_pos(): convert to new revindex API show_objects_for_type(): convert to new revindex API bitmap_position_packfile(): convert to new revindex API check_object(): convert to new revindex API write_reused_pack_verbatim(): convert to new revindex API write_reused_pack_one(): convert to new revindex API write_reuse_object(): convert to new revindex API ...
2021-01-14for_each_object_in_pack(): clarify pack vs index orderingJeff King1-6/+18
We may return objects in one of two orders: how they appear in the .idx (sorted by object id) or how they appear in the packfile itself. To further complicate matters, we have two ordering variables, "i" and "pos", and it is not clear to which order they apply. Let's clarify this by using an unambiguous name where possible, and leaving a comment for the variable that does double-duty. Signed-off-by: Jeff King <peff@peff.net> Acked-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-13for_each_object_in_pack(): convert to new revindex APITaylor Blau1-1/+1
Avoid looking at the 'revindex' pointer directly and instead call 'pack_pos_to_index()'. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-13unpack_entry(): convert to new revindex APITaylor Blau1-8/+18
Remove direct manipulation of the 'struct revindex_entry' type as well as calls to the deprecated API in 'packfile.c:unpack_entry()'. Usual clean-up is performed (replacing '->nr' with calls to 'pack_pos_to_index()' and so on). Add an additional check to make sure that 'obj_offset()' points at a valid object. In the case this check is violated, we cannot call 'mark_bad_packed_object()' because we don't know the OID. At the top of the call stack is do_oid_object_info_extended() (via packed_object_info()), which does mark the object. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-13packed_object_info(): convert to new revindex APITaylor Blau1-2/+9
Convert another call of 'find_pack_revindex()' to its replacement 'pack_pos_to_offset()'. Likewise: - Avoid manipulating `struct packed_git`'s `revindex` pointer directly by removing the pointer-as-array indexing. - Add an additional guard to check that the offset 'obj_offset()' points to a real object. This should be the case with well-behaved callers to 'packed_object_info()', but isn't guarenteed. Other blocks that fill in various other values from the 'struct object_info' request handle bad inputs by setting the type to 'OBJ_BAD' and jumping to 'out'. Do the same when given a bad offset here. The previous code would have segfaulted when given a bad 'obj_offset' value, since 'find_pack_revindex()' would return 'NULL', and then the line that fills 'oi->disk_sizep' would try to access 'NULL[1]' with a stride of 16 bytes (the width of 'struct revindex_entry)'. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-13retry_bad_packed_offset(): convert to new revindex APITaylor Blau1-4/+3
Perform exactly the same conversion as in the previous commit to another caller within 'packfile.c'. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-13get_delta_base_oid(): convert to new revindex APITaylor Blau1-4/+4
Replace direct accesses to the 'struct revindex' type with a call to 'pack_pos_to_index()'. Likewise drop the old-style 'find_pack_revindex()' with its replacement 'offset_to_pack_pos()' (while continuing to perform the same error checking). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04hash-lookup: rename from sha1-lookupMartin Ågren1-1/+1
Change all remnants of "sha1" in hash-lookup.c and .h and rename them to reflect that we're not just able to handle SHA-1 these days. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08Merge branch 'tb/idx-midx-race-fix'Junio C Hamano1-17/+2
Processes that access packdata while the .idx file gets removed (e.g. while repacking) did not fail or fall back gracefully as they could. * tb/idx-midx-race-fix: midx.c: protect against disappearing packs packfile.c: protect against disappearing indexes
2020-11-25packfile.c: protect against disappearing indexesTaylor Blau1-17/+2
In 17c35c8969 (packfile: skip loading index if in multi-pack-index, 2018-07-12) we stopped loading the .idx file for packs that are contained within a multi-pack index. This saves us the effort of loading an .idx and doing some lightweight validity checks by way of 'packfile.c:load_idx()', but introduces a race between processes that need to load the index (e.g., to generate a reverse index) and processes that can delete the index. For example, running the following in your shell: $ git init repo && cd repo $ git commit --allow-empty -m 'base' $ git repack -ad && git multi-pack-index write followed by: $ rm -f .git/objects/pack/pack-*.idx $ git rev-parse HEAD | git cat-file --batch-check='%(objectsize:disk)' will result in a segfault prior to this patch. What's happening here is that we notice that the pack is in the multi-pack index, and so don't check that it still has a .idx. When we then try and load that index to generate a reverse index, we don't have it, so the call to 'find_pack_revindex()' in 'packfile.c:packed_object_info()' returns NULL, and then dereferencing it causes a segfault. Of course, we don't ever expect someone to remove the index file by hand, or to be in a state where we never wrote it to begin with (yet find that pack in the multi-pack-index). But, this can happen in a timing race with 'git repack -ad', which removes all existing packs after writing a new pack containing all of their objects. Avoid this by reverting the hunk of 17c35c8969 which stops loading the index when the pack is contained in a MIDX. This makes the latter half of 17c35c8969 useless, since we'll always have a non-NULL 'p->index_data', in which case that if statement isn't guarding anything. These two together effectively revert 17c35c8969, and avoid the race explained above. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-16packfile: detect overflow in .idx file size checksJeff King1-3/+3
In load_idx(), we check that the .idx file is sized appropriately for the number of objects it claims to have. We recently fixed the case where the number of objects caused our expected size to overflow a 32-bit unsigned int, and we switched to size_t. On a 64-bit system, this is fine; our size_t covers any expected size. On a 32-bit system, though, it won't. The file may claim to have 2^31 objects, which will overflow even a size_t. This doesn't hurt us at all for a well-formed idx file. A 32-bit system would already have failed to mmap such a file, since it would be too big. But an .idx file which _claims_ to have 2^31 objects but is actually much smaller would fool our check. This is a broken file, and for the most part we don't care that much what happens. But: - it's a little friendlier to notice up front "woah, this file is broken" than it is to get nonsense results - later access of the data assumes that the loading function sanity-checked that we have at least enough bytes for the regular object-id table. A malformed .idx file could lead to an out-of-bounds read. So let's use our overflow-checking functions to make sure that we're not fooled by a malformed file. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-16use size_t to store pack .idx byte offsetsJeff King1-2/+2
We sometimes store the offset into a pack .idx file as an "unsigned long", but the mmap'd size of a pack .idx file can exceed 4GB. This is sufficient on LP64 systems like Linux, but will be too small on LLP64 systems like Windows, where "unsigned long" is still only 32 bits. Let's use size_t, which is a better type for an offset into a memory buffer. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-16compute pack .idx byte offsets using size_tJeff King1-6/+6
A pack and its matching .idx file are limited to 2^32 objects, because the pack format contains a 32-bit field to store the number of objects. Hence we use uint32_t in the code. But the byte count of even a .idx file can be much larger than that, because it stores at least a hash and an offset for each object. So using SHA-1, a v2 .idx file will cross the 4GB boundary at 153,391,650 objects. This confuses load_idx(), which computes the minimum size like this: unsigned long min_size = 8 + 4*256 + nr*(hashsz + 4 + 4) + hashsz + hashsz; Even though min_size will be big enough on most 64-bit platforms, the actual arithmetic is done as a uint32_t, resulting in a truncation. We actually exceed that min_size, but then we do: unsigned long max_size = min_size; if (nr) max_size += (nr - 1)*8; to account for the variable-sized table. That computation doesn't overflow quite so low, but with the truncation for min_size, we end up with a max_size that is much smaller than our actual size. So we complain that the idx is invalid, and can't find any of its objects. We can fix this case by casting "nr" to a size_t, which will do the multiplication in 64-bits (assuming you're on a 64-bit platform; this will never work on a 32-bit system since we couldn't map the whole .idx anyway). Likewise, we don't have to worry about further additions, because adding a smaller number to a size_t will convert the other side to a size_t. A few notes: - obviously we could just declare "nr" as a size_t in the first place (and likewise, packed_git.num_objects). But it's conceptually a uint32_t because of the on-disk format, and we correctly treat it that way in other contexts that don't need to compute byte offsets (e.g., iterating over the set of objects should and generally does use a uint32_t). Switching to size_t would make all of those other cases look wrong. - it could be argued that the proper type is off_t to represent the file offset. But in practice the .idx file must fit within memory, because we mmap the whole thing. And the rest of the code (including the idx_size variable we're comparing against) uses size_t. - we'll add the same cast to the max_size arithmetic line. Even though we're adding to a larger type, which will convert our result, the multiplication is still done as a 32-bit value and can itself overflow. I didn't check this with my test case, since it would need an even larger pack (~530M objects), but looking at compiler output shows that it works this way. The standard should agree, but I couldn't find anything explicit in 6.3.1.8 ("usual arithmetic conversions"). The case in load_idx() was the most immediate one that I was able to trigger. After fixing it, looking up actual objects (including the very last one in sha1 order) works in a test repo with 153,725,110 objects. That's because bsearch_hash() works with uint32_t entry indices, and the actual byte access: int cmp = hashcmp(table + mi * stride, sha1); is done with "stride" as a size_t, causing the uint32_t "mi" to be promoted to a size_t. This is the way most code will access the index data. However, I audited all of the other byte-wise accesses of packed_git.index_data, and many of the others are suspect (they are similar to the max_size one, where we are adding to a properly sized offset or directly to a pointer, but the multiplication in the sub-expression can overflow). I didn't trigger any of these in practice, but I believe they're potential problems, and certainly adding in the cast is not going to hurt anything here. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-04Merge branch 'mt/delta-base-cache-races'Junio C Hamano1-19/+29
A race that leads to an access to a free'd data was corrected in the codepath that reads pack files. * mt/delta-base-cache-races: packfile: fix memory leak in add_delta_base_cache() packfile: fix race condition on unpack_entry()
2020-09-28packfile: fix memory leak in add_delta_base_cache()Matheus Tavares1-2/+5
When add_delta_base_cache() is called with a base that is already in the cache, no operation is performed. But the check is done after allocating space for a new entry, so we end up leaking memory on the early return. In addition, the caller never free()'s the base as it expects the function to take ownership of it. But the base is not released when we skip insertion, so it also gets leaked. To fix these problems, move the allocation of a new entry further down in add_delta_base_cache(), and free() the base on early return. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-28packfile: fix race condition on unpack_entry()Matheus Tavares1-17/+24
The third phase of unpack_entry() performs the following sequence in a loop, until all the deltas enumerated in phase one are applied and the entry is fully reconstructed: 1. Add the current base entry to the delta base cache 2. Unpack the next delta 3. Patch the unpacked delta on top of the base When the optional object reading lock is enabled, the above steps will be performed while holding the lock. However, step 2. momentarily releases it so that inflation can be performed in parallel for increased performance. Because the `base` buffer inserted in the cache at 1. is not duplicated, another thread can potentially free() it while the lock is released at 2. (e.g. when there is no space left in the cache to insert another entry). In this case, the later attempt to dereference `base` at 3. will cause a segmentation fault. This problem was observed during a multithreaded git-grep execution on a repository with large objects. To fix the race condition (and later segmentation fault), let's reorder the aforementioned steps so that `base` is only added to the cache at the end. This will prevent the buffer from being released by another thread while it is still in use. An alternative solution which would not require the reordering would be to duplicate `base` before inserting it in the cache. However, as Phil Hord mentioned, memcpy()'ing large bases can negatively affect performance: in his experiments, this alternative approach slowed git-grep down by 10% to 20%. Reported-by: Phil Hord <phil.hord@gmail.com> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-22Merge branch 'jk/dont-count-existing-objects-twice'Junio C Hamano1-0/+1
There is a logic to estimate how many objects are in the repository, which is mean to run once per process invocation, but it ran every time the estimated value was requested. * jk/dont-count-existing-objects-twice: packfile: actually set approximate_object_count_valid
2020-09-17packfile: actually set approximate_object_count_validJeff King1-0/+1
The approximate_object_count() function tries to compute the count only once per process. But ever since it was introduced in 8e3f52d778 (find_unique_abbrev: move logic out of get_short_sha1(), 2016-10-03), we failed to actually set the "valid" flag, meaning we'd compute it fresh on every call. This turns out not to be _too_ bad, because we're only iterating through the packed_git list, and not making any system calls. But since it may get called for every abbreviated hash we output, even this can add up if you have many packs. Here are before-and-after timings for a new perf test which just asks rev-list to abbreviate each commit hash (the test repo is linux.git, with commit-graphs): Test origin HEAD ---------------------------------------------------------------------------- 5303.3: rev-list (1) 28.91(28.46+0.44) 29.03(28.65+0.38) +0.4% 5303.4: abbrev-commit (1) 1.18(1.06+0.11) 1.17(1.02+0.14) -0.8% 5303.7: rev-list (50) 28.95(28.56+0.38) 29.50(29.17+0.32) +1.9% 5303.8: abbrev-commit (50) 3.67(3.56+0.10) 3.57(3.42+0.15) -2.7% 5303.11: rev-list (1000) 30.34(29.89+0.43) 30.82(30.35+0.46) +1.6% 5303.12: abbrev-commit (1000) 86.82(86.52+0.29) 77.82(77.59+0.22) -10.4% 5303.15: load 10,000 packs 0.08(0.02+0.05) 0.08(0.02+0.06) +0.0% It doesn't help at all when we have 1 pack (5303.4), but we get a 10% speedup when there are 1000 packs (5303.12). That's a modest speedup for a case that's already slow and we'd hope to avoid in general (note how slow it is even after, because we have to look in each of those packs for abbreviations). But it's a one-line change that clearly matches the original intent, so it seems worth doing. The included perf test may also be useful for keeping an eye on any regressions in the overall abbreviation code. Reported-by: Rasmus Villemoes <rv@rasmusvillemoes.dk> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-28midx: traverse the local MIDX firstTaylor Blau1-0/+11
When a repository has an alternate object directory configured, callers can traverse through each alternate's MIDX by walking the '->next' pointer. But, when 'prepare_multi_pack_index_one()' loads multiple MIDXs, it places the new ones at the front of this pointer chain, not at the end. This can be confusing for callers such as 'git repack -ad', causing test failures like in t7700.6 with 'GIT_TEST_MULTI_PACK_INDEX=1'. The occurs when dropping a pack known to the local MIDX with alternates configured that have their own MIDX. Since the alternate's MIDX is returned via 'get_multi_pack_index()', 'midx_contains_pack()' returns true (which is correct, since it traverses through the '->next' pointer to find the MIDX in the chain that does contain the requested object). But, we call 'clear_midx_file()' on 'the_repository', which drops the MIDX at the path of the first MIDX in the chain, which (in the case of t7700.6 is the one in the alternate). This patch addresses that by: - placing the local MIDX first in the chain when calling 'prepare_multi_pack_index_one()', and - introducing a new 'get_local_multi_pack_index()', which explicitly returns the repository-local MIDX, if any. Don't impose an additional order on the MIDX's '->next' pointer beyond that the first item in the chain must be local if one exists so that we avoid a quadratic insertion. Likewise, use 'get_local_multi_pack_index()' in 'remove_redundant_pack()' to fix the formerly broken t7700.6 when run with 'GIT_TEST_MULTI_PACK_INDEX=1'. Finally, note that the MIDX ordering invariant is only preserved by the insertion order in 'prepare_packed_git()', which traverses through the ODB's '->next' pointer, meaning we visit the local object store first. This fragility makes this an undesirable long-term solution if more callers are added, but it is acceptable for now since this is the only caller. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-27packfile: compute and use the index CRC offsetbrian m. carlson1-0/+1
Both v2 pack index files and the v3 format specified as part of the NewHash work have similar data starting at the CRC table. Much of the existing code wants to read either this table or the offset entries following it, and in doing so computes the offset each time. In order to share as much code between v2 and v3, compute the offset of the CRC table and store it when the pack is opened. Use this value to compute offsets to not only the CRC table, but to the offset entries beyond it. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-24packfile: drop nth_packed_object_sha1()Jeff King1-16/+7
Once upon a time, nth_packed_object_sha1() was the primary way to get the oid of a packfile's index position. But these days we have the more type-safe nth_packed_object_id() wrapper, and all callers have been converted. Let's drop the "sha1" version (turning the safer wrapper into a single function) so that nobody is tempted to introduce new callers. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-24packed_object_info(): use object_id internally for delta baseJeff King1-17/+15
The previous commit changed the public interface of packed_object_info() to return a struct object_id rather than a bare hash. That enables us to convert our internal helper, as well. We can use nth_packed_object_id() directly for OFS_DELTA, but we'll still have to use oidread() to pull the hash for a REF_DELTA out of the packfile. There should be no additional cost, since we're copying directly into the object_id the caller provided us (just as we did before; it's just happening now via nth_packed_object_id()). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-24packed_object_info(): use object_id for returning delta baseJeff King1-3/+3
If a caller sets the object_info.delta_base_sha1 to a non-NULL pointer, we'll write the oid of the object's delta base to it. But we can increase our type safety by switching this to a real object_id struct. All of our callers are just pointing into the hash member of an object_id anyway, so there's no inconvenience. Note that we do still keep it as a pointer-to-struct, because the NULL sentinel value tells us whether the caller is even interested in the information. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-24nth_packed_object_oid(): use customary integer returnJeff King1-9/+9
Our nth_packed_object_sha1() function returns NULL for error. So when we wrapped it with nth_packed_object_oid(), we kept the same semantics. But it's a bit funny, because the caller actually passes in an out parameter, and the pointer we return is just that same struct they passed to us (or NULL). It's not too terrible, but it does make the interface a little non-idiomatic. Let's switch to our usual "0 for success, negative for error" return value. Most callers either don't check it, or are trivially converted. The one that requires the biggest change is actually improved, as we can ditch an extra aliased pointer variable. Since we are changing the interface in a subtle way that the compiler wouldn't catch, let's also change the name to catch any topics in flight. We can drop the 'o' and make it nth_packed_object_id(). That's slightly shorter, but also less redundant since the 'o' stands for "object" already. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-14Merge branch 'mt/threaded-grep-in-object-store'Junio C Hamano1-0/+34
Traditionally, we avoided threaded grep while searching in objects (as opposed to files in the working tree) as accesses to the object layer is not thread-safe. This limitation is getting lifted. * mt/threaded-grep-in-object-store: grep: use no. of cores as the default no. of threads grep: move driver pre-load out of critical section grep: re-enable threads in non-worktree case grep: protect packed_git [re-]initialization grep: allow submodule functions to run in parallel submodule-config: add skip_if_read option to repo_read_gitmodules() grep: replace grep_read_mutex by internal obj read lock object-store: allow threaded access to object reading replace-object: make replace operations thread-safe grep: fix racy calls in grep_objects() grep: fix race conditions at grep_submodule() grep: fix race conditions on userdiff calls
2020-02-14Merge branch 'jk/packfile-reuse-cleanup'Junio C Hamano1-5/+5
The way "git pack-objects" reuses objects stored in existing pack to generate its result has been improved. * jk/packfile-reuse-cleanup: pack-bitmap: don't rely on bitmap_git->reuse_objects pack-objects: add checks for duplicate objects pack-objects: improve partial packfile reuse builtin/pack-objects: introduce obj_is_packed() pack-objects: introduce pack.allowPackReuse csum-file: introduce hashfile_total() pack-bitmap: simplify bitmap_has_oid_in_uninteresting() pack-bitmap: uninteresting oid can be outside bitmapped packfile pack-bitmap: introduce bitmap_walk_contains() ewah/bitmap: introduce bitmap_word_alloc() packfile: expose get_delta_base() builtin/pack-objects: report reused packfile objects
2020-01-17grep: protect packed_git [re-]initializationMatheus Tavares1-0/+2
Some fields in struct raw_object_store are lazy initialized by the thread-unsafe packfile.c:prepare_packed_git(). Although this function is present in the call stack of git-grep threads, all paths to it are currently protected by obj_read_lock() (and the main thread usually indirectly calls it before firing the worker threads, anyway). However, it's possible that future modifications add new unprotected paths to it, introducing a race condition. Because errors derived from it wouldn't happen often, it could be hard to detect. So to prevent future headaches, let's force eager initialization of packed_git when setting git-grep up. There'll be a small overhead in the cases where we didn't really need to prepare packed_git during execution but this shouldn't be very noticeable. Also, packed_git may be re-initialized by packfile.c:reprepare_packed_git(). Again, all paths to it in git-grep are already protected by obj_read_lock() but it may suffer from the same problem in the future. So let's also internally protect it with obj_read_lock() (which is a recursive mutex). Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-17object-store: allow threaded access to object readingMatheus Tavares1-0/+32
Allow object reading to be performed by multiple threads protecting it with an internal lock, the obj_read_mutex. The lock usage can be toggled with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the functions which can be safely called in parallel are: read_object_file_extended(), repo_read_object_file(), read_object_file(), read_object_with_reference(), read_object(), oid_object_info() and oid_object_info_extended(). It's also possible to use obj_read_lock() and obj_read_unlock() to protect other sections that cannot execute in parallel with object reading. Probably there are many spots in the functions listed above that could be executed unlocked (and thus, in parallel). But, for now, we are most interested in allowing parallel access to zlib inflation. This is one of the sections where object reading spends most of the time in (e.g. up to one-third of git-grep's execution time in the chromium repo corresponds to inflation) and it's already thread-safe. So, to take advantage of that, the obj_read_mutex is released when calling git_inflate() and re-acquired right after, for every calling spot in oid_object_info_extended()'s call chain. We may refine this lock to also exploit other possible parallel spots in the future, but for now, threaded zlib inflation should already give great speedups for threaded object reading callers. Note that add_delta_base_cache() was also modified to skip adding already present entries to the cache. This wasn't possible before, but it would be now, with the parallel inflation. Take for example the following situation, where two threads - A and B - are executing the code at unpack_entry(): 1. Thread A is performing the decompression of a base O (which is not yet in the cache) at PHASE II. Thread B is simultaneously trying to unpack O, but just starting at PHASE I. 2. Since O is not yet in the cache, B will go to PHASE II to also perform the decompression. 3. When they finish decompressing, one of them will get the object reading mutex and go to PHASE III while the other waits for the mutex. Let’s say A got the mutex first. 4. Thread A will add O to the cache, go throughout the rest of PHASE III and return. 5. Thread B gets the mutex, also add O to the cache (if the check wasn't there) and returns. Finally, it is also important to highlight that the object reading lock can only ensure thread-safety in the mentioned functions thanks to two complementary mechanisms: the use of 'struct raw_object_store's replace_mutex, which guards sections in the object reading machinery that would otherwise be thread-unsafe; and the 'struct pack_window's inuse_cnt, which protects window reading operations (such as the one performed during the inflation of a packed object), allowing them to execute without the acquisition of the obj_read_mutex. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-06Merge branch 'ew/packfile-syscall-optim'Junio C Hamano1-14/+2
Code cleanup. * ew/packfile-syscall-optim: packfile: replace lseek+read with pread packfile: remove redundant fcntl F_GETFD/F_SETFD
2019-12-26packfile: replace lseek+read with preadEric Wong1-3/+2
We already have pread emulation for portability, so there's there's no reason to make two syscalls where one suffices. Furthermore, readers of the packfile will be using mmap (or pread to emulate mmap), anyways, so the file description offset does not matter in this case. Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-26packfile: remove redundant fcntl F_GETFD/F_SETFDEric Wong1-11/+0
git_open sets close-on-exec since cd66ada06588f797 ("sha1_file: open window into packfiles with O_CLOEXEC"). There's no reason to keep using fcntl to set the close-on-exec flag, anymore. Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-03packfile.c: speed up loading lots of packfilesColin Stolley1-9/+10
When loading packfiles on start-up, we traverse the internal packfile list once per file to avoid reloading packfiles that have already been loaded. This check runs in quadratic time, so for poorly maintained repos with a large number of packfiles, it can be pretty slow. Add a hashmap containing the packfile names as we load them so that the average runtime cost of checking for already-loaded packs becomes constant. Add a perf test to p5303 to show speed-up. The existing p5303 test runtimes are dominated by other factors and do not show an appreciable speed-up. The new test in p5303 clearly exposes a speed-up in bad cases. In this test we create 10,000 packfiles and measure the start-up time of git rev-parse, which does little else besides load in the packs. Here are the numbers for the new p5303 test: Test HEAD^ HEAD --------------------------------------------------------------------- 5303.12: load 10,000 packs 1.03(0.92+0.10) 0.12(0.02+0.09) -88.3% Signed-off-by: Colin Stolley <cstolley@runbox.com> Helped-by: Jeff King <peff@peff.net> [jc: squashed the change to call hashmap in install_packed_git() by peff] Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-15Merge branch 'ew/hashmap'Junio C Hamano1-8/+14
Code clean-up of the hashmap API, both users and implementation. * ew/hashmap: hashmap_entry: remove first member requirement from docs hashmap: remove type arg from hashmap_{get,put,remove}_entry OFFSETOF_VAR macro to simplify hashmap iterators hashmap: introduce hashmap_free_entries hashmap: hashmap_{put,remove} return hashmap_entry * hashmap: use *_entry APIs for iteration hashmap_cmp_fn takes hashmap_entry params hashmap_get{,_from_hash} return "struct hashmap_entry *" hashmap: use *_entry APIs to wrap container_of hashmap_get_next returns "struct hashmap_entry *" introduce container_of macro hashmap_put takes "struct hashmap_entry *" hashmap_remove takes "const struct hashmap_entry *" hashmap_get takes "const struct hashmap_entry *" hashmap_add takes "struct hashmap_entry *" hashmap_get_next takes "const struct hashmap_entry *" hashmap_entry_init takes "struct hashmap_entry *" packfile: use hashmap_entry in delta_base_cache_entry coccicheck: detect hashmap_entry.hash assignment diff: use hashmap_entry_init on moved_entry.ent
2019-10-11Merge branch 'rs/dedup-includes'Junio C Hamano1-1/+0
Code cleanup. * rs/dedup-includes: treewide: remove duplicate #include directives
2019-10-11Merge branch 'bc/object-id-part17'Junio C Hamano1-2/+2
Preparation for SHA-256 upgrade continues. * bc/object-id-part17: (26 commits) midx: switch to using the_hash_algo builtin/show-index: replace sha1_to_hex rerere: replace sha1_to_hex builtin/receive-pack: replace sha1_to_hex builtin/index-pack: replace sha1_to_hex packfile: replace sha1_to_hex wt-status: convert struct wt_status to object_id cache: remove null_sha1 builtin/worktree: switch null_sha1 to null_oid builtin/repack: write object IDs of the proper length pack-write: use hash_to_hex when writing checksums sequencer: convert to use the_hash_algo bisect: switch to using the_hash_algo sha1-lookup: switch hard-coded constants to the_hash_algo config: use the_hash_algo in abbrev comparison combine-diff: replace GIT_SHA1_HEXSZ with the_hash_algo bundle: switch to use the_hash_algo connected: switch GIT_SHA1_HEXSZ to the_hash_algo show-index: switch hard-coded constants to the_hash_algo blame: remove needless comparison with GIT_SHA1_HEXSZ ...
2019-10-07hashmap_cmp_fn takes hashmap_entry paramsEric Wong1-2/+7
Another step in eliminating the requirement of hashmap_entry being the first member of a struct. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap_get{,_from_hash} return "struct hashmap_entry *"Eric Wong1-2/+3
Update callers to use hashmap_get_entry, hashmap_get_entry_from_hash or container_of as appropriate. This is another step towards eliminating the requirement of hashmap_entry being the first field in a struct. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap_remove takes "const struct hashmap_entry *"Eric Wong1-1/+1
This is less error-prone than "const void *" as the compiler now detects invalid types being passed. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap_add takes "struct hashmap_entry *"Eric Wong1-1/+1
This is less error-prone than "void *" as the compiler now detects invalid types being passed. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap_entry_init takes "struct hashmap_entry *"Eric Wong1-1/+1
C compilers do type checking to make life easier for us. So rely on that and update all hashmap_entry_init callers to take "struct hashmap_entry *" to avoid future bugs while improving safety and readability. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07packfile: use hashmap_entry in delta_base_cache_entryEric Wong1-1/+1
This hashmap_entry_init function is intended to take a hashmap_entry struct pointer, not a hashmap struct pointer. This was not noticed because hashmap_entry_init takes a "void *" arg instead of "struct hashmap_entry *", and the hashmap struct is larger and can be cast into a hashmap_entry struct without data corruption. This has the beneficial side effect of reducing the size of a delta_base_cache_entry from 104 bytes to 72 bytes on 64-bit systems. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-04treewide: remove duplicate #include directivesRené Scharfe1-1/+0
Found with "git grep '^#include ' '*.c' | sort | uniq -d". Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-30Merge branch 'rs/get-tagged-oid'Junio C Hamano1-1/+1
Code cleanup. * rs/get-tagged-oid: use get_tagged_oid() tag: factor out get_tagged_oid()
2019-09-18Merge branch 'cc/multi-promisor'Junio C Hamano1-1/+2
Teach the lazy clone machinery that there can be more than one promisor remote and consult them in order when downloading missing objects on demand. * cc/multi-promisor: Move core_partial_clone_filter_default to promisor-remote.c Move repository_format_partial_clone to promisor-remote.c Remove fetch-object.{c,h} in favor of promisor-remote.{c,h} remote: add promisor and partial clone config to the doc partial-clone: add multiple remotes in the doc t0410: test fetching from many promisor remotes builtin/fetch: remove unique promisor remote limitation promisor-remote: parse remote.*.partialclonefilter Use promisor_remote_get_direct() and has_promisor_remote() promisor-remote: use repository_format_partial_clone promisor-remote: add promisor_remote_reinit() promisor-remote: implement promisor_remote_get_direct() Add initial support for many promisor remotes fetch-object: make functions return an error code t0410: remove pipes after git commands
2019-09-13packfile: expose get_delta_base()Jeff King1-5/+5
In a following commit get_delta_base() will be used outside packfile.c, so let's make it non static and declare it in packfile.h. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-05use get_tagged_oid()René Scharfe1-1/+1
Avoid derefencing ->tagged without checking for NULL by using the convenience wrapper for getting the ID of the tagged object. It die()s when encountering a broken tag instead of segfaulting. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-19packfile: replace sha1_to_hexbrian m. carlson1-2/+2
Replace a use of sha1_to_hex with hash_to_hex so that this code works with a hash algorithm other than SHA-1. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-13packfile: drop release_pack_memory()Jeff King1-18/+0
Long ago, in 97bfeb34df (Release pack windows before reporting out of memory., 2006-12-24), we taught xmalloc() and friends to try unmapping pack windows when malloc() failed. It's unlikely that his helps a lot in practice, and it has some downsides. First, the downsides: 1. It makes xmalloc() not thread-safe. We've worked around this in pack-objects.c, which installs its own locking version of the try_to_free_routine(). But other threaded code doesn't. 2. It makes the system as a whole harder to reason about. Functions which allocate heap memory under the hood may have farther-reaching effects than expected. That might be worth the tradeoff if there's a benefit. But in practice, it seems unlikely. We're generally dealing with mmap'd files, so the OS is going to do a much better job at responding to memory pressure by dropping individual pages (the exception is systems with NO_MMAP, but even there the OS can probably respond just as well with swapping). So the only thing we're really freeing is address space. On 64-bit systems, we have plenty of that to go around. On 32-bit systems, it could possibly help. But around the same time we made two other changes: 77ccc5bbd1 (Introduce new config option for mmap limit., 2006-12-23) and 60bb8b1453 (Fully activate the sliding window pack access., 2006-12-23). Together that means that a 32-bit system should have no more than 256MB total of packed-git mmaps at one time, split between a few 32MB windows. It's unlikely we have any address space problems since then, but we don't have any data since the features were all added at the same time. Likewise, xmmap() will try to free memory. At first glance, it seems like we'd need this (when we try to mmap a new window, we might need to close an old one to save address space on a 32-bit system). But we're saved again by core.packedGitLimit: if we're going to exceed our 256MB limit, we'll close an existing window before we even call mmap(). So it seems unlikely that this feature is actually doing anything useful. And while we don't have reports of it harming anything (probably because it rarely if ever kicks in), it would be nice to simplify the system overall. This patch drops the whole try_to_free system from xmalloc(), as well as the manual pack memory release in xmmap(). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-07-29Merge branch 'ds/close-object-store' into maintJunio C Hamano1-1/+4
The commit-graph file is now part of the "files that the runtime may keep open file descriptors on, all of which would need to be closed when done with the object store", and the file descriptor to an existing commit-graph file now is closed before "gc" finalizes a new instance to replace it. * ds/close-object-store: packfile: rename close_all_packs to close_object_store packfile: close commit-graph in close_all_packs commit-graph: use raw_object_store when closing commit-graph: extract write_commit_graph_file() commit-graph: extract copy_oids_to_commits() commit-graph: extract count_distinct_commits() commit-graph: extract fill_oids_from_all_packs() commit-graph: extract fill_oids_from_commit_hex() commit-graph: extract fill_oids_from_packs() commit-graph: create write_commit_graph_context commit-graph: remove Future Work section commit-graph: collapse parameters into flags commit-graph: return with errors during write commit-graph: fix the_repository reference
2019-07-29Merge branch 'rs/copy-array' into maintJunio C Hamano1-3/+3
Code clean-up. * rs/copy-array: use COPY_ARRAY for copying arrays coccinelle: use COPY_ARRAY for copying arrays
2019-07-25Merge branch 'mh/import-transport-fd-fix' into maintJunio C Hamano1-1/+1
The ownership rule for the file descriptor to fast-import remote backend was mixed up, leading to unrelated file descriptor getting closed, which has been fixed. * mh/import-transport-fd-fix: Use xmmap_gently instead of xmmap in use_pack dup() the input fd for fast-import used for remote helpers
2019-07-19Merge branch 'ds/midx-expire-repack'Junio C Hamano1-0/+28
"git multi-pack-index" learned expire and repack subcommands. * ds/midx-expire-repack: t5319: use 'test-tool path-utils' instead of 'ls -l' t5319-multi-pack-index.sh: test batch size zero midx: add test that 'expire' respects .keep files multi-pack-index: test expire while adding packs midx: implement midx_repack() multi-pack-index: prepare 'repack' subcommand multi-pack-index: implement 'expire' subcommand midx: refactor permutation logic and pack sorting midx: simplify computation of pack name lengths multi-pack-index: prepare for 'expire' subcommand Docs: rearrange subcommands for multi-pack-index repack: refactor pack deletion for future use
2019-07-09Merge branch 'rs/copy-array'Junio C Hamano1-3/+3
Code clean-up. * rs/copy-array: use COPY_ARRAY for copying arrays coccinelle: use COPY_ARRAY for copying arrays
2019-07-09Merge branch 'ds/close-object-store'Junio C Hamano1-1/+4
The commit-graph file is now part of the "files that the runtime may keep open file descriptors on, all of which would need to be closed when done with the object store", and the file descriptor to an existing commit-graph file now is closed before "gc" finalizes a new instance to replace it. * ds/close-object-store: packfile: rename close_all_packs to close_object_store packfile: close commit-graph in close_all_packs commit-graph: use raw_object_store when closing
2019-06-25Use promisor_remote_get_direct() and has_promisor_remote()Christian Couder1-1/+2
Instead of using the repository_format_partial_clone global and fetch_objects() directly, let's use has_promisor_remote() and promisor_remote_get_direct(). This way all the configured promisor remotes will be taken into account, not only the one specified by extensions.partialClone. Also when cloning or fetching using a partial clone filter, remote.origin.promisor will be set to "true" instead of setting extensions.partialClone to "origin". This makes it possible to use many promisor remote just by fetching from them. Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-17use COPY_ARRAY for copying arraysRené Scharfe1-3/+3
Convert calls of memcpy(3) to use COPY_ARRAY, which shortens and simplifies the code a bit. Patch generated by Coccinelle and contrib/coccinelle/array.cocci. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-13Merge branch 'mh/import-transport-fd-fix'Junio C Hamano1-1/+1
The ownership rule for the file descriptor to fast-import remote backend was mixed up, leading to unrelated file descriptor getting closed, which has been fixed. * mh/import-transport-fd-fix: Use xmmap_gently instead of xmmap in use_pack dup() the input fd for fast-import used for remote helpers
2019-06-12packfile: rename close_all_packs to close_object_storeDerrick Stolee1-1/+1
The close_all_packs() method is now responsible for more than just pack-files. It also closes the commit-graph and the multi-pack-index. Rename the function to be more descriptive of its larger role. The name also fits because the input parameter is a raw_object_store. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12packfile: close commit-graph in close_all_packsDerrick Stolee1-0/+3
The close_all_packs() method is used to close all read handles to pack-files and the multi-pack-index before running 'git gc --auto'. This is particularly important on the Windows platform, where read handles block any writes to those files. Replacing one of these files with a rename() will fail in this situation. The commit-graph also performs a rename, so is susceptable to this problem. We are careful to close the commit-graph before writing, but that doesn't work when a 'git fetch' (or similar) process runs 'git gc --auto' which may write a commit-graph. Here, close the commit-graph as part of close_all_packs(). Reported-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11repack: refactor pack deletion for future useDerrick Stolee1-0/+28
The repack builtin deletes redundant pack-files and their associated .idx, .promisor, .bitmap, and .keep files. We will want to re-use this logic in the future for other types of repack, so pull the logic into 'unlink_pack_path()' in packfile.c. The 'ignore_keep' parameter is enabled for the use in repack, but will be important for a future caller. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-19Merge branch 'ds/midx-too-many-packs'Junio C Hamano1-21/+9
The code to generate the multi-pack idx file was not prepared to see too many packfiles and ran out of open file descriptor, which has been corrected. * ds/midx-too-many-packs: midx: add packs to packed_git linked list midx: pass a repository pointer
2019-05-16Use xmmap_gently instead of xmmap in use_packMike Hommey1-1/+1
use_pack has its own error message on mmap error, but it can't be reached when using xmmap, which dies with its own error. Signed-off-by: Mike Hommey <mh@glandium.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-09Merge branch 'nd/sha1-name-c-wo-the-repository'Junio C Hamano1-7/+7
Further code clean-up to allow the lowest level of name-to-object mapping layer to work with a passed-in repository other than the default one. * nd/sha1-name-c-wo-the-repository: (34 commits) sha1-name.c: remove the_repo from get_oid_mb() sha1-name.c: remove the_repo from other get_oid_* sha1-name.c: remove the_repo from maybe_die_on_misspelt_object_name submodule-config.c: use repo_get_oid for reading .gitmodules sha1-name.c: add repo_get_oid() sha1-name.c: remove the_repo from get_oid_with_context_1() sha1-name.c: remove the_repo from resolve_relative_path() sha1-name.c: remove the_repo from diagnose_invalid_index_path() sha1-name.c: remove the_repo from handle_one_ref() sha1-name.c: remove the_repo from get_oid_1() sha1-name.c: remove the_repo from get_oid_basic() sha1-name.c: remove the_repo from get_describe_name() sha1-name.c: remove the_repo from get_oid_oneline() sha1-name.c: add repo_interpret_branch_name() sha1-name.c: remove the_repo from interpret_branch_mark() sha1-name.c: remove the_repo from interpret_nth_prior_checkout() sha1-name.c: remove the_repo from get_short_oid() sha1-name.c: add repo_for_each_abbrev() sha1-name.c: store and use repo in struct disambiguate_state sha1-name.c: add repo_find_unique_abbrev_r() ...
2019-05-07midx: add packs to packed_git linked listDerrick Stolee1-20/+8
The multi-pack-index allows searching for objects across multiple packs using one object list. The original design gains many of these performance benefits by keeping the packs in the multi-pack-index out of the packed_git list. Unfortunately, this has one major drawback. If the multi-pack-index covers thousands of packs, and a command loads many of those packs, then we can hit the limit for open file descriptors. The close_one_pack() method is used to limit this resource, but it only looks at the packed_git list, and uses an LRU cache to prevent thrashing. Instead of complicating this close_one_pack() logic to include direct references to the multi-pack-index, simply add the packs opened by the multi-pack-index to the packed_git list. This immediately solves the file-descriptor limit problem, but requires some extra steps to avoid performance issues or other problems: 1. Create a multi_pack_index bit in the packed_git struct that is one if and only if the pack was loaded from a multi-pack-index. 2. Skip packs with the multi_pack_index bit when doing object lookups and abbreviations. These algorithms already check the multi-pack-index before the packed_git struct. This has a very small performance hit, as we need to walk more packed_git structs. This is acceptable, since these operations run binary search on the other packs, so this walk-and-ignore logic is very fast by comparison. 3. When closing a multi-pack-index file, do not close its packs, as those packs will be closed using close_all_packs(). In some cases, such as 'git repack', we run 'close_midx()' without also closing the packs, so we need to un-set the multi_pack_index bit in those packs. This is necessary, and caught by running t6501-freshen-objects.sh with GIT_TEST_MULTI_PACK_INDEX=1. To manually test this change, I inserted trace2 logging into close_pack_fd() and set pack_max_fds to 10, then ran 'git rev-list --all --objects' on a copy of the Git repo with 300+ pack-files and a multi-pack-index. The logs verified the packs are closed as we read them beyond the file descriptor limit. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-07midx: pass a repository pointerDerrick Stolee1-2/+2
Much of the multi-pack-index code focuses on the multi_pack_index struct, and so we only pass a pointer to the current one. However, we will insert a dependency on the packed_git linked list in a future change, so we will need a repository reference. Inserting these parameters is a significant enough change to split out. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-25Merge branch 'bc/hash-transition-16'Junio C Hamano1-3/+3
Conversion from unsigned char[20] to struct object_id continues. * bc/hash-transition-16: (35 commits) gitweb: make hash size independent Git.pm: make hash size independent read-cache: read data in a hash-independent way dir: make untracked cache extension hash size independent builtin/difftool: use parse_oid_hex refspec: make hash size independent archive: convert struct archiver_args to object_id builtin/get-tar-commit-id: make hash size independent get-tar-commit-id: parse comment record hash: add a function to lookup hash algorithm by length remote-curl: make hash size independent http: replace sha1_to_hex http: compute hash of downloaded objects using the_hash_algo http: replace hard-coded constant with the_hash_algo http-walker: replace sha1_to_hex http-push: remove remaining uses of sha1_to_hex http-backend: allow 64-character hex names http-push: convert to use the_hash_algo builtin/pull: make hash-size independent builtin/am: make hash size independent ...
2019-04-25Merge branch 'jk/server-info-rabbit-hole'Junio C Hamano1-3/+15
Code clean-up around a much-less-important-than-it-used-to-be update_server_info() funtion. * jk/server-info-rabbit-hole: update_info_refs(): drop unused force parameter server-info: drop objdirlen pointer arithmetic server-info: drop nr_alloc struct member server-info: use strbuf to read old info/packs file server-info: simplify cleanup in parse_pack_def() server-info: fix blind pointer arithmetic http: simplify parsing of remote objects/info/packs packfile: fix pack basename computation midx: check both pack and index names for containment t5319: drop useless --buffer from cat-file t5319: fix bogus cat-file argument pack-revindex: open index if necessary packfile.h: drop extern from function declarations
2019-04-16packfile: fix pack basename computationJeff King1-1/+11
When we have a multi-pack-index that covers many packfiles, we try to avoid opening the .idx for those packfiles. To do that we feed the pack name to midx_contains_pack(). But that function wants to see only the basename, which we compute using strrchr() to find the final slash. But that leaves an extra "/" at the start of our string. We can fix this by incrementing the pointer. That also raises the question of what to do when the name does not have a '/' at all. This should generally not happen (we always find files in "pack/"), but it doesn't hurt to be defensive here. Let's wrap all of that up in a helper function and make it publicly available, since a later patch will need to use it, too. The tests don't notice because there's nothing about opening those .idx files that would cause us to give incorrect output. It's just a little slower. The new test checks this case by corrupting the covered .idx, and then making sure we don't complain about it. We also have to tweak t5570, which intentionally corrupts a .idx file and expects us to notice it. When run with GIT_TEST_MULTI_PACK_INDEX, this will fail since we now will (correctly) not bother opening the .idx at all. We can fix that by unconditionally dropping any midx that's there, which ensures we'll have to read the .idx. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-16pack-revindex: open index if necessaryJeff King1-2/+4
We can't create a pack revindex if we haven't actually looked at the index. Normally we would never get as far as creating a revindex without having already been looking in the pack, so this code never bothered to double-check that pack->index_data had been loaded. But with the new multi-pack-index feature, many code paths might not load the individual pack .idx at all (they'd find objects via the midx and then open the .pack, but not its index). This can't yet be triggered in practice, because a bug in the midx code means we accidentally open up the individual .idx files anyway. But in preparation for fixing that, let's have the revindex code check that everything it needs has been loaded. In most cases this will just be a quick noop. But note that this does introduce a possibility of error (if we have to open the index and it's corrupt), so load_pack_revindex() now returns a result code, and callers need to handle the error. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-08packfile.c: add repo_approximate_object_count()Nguyễn Thái Ngọc Duy1-7/+7
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01object-store: rename and expand packed_git's sha1 memberbrian m. carlson1-3/+3
This member is used to represent the pack checksum of the pack in question. Expand this member to be GIT_MAX_RAWSZ bytes in length so it works with longer hashes and rename it to be "hash" instead of "sha1". This transformation was made with a change to the definition and the following semantic patch: @@ struct packed_git *E1; @@ - E1->sha1 + E1->hash @@ struct packed_git E1; @@ - E1.sha1 + E1.hash Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-22midx: during verify group objects by packfile to speed verificationJeff Hostetler1-1/+1
Teach `multi-pack-index verify` to sort the set of object by packfile so that only one packfile needs to be open at a time. This is a performance improvement. Previously, objects were verified in OID order. This essentially requires all packfiles to be open at the same time. If the number of packfiles exceeds the open file limit, packfiles would be LRU-closed and re-opened many times. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-05Merge branch 'sb/more-repo-in-api'Junio C Hamano1-2/+3
The in-core repository instances are passed through more codepaths. * sb/more-repo-in-api: (23 commits) t/helper/test-repository: celebrate independence from the_repository path.h: make REPO_GIT_PATH_FUNC repository agnostic commit: prepare free_commit_buffer and release_commit_memory for any repo commit-graph: convert remaining functions to handle any repo submodule: don't add submodule as odb for push submodule: use submodule repos for object lookup pretty: prepare format_commit_message to handle arbitrary repositories commit: prepare logmsg_reencode to handle arbitrary repositories commit: prepare repo_unuse_commit_buffer to handle any repo commit: prepare get_commit_buffer to handle any repo commit-reach: prepare in_merge_bases[_many] to handle any repo commit-reach: prepare get_merge_bases to handle any repo commit-reach.c: allow get_merge_bases_many_0 to handle any repo commit-reach.c: allow remove_redundant to handle any repo commit-reach.c: allow merge_bases_many to handle any repo commit-reach.c: allow paint_down_to_common to handle any repo commit: allow parse_commit* to handle any repo object: parse_object to honor its repository argument object-store: prepare has_{sha1, object}_file to handle any repo object-store: prepare read_object_file to deal with any repo ...
2019-01-29Merge branch 'bc/tree-walk-oid'Junio C Hamano1-1/+1
The code to walk tree objects has been taught that we may be working with object names that are not computed with SHA-1. * bc/tree-walk-oid: cache: make oidcpy always copy GIT_MAX_RAWSZ bytes tree-walk: store object_id in a separate member match-trees: use hashcpy to splice trees match-trees: compute buffer offset correctly when splicing tree-walk: copy object ID before use
2019-01-15tree-walk: store object_id in a separate memberbrian m. carlson1-1/+1
When parsing a tree, we read the object ID directly out of the tree buffer. This is normally fine, but such an object ID cannot be used with oidcpy, which copies GIT_MAX_RAWSZ bytes, because if we are using SHA-1, there may not be that many bytes to copy. Instead, store the object ID in a separate struct member. Since we can no longer efficiently compute the path length, store that information as well in struct name_entry. Ensure we only copy the object ID into the new buffer if the path length is nonzero, as some callers will pass us an empty path with no object ID following it, and we will not want to read past the end of the buffer. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-08object-store: factor out odb_clear_loose_cache()René Scharfe1-5/+2
Add and use a function for emptying the loose object cache, so callers don't have to know any of its implementation details. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-04Merge branch 'jk/loose-object-cache'Junio C Hamano1-6/+14
Code clean-up with optimization for the codepath that checks (non-)existence of loose objects. * jk/loose-object-cache: odb_load_loose_cache: fix strbuf leak fetch-pack: drop custom loose object cache sha1-file: use loose object cache for quick existence check object-store: provide helpers for loose_objects_cache sha1-file: use an object_directory for the main object dir handle alternates paths the same as the main object dir sha1_file_name(): overwrite buffer instead of appending rename "alternate_object_database" to "object_directory" submodule--helper: prefer strip_suffix() to ends_with() fsck: do not reuse child_process structs
2018-11-13Merge branch 'ds/test-multi-pack-index'Junio C Hamano1-0/+5
Tests for the recently introduced multi-pack index machinery. * ds/test-multi-pack-index: packfile: close multi-pack-index in close_all_packs multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX midx: close multi-pack-index on repack midx: fix broken free() in close_midx()
2018-11-13object-store: provide helpers for loose_objects_cacheJeff King1-0/+8
Our object_directory struct has a loose objects cache that all users of the struct can see. But the only one that knows how to load the cache is find_short_object_filename(). Let's extract that logic in to a reusable function. While we're at it, let's also reset the cache when we re-read the object directories. This shouldn't have an impact on performance, as re-reads are meant to be rare (and are already expensive, so we avoid them with things like OBJECT_INFO_QUICK). Since the cache is already meant to be an approximation, it's tempting to skip even this bit of safety. But it's necessary to allow more code to use it. For instance, fetch-pack explicitly re-reads the object directory after performing its fetch, and would be confused if we didn't clear the cache. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13sha1-file: use an object_directory for the main object dirJeff King1-5/+5
Our handling of alternate object directories is needlessly different from the main object directory. As a result, many places in the code basically look like this: do_something(r->objects->objdir); for (odb = r->objects->alt_odb_list; odb; odb = odb->next) do_something(odb->path); That gets annoying when do_something() is non-trivial, and we've resorted to gross hacks like creating fake alternates (see find_short_object_filename()). Instead, let's give each raw_object_store a unified list of object_directory structs. The first will be the main store, and everything after is an alternate. Very few callers even care about the distinction, and can just loop over the whole list (and those who care can just treat the first element differently). A few observations: - we don't need r->objects->objectdir anymore, and can just mechanically convert that to r->objects->odb->path - object_directory's path field needs to become a real pointer rather than a FLEX_ARRAY, in order to fill it with expand_base_dir() - we'll call prepare_alt_odb() earlier in many functions (i.e., outside of the loop). This may result in us calling it even when our function would be satisfied looking only at the main odb. But this doesn't matter in practice. It's not a very expensive operation in the first place, and in the majority of cases it will be a noop. We call it already (and cache its results) in prepare_packed_git(), and we'll generally check packs before loose objects. So essentially every program is going to call it immediately once per program. Arguably we should just prepare_alt_odb() immediately upon setting up the repository's object directory, which would save us sprinkling calls throughout the code base (and forgetting to do so has been a source of subtle bugs in the past). But I've stopped short of that here, since there are already a lot of other moving parts in this patch. - Most call sites just get shorter. The check_and_freshen() functions are an exception, because they have entry points to handle local and nonlocal directories separately. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13rename "alternate_object_database" to "object_directory"Jeff King1-4/+4
In preparation for unifying the handling of alt odb's and the normal repo object directory, let's use a more neutral name. This patch is purely mechanical, swapping the type name, and converting any variables named "alt" to "odb". There should be no functional change, but it will reduce the noise in subsequent diffs. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-30Merge branch 'bc/hash-transition-part-15'Junio C Hamano1-2/+3
More codepaths are moving away from hardcoded hash sizes. * bc/hash-transition-part-15: rerere: convert to use the_hash_algo submodule: make zero-oid comparison hash function agnostic apply: rename new_sha1_prefix and old_sha1_prefix apply: replace hard-coded constants tag: express constant in terms of the_hash_algo transport: use parse_oid_hex instead of a constant upload-pack: express constants in terms of the_hash_algo refs/packed-backend: express constants using the_hash_algo packfile: express constants in terms of the_hash_algo pack-revindex: express constants in terms of the_hash_algo builtin/fetch-pack: remove constants with parse_oid_hex builtin/mktree: remove hard-coded constant builtin/repack: replace hard-coded constants pack-bitmap-write: use GIT_MAX_RAWSZ for allocation object_id.cocci: match only expressions of type 'struct object_id'
2018-10-26packfile: close multi-pack-index in close_all_packsDerrick Stolee1-0/+5
Whenever we delete pack-files from the object directory, we need to also delete the multi-pack-index that may refer to those objects. Sometimes, this connection is obvious, like during a repack. Other times, this is less obvious, like when gc calls a repack command and then does other actions on the objects, like write a commit-graph file. The pattern we use to avoid out-of-date in-memory packed_git structs is to call close_all_packs(). This should also call close_midx(). Since we already pass an object store to close_all_packs(), this is a nicely scoped operation. This fixes a test failure when running t6500-gc.sh with GIT_TEST_MULTI_PACK_INDEX=1. Reported-by: Szeder Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-19packfile: allow has_packed_and_bad to handle arbitrary repositoriesStefan Beller1-2/+3
has_packed_and_bad is not widely used, so just migrate it all at once. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-15fuzz: add fuzz testing for packfile indices.Josh Steadmon1-19/+25
Breaks the majority of check_packed_git_idx() into a separate function, load_idx(). The latter function operates on arbitrary buffers, which makes it suitable as a fuzzing test target. Signed-off-by: Josh Steadmon <steadmon@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-15packfile: express constants in terms of the_hash_algobrian m. carlson1-2/+3
Replace uses of GIT_SHA1_RAWSZ with references to the_hash_algo to avoid dependence on a particular hash length. It's likely that in the future, we'll update the pack format to indicate what hash algorithm it uses, and then this code will change. However, at least on an interim basis, make it easier to develop on a pure SHA-256 Git by using the_hash_algo here. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-17Merge branch 'jk/cocci'Junio C Hamano1-6/+6
spatch transformation to replace boolean uses of !hashcmp() to newly introduced oideq() is added, and applied, to regain performance lost due to support of multiple hash algorithms. * jk/cocci: show_dirstat: simplify same-content check read-cache: use oideq() in ce_compare functions convert hashmap comparison functions to oideq() convert "hashcmp() != 0" to "!hasheq()" convert "oidcmp() != 0" to "!oideq()" convert "hashcmp() == 0" to hasheq() convert "oidcmp() == 0" to oideq() introduce hasheq() and oideq() coccinelle: use <...> for function exclusion
2018-08-29convert "hashcmp() != 0" to "!hasheq()"Jeff King1-1/+1
This rounds out the previous three patches, covering the inequality logic for the "hash" variant of the functions. As with the previous three, the accompanying code changes are the mechanical result of applying the coccinelle patch; see those patches for more discussion. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29convert "hashcmp() == 0" to hasheq()Jeff King1-5/+5
This is the partner patch to the previous one, but covering the "hash" variants instead of "oid". Note that our coccinelle rule is slightly more complex to avoid triggering the call in hasheq(). I didn't bother to add a new rule to convert: - hasheq(E1->hash, E2->hash) + oideq(E1, E2) Since these are new functions, there won't be any such existing callers. And since most of the code is already using oideq, we're not likely to introduce new ones. We might still see "!hashcmp(E1->hash, E2->hash)" from topics in flight. But because our new rule comes after the existing ones, that should first get converted to "!oidcmp(E1, E2)" and then to "oideq(E1, E2)". Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20treewide: use get_all_packsDerrick Stolee1-1/+1
There are many places in the codebase that want to iterate over all packfiles known to Git. The purposes are wide-ranging, and those that can take advantage of the multi-pack-index already do. So, use get_all_packs() instead of get_packed_git() to be sure we are iterating over all packfiles. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20packfile: add all_packs listDerrick Stolee1-0/+27
If a repo contains a multi-pack-index, then the packed_git list does not contain the packfiles that are covered by the multi-pack-index. This is important for doing object lookups, abbreviations, and approximating object count. However, there are many operations that really want to iterate over all packfiles. Create a new 'all_packs' linked list that contains this list, starting with the packfiles in the multi-pack-index and then continuing along the packed_git linked list. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20midx: stop reporting garbageDerrick Stolee1-3/+4
When prepare_packed_git is called with the report_garbage method initialized, we report unexpected files in the objects directory as garbage. Stop reporting the multi-pack-index and the pack-files it covers as garbage. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20multi-pack-index: store local propertyDerrick Stolee1-2/+2
A pack-file is 'local' if it is stored within the usual object directory. If it is stored in an alternate, it is non-local. Pack-files are stored using a 'pack_local' member in the packed_git struct. Add a similar 'local' member to the multi_pack_index struct and 'local' parameters to the methods that load and prepare multi- pack-indexes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20Sync 'ds/multi-pack-index' to v2.19.0-rc0Junio C Hamano1-42/+127
* ds/multi-pack-index: (23 commits) midx: clear midx on repack packfile: skip loading index if in multi-pack-index midx: prevent duplicate packfile loads midx: use midx in approximate_object_count midx: use existing midx when writing new one midx: use midx in abbreviation calculations midx: read objects from multi-pack-index config: create core.multiPackIndex setting midx: write object offsets midx: write object id fanout chunk midx: write object ids in a chunk midx: sort and deduplicate objects from packfiles midx: read pack names into array multi-pack-index: write pack names in chunk multi-pack-index: read packfile list packfile: generalize pack directory list t5319: expand test data multi-pack-index: load into memory midx: write header information to lockfile multi-pack-index: add 'write' verb ...
2018-08-13for_each_packed_object: support iterating in pack-orderJeff King1-5/+16
We currently iterate over objects within a pack in .idx order, which uses the object hashes. That means that it is effectively random with respect to the location of the object within the pack. If you're going to access the actual object data, there are two reasons to move linearly through the pack itself: 1. It improves the locality of access in the packfile. In the cold-cache case, this may mean fewer disk seeks, or better usage of disk cache. 2. We store related deltas together in the packfile. Which means that the delta base cache can operate much more efficiently if we visit all of those related deltas in sequence, as the earlier items are likely to still be in the cache. Whereas if we visit the objects in random order, our cache entries are much more likely to have been evicted by unrelated deltas in the meantime. So in general, if you're going to access the object contents pack order is generally going to end up more efficient. But if you're simply generating a list of object names, or if you're going to end up sorting the result anyway, you're better off just using the .idx order, as finding the pack order means generating the in-memory pack-revindex. According to the numbers in 8b8dfd5132 (pack-revindex: radix-sort the revindex, 2013-07-11), that takes about 200ms for linux.git, and 20ms for git.git (those numbers are a few years old but are still a good ballpark). That makes it a good optimization for some cases (we can save tens of seconds in git.git by having good locality of delta access, for a 20ms cost), but a bad one for others (e.g., right now "cat-file --batch-all-objects --batch-check="%(objectname)" is 170ms in git.git, so adding 20ms to that is noticeable). Hence this patch makes it an optional flag. You can't actually do any interesting timings yet, as it's not plumbed through to any user-facing tools like cat-file. That will come in a later patch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-13for_each_*_object: take flag arguments as enumJeff King1-1/+2
It's not wrong to pass our flags in an "unsigned", as we know it will be at least as large as the enum. However, using the enum in the declaration makes it more obvious where to find the list of flags. While we're here, let's also drop the "extern" noise-words from the declarations, per our modern coding style. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20packfile: skip loading index if in multi-pack-indexDerrick Stolee1-2/+17
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: prevent duplicate packfile loadsDerrick Stolee1-0/+9
The multi-pack-index, when present, tracks the existence of objects and their offsets within a list of packfiles. This allows us to use the multi-pack-index for object lookups, abbreviations, and object counts. When the multi-pack-index tracks a packfile, then we do not need to add that packfile to the packed_git linked list or the MRU list. We still need to load the packfiles that are not tracked by the multi-pack-index. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: use midx in approximate_object_countDerrick Stolee1-0/+3
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: use midx in abbreviation calculationsDerrick Stolee1-0/+6
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: read objects from multi-pack-indexDerrick Stolee1-1/+7
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20config: create core.multiPackIndex settingDerrick Stolee1-1/+5
The core.multiPackIndex config setting controls the multi-pack- index (MIDX) feature. If false, the setting will disable all reads from the multi-pack-index file. Read this config setting in the new prepare_multi_pack_index_one() which is called during prepare_packed_git(). This check is run once per repository. Add comparison commands in t5319-multi-pack-index.sh to check typical Git behavior remains the same as the config setting is turned on and off. This currently includes 'git rev-list' and 'git log' commands to trigger several object database reads. Currently, these would only catch an error in the prepare_multi_pack_index_one(), but with later commits will catch errors in object lookups, abbreviations, and approximate object counts. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: sort and deduplicate objects from packfilesDerrick Stolee1-0/+17
Before writing a list of objects and their offsets to a multi-pack-index, we need to collect the list of objects contained in the packfiles. There may be multiple copies of some objects, so this list must be deduplicated. It is possible to artificially get into a state where there are many duplicate copies of objects. That can create high memory pressure if we are to create a list of all objects before de-duplication. To reduce this memory pressure without a significant performance drop, automatically group objects by the first byte of their object id. Use the IDX fanout tables to group the data, copy to a local array, then sort. Copy only the de-duplicated entries. Select the duplicate based on the most-recent modified time of a packfile containing the object. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20packfile: generalize pack directory listDerrick Stolee1-38/+63
In anticipation of sharing the pack directory listing with the multi-pack-index, generalize prepare_packed_git_one() into for_each_file_in_pack_dir(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-29object: add repository argument to parse_objectStefan Beller1-1/+1
Add a repository argument to allow the callers of parse_object to be more specific about which repository to act on. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-18Merge branch 'jl/zlib-restore-nul-termination'Junio C Hamano1-0/+3
Make zlib inflate codepath more robust against versions of zlib that clobber unused portion of outbuf. * jl/zlib-restore-nul-termination: packfile: correct zlib buffer handling
2018-06-13packfile: correct zlib buffer handlingJeremy Linton1-0/+3
The buffer being passed to zlib includes a NUL terminator that git needs to keep in place. unpack_compressed_entry() attempts to detect the case that the source buffer hasn't been fully consumed by checking to see if the destination buffer has been over consumed. This causes a problem, that more recent zlib patches have been poisoning the unconsumed portions of the buffer which overwrites the NUL byte, while correctly returning length and status. Let's place the NUL at the end of the buffer after inflate returns to assure that it doesn't result in problems for git even if its been overwritten by zlib. Signed-off-by: Jeremy Linton <lintonrjeremy@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-30Merge branch 'bc/object-id'Junio C Hamano1-36/+43
Conversion from uchar[20] to struct object_id continues. * bc/object-id: (42 commits) merge-one-file: compute empty blob object ID add--interactive: compute the empty tree value Update shell scripts to compute empty tree object ID sha1_file: only expose empty object constants through git_hash_algo dir: use the_hash_algo for empty blob object ID sequencer: use the_hash_algo for empty tree object ID cache-tree: use is_empty_tree_oid sha1_file: convert cached object code to struct object_id builtin/reset: convert use of EMPTY_TREE_SHA1_BIN builtin/receive-pack: convert one use of EMPTY_TREE_SHA1_HEX wt-status: convert two uses of EMPTY_TREE_SHA1_HEX submodule: convert several uses of EMPTY_TREE_SHA1_HEX sequencer: convert one use of EMPTY_TREE_SHA1_HEX merge: convert empty tree constant to the_hash_algo builtin/merge: switch tree functions to use object_id builtin/am: convert uses of EMPTY_TREE_SHA1_BIN to the_hash_algo sha1-file: add functions for hex empty tree and blob OIDs builtin/receive-pack: avoid hard-coded constants for push certs diff: specify abbreviation size in terms of the_hash_algo upload-pack: replace use of several hard-coded constants ...
2018-05-30Merge branch 'js/use-bug-macro'Junio C Hamano1-3/+3
Developer support update, by using BUG() macro instead of die() to mark codepaths that should not happen more clearly. * js/use-bug-macro: BUG_exit_code: fix sparse "symbol not declared" warning Convert remaining die*(BUG) messages Replace all die("BUG: ...") calls by BUG() ones run-command: use BUG() to report bugs, not die() test-tool: help verifying BUG() code paths
2018-05-23Merge branch 'sb/oid-object-info'Junio C Hamano1-17/+23
The codepath around object-info API has been taught to take the repository object (which in turn tells the API which object store the objects are to be located). * sb/oid-object-info: cache.h: allow oid_object_info to handle arbitrary repositories packfile: add repository argument to cache_or_unpack_entry packfile: add repository argument to unpack_entry packfile: add repository argument to read_object packfile: add repository argument to packed_object_info packfile: add repository argument to packed_to_object_type packfile: add repository argument to retry_bad_packed_offset cache.h: add repository argument to oid_object_info cache.h: add repository argument to oid_object_info_extended
2018-05-23Merge branch 'ds/lazy-load-trees'Junio C Hamano1-1/+1
The code has been taught to use the duplicated information stored in the commit-graph file to learn the tree object name for a commit to avoid opening and parsing the commit object when it makes sense to do so. * ds/lazy-load-trees: coccinelle: avoid wrong transformation suggestions from commit.cocci commit-graph: lazy-load trees for commits treewide: replace maybe_tree with accessor methods commit: create get_commit_tree() method treewide: rename tree to maybe_tree
2018-05-08Merge branch 'ds/commit-graph'Junio C Hamano1-2/+2
Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking. * ds/commit-graph: commit-graph: implement "--append" option commit-graph: build graph from starting commits commit-graph: read only from specific pack-indexes commit: integrate commit graph with commit parsing commit-graph: close under reachability commit-graph: add core.commitGraph setting commit-graph: implement git commit-graph read commit-graph: implement git-commit-graph write commit-graph: implement write_commit_graph() commit-graph: create git-commit-graph builtin graph: add commit graph design document commit-graph: add format document csum-file: refactor finalize_hashfile() method csum-file: rename hashclose() to finalize_hashfile()
2018-05-06Replace all die("BUG: ...") calls by BUG() onesJohannes Schindelin1-3/+3
In d8193743e08 (usage.c: add BUG() function, 2017-05-12), a new macro was introduced to use for reporting bugs instead of die(). It was then subsequently used to convert one single caller in 588a538ae55 (setup_git_env: convert die("BUG") to BUG(), 2017-05-12). The cover letter of the patch series containing this patch (cf 20170513032414.mfrwabt4hovujde2@sigill.intra.peff.net) is not terribly clear why only one call site was converted, or what the plan is for other, similar calls to die() to report bugs. Let's just convert all remaining ones in one fell swoop. This trick was performed by this invocation: sed -i 's/die("BUG: /BUG("/g' $(git grep -l 'die("BUG' \*.c) Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-02packfile: abstract away hash constant valuesbrian m. carlson1-29/+37
There are several instances of the constant 20 and 20-based values in the packfile code. Abstract away dependence on SHA-1 by using the values from the_hash_algo instead. Use unsigned values for temporary constants to provide the compiler with more information about what kinds of values it should expect. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-02packfile: convert find_pack_entry to object_idbrian m. carlson1-6/+6
Convert find_pack_entry and the static function fill_pack_entry to take pointers to struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-02packfile: convert has_sha1_pack to object_idbrian m. carlson1-2/+2
Convert this function to take a pointer to struct object_id and rename it has_object_pack for consistency with has_object_file. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-02packfile: remove unused member from struct pack_entrybrian m. carlson1-1/+0
The sha1 member in struct pack_entry is unused except for one instance in which we store a value in it. Since nobody ever reads this value, don't bother to compute it and remove the member from struct pack_entry. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-26cache.h: allow oid_object_info to handle arbitrary repositoriesStefan Beller1-30/+28
This involves also adapting oid_object_info_extended and a some internal functions that are used to implement these. It all has to happen in one patch, because of a single recursive chain of calls visits all these functions. oid_object_info_extended is also used in partial clones, which allow fetching missing objects. As this series will not add the repository struct to the transport code and fetch_object(), add a TODO note and omit fetching if a user tries to use a partial clone in a repository other than the_repository. Among the functions modified to handle arbitrary repositories, unpack_entry() is one of them. Note that it still references the globals "delta_base_cache" and "delta_base_cached", but those are safe to be referenced (the former is indexed partly by "struct packed_git *", which is repo-specific, and the latter is only used to limit the size of the former as an optimization). Helped-by: Brandon Williams <bmwill@google.com> Helped-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Stefan Beller <sbeller@google.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-26packfile: add repository argument to cache_or_unpack_entryStefan Beller1-2/+3
Add a repository argument to allow the callers of cache_or_unpack_entry to be more specific about which repository to act on. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Stefan Beller <sbeller@google.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>