, basename, NULL, MINIKIND, NULL, fileid_utf8, NULL, WHOLE NUMBER (* size *), NULL, packed stat, NULL, sha1|symlink target, {PARENT ROW} PARENT ROW = NULL, revision_utf8, NULL, MINIKIND, NULL, dirname, NULL, basename, NULL, WHOLE NUMBER (* size *), NULL, "y" | "n", NULL, SHA1 PARENT ROW's are emitted for every parent that is not in the ghosts details line. That is, if the parents are foo, bar, baz, and the ghosts are bar, then each row will have a PARENT ROW for foo and baz, but not for bar. In any tree, a kind of 'moved' indicates that the fingerprint field (which we treat as opaque data specific to the 'kind' anyway) has the details for the id of this row in that tree. I'm strongly tempted to add a id->path index as well, but I think that where we need id->path mapping; we also usually read the whole file, so I'm going to skip that for the moment, as we have the ability to locate via bisect any path in any tree, and if we lookup things by path, we can accumulate an id->path mapping as we go, which will tend to match what we looked for. I plan to implement this asap, so please speak up now to alter/tweak the design - and once we stabilise on this, I'll update the wiki page for it. The rationale for all this is that we want fast operations for the common case (diff/status/commit/merge on all files) and extremely fast operations for the less common but still occurs a lot status/diff/commit on specific files). Operations on specific files involve a scan for all the children of a path, *in every involved tree*, which the current format did not accommodate. ---- Design priorities: 1. Fast end to end use for bzr's top 5 uses cases. (commmit/diff/status/merge/???) 2. fall back current object model as needed. 3. scale usably to the largest trees known today - say 50K entries. (mozilla is an example of this) Locking: Eventually reuse dirstate objects across locks IFF the dirstate file has not been modified, but will require that we flush/ignore cached stat-hit data because we won't want to restat all files on disk just because a lock was acquired, yet we cannot trust the data after the previous lock was released. Memory representation:: vector of all directories, and vector of the childen ? i.e. root_entries = (direntry for root, [parent_direntries_for_root]), dirblocks = [ ('', ['data for achild', 'data for bchild', 'data for cchild']) ('dir', ['achild', 'cchild', 'echild']) ] - single bisect to find N subtrees from a path spec - in-order for serialisation - this is 'dirblock' grouping. - insertion of a file '/a' affects only the '/' child-vector, that is, to insert 10K elements from scratch does not generates O(N^2) memoves of a single vector, rather each individual, which tends to be limited to a manageable number. Will scale badly on trees with 10K entries in a single directory. compare with Inventory.InventoryDirectory which has a dictionary for the children. No bisect capability, can only probe for exact matches, or grab all elements and sort. - What's the risk of error here? Once we have the base format being processed we should have a net win regardless of optimality. So we are going to go with what seems reasonable. open questions: Maybe we should do a test profile of the core structure - 10K simulated searches/lookups/etc? Objects for each row? The lifetime of Dirstate objects is current per lock, but see above for possible extensions. The lifetime of a row from a dirstate is expected to be very short in the optimistic case: which we are optimising for. For instance, subtree status will determine from analysis of the disk data what rows need to be examined at all, and will be able to determine from a single row whether that file has altered or not, so we are aiming to process tens of thousands of entries each second within the dirstate context, before exposing anything to the larger codebase. This suggests we want the time for a single file comparison to be < 0.1 milliseconds. That would give us 10000 paths per second processed, and to scale to 100 thousand we'll another order of magnitude to do that. Now, as the lifetime for all unchanged entries is the time to parse, stat the file on disk, and then immediately discard, the overhead of object creation becomes a significant cost. Figures: Creating a tuple from 3 elements was profiled at 0.0625 microseconds, whereas creating a object which is subclassed from tuple was 0.500 microseconds, and creating an object with 3 elements and slots was 3 microseconds long. 0.1 milliseconds is 100 microseconds, and ideally we'll get down to 10 microseconds for the total processing - having 33% of that be object creation is a huge overhead. There is a potential cost in using tuples within each row which is that the conditional code to do comparisons may be slower than method invocation, but method invocation is known to be slow due to stack frame creation, so avoiding methods in these tight inner loops in unfortunately desirable. We can consider a pyrex version of this with objects in future if desired. é