3 * We have a few functions to
do with reading a netint, stashing
4 it somewhere, then moving into a different state. Is it worth
5 writing
generic functions
for that, or would it be too confusing?
7 * Optimisations and code cleanups;
9 scoop.c: Scoop needs major refactor. Perhaps the API needs
14 mdfour.c: This code has a different API to the RSA code in libmd
15 and is coupled with librsync in unhealthy ways (trace?). Recommend
20 * Don
't use the rs_buffers_t structure.
22 There's something confusing about the existence of
this structure.
23 In part it may be the name. I think people expect that it will be
24 something that behaves like a FILE* or C++ stream, and it really
25 does not. Also, the structure does not behave as an
object: it
's
26 really just a shorthand for passing values in to the encoding
27 routines, and so does not have a lot of identity of its own.
29 An alternative might be
31 result = rs_job_iter(job,
32 in_buf, &in_len, in_is_ending,
35 where we update the length parameters on return to show how much we
38 One technicality here will be to restructure the code so that the
39 input buffers are passed down to the scoop/tube functions that need
40 them, which are relatively deeply embedded. I guess we could just
41 stick them into the job structure, which is becoming a kind of
42 catch-all "environment" for poor C programmers.
46 * Plot lengths of each function
48 * Some kind of statistics on delta each day
52 * Include a version in the signature and difference fields
54 * Remember to update them if we ever ship a buggy version (nah!) so
55 that other parties can know not to trust the encoded data.
59 In fact, we can vary on several different variables:
61 * what signature format are we using
63 * what command protocol are we using
65 * what search algorithm are we using?
67 * what implementation version are we?
69 Some are more likely to change than others. We need a chart
70 showing which source files depend on which variable.
74 * Self-referential copy commands
76 Suppose we have a file with repeating blocks. The gdiff format
77 allows for COPY commands to extend into the *output* file so that
78 they can easily point this out. By doing this, they get
79 compression as well as differencing.
81 It'd be pretty simple to implement
this, I think: as we produce
82 output, we
'd also generate checksums (using the search block
83 size), and add them to the sum set. Then matches will fall out
84 automatically, although we might have to specially allow for
87 However, I don't see many files which have repeated 1kB chunks,
88 so I don
't know if it would be worthwhile.
90 * Support compression of the difference stream. Does this
91 belong here, or should it be in the client and librsync just have
92 an interface that lets it cleanly plug in?
94 I think if we're going to just
do plain gzip, rather than
95 rsync-gzip, then it might as well be external.
97 rsync-gzip: preload with the omitted text so as to
get better
98 compression. Abo thinks
this gets significantly better
99 compression. On the other hand we have to important and maintain
100 our own zlib fork, at least until we can persuade the upstream to
101 take the necessary patch. Can that be done?
105 It does
get better compression, but at a price. I actually
106 think that getting the code to a point where a feature like
107 this can be easily added or removed is more important than the
108 feature itself. Having
generic pre and post processing layers
109 for hit/miss data would be useful. I would not like to see it
110 added at all
if it tangled and complicated the code.
112 It also doesn
't require a modified zlib... pysync uses the
113 standard zlib to do it by compressing the data, then throwing
114 it away. I don't know how much benefit the rsync modifications
115 to zlib actually are, but
if I was implementing it I would
116 stick to a stock zlib until it proved significantly better to
121 Will the GNU Lesser GPL work? Specifically, will it be a problem
122 in distributing
this with Mozilla or Apache?
126 * Just more testing in general.
128 * Test broken pipes and that IO errors are handled properly.
130 * Test files >2GB, >4GB. Presumably these must be done in streams
131 so that the disk requirements to run the test suite are not too
132 ridiculous. I wonder
if it will take too
long to run these
133 tests? Probably, but perhaps we can afford to run just one
134 carefully-chosen test.
136 * Fuzz instruction streams. <https:
138 * Generate random data;
do random mutations.
140 * Tests should fail
if they can
't find their inputs, or have zero
141 inputs: at present they tend to succeed by default.
145 * If this code was to read differences or sums from random machines
146 on the network, then it's a security boundary. Make sure that
147 corrupt input data can
't make the program crash or misbehave.
Description of input and output buffers.
rs_result rs_job_drive(rs_job_t *job, rs_buffers_t *buf, rs_driven_cb in_cb, void *in_opaque, rs_driven_cb out_cb, void *out_opaque)
Actively process a job, by making callbacks to fill and empty the buffers until the job is done...