I know I made a thread asking a statistics-related question a while back, but I think this is sufficiently unrelated that it deserves its own thread. Mods, if you see it differently, by all means merge them.

I have been keeping a diary for three years, of which the last half-year is typed, with each day's wordcount recorded in an Excel file. Over the course of the next year I plan to type up all of my old handwritten entries in readiness for (perhaps) posting old entries in a sort of "blog from the past", so I will end up with about 1500 data points. Already I have enough data to notice a few trends:

As expected for a random walk, the cumulative wordcount over time is fractal in nature. Curiously, on various scales it also seems to be fairly close to piecewise linear if the finer detail is ignored. This could just be my perception, but if not then I'd be interested to know what the significance is, if any.

The wordcount probability density function (considered independently of time) seems, so far, to be surprisingly close to uniform up to about 1250 (2/3 of data), and fairly close to uniform but with a lower density from there up to about 2000 (1/12 of data), with a couple of percent of entries longer than that and the rest empty, i.e. no entry written on that day. Again, I'm interested as to why this is. I would have expected a more bell-like distribution.

I'm also curious about how entry lengths on given days are dependent on those from previous days. In short, I'm interested in the various sorts of observations and predictions which could be made about the time series based on the data I have so far.

As well as having the data in raw form, I've represented it with four graphs: three scatter graphs showing entries by actual wordcount, percentage ranking and logarithm of wordcount, and a cumulative graph. The first three currently have moving averages showing trends, but I've been wondering whether there might be a better way to do this. It occurred to me to represent the trend as a function of time which minimized the average distance from any point on the trendline to the data points (using mean squared distance wouldn't work, as it would result in a constant function). Might this idea, or something like it, work?