## Tuesday, 1 December 2015

### PDFs and Buses

The problem with what I've been doing in place of sea ice over much of this autumn is that I can't blog on it. You don't give away ways of doing things that you intend to make money out of, it's bad business. Then I was waiting at the bus stop over the weekend and it occurred to me that I often talk about probability distributions, but haven't explained what they are.

As a regular bus user I've had time to see some details of the emergent behaviour of buses. The one I find most interesting is the behaviour of closely spaced buses at busy times of the day. The route into the nearest City runs services every 10 minutes, but at rush hour bunching occurs, this happens as follows: Bus 1 sets off from the bus station, it is followed by bus 2 ten minutes later. Bus 1 is slowed down by the number of passengers getting on it, bus 2 follows in its wake and speeds past empty bus stops until it catches up with bus 1, at that point when bus 1 stops to take on yet more passengers bus 2 overtakes it, and bus 2 is then slowed and is passed by bus 1. The two then engage in an oscillatory dance, each taking the lead one after the other. If you're unlucky enough to get to the bus stop at the wrong you then may face a 20 minute wait for the next bus, despite it being a 10 minute interval bus route.

I have also spent a lot of time waiting at bus stops, and have pondered the issue of probabilities of the arrival of the bus. It struck me that it was a useful way of explaining probability density functions (PDFs).

Knowing the timetable can be of no use, as I found out last Sunday, and have even found out from online bus timetables, as with the online versions, the timetable at the bus station was wrong. So you have a timetable, but there is good reason to suggest the timings are wrong, but you can tell the frequency of the service. Take the simple case of two routes, one has a bus every hour, one has a bus every half hour. Without prior information about the actual time of arrival you need to assume that as you got to the stop the previous bus had recently gone, the PDFs are a rectangular form starting from when you get to the bus stop.

With the 30 minutes service there is little way to tie down when the bus will actually get there, assuming good reason to expect that the timetable itself is wrong. So without such prior reason to constrain the likelihood the probability is equally distributed for each minute of the 30 minutes from when you get to the bus stop. The result is a rectangular PDF and this is usually the most conservative form of PDF to use when you have little prior information about the actual distribution of probability. Just for interest I have also plotted the PDF for a service of buses every 60 minutes, obviously for each minute of waiting the probability of a bus arriving for the 60 minutes service is half that of the 30 minute service.

As I wait for a bus the longer I wait the more certain is the arrival of the bus, this is evident from the cumulative probability of a PDF, which is the cumulative sum of each minute's probability under the PDF.

So let's say you have guessed that this route is probably a thirty minute route, by the time you get to something over thirty minutes the bus should have arrived (probability should be 1), so you start to ponder that the interval of buses might be one hour (or something else). Of course if you're drunkenly waiting there at 11:00PM the probability is you'll have to wait until next morning.

Sometimes we have enough prior information to tie down the PDF to something more precise than a rectangular distribution. Sticking with the example of waiting for a bus. Say you're on a regular bus route, you've got there just after the hour, and you know that the next bus is scheduled at 20 minutes past. You might use an assumption that the average time of arrival is the scheduled time of arrival and that the variation of actual times of arrival is normally distributed around the actual time of arrival. If the factors leading to the bus being a bit early or a bit late are truly random, then with a number of such random factors they combine to create a normal distribution, the theory behind this is the Central Limit Theorem.

If the standard deviation of the time of arrival is 2 minutes then about 68% of the time buses arrive within +/- 2 minutes of the scheduled time, using this assumption and applying a normal distribution, the PDF of the bus arrival is as follows.

The cumulative probability distribution shows that it is equally likely that arrival will be before and after 20 minutes past, but as is shown by the above graph, arrival five minutes early is very unlikely and if the bus hasn't arrived by 25 minutes past it is likely that some external factor has interfered with the bus, and that the normal distribution is not applicable.

Take one such external factor, weather. Say you get to the bus stop and it's a 'pea souper' of very dense fog. You reasonably expect that the bus is more likely to be late than on time. You might be tempted to revert to a rectangular distribution, but the correct distribution may be a skewed variant of a normal distribution. The Central Limit theorem suggests that by combining various random factors you end up with a normal distribution, but here the Central Limit Theorem is violated by a dominant factor (the fog) and the normal distribution is skewed by that random factor.

Turning now to the cumulative probability the scheduled time of arrival is 20 minutes past the hour, but by that time the probability is only about 40%, it is 60% likely that the bus will be late. In fact even by 10 minutes after the scheduled time of arrival the probability of the bus being that late is non zero.

As an engineer when I work out probabilities I have to be very careful about dominant factors, normally I'd combine uncertainties using the root of the sum of squares under the assumption of the combined uncertainty being a normal distribution, in the case of a dominant term I abandon the assumption of normal distribution driven by the Central Limit Theorem and have to add the uncertainties, which gives a wider bound of uncertainty.