On all my analytical travels in the past several years I’m yet to come across a Data Scientist in the flesh. These people are the mythical figures who supposedly can turn data into oil.
When you think about the breadth of skills this role requires it’s not surprising these folks are rare in business.
There are two ways to close the analytical gap, develop the human resources and develop the technology.
We already have technology that is making analytics easier and I believe we’ll continue to see improvements in this space.
In this post I’ll demo how we can take a classical data scientist type problem and solve it by producing a model in trusty old Excel.
I was inspired to write this post after reading about the new Quantum (allegedly) Computer that D-Wave are selling. Quantum Computers are so different to existing computers it can be hard to define them and frankly when I read Quantum bits can be in two states simultaneously I refer you to those smart Quantum Mechanics at D-Wave.
One thing I do understand is that Quantum Computers will solve computational problems in different ways to today.
Take the classical Travelling Salesperson Problem (TSP). If you need a recap. The salesperson has to visit a number of customers once before returning home. The question we have to answer is: Which route does the salesperson take to minimise the distance travelled?
This problem crops up in many places in society such farmers cropping their field or where to place chips on a PCB board or where to send a team of field sales professionals.
It’s a really hard problem to solve as the more visits there are the more harder it is to solve. You can solve it by hand by trying out the different combinations but you’ll be here a very long time.
To solve 3 visits would take 1*2*3 = 6 attempts Not so bad right? Well what if we have 20 visits? 20! = 2,432,902,008,176,640,000 attempts. Gulp.
The practical use of the solution and the inherent difficulty has led to considerable research effort being ploughed into this problem.
To solve this with Excel we can’t try every permutation of 20 visits as it would take too long. We don’t really have the computational power to calculate the optimal route at a practical level, as far as I know (unless you have a D-Wave).
What we can do is calculate a ‘good route’. I can’t guarantee it will be the best, but in business, this compromise is generally acceptable, think ‘bang per buck’.
Okay, enough of the theory. I hope you’re still with me.
Before creating a model I always plan it using a simple framework.
1.State the problem
Calculate a good route for a sales person to travel ensuring they visit each destination once before returning home.
2.State the variables
3. Describe the inputs
The only inputs we have in this problem is the Post code of each address and the home post code for the sales person.
4. Develop an approach
We can calculate the distance between each combination of 2 post codes.
We can then use an evolutionary (pseudo random) approach to determine a good route.
5. Build it.
We’ll create a 2 dimension matrix of post codes and use Bing maps to calculate the distance at each inter-section of the matrix.
The evolutionary algorithm can be found with the Solver add-in in the 2010 version of Excel. You may need to install it and activate the add-in.
6. Test it.
It’s hard to test this. One way is to look at a gut feel route and compare this to the ‘good route’
7. Deploy it.
Easy with Excel. Publish to SharePoint if you have it.
Okay, with the planning done let’s crank up Excel.
If you want jump ahead, take a look at the finished model here.
Layout the worksheet as below.
Input the post codes or other address type into column H. You can copy/paste/transpose to get the same post codes in row 6.
The counter in column B is a simple counter 1 through 20. The 20th post code needs to the home post code.
The optimal order in column C will be where Solver loads the output i.e. route. For now input 1,2,3..20 for test purposes.
If you create a Named Range called Distance which covers the distances (excl. column/row headers) then it’s easier to make the formula which returns the distance in column D.
The first formulas in cell D7 is : =INDEX(Distance,C26,C7)
C26 is the start point. Of course, it’s also the end point, hence why we have it at the end of the list.
The formula in cell D8 is : =INDEX(Distance,C7,C8). This calculates the distance from the previous end point.
These formulas well return the distance from the matrix (once we run the model). The distance is aggregated to calculate the total distance.
In column E we have a simple VLOOKUP to return the post code using the Counter as a lookup value.
I have a counter to count the number of post codes in E4 and a simpe KM to miles figure in E3.
If you don’t want to follow the Bing Maps integration then you can add whatever distances you want in the matrix. However, if you want real distances then have a look at the VBA code in the demo file. If you know VBA it should be self explanatory.
We use an object to call the Bing API with two properties, the origin and destination (from the matrix row/col). Another object is then used to parse the XML and extract the Distance.
I go through Bing Maps integration in more detail in this post.
Once we have the distances loaded to the matrix we can set up the Solver algorithm.
The objective is to minimise the total distance travelled so select Min for cell D28.
Input the cells to change and set a constraint that these cells should be different. We can’t have duplicate visits in this scenario.
Select the Solver Method to be Evolutionary and set the options according to the screen below.
The evolutionary algorithm works by randomly selecting a collection of routes. Each route is checked to see which is the minimum. Another collection of routes is then made to see if a shorter route can be found. This process continues until no shorter distance is found for the duration we set.
On my machine a good route was found after 3 minutes at 99 miles compared with a starting distance of 316 miles. You may get different results due to the algorithm having random properties.
Well, there are some limits to Solver that is bundled freely with Excel. It only processes up to 200 variables, in our case 200 post codes. If you want to process more you have to buy the software from FrontLine Systems.
If you’re feeling up to it why not extend this model by plotting the journey on Bing Maps. Another improvement could be to have the salesperson travelling on different days.
Once the model is built I hope you agree that this solution can be produced without the help of a Data Scientist. I’m trying to demonstrate that with a little learning combined with the right software it’s possible to close the analytical gap and solve business problems.