I am real.
After years of experience in marketing, engineering, and sales consulting, I am learning programming and data science. The tools and methods I use in the project below are also real. The scenario, however, is a work of fiction. The company does not exist and is not based in any way on an actual company. While real companies face the problem of allocating salespeople to territories every day, I made up details for this exercise. This is my final project for the IBM Data Science Professional Certificate, an excellent series of nine classes offered by Coursera.
Why should you care? I created this project to demonstrate applications of Python programming with Pandas, machine learning, Folium graphics, data processing, and cleaning (so much cleaning …). One requirement was to use data from Foursquare, which (in its free plan) strictly limits the size and geographic scope of the data it provides. In this project, I demonstrate an efficient way to obtain large data sets through this limited portal while avoiding missed data and duplicated records. If you are interested in data science and need examples of any of these topics, read on. (See and execute the actual code in the project’s Jupyter Notebook.)
If you run a business you may also get some ideas on how you can define a customer base, gather data, and use machine learning to segment the base both geographically and by business needs.
All of the requirements and constraints for this scenario were defined before the project began and were not changed to fit the data or to improve the results. No data was ignored or “fudged”.
A company that manufactures equipment and supplies for Asian restaurants plans to expand into Georgia in the United States. It creates products of interest to all types of Asian restaurants. For example, it sells sushi display cases to Japanese restaurants and tandoor ovens to South Asian restaurants. The company wishes to define territories for its salespeople based on the type of restaurants and locations in the state.
I obtained a listing of all Asian restaurants in Georgia from Foursquare and analyzed that data using Pandas, scikit-learn, and NumPy. To better understand and communicate the results I employed Folium and Matplotlib.
Part 1: Obtaining a Lot of Data from Foursquare
Georgia is one of the largest eastern states in America, covering well over 150,000 square kilometers. There are thousands of Asian restaurants here. My goal was to build a database of all Asian restaurants including the sub-type (Japanese, Korean, Pakistani, etc.) and location. Class exercises and examples for the final project simply divided the desired location into roughly equal-sized areas and performed a Foursquare request on each one. This was impractical in my case because Foursquare will return at most 100 venues per request (at least for the type of account that I had). Some parts of Georgia have a high density of Asian restaurants, while other parts of the state only have a handful. (There are hundreds near my house in Duluth.) I later calculated that capturing all restaurants using equal-sized areas would take over 50,000 requests. This assumes that I knew the concentration of restaurants in the most highly populated area, which of course I did not.
I solved this problem using recursion in Python. (You can see all the code and the results in my Jupyter notebook.) I simply give the code the coordinates of a box surrounding the state, and it breaks the area into sub-boxes of a manageable size and requests data for each one. (Foursquare limits request areas to no more than about 10,000 square kilometers.) If a request returns 99 or fewer restaurants, they are recorded in my dataframe. If the request returns 100 restaurants, then the request has reached the Foursquare limit, and some restaurants in the area may be left out of the results. In that case, my Python code discards the results, breaks the box into four pieces, and makes new requests for each one. If any sub-box returns 100 results it breaks that box into four even smaller pieces and repeats the process until every part of the state is covered by a request that isn’t maxed out.
This method required 168 requests and returned all results in less than 90 seconds. There were no duplicates in the data. Some results were returned from adjacent states, but these rows were easily dropped from the data frame.
The rest of the data cleaning took serious effort (as it often does). Over 20% of the data was misassigned (for example, to “Asian Restaurant” when a sub-category like “Japanese Restaurant” was more accurate, or vice-versa), and some non-Asian restaurants were included (Applebee’s, anyone?). I focused on programmatic data cleaning and avoided manual effort. For example, I performed a word frequency analysis on the restaurant names in each sub-category and used the results to identify and reassign restaurants assigned to incorrect categories. I then used Foursquare’s category hierarchy to assign all restaurants to categories that made sense for sales territory assignment.
After removing non-Asian results and assigning each restaurant to its best category, I was left with 3,305 potential customers for my restaurant equipment manufacturer.
Let’s look at the cleaned database before we move onto the analysis. A Folium heatmap gives some idea of the geographic distribution of the restaurants, but it can be misleading. The picture below shows the same map with the same data, zoom level, and settings. The left-hand image was generated in the Jupyter notebook. I then zoomed out and back in again and took another screengrab (on the right). Notice that the blobs are not the same. The left image would make you think that the highest concentration of Asian restaurants is in the Warner Robins / Macon area (near the center of the state). The right image might convince you that the highest concentration is near Augusta or perhaps Savannah (right side of the state). If you rerun the code or zoom out and back in again, the blobs are likely to shift more.
Fortunately, Folium has another way to visualize these results. Take a look at the cluster marker map below.
This gives a better feel for how restaurants are distributed. About three-fourths of the restaurant equipment manufacturer’s potential customers are in the greater Atlanta area. All types of Asian restaurants are concentrated in the Atlanta area, but the most common types (Chinese and Japanese) are more evenly distributed throughout the state. Click the link in the picture caption to try the interactive map. This combination of high-concentrations in small areas and broad areas with a low density will prove to be a challenge in the clustering step.
Part 2: The Analysis and Results
The restaurant database forms an unlabeled set, and the challenge is to create a number of restaurant clusters equal to the number of available salespeople. I chose k-means clustering for this project.
This problem requires segmenting a market both by geographic data (numeric) and by the restaurant’s cuisine (categorical data). Unfortunately, mixing data of different types in a single machine learning is problematic. Categorical data can be made numeric (by using one-hot analysis, for example), but how should the machine learning algorithm interpret combining this with X-Y location data? It is easy to imagine the algorithm grouping Indian restaurants that share a common latitude while their longitudes (and hence the physical distance between them) vary widely.
I first attempted to create territories using just the cuisine data. I used the k means algorithm to perform the grouping and converted the type of restaurant (stored as text strings in a single column) to separate columns for each type containing a “1” or “0” (one-hot encoding). The number of clusters was set to the number of salespeople the customer planned to hire (in this case, five).
Grouping solely on restaurant type produced unwieldy territories. Most sales territories only serve one kind of restaurant, but every territory covers the entire state. Travel time would harm productivity in this case. This analysis also uncovers a charming attribute of k means clustering — its tendency to produce groups of widely different sizes.
In the table below, note that the territories are named using the corresponding territory colors on the map.
Next, I attempted to create the territories using location data (latitude and longitude) alone. A one-degree change in latitude represents a different distance than a one-degree change in longitude. (In Georgia, lines of latitude are roughly 30% shorter than lines of longitude.) I rescaled latitude and longitude using the average distances covered by each before performing the analysis.
Grouping the restaurants physically yielded sales territories that include all types of restaurants. This increases the load on the salespeople, as opportunities to specialize are lost. The high concentration of restaurants in the greater Atlanta area also ensures that the number of restaurants in each territory is uneven.
This grouping is clearly inadequate, but it does suggest a possible solution. The four territories outside of the Atlanta area have relatively few restaurants. These territories could be merged together to form two territories, freeing up two salespeople to move to the Atlanta area. I tried this, merging Yellow plus Blue (to take advantage of I-75) and Purple plus Red.
The two territories outside of the Atlanta area are larger and include all types of restaurants, thereby increasing the time the salespeople spend traveling and learning about the restaurant types. However, the number of restaurants is lower than average, helping to balance the load. Now three salespeople are available to work in the Atlanta area which can be grouped by restaurant type. These salespeople have a smaller physical territory and fewer restaurant types to learn about, but more restaurants to cover.
I used this approach to create two territories outside of Atlanta and ran the k means grouping on just the restaurants in the Atlanta area.
The Atlanta territories (Yellow, Purple, and Green) were better but still uneven. The average of these three territories should be around 840 restaurants. The territory consisting only of Chinese restaurants (Green) is very close in this regard. That territory only contained one type of restaurant, which required the other two territories to have more restaurant types. I judged that the impact of moving another type of restaurant into this territory (and thereby increasing the number of restaurants by over 100 at the least) would cause the territories to be more unbalanced when all factors are considered.
The other two territories are unbalanced in both the number of restaurants and the number of restaurant types in each. I considered balancing the other two Atlanta territories (Yellow and Purple) by moving about 270 restaurants from the largest Atlanta territory to the smallest. There were two obvious ways to do this:
1. Move Indo-Pak restaurants, or
2. Move Vietnamese and Thai restaurants
First, I decided to see if there were significant differences in the physical locations of restaurants in these two territories by mapping the centers of mass for each restaurant type.
The mean location of the Indo-Pak restaurants is closer to the Japanese center than are either the Thai or Vietnamese. By this measure, it would have made sense to move Indo-Pak into territory that only contained Japanese Restaurants. However, the difference was only 3–4 miles in a territory that is about 150 miles across. Balancing the number of restaurants in this way would leave the number of cuisines in each one uneven.
I decided instead to move the Vietnamese and Thai restaurants. This produced more even distributions of both the number of restaurants and the number of cuisines in each territory.
How did I do?
We now have five unambiguous territories. As the company’s circumstances change (for example, hiring new salespeople in the future or expanding operations to other states) it can quickly repeat this analysis with new data and constraints. In the meantime, assigning new restaurants to the existing territories is obvious.
However, in the densely populated Atlanta metro area location data was used outside of the machine learning algorithm. This was appropriate given the number of clusters (salespeople), but enhancements to this project could include methods to mix location and cuisine data into the same learning algorithm, perhaps using a custom algorithm.
The algorithm chosen for this project (k means) tends to produce uneven groupings (particularly in cases where there are uneven concentrations of samples), and this algorithm could be modified or replaced to automatically produce clusters more appropriate for salespeople. Finally, domain knowledge was used to balance the number of cuisines in a territory with the number of restaurants. Future versions of this project could capture the relevant domain knowledge and perform this step automatically.
What do you think? All comments and ideas welcome.