Cluster analysis of Toronto neighborhoods to identify opportunities for local businesses

1. Introduction

1.1 Business Problem

For investors, it is useful to know what is likely to be popular in which neighborhood based on demographics such as age, education levels, marital status, income, ethnic background, and other factors that can explain and influence consumer behavior. There are some restaurants, for example, that open in a neighborhood with considerable sunk costs and only after a few months end up closing down because they fail to attract enough customers. If they can understand the market better in the first place, such failures could possibly be avoided.

1.2 Background

With a population of close to 3 million, Toronto is an important financial center for Canada. Being a cosmopolitan city with diverse demographics, it is only natural that it offers everything you need in terms of restaurants, parks, spas, pubs, gyms, business services and so on. Like in every major city, certain venues such as corporate offices can be concentrated in a financial district, for example. Households with higher income and education levels can also be concentrated in certain neighborhoods. Using two datasets on Toronto neighborhoods, I will test my hypothesis. If different patterns do exist, I will attempt to explain why certain venues are more prevalent in particular neighborhoods. I will also make recommendations for investment opportunities in case there are places with little competition but potential for attracting significant number of a particular group of customers.

2. Data Sources

The data used in this research comes from Foursquare and Toronto open data portal (https://open.toronto.ca/catalogue/). Foursquare API gives information on each neighborhood with venues and their location coordinates and category. The 2016 census for Toronto (https://open.toronto.ca/dataset/neighbourhood-profiles/)  contains aforementioned important demographics for each neighborhood. By combining the two datasets, I plan to come up with recommendations for type of venues that are not yet fully saturated in selected neighborhoods.

With the use of clustering algorithms, it is possible to get an overview of the similarities or differences between the neighborhoods. Similar neighborhoods can be further analyzed to find out the concentration of certain venues, income levels, education, age, marital status, ethno-linguistic backgrounds and other demographic factors.

3. Methodology 

I used venues data from Foursquare to create a K-means clustering model and compared with the clusters obtained from the census dataset. This approach helps me figure out if the differences among neighborhoods can be explained by differences in demographics. The clustering labels from two different datasets can be joined using the “Neighborhood” index and compared using rank correlation methods such as Kendall’s Tau. If the differences between two clustering groups are not statistically significant, we can infer that the demographics from the census data indeed explain the variations among the neighborhoods. 

3.1 Feature selection and data transformation

The census data for Toronto contains over 2300 columns with information such as income taxes, languages spoken at home, and mobility among others which are not very relevant in this study. 

A screenshot of a cell phone

Description automatically generated

Figure 1 (Original Data Frame)

Feature selection was challenging due to high dimensionality in this dataset. Fortunately, some of the features could be manually selected and this resulted in a data frame with 64 columns or variables.  Afterwards, I transformed the numerical values by removing the white spaces, currency symbols, percentage signs and commas. 

A screenshot of a cell phone

Description automatically generated

Figure 2 (Transformed Data)

Although the following correlation matrix with selected columns shows collinearity between some variables, it is not very strong. High collinearity does not significantly affect K-means because the algorithm computes distances between the samples to create clusters. However, high dimensionality does not necessarily add more information and can be noisy. To avoid this, I decided to use principal component analysis to reduce dimensions before K-means. Therefore, the previous data frame with manually selected columns (variables) was discarded.

A picture containing clock

Description automatically generated

Figure 3 (Correlation Matrix)

3.2 Feature Extraction

As stated earlier, the original census data includes over 2300 variables and 140 neighborhoods. 

Figure 4 ( Original Data Shape)

To perform Principal Component Analysis, I started again with the original data set and performed data transformation steps mentioned in section 3.2.  Only the columns with numerical values were kept to standardize data with StandardScaler() in Sci-kit Learn which uses mean and standard deviation to produce a dataset with mean of zero and variance of 1

Afterwards, the scaled data was fed into the PCA function. The following graph shows that the first 30 components explain about 90% of the variance in the data. In other words, the 30 components contain 90% of information from the original data frame.

A picture containing drawing

Description automatically generated

Figure 5 (Principal Components)

Figure 6 shows that it is not obvious if PCA 1 and PCA 2 are good for clustering. However, because there are 20 components, we need to run K-means to figure this out.

A picture containing food

Description automatically generated

Figure 6 (PCA 1 and PCA 2)

After feeding the principal components through K-means algorithm, I observed that there is no clear elbow point for picking n clusters. Looking closely at the graph, 5 could be a good breaking point with a ‘little’ elbow. 

A close up of a wire fence

Description automatically generated

Figure 7 (K-means elbow)

Therefore, I run again with 5 clusters for K-means. It seems to be a good choice so far as we can clearly see five clusters in the following graph.

A close up of a logo

Description automatically generated

Figure 8 (Immigrant population vs Count of  Working Age (25-54)

3.3 Separation of Outliers 

Unlike DBSCAN algorithm, K-means is sensitive to outliers in the sample. Therefore, it is a good idea to separate outliers and treat them as a topic on their own. In this study, I am interested in outliers especially because they are different from other neighborhoods in Toronto and, in theory, can present unique challenges or opportunities for a business.

Demographic outliers such as a high number of immigrants from South Asia or people of age 18-35, for example, could indicate an opportunity for offering services to that particular group. To determine if such opportunities exist, I can use Foursquare data and find out the density of venues by category.

I put all observations that deviate three standard deviations from the mean into a separate data frame. These values were obtained using z-score.

In the following two graphs, we can see that the outliers on average have higher number of immigrant population. In figure 3 for the normal group, the average number of immigrants in each neighborhood is around 7000. 

A picture containing drawing

Description automatically generated

Figure 9 (Immigrant Population – Outliers)

A picture containing sitting, computer, computer, boat

Description automatically generated

Figure 10 ( Immigrant Population – Normal Group)

The differences disappear when we look at the proportion and not the absolute numbers like the previous graphs. 

A picture containing implement, pencil

Description automatically generated

Figure 11 ( Population Stack – Outliers)

A picture containing implement, pencil

Description automatically generated

Figure 12 (Population Stack – normal group)

This is because outlier group contains neighborhoods with larger population. However, absolute numbers can also make a difference when they pass certain demand thresholds where it allows a business operation surpasses the breakeven point and be profitable. For example, a neighborhood with 40% millennials and Gen Z may not still be very attractive for a youth fashion brand if the whole population is only 500.

Figure 13 (Total Population Boxplot – outliers vs normal)

In addition, there does not seem to be much difference when it comes to education levels. The pattern is quite similar between the outliers and normal group (see Figure 5 and Figure 6).

A picture containing implement, stationary, pencil

Description automatically generated

Figure 14 (Education Levels – Outliers)

A picture containing implement, stationary, pencil

Description automatically generated

Figure 15 (Education Levels -Normal Group)

However, I found an interesting pattern when it comes to income groups. In the following two figures (13 and 14), you can observe that high income group (100,000 and over) is more prevalent in the normal group. The outlier group consist of less households with income 100,000 and over.

In Bridle Path-Sunny Brook-York Mills, it is especially higher than other neighborhoods. It is not clear why three neighborhoods are combined into one in the census data. It is probably due to the fact that all three combined only represent 9266 of the population. This wealthy neighborhood seems to be a suburban style area with low population density of 1040/sqkm while an average neighborhood in Toronto has 6261/sqkm.

Comparing the outliers and normal group based on other features would likely reveal more differences between the two. This is because differences in income—even when social security systems maybe exceptionally good—result in disparities in other aspects of life, and thus, divergent consumer behaviors. 

A picture containing drawing

Description automatically generated

Figure 16 (Income groups in outlier neighborhoods)

A picture containing implement, stationary, pencil, drawing

Description automatically generated

Figure 17 (Income groups in the normal group)

Another important assumption: due to large proportion of immigrants or ethnic minorities in outlier neighborhoods, they would be qualitatively different from the ‘normal’ neighborhoods. An obvious example would be the case of Chinatowns across the world in which the majority ethnic group is Chinese and consequently we see more Chinese shops and restaurants in those neighborhoods. 

In the following boxplots, we can see that South Asian origin is far higher in outlier group than normal group.

Figure 18 (South Asian origins)

It is not only in absolute numbers but also in proportion as seen from the calculations below:

People of South Asian origins as percentage of total population:

outliers : 18.14%

normal group : 9.99%

It is to be noted that outlier groups not only differ in population of migrants but also in other factors. I am using immigrant population variable as an example because it is one of the salient features that set the two groups apart. I will be examining other dimensions later. Due to differences outlined above, it is worth exploring normal and outlier groups separately. 

4. Limitations

In this report, I will be only focusing on outlier group and it is one of the limitations of this exercise. Comparisons between outlier and normal group in more detail and that of between neighborhoods in each respective group should be done for a more thorough analysis.

5. Results

Finally, outlier group was selected for further analysis. Using Nominatim, I retrieved GPS coordinates for each neighborhood. However, some values were missing and I compensated for that by typing in manually. There are 30 outliers. 

Correlation matrix on outliers return an interesting fact that French origins is highly correlated to jobs in arts and entertainment as seen in figure 16. It is worth looking more closely at this finding and see if it was mere a chance or not. This can be done by getting p-value for statistical significance.  For the moment, real world information supports this finding because many French-Canadians are employed in Quebec’s film industry.

Figure 19 (French Origin correlation)

5.1 Clustering with census data

A close up of text on a white background

Description automatically generated

To perform K-means clustering, the values in outliers group obtained Principal Component  Analysis were standardized using StandardScaler(). 

Iterating with “n” number of clusters again resulted in following elbow pattern. In this case, 4 seems to be the optimal number of clusters.

Figure 21 (Elbow graph & Clusters – outliers)A close up of a logo

Description automatically generatedA screenshot of a cell phone

Description automatically generated

Plotting the clusters on the map didn’t result in clear geographical clusters.  This could indicate a problem with the data or it could be that the number of clusters is not optimal.

A close up of a map

Description automatically generated

Figure 22 (Cluster map- Census Data)

5.2 Clustering with Foursquare Data

The outliers were further analyzed with venues data obtained from Foursquare API. In total, data for 661 venues were obtained for 30 neighborhoods with 172 unique categories.

A screenshot of a cell phone

Description automatically generated

Figure 23 (Venues Foursquare Data)

In the next step, all categories in the data frame were one-hot coded and then mean value for each neighborhood was computed. The mean values were then used to rank most popular categories in each neighborhood.

A screenshot of a cell phone

Description automatically generated

Figure 24 (Venues One-Hot Coding and Mean)

A screenshot of a cell phone

Description automatically generated

Figure 25 (Most popular venue categories by neighborhood)

In Figure 21 above, we can see that coffee shops, Cafes, and restaurants are most popular categories in Bay Street area. It is not surprising since Bay Street is a financial district with corporate offices.

A picture containing photo, sitting, different, colorful

Description automatically generated

Figure 26 (Elbow graph- venues clusters)

With venues data, the optimal number of clusters was 3. However, 4 clusters were needed to compare with the previous K-means result for census data.  Performing K-means with 4 clusters, the following clusters were obtained as shown in the map.

Comparing the two maps below shows that the census data does not completely explains the differences between the clusters. However, there are still a few similarities suggesting that some of the data in the census such as income categories, age and other demographic factors could influence clustering of venues.

A picture containing light, sitting, computer, dark

Description automatically generatedA picture containing looking, sitting, dark, light

Description automatically generated

Figure 27 (Cluster Map – Venues)

Figure 28 (cluster map – census data 2 )

A screenshot of a cell phone

Description automatically generated

Figure 29 (comparing cluster labels)

Because the two clusters represent the same entities, i.e. neighborhoods, if the two datasets used for clustering are highly correlated to each other, the clusters should be roughly the same. At a first glance, we can see on the map that it is not the case. 

To see if the different is statistically significant, I computed Kendall’s tau and results were as follows:

tau:0.2535211267605633, p-value:0.12559254742606987

Tau value indicates they are only 25% similar. The difference is not statistically because p-value is above the standard 5% threshold at 12.6%. 

6. Discussion

What can be inferred from this result is that the principal components from extracted from census data can still explain, albeit to a small degree, the differences between venue clusters. However, it is still arguable whether it is even meaningful to compare the two clusters that come from two entirely different datasets. For that, I will not be exploring further in this report as it is already lengthy. [I will do some digging and write a separate article on this matter.]

As seen in Figure 18 and following paragraph, outlier group deserves further analysis because of its unique characteristics. Figure 18 shows that South Asian population is higher in both relative and absolute numbers. However, from Figure 25, we can see that South Asian restaurants are not very common in those neighborhoods and may present business opportunities. 

Just like the South Asian restaurant example, we can use exploratory data analysis on other categories to uncover various opportunities where the market is not yet saturated for certain business types.

7. Conclusion

In this report, I attempted a solution for identifying business opportunities in a geographical area using census data and Foursquare API. I first explored the Toronto census data to uncover the demographic characteristics. Exploratory data analysis and separating outliers revealed interesting patterns that can be used to draw valuable insights. In order to see how much census data can explain density of venues in neighborhoods, I used K-means clustering on both datasets and compared them visually and used Kendall’s tau to see if the difference is statistically significant. To conclude, further analysis is necessary to ensure that the assumptions, statistical methods and machine learning algorithms used are indeed sound and appropriate. Further analyzsis should be done using methods such as Kolmogorov-Smirnov test, Tukey’s HSD, and DBSCAN, K-medoids or other clustering algorithms.

0 Replies to “Cluster analysis of Toronto neighborhoods to identify opportunities for local businesses”