SHOPPING FOR A MOUNTAIN TRIP: ANALYSIS OF MULTIPLE RESPONSE SETS

Reading the article will take you: 6 minutes.
Multiple response sets often form the basis for further detailed questions about opinions on selected products.

The analysis of existing data also makes extensive use of sets of multiple answers since it is commonplace that a customer may have several banking products, use many services, read several blog entries, etc.

The use of multiple response sets may cause some initial problems with the choice of the percentage base (the number of answers is by definition higher than the number of respondents who answered the question).This topic was discussed in detail in a previous article. In the following post I will present some suggestions for visualizing the results of the analysis of a set of multiple responses using analytical techniques and procedures available in PS IMAGO PRO.

To begin with let's take a look at our data. As a basis we will use a dataset containing information about purchases of tourist equipment in an outdoor adventure store. Each record in the database corresponds to a single transaction. The purchase of a product is a dichotomous variable where 1 means the product was purchased and 0 means that it was not. Below is a table summarizing the transactions made.

Table 1. Transaction summary - purchased products

Table 1. Transaction summary - purchased products

 

Our analysis covers 100 transactions in which customers purchased 407 products, which results in 100 observations and 407 responses. From the table we can see that the most frequently purchased products were a sleeping bag (60 times) and a headlamp (58 times), and the least frequently purchased was a tent. Obviously, we do not want to stop at a simple statement of what is bought most often and what is bought least often. We can see, for example, that on average four products were purchased together in a single transaction. This then leads to more questions, for example, which products were most often placed in one basket during a transaction. Answers to these questions may be crucial in the context of building a system for recommending items to customers on the basis of their current choices. This in turn may drive up customer satisfaction and boost sales.

Pairing of products - procedure for coloring the table

The basic form of analysis of correlations of a set of multiple responses is the table. The best option to perform such an analysis in PS IMAGO PRO is to use the Custom Tables module, in which we have previously defined a set of multiple answers. A comprehensive guide on how to carry this out was described in a previous post. When defining the table we have to decide up-front on the statistics we want displayed. If we want to find out which products were in the same basket as the analyzed product , our choice would be to display the row percentage. In this case, it does not matter whether we choose answers or observations as the basis for interest, because our data does not contain information about the number of products purchased, only whether the product was purchased or not.

To color the table with a gradient according to the intensity of the value, use the table coloring procedure, which PS IMAGO PRO users will find in the "Predictive Solutions" menu in the "Reports" section. Of course, it is not a separate statistical procedure, but is intended only to make it easier to interpret the results. In the procedure window, select the "Apply gradient coloring" option and choose the color palette (I decided to choose "Fall"). Additionally, I checked the "Omit diagonal values" option, owing to which the values placed on the diagonal (100%) will not affect the color scale. The result of these steps is the table below. To make it easier to read, I removed the values from the diagonal using the table style wizard and I also rotated the column headers to the vertical position using the table editing mode.

Table 2. Co-purchase of products. Statistics in the table: row percentage of observations.

Table 2. Co-purchase of products. Statistics in the table: row percentage of observations.

 

 

The table contains a lot of interesting information. For example, people who bought a tent most often decided to buy a sleeping bag during the same transaction (85.4% of transactions), a little less frequently a penknife (61.0%), a headlamp and a gas canister (58.5% each). By using coloring we can immediately identify pairs of most frequently occurring categories or quickly find another alternative for them.

Basket construction - dendrogram (cluster analysis)

Let's take a deeper look at our transactions. We are looking for an answer to the question which combination of products are most common in one basket, but we do not want to limit ourselves to comparing pairs of products. When we have dichotomous variables and want to group cases or variables in terms of their coexistence, hierarchical cluster analysis may be an interesting approach. As a result, we will obtain information not only about the membership of objects to individual groups, but also about the degree of similarity between groups. The most important result visualization yielded by this clustering technique is the dendrogram.

How do we perform hierarchical cluster analysis? In PS IMAGO PRO it is located in the "Analyze" menu in the "Classify" section. After selecting variables for analysis, we specify we want to build clusters on variables (Cluster>Variables), and in the “Plots” window we select the dendrogram we are interested in (Plots>Dendogram). The key decision that we make during the cluster analysis is the choice of the analysis method that we make in the "Method" window. Without going into detail, we will use the default agglomeration method (Cluster Method>Between-groups linkage), and as a metric we will select Jaccard's measure (Measure>Binary>Jaccard). This distance measurement method uses information about the co-existence of products in a transaction and relates it to the total number of transactions in which the analyzed objects appeared. The higher the value, the "closer" the products are to each other. When calculating the metric, the number of transactions in which both products did not occur is not taken into account.

Figure 1. Dendrogram showing co-purchase of products

Figure 1. Dendrogram showing co-purchase of products

The above dendrogram shows the process of connecting successive objects (on the left), which are then grouped into larger and larger segments until one large concentration covering all the analyzed objects is achieved (on the right). The earlier the analyzed products are combined, the greater their similarity to each other. Vertical lines symbolize the joining of groups. Horizontal lines reflect similarity relations between the connected objects and clusters - the longer they are, the more diverse the objects are.

Let's look at the effects of grouping. Items that were most often purchased together were a tent and a sleeping bag, followed by a headlamp + penknife, and then boots + jacket. The problem begins with thermal underwear and gas canister which seem to be out of step with other products. What could this mean? After a short look at Table 2, we see that the dilemma of belonging to a particular product group could have been the result of the fact that thermal underwear and gas canister are often chosen together with a headlamp or sleeping bag, but rarely with a tent. Eventually, the algorithm merged these two products into the headlamp + penknife group.

How many clusters should I select? The final decision is up to the analyst, and the dendrogram provides an important hint. The more groups we choose, the more difficult the interpretation can be, but the more homogeneous the groups will be. On the contrary, the fewer groups we decide to leave in the solution, the more user-friendly it will be, but the objects in focus will be less similar to each other. On the basis of the evaluation of the dendrogram, it seems clear that there are two separate baskets: boots + jacket, and tent + sleeping bag. The headlamp and penknife are also frequently purchased together. If you don't want to analyze thermal underwear and gas canister separately, you can include them in one basket with headlamp and penknife. In the end, I decided to focus on three baskets, knowing that the last one is the least unified.

Hidden dimension - perceptual map (correspondence analysis)

The last statistical technique I would like to propose as a tool for analyzing a set of multiple responses is the perceptual map resulting from correspondence analysis. In PS IMAGO PRO this technique is available in the "Analyze" menu in the "Dimension Reduction" section. In contrast to the previous technique, in correspondence analysis the assignment should be made on its own. An important feature of the perceptual map is the ability to interpret the differences between objects. Thanks to the evaluation of their scattering we can interpret individual dimensions as hidden variables - factors that shaped the layout of analyzed objects. It should also be remembered that the map is a simplified interpretation of the variability of the analyzed table, as it presents the two most important dimensions.

Correspondence analysis is a technique designed to analyze two nominal variables of many categories. In this case, however, we deal with quite specific data (eight dichotomous variables), so we do not have the ability to build a proper crosstab table. Does this then prevent us from carrying out an analysis, or lead us to perform complex transformations? Not necessarily. It is enough to treat the whole analyzed set of data as a table, where a transaction is a line variable (each line is a kind of a person, a category of the "transaction" variable) and a column variable is a product (each column is a category of the "purchased product" variable). There are only 0 values in each cell of this table if the product was not bought in the transaction or 1 if the product was bought.

In order for correspondence analysis to treat a dataset as a table, it is necessary to modify the CORRESPONDENCE command syntax with TABLE=ALL, as described in one of the previous texts. In brackets enter the number of rows and columns to be analyzed. Our table has 100 rows and 8 columns (see table fragment above). The map was made using column normalization (CPRINCIPAL), with column points (CPOINTS) only. The command and the perceptual map are shown below.

perceptual map

Figure 2. Perceptual map

 

As in the case of cluster analysis, we can see that the analyzed products can be combined into three groups: 1) jacket+boots; 2) tent+sleeping bag, 3) headlamp+thermal underwear+pocket knife+ gas canister. The perceptual map also allows us to interpret the dimensions on which our products are mapped. The first, most important dimension (horizontal axis) differentiates products that are fairly essential in the mountains (you can imagine a trip without thermal underwear, but without boots or jacket it is a risky business!). The second dimension (vertical axis) introduces an additional distinct grouping of camping equipment (a night in a tent at altitude without a warm sleeping bag is not recommended!)