Recently, I talked with someone who was working to explore products that are often purchased together in the hopes of finding valuable patterns. This is usually termed association rule learning or market basket analysis. It is often used by businesses to see which items may be good to sell together, via bundling for discounts or upselling on an existing order. Amazon’s suggestions for items “Customers Also Shopped For” are a great example of this. It is also used in recommending books or music based on other ones people have liked.

I was interested in trying this myself, and found a dataset hosted by the University of California, Irvine. It contains transactions for a UK-based online retailer. Most of their items are “all-occasion gifts” and most of their customers are wholesalers.

I’m hoping to be able to find relations that are not obvious, but still useful. It’s possible the results end up telling us more obvious things like people purchasing salsa often purchase chips as well. This wouldn’t be overly ideal, but the exact correlation may still be valuable to know.

Data Preparation

The first few rows of the data looks like this:

invoiceno stockcode description quantity invoicedate unitprice customerid country
536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/10 8:26 2.55 17850 United Kingdom
536365 71053 WHITE METAL LANTERN 6 12/1/10 8:26 3.39 17850 United Kingdom
536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/10 8:26 2.75 17850 United Kingdom
536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/10 8:26 3.39 17850 United Kingdom
536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/10 8:26 3.39 17850 United Kingdom

There is a row for every item in every order. I’m mostly interested in the invoice number and item description for now. The time and location of the order could yield associations that are lost in the bigger picture though. While just scrolling through the data, I noticed that some lines had a blank description. Exactly 132 of the 65,499 rows are missing the description, so I dropped those rows as we don’t have any way to know what those items are. There are also transactions where the quantity is negative, so I dropped those 1119 rows as well.1 This will only drop them from the return transaction, so they will still be represented as part of the purchase transactions despite them later being returned. They certainly could be removed without losing much of the dataset, but I’ve chosen to keep them as analyzing which items are purchased together is our ultimate goal.

Analysis

With the data in good shape, we’re ready to start seeing what we can do. Support, confidence, and lift are important terms when discussing association rule learning, and are defined well here. One of the more common algorithms for market basket analysis is Apriori. At a very simple level, it will run through and count how many times each different combination of items appears. For datasets with lots of different items appearing in lots of different combinations, it may take a long time to count each combination. To reduce the number of combinations being counted, a minimum level of support is usually specified. If an item or combination of items doesn’t appear often enough, it will be excluded. This also can help a bit in preventing rules being generated from a set of items that may only appear a few times.

Finding the correct limit is very difficult, as it’s dependent on the number of transactions, how many items the transactions contain, and how many distinct items there are. I’ll be using the R’s arules package, which provides lots of useful functions for association rules to explore the data (including an implementation of the Apriori algorithm).

# install arules package
install.packages("arules")
library(arules)

# load transactions
transactions <- read.transactions('transactions.csv', format='single', sep=',', cols=c('invoiceno','description'))

# see summary of transactions
summary(transactions)
transactions as itemMatrix in sparse format with
#  1380 rows (elements/itemsets/transactions) and
#  3060 columns (items) and a density of 0.007959648
#
# most frequent items:
# WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER
#                                187                                123
#              HEART OF WICKER SMALL            JUMBO BAG RED RETROSPOT
#                                121                                107
#        HAND WARMER BABUSHKA DESIGN                            (Other)
#                                106                              32968
#
# element (itemset/transaction) length distribution:
# sizes
#   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
# 164  61  53  53  58  50  44  40  43  39  49  39  35  35  36  37  30  24  32  27
#  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40
#  27  15  19  14  14  16  13  15  15  12  12  10  11  15  10   7  11  13   3  11
#  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
#   9   6   8   7   7   5   1  10   3   2   1   1   8   4   1   2   3   4   1   1
#  61  62  63  64  65  66  67  68  69  70  71  72  74  75  80  81  83  86  89  90
#   1   2   1   1   1   2   2   4   2   1   1   2   1   1   2   1   1   2   1   1
#  93  94  96  97 100 103 104 107 108 120 126 131 134 137 139 144 147 152 154 177
#   1   1   1   1   1   1   3   1   1   1   2   1   2   1   1   1   1   1   1   1
# 178 205 207 209 217 218 220 228 235 239 246 250 254 262 269 312 313 320 324 333
#   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
# 337 354 370 380 396 401 425 443 472
#   1   1   1   1   1   1   2   1   1
#
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#    1.00    5.00   12.00   24.36   25.00  472.00

Our dataset has 1380 transactions and 3060 different items purchased over the course of about 20 months, which is 2.3 transactions per day. The most items purchased in a single transaction was 472. In 164 transactions (almost 12% of the total), only one item was purchased. The transactions are represented as a matrix, where each row is a transaction, and each column is an item. The summary tells us our transaction matrix has a density of 0.0079, which means 99.2% of the elements in the matrix are empty (which makes it a sparse matrix).

The arules package allows us to specify parameters for the minimum support and confidence, as well as how many items the rules it generates can contain. The minimum support defaults to 0.1, a minimum confidence of 0.8 and maximum item length of 10 items.

Just for fun, we’ll try the default:

rules <- apriori(transactions)
# Apriori
#
# Parameter specification:
#  confidence minval smax arem  aval originalSupport maxtime support minlen
#         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
#  maxlen target   ext
#      10  rules FALSE
#
# Algorithmic control:
#  filter tree heap memopt load sort verbose
#     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
#
# Absolute minimum support count: 138
#
# set item appearances ...[0 item(s)] done [0.00s].
# set transactions ...[3060 item(s), 1380 transaction(s)] done [0.01s].
# sorting and recoding items ... [1 item(s)] done [0.00s].
# creating transaction tree ... done [0.00s].
# checking subsets of size 1 done [0.00s].
# writing ... [0 rule(s)] done [0.00s].
# creating S4 object  ... done [0.00s].
# > ?apriori
# > inspect(rules)
# > summary(rules)
# set of 0 rules

It turns out that the defaults are too strict to generate any rules. A minimum support level of 0.1 would be items that are purchased together about once every three and a half days. To relax our restrictions a bit, let’s try a support level of itemsets purchased at least once a month on average. Our dataset covers 20 months, so our support would be 20/1639 = 0.0145

If we try again:

# default confidence is 0.8, but if we wanted to change it, this is how we'd specify it.
rules <- apriori(transactions, parameter = list(support = 0.0145, confidence = 0.8))

Good news, we have 25 rules! We could try playing with the parameters a bit, but we can work with this for now. Let’s take a look at the first few:

inspect(rules[1:10])
# [1]  {SET/6 RED SPOTTY PAPER CUPS}    => {SET/6 RED SPOTTY PAPER PLATES}      0.01594203  0.8148148 41.64609
# [2]  {SET/6 RED SPOTTY PAPER PLATES}  => {SET/6 RED SPOTTY PAPER CUPS}        0.01594203  0.8148148 41.64609
# [3]  {ALARM CLOCK BAKELIKE CHOCOLATE} => {ALARM CLOCK BAKELIKE RED }          0.01594203  0.8148148 16.78275
# [4]  {SET OF 6 HERB TINS SKETCHBOOK}  => {SET OF 6 SPICE TINS PANTRY DESIGN}  0.01521739  0.8400000 21.46667
# [5]  {JAM JAR WITH GREEN LID}         => {JAM JAR WITH PINK LID}              0.01884058  0.8125000 31.14583
# [6]  {ALARM CLOCK BAKELIKE ORANGE}    => {ALARM CLOCK BAKELIKE GREEN}         0.02536232  0.8139535 15.38707
# [7]  {RIBBON REEL CHRISTMAS PRESENT } => {RIBBON REEL CHRISTMAS SOCK BAUBLE}  0.01521739  0.8750000 30.18750
# [8]  {CAKES AND BOWS GIFT  TAPE}      => {STARS GIFT TAPE }                   0.01521739  0.8400000 36.22500
# [9]  {SWISS ROLL TOWEL, PINK  SPOTS}  => {SWISS ROLL TOWEL, CHOCOLATE  SPOTS} 0.01521739  0.8750000 43.12500
# [10] {CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE HEART DECORATION}    0.01956522  0.9000000 41.40000

At first glance, it seems many rules are for similar items with different designs. The other interesting part is that rules 1 and 2 contain the same items in a different order.

The rules also have their corresponding support, confidence and lift. The lift is worth investigating more, as it tells us how much more likely it is someone will buy the right hand item given that the left hand set of items is also in a transaction. For rule 3, this means a person purchasing an ALARM CLOCK BAKELIKE CHOCOLATE is 16 times more likely to also purchase a ALARM CLOCK BAKELIKE RED compared to someone who is not. To see the rules with the greatest lift, we can sort before printing the top 10:

inspect(sort(rules, by ="lift")[1:10])
#      lhs                                      rhs                                      support confidence     lift
# [1]  {CHRISTMAS TREE DECORATION WITH BELL,
#       CHRISTMAS TREE STAR DECORATION}      => {CHRISTMAS TREE HEART DECORATION}     0.01811594  0.9615385 44.23077
# [2]  {CHRISTMAS TREE DECORATION WITH BELL,
#       CHRISTMAS TREE HEART DECORATION}     => {CHRISTMAS TREE STAR DECORATION}      0.01811594  0.9615385 44.23077
# [3]  {CHRISTMAS TREE HEART DECORATION,
#       CHRISTMAS TREE STAR DECORATION}      => {CHRISTMAS TREE DECORATION WITH BELL} 0.01811594  0.9259259 44.06130
# [4]  {SWISS ROLL TOWEL, PINK  SPOTS}       => {SWISS ROLL TOWEL, CHOCOLATE  SPOTS}  0.01521739  0.8750000 43.12500
# [5]  {SET/6 RED SPOTTY PAPER CUPS}         => {SET/6 RED SPOTTY PAPER PLATES}       0.01594203  0.8148148 41.64609
# [6]  {SET/6 RED SPOTTY PAPER PLATES}       => {SET/6 RED SPOTTY PAPER CUPS}         0.01594203  0.8148148 41.64609
# [7]  {CHRISTMAS TREE STAR DECORATION}      => {CHRISTMAS TREE HEART DECORATION}     0.01956522  0.9000000 41.40000
# [8]  {CHRISTMAS TREE HEART DECORATION}     => {CHRISTMAS TREE STAR DECORATION}      0.01956522  0.9000000 41.40000
# [9]  {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE STAR DECORATION}      0.01884058  0.8965517 41.24138
# [10] {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE HEART DECORATION}     0.01884058  0.8965517 41.24138

As we saw before, a lot of the sets are variations of the same item. There is a lot of very strong correlation amongst these rules, and tells us a bit about the shopping that happens on the site. The data showed that 75% of transactions have at least 5 items, so transactions aren’t small. Given that buyers are generally wholesalers, it is not surprising to see multiple varieties of the same item purchased together. It’s also possible the site is already doing something to encourage buyers to purchase these items together.

Wrapping Up

While we didn’t come across any very surprising relationships, the transactions do seem to reflect the purchasing behavior of wholesale buyers rather than individual consumers. We could try other algorithms that are more efficient in searching for patterns, especially when itemsets are longer, and see if it allows us to expand our search and find different rules.

Given the similarity between a lot of the items in a set, we could also consider grouping similar products (maybe using something like k-means on the description). We could then run Apriori across these item groups to see how buyers are purchasing across different kinds of items. I’ve posted my code and data here.

If you’d like to work with me, or have questions/corrections on the above, I can be reached via Twitter or email.

1: The stock code precedes the description, so I ran :g/\d\{5},,/d in vim to delete any lines that have the 5 digit stock code and then two consecutive commas. In a similar manner, I ran :g/,-\d\+,/d to remove lines containing a negative quantity.