Presentation is loading. Please wait.

Presentation is loading. Please wait.

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Similar presentations


Presentation on theme: "Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006."— Presentation transcript:

1 Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

2 December 2008©GKGupta2 Bigger Example

3 December 2008©GKGupta3 Frequency of Items

4 December 2008©GKGupta4 Frequent Items Assume 25% support. In 25 transactions, a frequent item must occur in at least 7 transactions. The frequent 1-itemset or L 1 is now given below. How many candidates in C 2 ? List them.

5 December 2008©GKGupta5 L2L2 The following pairs are frequent. Now find C 3 and then L 3 and the rules.

6 December 2008©GKGupta6 Rules The full set of rules are given below. Could some rules be removed? Comment: Study the above rules carefully.

7 December 2008©GKGupta7 Improving the Apriori Algorithm Many techniques for improving the efficiency have been proposed: Pruning (already mentioned) Hashing based technique Transaction reduction Partitioning Sampling Dynamic itemset counting

8 December 2008©GKGupta8 Pruning Pruning can reduce the size of the candidate set C k. We want to transform C k into a set of frequent items L k. To reduce the work of checking, we may use the rule that all subsets of C k must also be frequent.

9 December 2008©GKGupta9 Example Suppose the items are A, B, C, D, E, F,.., X, Y, Z Suppose L 1 is A, C, E, P, Q, S, T, V, W, X Suppose L 2 is {A, C}, {A, F}, {A, P}, {C, P}, {E, P}, {E, G}, {E, V}, {H, J}, {K, M}, {Q, S}, {Q, X} Are you able to identify errors in the L 2 list? What is C 3 ? How to prune C 3 ? C 3 is {A, C, P}, {E, P, V}, {Q, S, X}

10 December 2008©GKGupta10 Hashing The direct hashing and pruning (DHP) algorithm attempts to generate large itemsets efficiently and reduces the transaction database size. When generating L 1, the algorithm also generates all the 2-itemsets for each transaction, hashes them to a hash table and keeps a count.

11 December 2008©GKGupta11 Example Consider the transaction database in the first table below used in an earlier example. The second table below shows all possible 2-itemsets for each transaction.

12 December 2008©GKGupta12 Hashing Example The possible 2-itemsets in the last table are now hashed to a hash table below. The last column shown in the table below is not required in the hash table but we have included it for explaining the technique.

13 December 2008©GKGupta13 Hash Function Used  For each pair, a numeric value is obtained by first representing B by 1, C by 2, E 3, J 4, M 5 and Y 6. Now each pair can be represented by a two digit number, for example (B, E) by 13 and (C, M) by 26.  The two digits are then coded as modulo 8 number (dividing by 8 and using the remainder). This is the bucket address.  A count of the number of pairs hashed is kept. Those addresses that have a count above the support value have the bit vector set to 1 otherwise 0.  All pairs in rows that have zero bit are removed.

14 December 2008©GKGupta14 Find C 2 The major aim of the algorithm is to reduce the size of C 2. It is therefore essential that the hash table is large enough so that collisions are low. Collisions result in loss of effectiveness of the hash table. This is what happened in the example in which we had collisions in three of the eight rows of the hash table which required us finding which pair was frequent.

15 December 2008©GKGupta15 Transaction Reduction As discussed earlier, any transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets and such a transaction may be marked or removed.

16 December 2008©GKGupta16 Example Frequent items (L 1 ) are A, B, D, M, T. We are not able to use these to eliminate any transactions since all transactions have at least one of the items in L 1. The frequent pairs (C 2 ) are {A,B} and {B,M}. How can we reduce transactions using these? TIDItems bought 001B, M, T, Y 002B, M 003T, S, P 004A, B, C, D 005A, B 006T, Y, E 007A, B, M 008B, C, D, T, P 009D, T, S 010A, B, M

17 December 2008©GKGupta17 Partitioning The set of transactions may be divided into a number of disjoint subsets. Then each partition is searched for frequent itemsets. These frequent itemsets are called local frequent itemsets. How can information about local itemsets be used in finding frequent itemsets of the global set of transactions? In the example on the next slide, we have divided a set of transactions into two partitions. Find the frequent items sets for each partition. Are these local frequent itemsets useful?

18 December 2008©GKGupta18 Example 1257 2345 5611 257413 6111324 1419 125714 121419 24567 2461113 2461113 2 2571113 12 3 456 1257 2461113 561113

19 December 2008©GKGupta19 Partitioning Phase 1 –Divide n transactions into m partitions –Find the frequent itemsets in each partition –Combine all local frequent itemsets to form candidate itemsets Phase 2 Find global frequent itemsets

20 December 2008©GKGupta20 Sampling A random sample (usually large enough to fit in the main memory) may be obtained from the overall set of transactions and the sample is searched for frequent itemsets. These frequent itemsets are called sample frequent itemsets. How can information about sample itemsets be used in finding frequent itemsets of the global set of transactions?

21 December 2008©GKGupta21 Sampling Not guaranteed to be accurate but we sacrifice accuracy for efficiency. A lower support threshold may be used for the sample to ensure not missing any frequent datasets. The actual frequencies of the sample frequent itemsets are then obtained. More than one sample could be used to improve accuracy.

22 December 2008©GKGupta22 Problems with Association Rules Algorithms Users are overwhelmed by the number of rules identified ─ how can the number of rules be reduced to those that are relevant to the user needs? Apriori algorithm assumes sparsity since number of items on each record is quite small. Some applications produce dense data which may also have many frequently occurring items strong correlations many items on each record

23 December 2008©GKGupta23 Problems with Association Rules Also consider: AB → C (90% confidence) andA → C (92% confidence) Clearly the first rule is of no use. We should look for more complex rules only if they are better than simple rules.

24 December 2008©GKGupta24 Top Down Approach Algorithms considered so far were bottom up i.e. they started from looking at each frequent item, then each pair and so on. Is it possible to design top down algorithms that consider the largest group of items first and then finds the smaller groups. Let us first look at the itemset ABCD which can be frequent only if all subsets are frequent.

25 December 2008©GKGupta25 Subsets of ABCD A B C D AB ABCD ABC BCDABD ACD CD BD AC AD BC

26 December 2008©GKGupta26 Closed and Maximal Itemsets A frequent closed itemset is a frequent itemset X such that there exists no superset of X with the same support count as X. A frequent itemset Y is maximal if it is not a proper subset of any other frequent itemset. Therefore a maximal itemset is a closed itemset but a closed itemset is not necessarily a maximal itemset.

27 December 2008©GKGupta27 Closed and Maximal Itemsets Frequent maximal itemsets – the frequent maximal itemsets uniquely determine all frequent itemsets. Therefore the aim of any association rule algorithm is to find all maximal frequent itemsets.

28 December 2008©GKGupta28 Closed and Maximal Itemsets In Example, we found {B, D} and {B, C, D} had the same support of 8 while {C, D} had a support of 9. {C, D} is therefore a closed itemset but not maximal. On the other hand, {B, C} was frequent but no superset of the two items is frequent. This pair therefore is maximal as well as closed.

29 December 2008©GKGupta29 Closed and maximal itemsets Closed Frequent Itemsets Maximal Frequent Itemsets Frequent Itemsets

30 December 2008©GKGupta30 Performance Evaluation of Algorithms The FP-growth method was usually better than the best implementation of the Apriori algorithm. CHARM was also usually better than Apriori. In some cases, Charm was better than the FP-growth method. Apriori was generally better than other algorithms if the support required was high since high support leads to a smaller number of frequent items which suits the Apriori algorithm. At very low support, the number of frequent items became large and none of the algorithms were able to handle large frequent sets gracefully.

31 December 2008©GKGupta31 Bibliography R. Agarwal, T. Imielinski, and A. Swami, Mining Association Rules Between sets of Items in Large Databases, In Proc of the ACM SIGMOD, 1993, pp. 207-216. R. Ramakrishnan and J. Gehrke, Database management systems,, 2nd ed. McGraw-Hill, 2000. M. J. A. Berry and G. Linoff, Mastering data mining : the art and science of customer relationship management, Wiley, 2000. I. H. Witten and E. Frank, Data mining: practical machine learning tools and techniques with Java implementations,. Morgan Kaufmann, 2000.

32 December 2008©GKGupta32 Bibliography M. J. A. Berry and G. Linoff, Data mining techniques: for marketing, sales, and customer support, New York : Wiley, 1997. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996. R. Agarwal, M. Mehta, J. Shafer, A. Arning, and T. Bollinger, The Quest Data Mining System, Proc 1996 Int. Conf on Data Mining and Knowledge Discovery (KDD’96), Portland, Oregon, pp. 244-249, Aug 1996.


Download ppt "Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006."

Similar presentations


Ads by Google