Skip to content

Commit

Permalink
added indexes
Browse files Browse the repository at this point in the history
  • Loading branch information
carnivuth committed Sep 12, 2024
1 parent bb0616f commit 887bd19
Show file tree
Hide file tree
Showing 46 changed files with 612 additions and 254 deletions.
4 changes: 1 addition & 3 deletions CLASSIFICATION.canvas
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,17 @@
{"id":"5cfa5299ec6e5e42","type":"file","file":"pages/classification/LINEAR PERCEPTRON.md","x":880,"y":-1140,"width":560,"height":400},
{"id":"9f5fffddfbaf0c19","type":"file","file":"pages/classification/SVM.md","x":880,"y":-440,"width":560,"height":560},
{"id":"030e01b3f666f84d","type":"file","file":"pages/classification/RETI NEURALI.md","x":850,"y":360,"width":620,"height":520},
{"id":"7e5a119bbf27b16a","type":"file","file":"pages/VALUTARE UN CLASSIFICATORE PROBABILISTICO.md","x":105,"y":-460,"width":540,"height":600},
{"id":"5697c0e8d4fadc73","type":"file","file":"pages/classification/DECISION TREES.md","x":-405,"y":-1140,"width":400,"height":400},
{"id":"96f0383b7c3cdd2d","type":"file","file":"pages/classification/DECISION TREE PRUNING.md","x":-405,"y":-460,"width":400,"height":400},
{"id":"12269943a4445a5d","type":"file","file":"pages/classification/REGRESSION.md","x":-880,"y":-1140,"width":400,"height":400},
{"id":"fd16fde9c842e44d","type":"file","file":"pages/classification/TRAINING STRATEGIES.md","x":-605,"y":-1880,"width":400,"height":400},
{"id":"5d76ed9475dd6a2b","type":"file","file":"pages/classification/PERFORMANCE OF A CLASSIFIER.md","x":880,"y":-1880,"width":400,"height":400}
{"id":"5d76ed9475dd6a2b","type":"file","file":"pages/classification/PERFORMANCE_OF_A_CLASSIFIER.md","x":880,"y":-1880,"width":400,"height":400}
],
"edges":[
{"id":"deda8623490138f7","fromNode":"5697c0e8d4fadc73","fromSide":"bottom","toNode":"96f0383b7c3cdd2d","toSide":"top"},
{"id":"f9597acbc37cfe21","fromNode":"c0a3107194fb5ae7","fromSide":"bottom","toNode":"5697c0e8d4fadc73","toSide":"top"},
{"id":"e1a41272a88507d2","fromNode":"c0a3107194fb5ae7","fromSide":"left","toNode":"fd16fde9c842e44d","toSide":"right"},
{"id":"58b4fbfe6e88b23b","fromNode":"c0a3107194fb5ae7","fromSide":"bottom","toNode":"12269943a4445a5d","toSide":"top"},
{"id":"96c8250cafb56e33","fromNode":"5eb3a3b43f9ae782","fromSide":"bottom","toNode":"7e5a119bbf27b16a","toSide":"top"},
{"id":"a3c41bfd658ee9ff","fromNode":"c0a3107194fb5ae7","fromSide":"bottom","toNode":"5eb3a3b43f9ae782","toSide":"top"},
{"id":"0c7c37d6d6ad53ed","fromNode":"c0a3107194fb5ae7","fromSide":"right","toNode":"5d76ed9475dd6a2b","toSide":"left"},
{"id":"2828866a9b800c1e","fromNode":"5cfa5299ec6e5e42","fromSide":"bottom","toNode":"9f5fffddfbaf0c19","toSide":"top"},
Expand Down
30 changes: 15 additions & 15 deletions DATA PREPROCESSING.canvas
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
{
"nodes":[
{"id":"bd5cc2c186c6513e","type":"file","file":"pages/preprocessing/FEATURE SUBSET SELECTION.md","x":-576,"y":-1119,"width":639,"height":870},
{"id":"94839b4b2a1dec51","type":"file","file":"pages/preprocessing/SAMPLING.md","x":619,"y":400,"width":841,"height":1213},
{"id":"ff325faa12fdff6f","type":"file","file":"pages/preprocessing/SCALING.md","x":-160,"y":560,"width":640,"height":743},
{"id":"fbf4d8f8a68b0c3b","type":"file","file":"pages/preprocessing/FEATURE CREATION.md","x":1160,"y":-251,"width":400,"height":391},
{"id":"4048a2750f515f64","type":"file","file":"pages/preprocessing/DIMENSIONALITY REDUCTION.md","x":676,"y":-1000,"width":684,"height":471},
{"id":"d7262774c5d64ddb","type":"file","file":"pages/preprocessing/DATA PREPROCESSING.md","x":-680,"y":40,"width":343,"height":200},
{"id":"dd7768016b37b514","type":"file","file":"pages/preprocessing/TYPE CONVERSIONS.md","x":-1360,"y":0,"width":400,"height":400},
{"id":"7fd5b35f882fb209","x":-708,"y":870,"width":400,"height":400,"type":"file","file":"pages/preprocessing/SIMILARITY AND DISSIMILARITY.md"},
{"id":"9938201f0bcc2a77","x":-1320,"y":731,"width":400,"height":400,"type":"file","file":"pages/preprocessing/DISTANCES.md"},
{"id":"10b4c510401968cd","x":-823,"y":489,"width":225,"height":41,"type":"text","text":"# PROXIMITY"}
{"id":"dd7768016b37b514","type":"file","file":"pages/preprocessing/TYPE CONVERSIONS.md","x":-1600,"y":-60,"width":400,"height":400},
{"id":"d7262774c5d64ddb","type":"file","file":"pages/preprocessing/DATA PREPROCESSING.md","x":-680,"y":40,"width":360,"height":200},
{"id":"10b4c510401968cd","type":"text","text":"# PROXIMITY","x":-612,"y":440,"width":225,"height":50},
{"id":"7fd5b35f882fb209","type":"file","file":"pages/preprocessing/SIMILARITY AND DISSIMILARITY.md","x":-387,"y":732,"width":400,"height":400},
{"id":"9938201f0bcc2a77","type":"file","file":"pages/preprocessing/DISTANCES.md","x":-1012,"y":732,"width":400,"height":400},
{"id":"ff325faa12fdff6f","type":"file","file":"pages/preprocessing/SCALING.md","x":-680,"y":-1040,"width":360,"height":314},
{"id":"bd5cc2c186c6513e","type":"file","file":"pages/preprocessing/FEATURE SUBSET SELECTION.md","x":-1380,"y":-560,"width":360,"height":314},
{"id":"4048a2750f515f64","type":"file","file":"pages/preprocessing/DIMENSIONALITY REDUCTION.md","x":-1160,"y":-960,"width":360,"height":314},
{"id":"94839b4b2a1dec51","type":"file","file":"pages/preprocessing/SAMPLING.md","x":-200,"y":-960,"width":360,"height":314},
{"id":"fbf4d8f8a68b0c3b","type":"file","file":"pages/preprocessing/FEATURE CREATION.md","x":0,"y":-560,"width":360,"height":314}
],
"edges":[
{"id":"4c3e9cf3050d1b64","fromNode":"d7262774c5d64ddb","fromSide":"right","toNode":"bd5cc2c186c6513e","toSide":"bottom"},
{"id":"bb18dee657ecb168","fromNode":"d7262774c5d64ddb","fromSide":"right","toNode":"4048a2750f515f64","toSide":"bottom"},
{"id":"dc3291d798198289","fromNode":"d7262774c5d64ddb","fromSide":"right","toNode":"fbf4d8f8a68b0c3b","toSide":"left"},
{"id":"f97a2fe55380871a","fromNode":"d7262774c5d64ddb","fromSide":"right","toNode":"94839b4b2a1dec51","toSide":"top"},
{"id":"a596bf5f8481ccd2","fromNode":"d7262774c5d64ddb","fromSide":"right","toNode":"ff325faa12fdff6f","toSide":"top"},
{"id":"4c3e9cf3050d1b64","fromNode":"d7262774c5d64ddb","fromSide":"top","toNode":"bd5cc2c186c6513e","toSide":"bottom"},
{"id":"bb18dee657ecb168","fromNode":"d7262774c5d64ddb","fromSide":"top","toNode":"4048a2750f515f64","toSide":"bottom"},
{"id":"dc3291d798198289","fromNode":"d7262774c5d64ddb","fromSide":"top","toNode":"fbf4d8f8a68b0c3b","toSide":"bottom"},
{"id":"f97a2fe55380871a","fromNode":"d7262774c5d64ddb","fromSide":"top","toNode":"94839b4b2a1dec51","toSide":"bottom"},
{"id":"a596bf5f8481ccd2","fromNode":"d7262774c5d64ddb","fromSide":"top","toNode":"ff325faa12fdff6f","toSide":"bottom"},
{"id":"6362fa8ff5cd2943","fromNode":"d7262774c5d64ddb","fromSide":"left","toNode":"dd7768016b37b514","toSide":"right"},
{"id":"0a8f6608df3bf6a1","fromNode":"d7262774c5d64ddb","fromSide":"bottom","toNode":"10b4c510401968cd","toSide":"top"},
{"id":"94854710411d170b","fromNode":"10b4c510401968cd","fromSide":"bottom","toNode":"7fd5b35f882fb209","toSide":"top"},
Expand Down
8 changes: 7 additions & 1 deletion index.md
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
# DATAMINING
# Datamining
## CONTENTS
- [ASSOCIATION_RULES](pages/association_rules/ASSOCIATION_RULES.md)
- [CLASSIFICATION](pages/classification/CLASSIFICATION.md)
- [CLUSTERING](pages/clustering/CLUSTERING.md)
- [BUSINESS_INTELLIGENCE_AND_DATA_WAREHOUSE](pages/datamining_process/BUSINESS_INTELLIGENCE_AND_DATA_WAREHOUSE.md)
- [DATA_PREPROCESSING](pages/preprocessing/DATA_PREPROCESSING.md)
6 changes: 0 additions & 6 deletions pages/SCELTE DI PROGETTO.md

This file was deleted.

4 changes: 0 additions & 4 deletions pages/TIPI DI LEARNING.md

This file was deleted.

11 changes: 0 additions & 11 deletions pages/VALUTARE UN CLASSIFICATORE PROBABILISTICO.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,10 +1,23 @@
---
id: APRIORI_ALGORITHM
aliases: []
tags: []
---

- [~] ---
id: APRIORI ALGORITHM
aliases: []
tags: []
index: 5
---

# APRIORI ALGORITHM

The apriori algorithm is a strategy to prune the three of candidates of the [frequent item-set generation](FREQUENT%20ITEMSET%20GENERATION.md) fase it's based on the apriori priciple
The apriori algorithm is a strategy to prune the three of candidates of the [frequent item-set generation](FREQUENT_ITEMSET_GENERATION.md) fase it's based on the apriori priciple

### APRIORI PRINCIPLE
If an itemset is frequent, then all of its subsets must also be frequent and viceversa.
We can see this principle as follows:
We can see this principle as follows:

$$
\forall X,Y: (X \subset Y) \implies sup(X) \geq sup(Y)
Expand All @@ -26,4 +39,6 @@ flowchart TD
C-->|repeat until the current level is empty|A
```

The $threshold$ value it's an important tuning parameter for complexity and the tradeoff element between number of valid time-sets founded and quality of the item-sets founded
The $threshold$ value it's an important tuning parameter for complexity and the tradeoff element between number of valid time-sets founded and quality of the item-sets founded

[PREVIOUS](FREQUENT_ITEMSET_GENERATION.md) [NEXT](FP-GROWTH.md)
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
---
id: ASSOCIATION RULES
aliases: []
tags: []
index: 1
---

# ASSOCIATION RULES

They are rules that describes situation where the presence of a given element $\{A\}$ or a combination of elements $\{A,B\}$ assure the presence of a third element $\{C\}$, they are based on statistics.
Expand All @@ -10,7 +17,7 @@ They are rules that describes situation where the presence of a given element $\
- **SUPPORT** --> Fraction of transactions that contain an itemset.
- **FREQUENT ITEMSET** --> An itemset whose support is greater than or equal to a minsup threshold.

Association rules can be described by the form
Association rules can be described by the form

$$
A \rightarrow C \space where \space A,C \in itemset
Expand All @@ -30,21 +37,23 @@ $$

### CONFIDENCE $conf$

the number of times $C$ appears over transactions that contains $A$
the number of times $C$ appears over transactions that contains $A$

$$
conf = \frac{(A,C)}{A}
$$

#### CONFIDENCE FROM SUPPORT

confidence can also be computed from supports as
confidence can also be computed from supports as

$$
conf = \frac{(A,C)}{A} =\frac{\frac{(A,C)}{N}}{\frac{A}{N}} = \frac{sup(A,C)}{sup(A)}
conf = \frac{(A,C)}{A} =\frac{\frac{(A,C)}{N}}{\frac{A}{N}} = \frac{sup(A,C)}{sup(A)}
$$


support measures "how much" an occurrence can be considered a rule (there must be enough transaction cases), a rule with low support can be generated by random associations

confidence measures how much a rule is represented in the transactions that contains it
confidence measures how much a rule is represented in the transactions that contains it

[NEXT](ASSOCIATION_RULES_MINING.md)
Original file line number Diff line number Diff line change
@@ -1,14 +1,23 @@
---
id: ASSOCIATION RULES MINING
aliases: []
tags: []
index: 2
---

# ASSOCIATION RULES MINING

The goal of this procedure it's, given a list of $N$ item-set, finding association rules that have $conf$ and $sup$ grater than some thresholds
The goal of this procedure it's, given a list of $N$ item-set, finding association rules that have $conf$ and $sup$ grater than some thresholds

## BRUTE-FORCE APPROACH

generate all possible combination and compute $conf$ and $sup$, this approach is always possible but is too much computational expensive

## TWO STEP APPROACH

this approach is based on the fact that rules that are generated from the same item-set have the same $sup$
this approach is based on the fact that rules that are generated from the same item-set have the same $sup$

- **[frequent itemset generation](FREQUENT%20ITEMSET%20GENERATION.md)** -> in the first step all item-set that have $sup \gt threshold$ are generated (**this step is still computational expensive**)
- **[frequent itemset generation](FREQUENT_ITEMSET_GENERATION.md)** -> in the first step all item-set that have $sup \gt threshold$ are generated (**this step is still computational expensive**)
- **RULE GENERATION** -> in the second step rules with high confidence are generated from the previous generated item-sets

[PREVIOUS](ASSOCIATION_RULES.md) [NEXT](RULES_GENERATION.md)
28 changes: 21 additions & 7 deletions pages/association_rules/FP-GROWTH.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,25 @@
The Apriori Alg. needs to generate the candidates sets, whose number can be really high!
The FP-Growth algorithm consists on finding shortest patterns to chain with suffixes.
---
id: FP-GROWTH
aliases: []
tags: []
index: 6
---

The Apriori Alg. needs to generate the candidates sets, whose number can be really high!
The FP-Growth algorithm consists on finding shortest patterns to chain with suffixes.
FP-Growth uses a compact representation of the DB via a FP-Tree, on which a recursive approach is used, following the "divide et impera" principle.

### HOW IT WORKS

1) Data are scanned in order to find the max support for every single item. Non-frequent item are discarded. Frequent items are sorted by decreasing support.
2) A second scan is done to build the FP-Tree. When the first transaction is read, the A and B nodes are generated with frequency counting = 1.
![](Pasted%20image%2020231231173158.png)
![](Pasted_image_20231231173158.png)
3) When the second transaction is reaa new set containing the B, C and D nodes are created, each one with its relative path starting from the *null* node. Then, the subtree created during step 1) is linked to the just generated one. The two paths do not overlap because of their different prefix.
![](Pasted%20image%2020231231173623.png)
![](Pasted_image_20231231173623.png)
4) If an overlapped path is found (it has the same prefix as a node, let's suppose, A), the counting of the node A is increased by 1.
![](Pasted%20image%2020231231174113.png)
![](Pasted_image_20231231174113.png)
5) The algorithm continues untill the last transaction.
![](Pasted%20image%2020231231174157.png)
![](Pasted_image_20231231174157.png)

The tree size is often lower than the dataset one, but it depends on the transactions orientation.

Expand All @@ -21,4 +28,11 @@ The tree size is often lower than the dataset one, but it depends on the transac
Then, FP-Growth procedes with a ***bottom-up *strategy**:
- The research procedure goes from the less frequent item to the most frequent one. FP-Growth scans the tree in search of itemset ending with the desired search item (es. D).
- Then look for the only paths that contain the D element. This search is sped up with a pointer data structure.
- So the subquestion that contains all the itemsets that end in D and that are frequent has to be built. The research is done by evaluating all possible combinations found that include D and that exceed $minSup$ in a divide-and-conquer logic, starting from the leaves to the root.
- So the subquestion that contains all the itemsets that end in D and that are frequent has to be built. The research is done by evaluating all possible combinations found that include D and that exceed $minSup$ in a divide-and-conquer logic, starting from the leaves to the root.






[PREVIOUS](APRIORI_ALGORITHM.md)
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
---
id: FREQUENT ITEMSET GENERATION
aliases: []
tags: []
index: 4
---

# FREQUENT ITEM-SET GENERATION

This step aims to generate all possible with a $sup \gt threshold$.
Expand All @@ -12,22 +19,24 @@ This, approach which have a complexity of $\mathcal{o}(NMW)$ where:
- $W$ -> average number of items within a transaction
- $M$ -> number of frequent item-set candidate.

it's extremely computational expensive.
it's extremely computational expensive.

There are other strategies that aims to reduce the computational cost of this operation such as:

- reducing the number of candidates by pruning ([apriori algorithm](APRIORI%20ALGORITHM.md))
- reducing the number of candidates by pruning ([apriori algorithm](APRIORI_ALGORITHM.md))
- reducing the number of comparisons $NM$

### BRUTE-FORCE APPROACH

The brute-force approach generates each item-set in the graph above. Then, it computes the *sup* and *conf* indexes values for every association rule generated by every item-set.
- Complexity (**EXPENSIVE**): O(NWM), where

#### Frequent item-set Generation Strategies
- Reduce the **number of candidates** M (Apriori Algorithm)
- Complete Search: $M=2^D$
- Use pruning techniques to reduce M
- Reduce the **number of comparisons** NM
- Use efficient data structures to store the candidates or transactions
- No need to match every candidate against every transaction
- No need to match every candidate against every transaction

[PREVIOUS](RULES_GENERATION.md) [NEXT](APRIORI_ALGORITHM.md)
Loading

0 comments on commit 887bd19

Please sign in to comment.