added indexes

carnivuth · Sep 12, 2024 · 887bd19 · 887bd19
1 parent bb0616f
commit 887bd19
Show file tree

Hide file tree

Showing 46 changed files with 612 additions and 254 deletions.
diff --git a/CLASSIFICATION.canvas b/CLASSIFICATION.canvas
@@ -5,19 +5,17 @@
 		{"id":"5cfa5299ec6e5e42","type":"file","file":"pages/classification/LINEAR PERCEPTRON.md","x":880,"y":-1140,"width":560,"height":400},
 		{"id":"9f5fffddfbaf0c19","type":"file","file":"pages/classification/SVM.md","x":880,"y":-440,"width":560,"height":560},
 		{"id":"030e01b3f666f84d","type":"file","file":"pages/classification/RETI NEURALI.md","x":850,"y":360,"width":620,"height":520},
-		{"id":"7e5a119bbf27b16a","type":"file","file":"pages/VALUTARE UN CLASSIFICATORE PROBABILISTICO.md","x":105,"y":-460,"width":540,"height":600},
 		{"id":"5697c0e8d4fadc73","type":"file","file":"pages/classification/DECISION TREES.md","x":-405,"y":-1140,"width":400,"height":400},
 		{"id":"96f0383b7c3cdd2d","type":"file","file":"pages/classification/DECISION TREE PRUNING.md","x":-405,"y":-460,"width":400,"height":400},
 		{"id":"12269943a4445a5d","type":"file","file":"pages/classification/REGRESSION.md","x":-880,"y":-1140,"width":400,"height":400},
 		{"id":"fd16fde9c842e44d","type":"file","file":"pages/classification/TRAINING STRATEGIES.md","x":-605,"y":-1880,"width":400,"height":400},
-		{"id":"5d76ed9475dd6a2b","type":"file","file":"pages/classification/PERFORMANCE OF A CLASSIFIER.md","x":880,"y":-1880,"width":400,"height":400}
+		{"id":"5d76ed9475dd6a2b","type":"file","file":"pages/classification/PERFORMANCE_OF_A_CLASSIFIER.md","x":880,"y":-1880,"width":400,"height":400}
 	],
 	"edges":[
 		{"id":"deda8623490138f7","fromNode":"5697c0e8d4fadc73","fromSide":"bottom","toNode":"96f0383b7c3cdd2d","toSide":"top"},
 		{"id":"f9597acbc37cfe21","fromNode":"c0a3107194fb5ae7","fromSide":"bottom","toNode":"5697c0e8d4fadc73","toSide":"top"},
 		{"id":"e1a41272a88507d2","fromNode":"c0a3107194fb5ae7","fromSide":"left","toNode":"fd16fde9c842e44d","toSide":"right"},
 		{"id":"58b4fbfe6e88b23b","fromNode":"c0a3107194fb5ae7","fromSide":"bottom","toNode":"12269943a4445a5d","toSide":"top"},
-		{"id":"96c8250cafb56e33","fromNode":"5eb3a3b43f9ae782","fromSide":"bottom","toNode":"7e5a119bbf27b16a","toSide":"top"},
 		{"id":"a3c41bfd658ee9ff","fromNode":"c0a3107194fb5ae7","fromSide":"bottom","toNode":"5eb3a3b43f9ae782","toSide":"top"},
 		{"id":"0c7c37d6d6ad53ed","fromNode":"c0a3107194fb5ae7","fromSide":"right","toNode":"5d76ed9475dd6a2b","toSide":"left"},
 		{"id":"2828866a9b800c1e","fromNode":"5cfa5299ec6e5e42","fromSide":"bottom","toNode":"9f5fffddfbaf0c19","toSide":"top"},

diff --git a/DATA PREPROCESSING.canvas b/DATA PREPROCESSING.canvas
@@ -1,22 +1,22 @@
 {
 	"nodes":[
-		{"id":"bd5cc2c186c6513e","type":"file","file":"pages/preprocessing/FEATURE SUBSET SELECTION.md","x":-576,"y":-1119,"width":639,"height":870},
-		{"id":"94839b4b2a1dec51","type":"file","file":"pages/preprocessing/SAMPLING.md","x":619,"y":400,"width":841,"height":1213},
-		{"id":"ff325faa12fdff6f","type":"file","file":"pages/preprocessing/SCALING.md","x":-160,"y":560,"width":640,"height":743},
-		{"id":"fbf4d8f8a68b0c3b","type":"file","file":"pages/preprocessing/FEATURE CREATION.md","x":1160,"y":-251,"width":400,"height":391},
-		{"id":"4048a2750f515f64","type":"file","file":"pages/preprocessing/DIMENSIONALITY REDUCTION.md","x":676,"y":-1000,"width":684,"height":471},
-		{"id":"d7262774c5d64ddb","type":"file","file":"pages/preprocessing/DATA PREPROCESSING.md","x":-680,"y":40,"width":343,"height":200},
-		{"id":"dd7768016b37b514","type":"file","file":"pages/preprocessing/TYPE CONVERSIONS.md","x":-1360,"y":0,"width":400,"height":400},
-		{"id":"7fd5b35f882fb209","x":-708,"y":870,"width":400,"height":400,"type":"file","file":"pages/preprocessing/SIMILARITY AND DISSIMILARITY.md"},
-		{"id":"9938201f0bcc2a77","x":-1320,"y":731,"width":400,"height":400,"type":"file","file":"pages/preprocessing/DISTANCES.md"},
-		{"id":"10b4c510401968cd","x":-823,"y":489,"width":225,"height":41,"type":"text","text":"# PROXIMITY"}
+		{"id":"dd7768016b37b514","type":"file","file":"pages/preprocessing/TYPE CONVERSIONS.md","x":-1600,"y":-60,"width":400,"height":400},
+		{"id":"d7262774c5d64ddb","type":"file","file":"pages/preprocessing/DATA PREPROCESSING.md","x":-680,"y":40,"width":360,"height":200},
+		{"id":"10b4c510401968cd","type":"text","text":"# PROXIMITY","x":-612,"y":440,"width":225,"height":50},
+		{"id":"7fd5b35f882fb209","type":"file","file":"pages/preprocessing/SIMILARITY AND DISSIMILARITY.md","x":-387,"y":732,"width":400,"height":400},
+		{"id":"9938201f0bcc2a77","type":"file","file":"pages/preprocessing/DISTANCES.md","x":-1012,"y":732,"width":400,"height":400},
+		{"id":"ff325faa12fdff6f","type":"file","file":"pages/preprocessing/SCALING.md","x":-680,"y":-1040,"width":360,"height":314},
+		{"id":"bd5cc2c186c6513e","type":"file","file":"pages/preprocessing/FEATURE SUBSET SELECTION.md","x":-1380,"y":-560,"width":360,"height":314},
+		{"id":"4048a2750f515f64","type":"file","file":"pages/preprocessing/DIMENSIONALITY REDUCTION.md","x":-1160,"y":-960,"width":360,"height":314},
+		{"id":"94839b4b2a1dec51","type":"file","file":"pages/preprocessing/SAMPLING.md","x":-200,"y":-960,"width":360,"height":314},
+		{"id":"fbf4d8f8a68b0c3b","type":"file","file":"pages/preprocessing/FEATURE CREATION.md","x":0,"y":-560,"width":360,"height":314}
 	],
 	"edges":[
-		{"id":"4c3e9cf3050d1b64","fromNode":"d7262774c5d64ddb","fromSide":"right","toNode":"bd5cc2c186c6513e","toSide":"bottom"},
-		{"id":"bb18dee657ecb168","fromNode":"d7262774c5d64ddb","fromSide":"right","toNode":"4048a2750f515f64","toSide":"bottom"},
-		{"id":"dc3291d798198289","fromNode":"d7262774c5d64ddb","fromSide":"right","toNode":"fbf4d8f8a68b0c3b","toSide":"left"},
-		{"id":"f97a2fe55380871a","fromNode":"d7262774c5d64ddb","fromSide":"right","toNode":"94839b4b2a1dec51","toSide":"top"},
-		{"id":"a596bf5f8481ccd2","fromNode":"d7262774c5d64ddb","fromSide":"right","toNode":"ff325faa12fdff6f","toSide":"top"},
+		{"id":"4c3e9cf3050d1b64","fromNode":"d7262774c5d64ddb","fromSide":"top","toNode":"bd5cc2c186c6513e","toSide":"bottom"},
+		{"id":"bb18dee657ecb168","fromNode":"d7262774c5d64ddb","fromSide":"top","toNode":"4048a2750f515f64","toSide":"bottom"},
+		{"id":"dc3291d798198289","fromNode":"d7262774c5d64ddb","fromSide":"top","toNode":"fbf4d8f8a68b0c3b","toSide":"bottom"},
+		{"id":"f97a2fe55380871a","fromNode":"d7262774c5d64ddb","fromSide":"top","toNode":"94839b4b2a1dec51","toSide":"bottom"},
+		{"id":"a596bf5f8481ccd2","fromNode":"d7262774c5d64ddb","fromSide":"top","toNode":"ff325faa12fdff6f","toSide":"bottom"},
 		{"id":"6362fa8ff5cd2943","fromNode":"d7262774c5d64ddb","fromSide":"left","toNode":"dd7768016b37b514","toSide":"right"},
 		{"id":"0a8f6608df3bf6a1","fromNode":"d7262774c5d64ddb","fromSide":"bottom","toNode":"10b4c510401968cd","toSide":"top"},
 		{"id":"94854710411d170b","fromNode":"10b4c510401968cd","fromSide":"bottom","toNode":"7fd5b35f882fb209","toSide":"top"},

diff --git a/index.md b/index.md
@@ -1 +1,7 @@
-# DATAMINING
+# Datamining
+## CONTENTS
+- [ASSOCIATION_RULES](pages/association_rules/ASSOCIATION_RULES.md)
+- [CLASSIFICATION](pages/classification/CLASSIFICATION.md)
+- [CLUSTERING](pages/clustering/CLUSTERING.md)
+- [BUSINESS_INTELLIGENCE_AND_DATA_WAREHOUSE](pages/datamining_process/BUSINESS_INTELLIGENCE_AND_DATA_WAREHOUSE.md)
+- [DATA_PREPROCESSING](pages/preprocessing/DATA_PREPROCESSING.md)
diff --git a/pages/SCELTE DI PROGETTO.md b/pages/SCELTE DI PROGETTO.md
diff --git a/pages/TIPI DI LEARNING.md b/pages/TIPI DI LEARNING.md
diff --git a/pages/VALUTARE UN CLASSIFICATORE PROBABILISTICO.md b/pages/VALUTARE UN CLASSIFICATORE PROBABILISTICO.md
diff --git a/pages/association_rules/APRIORI ALGORITHM.md → pages/association_rules/APRIORI_ALGORITHM.md b/pages/association_rules/APRIORI ALGORITHM.md → pages/association_rules/APRIORI_ALGORITHM.md
@@ -1,10 +1,23 @@
+---
+id: APRIORI_ALGORITHM
+aliases: []
+tags: []
+---
+
+- [~] ---
+id: APRIORI ALGORITHM
+aliases: []
+tags: []
+index: 5
+---
+
 # APRIORI ALGORITHM
 
-The apriori algorithm is a strategy to prune the three of candidates of the [frequent item-set generation](FREQUENT%20ITEMSET%20GENERATION.md) fase it's based on the apriori priciple
+The apriori algorithm is a strategy to prune the three of candidates of the [frequent item-set generation](FREQUENT_ITEMSET_GENERATION.md) fase it's based on the apriori priciple
 
 ### APRIORI PRINCIPLE
 If an itemset is frequent, then all of its subsets must also be frequent and viceversa.
-We can see this principle as follows: 
+We can see this principle as follows:
 
 $$
 \forall X,Y: (X \subset Y) \implies sup(X) \geq sup(Y)
@@ -26,4 +39,6 @@ flowchart TD
 	C-->|repeat until the current level is empty|A
 ```
 
-The $threshold$ value it's an important tuning parameter for complexity and the tradeoff element between number of valid time-sets founded and quality of the item-sets founded 
+The $threshold$ value it's an important tuning parameter for complexity and the tradeoff element between number of valid time-sets founded and quality of the item-sets founded
+
+[PREVIOUS](FREQUENT_ITEMSET_GENERATION.md) [NEXT](FP-GROWTH.md)
diff --git a/pages/association_rules/ASSOCIATION RULES.md → pages/association_rules/ASSOCIATION_RULES.md b/pages/association_rules/ASSOCIATION RULES.md → pages/association_rules/ASSOCIATION_RULES.md
@@ -1,3 +1,10 @@
+---
+id: ASSOCIATION RULES
+aliases: []
+tags: []
+index: 1
+---
+
 # ASSOCIATION RULES
 
 They are rules that describes situation where the presence of a given element $\{A\}$ or a combination of elements $\{A,B\}$ assure the presence of a third element $\{C\}$, they are based on statistics.
@@ -10,7 +17,7 @@ They are rules that describes situation where the presence of a given element $\
 - **SUPPORT** --> Fraction of transactions that contain an itemset.
 - **FREQUENT ITEMSET** --> An itemset whose support is greater than or equal to a minsup threshold.
 
-Association rules can be described by the form 
+Association rules can be described by the form
 
 $$
 A \rightarrow C \space where \space A,C \in itemset
@@ -30,21 +37,23 @@ $$
 
 ### CONFIDENCE $conf$
 
-the number of times $C$ appears over transactions that contains $A$ 
+the number of times $C$ appears over transactions that contains $A$
 
 $$
 conf = \frac{(A,C)}{A}
 $$
 
 #### CONFIDENCE FROM SUPPORT
 
-confidence can also be computed from supports as 
+confidence can also be computed from supports as
 
 $$
-conf = \frac{(A,C)}{A} =\frac{\frac{(A,C)}{N}}{\frac{A}{N}} = \frac{sup(A,C)}{sup(A)}  
+conf = \frac{(A,C)}{A} =\frac{\frac{(A,C)}{N}}{\frac{A}{N}} = \frac{sup(A,C)}{sup(A)}
 $$
 
 
 support measures "how much" an occurrence can be considered a rule (there must be enough transaction cases), a rule with low support can be generated by random associations
 
-confidence measures how much a rule is represented in the transactions that contains it  
+confidence measures how much a rule is represented in the transactions that contains it
+
+ [NEXT](ASSOCIATION_RULES_MINING.md)
diff --git a/...ciation_rules/ASSOCIATION RULES MINING.md → ...ciation_rules/ASSOCIATION_RULES_MINING.md b/...ciation_rules/ASSOCIATION RULES MINING.md → ...ciation_rules/ASSOCIATION_RULES_MINING.md
@@ -1,14 +1,23 @@
+---
+id: ASSOCIATION RULES MINING
+aliases: []
+tags: []
+index: 2
+---
+
 # ASSOCIATION RULES MINING
 
-The goal of this procedure it's, given a list of $N$ item-set, finding association rules that have  $conf$ and $sup$ grater than some thresholds 
+The goal of this procedure it's, given a list of $N$ item-set, finding association rules that have  $conf$ and $sup$ grater than some thresholds
 
 ## BRUTE-FORCE APPROACH
 
 generate all possible combination and compute $conf$ and $sup$, this approach is always possible but is too much computational expensive
 
 ## TWO STEP APPROACH
 
-this approach is based on the fact that rules that are generated from the same item-set have the same $sup$  
+this approach is based on the fact that rules that are generated from the same item-set have the same $sup$
 
-- **[frequent itemset generation](FREQUENT%20ITEMSET%20GENERATION.md)** -> in the first step all item-set that have $sup \gt threshold$ are generated (**this step is still computational expensive**)
+- **[frequent itemset generation](FREQUENT_ITEMSET_GENERATION.md)** -> in the first step all item-set that have $sup \gt threshold$ are generated (**this step is still computational expensive**)
 - **RULE GENERATION** -> in the second step rules with high confidence are generated from the previous generated item-sets
+
+[PREVIOUS](ASSOCIATION_RULES.md) [NEXT](RULES_GENERATION.md)
diff --git a/pages/association_rules/FP-GROWTH.md b/pages/association_rules/FP-GROWTH.md
@@ -1,18 +1,25 @@
-The Apriori Alg. needs to generate the candidates sets, whose number can be really high! 
-The FP-Growth algorithm consists on finding shortest patterns to chain with suffixes. 
+---
+id: FP-GROWTH
+aliases: []
+tags: []
+index: 6
+---
+
+The Apriori Alg. needs to generate the candidates sets, whose number can be really high!
+The FP-Growth algorithm consists on finding shortest patterns to chain with suffixes.
 FP-Growth uses a compact representation of the DB via a FP-Tree, on which a recursive approach is used, following the "divide et impera" principle.
 
 ### HOW IT WORKS
 
 1) Data are scanned in order to find the max support for every single item. Non-frequent item are discarded. Frequent items are sorted by decreasing support.
 2) A second scan is done to build the FP-Tree. When the first transaction is read, the A and B nodes are generated with frequency counting = 1.
-					          ![](Pasted%20image%2020231231173158.png)
+					          ![](Pasted_image_20231231173158.png)
 3) When the second transaction is reaa new set containing the B, C and D nodes are created, each one with its relative path starting from the *null* node. Then, the subtree created during step 1) is linked to the just generated one. The two paths do not overlap because of their different prefix.
-							   ![](Pasted%20image%2020231231173623.png)
+							   ![](Pasted_image_20231231173623.png)
 4) If an overlapped path is found (it has the same prefix as a node, let's suppose, A), the counting of the node A is increased by 1.
-						   ![](Pasted%20image%2020231231174113.png) 
+						   ![](Pasted_image_20231231174113.png)
 5) The algorithm continues untill the last transaction.
-						   ![](Pasted%20image%2020231231174157.png)
+						   ![](Pasted_image_20231231174157.png)
 
 The tree size is often lower than the dataset one, but it depends on the transactions orientation.
 
@@ -21,4 +28,11 @@ The tree size is often lower than the dataset one, but it depends on the transac
 Then, FP-Growth procedes with a ***bottom-up *strategy**:
 - The research procedure goes from the less frequent item to the most frequent one. FP-Growth scans the tree in search of itemset ending with the desired search item (es. D).
 - Then look for the only paths that contain the D element. This search is sped up with a pointer data structure.
-- So the subquestion that contains all the itemsets that end in D and that are frequent has to be built. The research is done by evaluating all possible combinations found that include D and that exceed $minSup$ in a divide-and-conquer logic, starting from the leaves to the root.
+- So the subquestion that contains all the itemsets that end in D and that are frequent has to be built. The research is done by evaluating all possible combinations found that include D and that exceed $minSup$ in a divide-and-conquer logic, starting from the leaves to the root.
+
+
+
+
+
+
+[PREVIOUS](APRIORI_ALGORITHM.md)
diff --git a/...tion_rules/FREQUENT ITEMSET GENERATION.md → ...tion_rules/FREQUENT_ITEMSET_GENERATION.md b/...tion_rules/FREQUENT ITEMSET GENERATION.md → ...tion_rules/FREQUENT_ITEMSET_GENERATION.md
@@ -1,3 +1,10 @@
+---
+id: FREQUENT ITEMSET GENERATION
+aliases: []
+tags: []
+index: 4
+---
+
 # FREQUENT ITEM-SET GENERATION
 
 This step  aims to generate all possible  with a $sup \gt threshold$.
@@ -12,22 +19,24 @@ This, approach which have a complexity of $\mathcal{o}(NMW)$ where:
 - $W$ -> average number of items within a transaction
 - $M$ -> number of frequent item-set candidate.
 
-it's extremely computational expensive. 
+it's extremely computational expensive.
 
 There are other strategies that aims to reduce the computational cost of this operation such as:
 
-- reducing the number  of candidates by pruning ([apriori algorithm](APRIORI%20ALGORITHM.md))
+- reducing the number  of candidates by pruning ([apriori algorithm](APRIORI_ALGORITHM.md))
 - reducing the number of comparisons $NM$
 
 ### BRUTE-FORCE APPROACH
 
 The brute-force approach generates each item-set in the graph above. Then, it computes the *sup* and *conf* indexes values for every association rule generated by every item-set.
  - Complexity (**EXPENSIVE**): O(NWM), where
-	 
+
 #### Frequent item-set Generation Strategies
 - Reduce the **number of candidates** M (Apriori Algorithm)
 	- Complete Search: $M=2^D$
 	- Use pruning techniques to reduce M
 - Reduce the **number of comparisons** NM
 	- Use efficient data structures to store the candidates or transactions
-	- No need to match every candidate against every transaction
+	- No need to match every candidate against every transaction
+
+[PREVIOUS](RULES_GENERATION.md) [NEXT](APRIORI_ALGORITHM.md)