Implementation of KNN based on the Spark-ML-LSH #3

Victor0118 · 2017-12-07T19:56:01Z

Hash Function: h_i(x) = floor(r_i.dot(x) / bucketLength)
threshold = 2000
W = bucketLength
NHT = # of HashTables

The number of buckets will be (max L2 norm of input vectors) / bucketLength.
If input vectors are normalized, 1-10 times of pow(numRecords, -1/inputDim) would be a reasonable value

k	NHT	W	Accuracy_train	Accuracy_test	T_index	T_query
1	3	2	-	0.9087	54	175848
5	3	2	-	0.893	54	174651
9	3	2	-	0.8808	54	155673
1	5	2	-	0.9291	29	251302
5	5	2	-	0.9137	29	275162
9	5	2	-	0.9036	29	367008
1	7	2	-	0.9372	34	523696
5	7	2	-	0.9238	34	460986
9	7	2	-	0.9145	34	485565
1	3	5	-	0.9357	30	367245
5	3	5	-	0.9263	30	340930
9	3	5	-	0.9171	30	341963
1	5	5	-	0.9459	41	596984
5	5	5	-	0.9401	41	559091
9	5	5	-	0.93	41	561646
1	7	5	-	0.9496	22	770659
5	7	5	-	0.9465	22	787571
9	7	5	-	0.9385	22	841044
1	3	8	-	0.9419	37	439672
5	3	8	-	0.9348	37	417642
9	3	8	-	0.9253	37	422822
1	5	8	-	0.9481	24	605899
5	5	8	-	0.9438	24	609686
9	5	8	-	0.9358	24	609061
1	7	8	-	0.9511	22	780209
5	7	8	-	0.9447	22	769710
9	7	8	-	0.9409	22	769710

The text was updated successfully, but these errors were encountered:

Victor0118 mentioned this issue Dec 8, 2017

add parameter tunning #5

Merged

Provide feedback