It is final project of Big Data Analysis course and the paper implemented by me, Ali Mortazavi and Bita Jalali.
In this project, we explore a theoretical paper published in Neurips2017 Conference titled Learning Mixture of Gaussians with Streaming Data by doing some experimental tests on the performance of the algorithm. You can find the original paper here.
The authors of the paper proposed an algorithm with a theoretical guarantee for the problem of clustering the streaming data points where data points come from a Mixture of Gaussian Distribution. In particular, they showed that with high probability, the algorithm eventually finds (/learns) the mixture of Gaussians accurately. Moreover, they showed that the learning rate is near-optimal. Meaning that in each step, the error decreases optimally. Note that, as the problem of learning the mixture of Gaussian is NP-Hard, they used a simplification assumption for the data to make the problem tractable. The simplification assumption they used is that data comes from a mixture of well-separated Gaussians. (i.e. the centers of any two clusters are far enough so that the algorithm can decide computationally easily whether the point belongs to the first cluster or the second one.)
Although the assumption allows them to have a good performance for this problem, one question might arise:
Is the analysis of the paper tight?
One way to find an answer to this question is to test the algorithm's performance by running some experimental tests. Of course, the experimental tests just provide evidence and they are not proof.
In the following, we will first give a bit explanation about the general setup and data distribution in the first section. Then we explain the theorems about the performance of the algorithm. Finally, we test the algorithm performance by generating data from the mixture of Gaussian Distribution and run our implementation of the paper's algorithm
Data is drawn from a mixture of a mixture of k spherical Gaussians distributions,
i.e. for all
where
For simplicity, we assumed all
As we stated earlier, the paper used a simplification assumption for the data. In particular, any two centers
We will explain more about the constant C in the next sections.
The algorithm consists of two parts: Algorithm1 (initialization algorithm), Algorithm2 (Streaming Lloyd's Algorithm)
In the initialization part, we try to find k initial centers for k clusters. To do so, we first use SVD-style operations to project d dimensional data points into a k-dimensional space. This project will de-noise the data, making it easy for the algorithm to identify all k distinct clusters. After we find k clusters in the projected space, we project centers for each cluster to the original d-dimensional space. This way, we have a good initialization for the k centers.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ortho_group
from scipy import linalg
import random
from sklearn import neighbors
import sklearn.metrics.pairwise
from scipy.sparse import csgraph
import random
import pandas as pd
# Defining Fixed Parameters
d = 10
k = 5
scaler = 10000
sigma = 100
num_of_samples = int(d*d*d*k*k*k*np.log(d*k)) #N0
w = np.ones(k) / k
## For initializatoin, we need C to be at least OMEGA((klogk)^1/4). So:
C = 10*((k*np.log(k))**(1/4))
C2 = (np.log(k)**(1/2))
##TODO: Change max to min but make sure that you ignore the diagonal values as they are zeros and they should be ignored.
def generate_distant_means(dimension, C, std):
mu = np.random.rand(d, k) * scaler
dist = sklearn.metrics.pairwise.euclidean_distances(mu.T)
max_dist = np.max(dist)
while (max_dist < C):
print ("close_initial_points_occured, start again")
mu = np.random.rand(d, k) * scaler
dist = sklearn.metrics.pairwise.euclidean_distances(mu.T)
max_dist = np.max(dist)
return mu
mu = generate_distant_means(d, C, sigma)
print ("True Centers for Clusters")
tmp = pd.DataFrame(mu)
tmp
True Centers for Clusters
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 330.866695 | 1592.683836 | 2944.320020 | 9962.937967 | 5853.346165 |
1 | 2132.275717 | 9237.097793 | 3265.570185 | 1562.701494 | 5699.554717 |
2 | 6060.283353 | 5569.666050 | 409.092062 | 2363.665258 | 2617.110061 |
3 | 6114.819306 | 1627.059121 | 3403.442057 | 6817.543043 | 3382.809773 |
4 | 3936.485869 | 5456.217746 | 2443.242682 | 5012.802796 | 9621.939483 |
5 | 6075.204441 | 3299.465330 | 8988.023537 | 2053.151819 | 8979.809488 |
6 | 4076.944947 | 9141.843753 | 5706.804996 | 7593.784282 | 6401.821169 |
7 | 3538.960768 | 4945.260117 | 5823.908468 | 4446.951617 | 3231.770582 |
8 | 8690.938024 | 7077.071907 | 7232.449403 | 3030.412195 | 2919.475121 |
9 | 2672.001358 | 4766.881424 | 5883.643620 | 4931.263377 | 8034.932406 |
class SampleGenerator:
def __init__(self, weights, mu, sigma, d):
self.weights = weights
self.mu = mu
self.sigma=sigma
self.d = d
def draw_a_sample(self):
rand = random.uniform(0,1)
for i in range(len(self.weights)):
rand -= self.weights[i]
if (rand <=0):
return np.random.multivariate_normal(self.mu[:,i], self.sigma * np.identity(self.d), 1).T
def generate_samples(self, sample_size):
samples = []
for i in range(sample_size):
if len(samples) == 0:
samples = self.draw_a_sample()
else:
samples = np.append(samples, self.draw_a_sample(), axis=1)
return samples
generator = SampleGenerator(w, mu, sigma, d)
In this part, we are going to find a k dimensional space to reduce the noise of each cluster. This way, all points from a single clusters will become closer to each other, allowing us to use Nearest Neighbor Graph to identify each cluster easily.
N0 = num_of_samples
S = np.zeros((d, d))
k_2 = int (10*k*np.log(k))
B = int(d * np.log(d))
# In this part we use QR algorithm to find k Eigenvectors for each cluster
U = ortho_group.rvs(dim=d)[:, :k] # Choosing k orthonormal vectors in Rd space
for t in range(N0 - k_2):
if t % B == B-1:
Q, R = linalg.qr(np.matmul(S, U))
U = Q
S = np.zeros((d, d))
reshaped_sample = generator.draw_a_sample().reshape(-1, 1)
S = S + np.matmul(reshaped_sample, reshaped_sample.T)
def generate_samples(sample_size):
#TODO: The append part takes O(n) time which is not optimum. We should change it in the future.
samples = []
for i in range(sample_size):
if len(samples) == 0:
samples = generator.draw_a_sample()
else:
samples = np.append(samples, generator.draw_a_sample(), axis=1)
return samples
100 points drawn from the distribution are depicted here. (note that we have shown a 2-d subspace from the original d-dimensional space of data points.
You can see there are k clusters.
temp = generate_samples(100)
plt.plot(temp[0, :], temp[1,:], 'x')
plt.show()
First, we project points to the k-dimensional space to reduce the noise, then using each point as a node, we build a nearest neighbor graph. Each connected component in the graph would represent a cluster.
X0 = generate_samples(k_2)
save_U = U
U = U[:, 0:k]
components = np.matmul(U.T, X0)
adj = neighbors.kneighbors_graph(np.transpose(components), 3)
n_components, labels = csgraph.connected_components(adj)
fig, ax = plt.subplots()
ax.scatter(components[0], components[1], c=labels)
print ("THIS is k-dimentional subspace")
plt.show()
THIS is k-dimentional subspace
components = np.matmul(U, components)
fig, ax = plt.subplots()
ax.scatter(components[0], components[1], c=labels)
print ("This is Original Space!")
plt.show()
print (components.shape)
print (X0.shape)
This is Original Space!
temp = generate_samples(100)
plt.plot(temp[0, :], temp[1,:], 'x')
plt.show()
components_mean = np.zeros((k, d))
components_by_labels = []
for c in range(k):
list_of_indices = np.where(labels == c)[0]
components_mean[c] = np.mean(components[:, list_of_indices], axis=1)
print("Components mean:")
pd.DataFrame(components_mean.T)
Components mean:
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1595.711986 | 9959.514702 | 5853.255765 | 329.540575 | 2937.549900 |
1 | 9236.328278 | 1557.875046 | 5702.571121 | 2132.529562 | 3261.914944 |
2 | 5577.287155 | 2361.796761 | 2614.630451 | 6057.967488 | 413.756027 |
3 | 1622.833396 | 6816.746048 | 3384.201598 | 6115.296677 | 3405.319851 |
4 | 5449.638000 | 5016.487603 | 9610.478273 | 3934.831259 | 2440.172973 |
5 | 3301.458048 | 2046.191988 | 8980.793006 | 6071.727241 | 8987.282351 |
6 | 9144.341475 | 7594.503771 | 6410.580483 | 4076.321658 | 5712.465819 |
7 | 4943.611725 | 4446.285555 | 3222.699492 | 3538.793273 | 5827.021373 |
8 | 7075.887503 | 3031.410281 | 2915.983306 | 8689.949603 | 7230.836730 |
9 | 4769.155757 | 4928.492969 | 8039.437913 | 2670.602884 | 5891.542394 |
print ("TRUE centers")
pd.DataFrame(mu)
TRUE centers
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 330.866695 | 1592.683836 | 2944.320020 | 9962.937967 | 5853.346165 |
1 | 2132.275717 | 9237.097793 | 3265.570185 | 1562.701494 | 5699.554717 |
2 | 6060.283353 | 5569.666050 | 409.092062 | 2363.665258 | 2617.110061 |
3 | 6114.819306 | 1627.059121 | 3403.442057 | 6817.543043 | 3382.809773 |
4 | 3936.485869 | 5456.217746 | 2443.242682 | 5012.802796 | 9621.939483 |
5 | 6075.204441 | 3299.465330 | 8988.023537 | 2053.151819 | 8979.809488 |
6 | 4076.944947 | 9141.843753 | 5706.804996 | 7593.784282 | 6401.821169 |
7 | 3538.960768 | 4945.260117 | 5823.908468 | 4446.951617 | 3231.770582 |
8 | 8690.938024 | 7077.071907 | 7232.449403 | 3030.412195 | 2919.475121 |
9 | 2672.001358 | 4766.881424 | 5883.643620 | 4931.263377 | 8034.932406 |
As you can see, the TRUE MEANS are very close to initial algorithm's output as desired.
As we know in the Theorem 3 of the paper,
if each two centers have enough seperations,
then with high probability
comparision_matrix = sklearn.metrics.pairwise.euclidean_distances(mu.T, components_mean)
print ("Distance between each cluster center and algorithm's output for that center:")
theorem3_bound= C*sigma/20
tmp = np.ones(k) * theorem3_bound
distance_tmp = np.min(comparision_matrix, axis=0)
DF = pd.DataFrame(np.min(comparision_matrix, axis=0))
DF.columns = ["Distance"]
DF
distance = np.sum(np.min(comparision_matrix, axis=0))
print(DF)
print(distance)
Distance between each cluster center and algorithm's output for that center:
Distance
0 12.185854
1 10.525660
2 18.466775
3 5.058859
4 14.183543
60.42069018957645
print ("Theorem 3 Bound: C/20 * sigma")
theorem3_bound
Theorem 3 Bound: C/20 * sigma
84.21337295817725
def distance_calculator(mu, mu_t):
comparision_matrix = sklearn.metrics.pairwise.euclidean_distances(mu.T, mu_t.T)
theorem3_bound= C*sigma/20
tmp = np.ones(k) * theorem3_bound
distance_tmp = np.min(comparision_matrix, axis=0)
distance = np.sum(np.min(comparision_matrix, axis=0))
return distance
from numpy import linalg as LA
import copy
mu01=components_mean.T
def streaming_kmeans(n, input_mu, num_of_logs):
mu0 = copy.deepcopy(input_mu)
distance_log = []
eta=3*k*(np.log2(3*n))/n
cut = int(n / num_of_logs)
for t in range(n):
if t % cut == 0:
d = distance_calculator(mu, mu0)
distance_log.append(d)
x=generator.draw_a_sample()
dis=LA.norm((x-mu0),axis=0)
i=np.argmin(dis)
x=x.reshape(-1)
mu0[:,i]=np.multiply((1-eta),mu0[:,i])+np.multiply(eta,x)
return mu0, distance_log
print("final mean:")
num_of_logs = 10
n=2*int(d*d*d*k*k*k*np.log(d*k))
final mean:
def theorem_1(mu, N, sigma, K, C, d):
return (np.max(LA.norm(mu,axis=0)) / N) + (K**3 * (sigma**2 * d * np.log(N) / N) + np.exp(-C**2 / 8) * (C**2 + K)* sigma**2)
run_count = 2
error_matrix = np.zeros((run_count, num_of_logs))
Ns = []
n = 2*int(d*d*d*k*k*k*np.log(d*k))
for run in range(run_count):
mu_t, distance_log = streaming_kmeans(n, mu01, num_of_logs)
error_matrix[run] = distance_log[:num_of_logs]
print(error_matrix)
[[60.42069019 1.87186898 1.8527368 1.94149964 2.03626929 1.96715912
1.65601972 2.21425012 1.97792386 2.13415165]
[60.42069019 1.95681541 1.77998779 1.84137732 2.44125269 2.12873823
2.05096963 1.98234627 1.76737377 2.02814543]]
mean_error = np.mean(error_matrix, axis=0)
mean_error
theorem_1_log = []
Ns = []
for i in range(num_of_logs):
cut = int(n / num_of_logs)
theorem_1_log.append(theorem_1(mu, cut * (i+1), sigma, k, C, d))
Ns.append(cut * (i+1))
p1, =plt.plot(Ns, mean_error, label='Expected Error')
p2, =plt.plot(Ns, theorem_1_log, label='Theorem1 Bound')
plt.legend(loc='best')
plt.show()
theorem_2_log = []
for i in range(num_of_logs):
cut = int(n / num_of_logs)
theorem_2_log.append(theorem_1(mu, cut * (i+1), sigma, k, C2, d))
p1, =plt.plot(Ns, mean_error, label='Expected Error')
p2, =plt.plot(Ns, theorem_2_log, label='Theorem2 Bound')
plt.legend(loc='best')
plt.show()
eta = 3 * np.log(n) / n
k = 2
sigma = 100
num_of_samples = int(d*d*d*k*k*k*np.log(d*k)) #N0
w = np.ones(k) / k
while True:
mu = np.random.rand(1, d) * scaler
if 2 * LA.norm(mu) / sigma > 4:
break
mu1 = mu.reshape(-1, 1)
mu = np.concatenate([mu1, -mu1], axis=1)
generator = SampleGenerator(w, mu, sigma, d)
N0 = num_of_samples
S = np.zeros((d, d))
k_2 = int (10*k*np.log(k))
B = int(d * np.log(d))
# In this part we use QR algorithm to find k Eigenvectors for each cluster
U = ortho_group.rvs(dim=d)[:, :k] # Choosing k orthonormal vectors in Rd space
for t in range(N0 - k_2):
if t % B == B-1:
Q, R = linalg.qr(np.matmul(S, U))
U = Q
S = np.zeros((d, d))
reshaped_sample = generator.draw_a_sample().reshape(-1, 1)
S = S + np.matmul(reshaped_sample, reshaped_sample.T)
X0 = generator.generate_samples(int(k * np.log(k)) * 100)
save_U = U
U = U[:, 0:k]
components = np.matmul(U.T, X0)
adj = neighbors.kneighbors_graph(np.transpose(components), 3)
n_components, labels = csgraph.connected_components(adj)
components = np.matmul(U, components)
components_mean = np.zeros((k, d))
components_by_labels = []
for c in range(k):
list_of_indices = np.where(labels == c)[0]
components_mean[c] = np.mean(components[:, list_of_indices], axis=1)
print("Components mean:")
pd.DataFrame(components_mean.T)
Components mean:
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | |
---|---|---|
0 | -6860.805228 | 6860.324109 |
1 | -3387.987545 | 3387.594833 |
2 | -6462.469029 | 6462.764062 |
3 | -2488.811675 | 2487.946125 |
4 | -3772.291355 | 3771.761091 |
5 | -6048.222240 | 6047.984681 |
6 | -528.199426 | 528.165493 |
7 | -5057.924539 | 5058.506777 |
8 | -4220.923949 | 4221.431446 |
9 | -4198.118630 | 4198.349011 |
comparision_matrix = sklearn.metrics.pairwise.euclidean_distances(mu.T, components_mean)
theorem3_bound= C*sigma/20
tmp = np.ones(k) * theorem3_bound
distance = np.min(comparision_matrix, axis=0)
print(mu)
print(theorem3_bound)
print(distance)
[[ 6862.11319977 -6862.11319977]
[ 3389.87020904 -3389.87020904]
[ 6461.58603634 -6461.58603634]
[ 2487.43770585 -2487.43770585]
[ 3772.86764979 -3772.86764979]
[ 6043.92763651 -6043.92763651]
[ 525.5027017 -525.5027017 ]
[ 5058.9659104 -5058.9659104 ]
[ 4218.52890698 -4218.52890698]
[ 4199.66037439 -4199.66037439]]
84.21337295817725
[6.57026576 6.71968489]
def em(N):
meyou = copy.deepcopy(components_mean.T[:, 0].reshape(-1, 1))
N = 20000
for t in range(N):
x = generator.draw_a_sample()
A = np.exp((-(LA.norm(x - meyou)) ** 2) / sigma ** 2)
B = np.exp((-(LA.norm(x - meyou)) ** 2) / sigma ** 2) + np.exp((-(LA.norm(x + meyou)) ** 2) / sigma ** 2)
w = A / B
meyou = np.multiply((1 - eta), meyou) + np.multiply(eta * (2 * w - 1), x)
return meyou
def theorem_7(N, sigma, d):
return (LA.norm(mu1) / N) + (((np.log(N)) / N) * d * sigma**2)
run_count = 10
distance_vector = []
theorem7_vector = []
Ns = []
n = 2*int(d*d*d*k*k*k*np.log(d*k))
for run in range(run_count):
n = (run + 1) * 10000
meyou = em(n)
meyous = np.concatenate([meyou, -meyou], axis=1)
distance = distance_calculator(-mu.T, meyous.T)
print(distance)
distance_vector.append(distance)
theorem7_vector.append(theorem_7(n, sigma, d))
Ns.append(n)
p1, =plt.plot(Ns, distance_vector, label='EM Error')
p2, =plt.plot(Ns, theorem7_vector, label='Theorem7 Bound')
plt.legend(loc='best')
plt.show()