SOMPY package

Training based on a real data set

We have a set of real data, which are showing the measurements of different pollutants. What we expect in the first step is to see the visual correlation between different pollutants, forexample pm2.5 and pm10
In [6]:
%reset -f
from numpy import genfromtxt, savetxt
import numpy as np
#pollution data
Data = genfromtxt(open('data/pollution.csv','r'),dtype=float, delimiter=',')[1:]
Labels = Data[:,0]
Data = Data[:,1:]
header= genfromtxt(open('data/pollution.csv','r'),delimiter=',',dtype = None)[0]
header = header[1:]
header = header[np.newaxis,:]

print 'size of data set: ', Data.shape
# %reset -f
import sys
import numpy as np
import sompylib.som_structure as SOM
from matplotlib import pyplot as plt
msz0 = 50
msz1 = 50
cd = msz0*msz1*1*1
dlen = 100*1000*1*1*1#+224
dim = 3
size of data set:  (58410, 16)

Traditional Methods for high dimensional data analysis

In [34]:
from pandas.tools.plotting import scatter_matrix
from pandas import Series, DataFrame
import pandas as pd
df = DataFrame(data = Data[1:1000,:], columns= header.T)

fig = scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='kde')

Initializing the Map

Linear analysis, based on first two PCA

we can either use random values for initializing the som weight vectors or we can linearly initialize them. For that, we first need to calculate eigenvectors of the correlation matrix of the original data. Then, we take two largest eigenvalues and their corresponding eigenvectors and initialize the weight vectors linealy alongside of the SOM map based on these two PCs. For large data size it is better to use Randomized PCA method as it is fast and the results are good. We use RandomizedPCA from scikitlearn package. As we could expect, the first two PCs show which features have mutual correlation, but certainly in a linear way. If SOM training starts based on this PCA based initial values, the training needs less number of iterations.
In [7]:
reload(sys.modules['sompylib.som_structure'])
sm = SOM.SOM('sm', Data, mapsize = [msz0, msz1],norm_method = 'var',initmethod='pca')
sm.init_map()
setattr(sm, 'compname', header)
sm.view_map(which_dim = 'all')

Training

The deafult training algorithm is Batch training. The learning parameters are selected autamatically. n_job is defining the number cores to be used in parallel. It uses Joblib multiprocessing library from scikitlearn by deafault, there is no shared memory. this means if you add 1 job it increases the amount of memory you need. Then, you can use shared memory based on numpy.memmap. But to be honest, I didn't get better performance and I guess I didn't implement it correctly. The current implementation looks to behave linearly scalable with the size of memory and number of cores, but I have this feeling that my memory allocation is not optimum.
In [8]:
sm.train(n_job = 1, shared_memory = 'no',verbose='on')
initialization method = pca, initializing..

initialization done in 0.210000 seconds

rough training...
radius_ini: 7.000000 , radius_final: 1.750000, trainlen: 1

epoch: 1 ---> elapsed time:  2.102000, quantization error: 2.186011 


finetune training...
radius_ini: 1.750000 , radius_final: 1.000000, trainlen: 2

epoch: 1 ---> elapsed time:  1.508000, quantization error: 1.888593 

epoch: 2 ---> elapsed time:  1.564000, quantization error: 1.392114 

Total time elapsed: 5.891000 secodns
final quantization error: 1.392114

SOM visualization

In [9]:
sm.view_map(which_dim = 'all')
In [13]:
sm.view_map(which_dim= 'all' , pack='Yes',text_size=6,save='No',save_dir='')
In [14]:
sm.hit_map()

A comparison on larger data sets using parallel-processing and shared memory

I used an specific data size, since it was used in a map-reduce based implementation of SOM: (http://www.hicomb.org/papers/HICOMB2011-01.pdf) The data size: 81920 X 256 on 50X50 SOM grid The best achieved results in this paper (using 1024 cores) is 4.02 minutes, which doesn't look acceptable, as the data is not that large 1-parallel-Map-reduce SOM: 241.2 seconds 2- Matlab: 69.621502 seconds 3- sompy: 26.1570 seconds
In [51]:
%reset -f
import numpy as np
import sompylib.som_structure as SOM
from matplotlib import pyplot as plt
import sys
msz0 = 50
msz1 = 50
cd = msz0*msz1*1*1
dlen = 81920*1*1*1*1#+224
# dlen = 200*1000
dim = 256
Data = np.random.randint(0,2,size = (dlen,dim)) 
In [52]:
reload(sys.modules['sompylib.som_structure'])
sm = SOM.SOM('sm', Data, mapsize = [msz0, msz1],norm_method = 'var')
sm.train()
initialization method = pca, initializing..

initialization done in 0.744000 seconds

rough training...
radius_ini: 7.000000 , radius_final: 1.750000, trainlen: 1

epoch: 1 ---> elapsed time:  6.392000, quantization error: 15.944344 


finetune training...
radius_ini: 1.750000 , radius_final: 1.000000, trainlen: 2

epoch: 1 ---> elapsed time:  6.824000, quantization error: 15.935956 

epoch: 2 ---> elapsed time:  7.408000, quantization error: 15.821423 


Total time elapsed: 21.913000 secodns
final quantization error: 15.821423
I compared my implementation with Matlab with larger data sizes. For example: 1- data: 200*1000 X 20 on 50X50 SOM grid 1-1- Matlab: 40.590471 seconds 1-2- sompy: 11.6 secodns 2- data: 400*1000 X 20 on 50X50 SOM grid 2-1- Matlab: 81.514382 2-2- sompy: 24.992000 3- data: 800*1000 X 20 on 50X50 3-1- Matlab: 166.681264 seconds 3-2- sompy: 49.350000 secodns 2- data: 200*1000 X 50 on 80X80 SOM grid 2-1- Matlab: 191.710211 seconds 2-2- sompy: 74.932 seconds, CPU gets blocked for a while since memory becomes full! this is the current problem. When I checked the CPU performance, it seems in the begining I have a peak of memory that I can't find it.

Next Steps

1- Clustering

1-1- Based on U-Matrix

1-2- Based on K-Means

2- Prediction and Function approximation

Testing with New Data

In [115]:
from pandas.tools.plotting import scatter_matrix
from pandas import Series, DataFrame
import pandas as pd


data = Data[48*1000:58*1000]
Target = 8
print 'Variable to predict: ', header[0][Target]
pred = sm.predict_by(data,Target, K =1)
real = data[:,Target]
accuracy = (1-np.abs((pred-real)/real))*100
print 'median accuracy', np.median(accuracy)
print 'mean accuracy', np.mean(accuracy)
print 'std accuracy', np.std(accuracy)
print 'min accuracy', np.min(accuracy)
print 'max accuracy', np.max(accuracy)
DF = DataFrame({'True Value': real[1:100], 'Predicted Value':pred[1:100]})
fig = plt.figure(); 
DF.plot(DF.index,DF.columns[:],label=header[0][Target],colormap='jet',x_compat=True,style='.-'); plt.legend(loc='best',bbox_to_anchor = (1.0, 1.0),fontsize = 'medium')
plt.ylabel('values')
font = {'size'   : 12}
plt.rc('font', **font)
fig.set_size_inches(10,10)
Variable to predict:  B-PM25-ug-m3
median accuracy 96.0835761262
mean accuracy 93.8074000134
std accuracy 8.90422108991
min accuracy -93.410584713
max accuracy 99.9989308307

2-2- based on associative maps

3-classification

Same as prediction, but just with binary values:

Y=F(X)

Y={y1,..,yn},

Binary classification: yi ={0,1}

X={x1,...,xm}

4-Time series prediction

It can be converted to a classification or prediction problem, easily:

Y=F(X)

Y=x(t+1)

X=x(t),...,x(t-n)

In [ ]: