#4 Code en Vrac : Neural Network with Tensorflow by RaspVor (Part 2)

This part is in progress. I need more time to properly explain how to modify your database in order to be put as an input in a neural network. The neural network works with tensorflow and we use the root mean square error to define the cost function.

The challenge here is to have a matrix with 0 and 1 and, for continuous variables, to choose if you want to create groups or not. To create groups, we use the kmean algorithm.

Here, I used another database than the titanic.

You will find below the code. I think in 2-3 weeks? I will provide all the explanation.

In [6]:
################################################
###############Variables Preparation############
################################################
In [7]:
#as_matrix() :> to convert pandas array as numpy array
df_toPred = df["Benefice net annuel"].as_matrix()
df_toPred = df_toPred.reshape(df_toPred.shape[0],1)

#We remove from the DataBase the Variable to explain
df_toUse = df.drop('Benefice net annuel', 1)
In [9]:
#We want to transform the DataBase in a matrix with only 0 and 1.
#First, We create groups (max 10) for all variables which are continuous
#We keep the possibility not to create group for continuous variables we want
#For variables which contains labels, each label will be change by a digits
from sklearn.cluster import KMeans
from scipy.stats import itemfreq

def matrix_Class(df_toUse, excluded, kmean_Size):
    xval=np.array([])
    index = 0
    excluded_index = np.array([])
    
    for i, col in enumerate(df_toUse.columns):
        
        name = np.array(col)
        x = np.array(df_toUse[col])
        
        #if the variable is discrete _ even if it belongs to the list "not to group", it will be grouped (^o^)
        if x.dtype == 'O':  
            _, xval0 = np.unique(x, return_inverse=True)
            xval0 = xval0.reshape(1,xval0.shape[0])
            
        #if the variable is continuous
        else:
            #if it belongs to the variables you want to group 
            if (name in excluded):
                xval0 = x.reshape(1,x.shape[0])
                xval0 = (xval0-np.min(xval0))/(np.max(xval0)-np.min(xval0)) #normalization between 0 and 1
                excluded_index = np.append(excluded_index,[index])
            #if it belongs to the variables which are in the excluded list 
            else:
                globals()['kmean%s' % i]= KMeans(n_clusters=min(kmean_Size[index],itemfreq(x)[:,0].shape[0])).fit(x.reshape(-1,1))
                xval0 = np.array([globals()['kmean%s' % i].labels_])

        if (xval.shape[0] == 0):
            xval = xval0
        else:
            xval = np.concatenate((xval,xval0),axis=0)
        
        index += 1 
        
    return xval, excluded_index.astype(int)

excluded = np.array(['Age', 'Coefficient bonus malus'])
#excluded=np.array([''])
kmean_Size = np.array([0,10,10,10,0,10,10,10,10,10,15,15])

xval, excluded_index = matrix_Class(df_toUse, excluded, kmean_Size)
xval.shape

#Test this if you want to see the 1st variable
#print(kmean1.predict([[40.]]))
#xval[1]
#excluded_index
Out[9]:
(12, 922)
In [10]:
#Each modality of a variable will become a column. It will value 1 if the observation has this modality and 0 if not.
df_nn=np.array([])
nb_var = xval.shape[0]

def matrix_Bin(nb_var, dt_nn, xval, excluded_index, name):
    print(name)
    
    for k in range(nb_var):
        
        if (k not in excluded_index):
            
            for _, i in enumerate(itemfreq(xval[k])[:,0].astype(int)):
                dt_nn0 = np.where(xval[k] == itemfreq(xval[k])[:,0][i], 1., 0.)
                dt_nn0 = dt_nn0.reshape(1,dt_nn0.shape[0])
                
                if (dt_nn.shape[0] == 0):
                    dt_nn = dt_nn0
                else:
                    dt_nn = np.concatenate((dt_nn,dt_nn0 ),axis=0)
        else:
            
            dt_nn0 = xval[k]
            dt_nn0 = dt_nn0.reshape(1,dt_nn0.shape[0])
            if (dt_nn.shape[0] == 0):
                dt_nn = dt_nn0
            else:
                dt_nn = np.concatenate((dt_nn,dt_nn0 ),axis=0)
            
        print("#Variable : {0} & Nber SubVariable {1}".format(k,itemfreq(xval[k])[:,0].shape[0]))
    dt_nn = dt_nn.transpose()
      
    print("Shape : {0}".format(dt_nn.shape))
    
    return dt_nn

df_nn = matrix_Bin(nb_var , df_nn, xval, excluded_index, "DATABASE")
df_nn.shape

#Verif:
#df_nn[0]
DATABASE
#Variable : 0 & Nber SubVariable 71
#Variable : 1 & Nber SubVariable 10
#Variable : 2 & Nber SubVariable 5
#Variable : 3 & Nber SubVariable 10
#Variable : 4 & Nber SubVariable 84
#Variable : 5 & Nber SubVariable 4
#Variable : 6 & Nber SubVariable 10
#Variable : 7 & Nber SubVariable 10
#Variable : 8 & Nber SubVariable 6
#Variable : 9 & Nber SubVariable 10
#Variable : 10 & Nber SubVariable 15
#Variable : 11 & Nber SubVariable 15
Shape : (922, 97)
Out[10]:
(922, 97)
In [11]:
#Creation of the Train DataBase and Test DataBase
#x% of the observations will belong to the Train DAtaBase
from random import sample

def train_test_creation(x, data, toPred):
    indices = sample(range(data.shape[0]),int(x * data.shape[0]))
    indices = np.sort(indices, axis=None) 
    index = np.arange(df_nn.shape[0])
    reverse_index = np.delete(index, indices,0)
    
    train_toUse = data[indices]
    train_toPred = toPred[indices]
    test_toUse = data[reverse_index]
    test_toPred = toPred[reverse_index]
        
    return train_toUse, train_toPred, test_toUse, test_toPred

df_train_toUse, df_train_toPred, df_test_toUse, df_test_toPred = train_test_creation(0.7, df_nn, df_toPred)
df_train_toPred.shape
Out[11]:
(645, 1)
In [ ]:
################################################
###############Tensorflow############
################################################
In [12]:
import tensorflow as tf

learning_rate = 0.01
batch_size = 100
size_train_df = df_train_toUse.shape[1]

df_train_toUse.shape
Out[12]:
(645, 97)
In [13]:
def new_weights(shape):
    return tf.Variable(tf.truncated_normal(shape, stddev=0.05))
#outputs random value from a truncated normal distribution

def new_biases(length):
    return tf.Variable(tf.constant(0.05, shape=[length]))
#outputs the constant vlaue 0.05
In [14]:
def new_fc_layer(input,          # The previous layer.
                 num_inputs,     # Num. inputs from prev. layer.
                 num_outputs,    # Num. outputs.
                 use_relu=False): # Use Rectified Linear Unit (ReLU)?

    # Create new weights and biases.
    weights = new_weights(shape=[num_inputs, num_outputs])
    biases = new_biases(length=num_outputs)

    # Calculate the layer as the matrix multiplication of
    # the input and weights, and then add the bias-values.
    layer = tf.matmul(input, weights) + biases

    # Use ReLU?
    if use_relu:
        layer = tf.nn.relu(layer)

    return layer
In [15]:
x = tf.placeholder("float", [None, size_train_df], name='x')
y_true = tf.placeholder("float", [None, 1], name='y_true')

layer_1 = new_fc_layer(input=x,
                         num_inputs=size_train_df,
                         num_outputs=size_train_df,
                         use_relu=False)

layer_2 = new_fc_layer(input=layer_1,
                         num_inputs=size_train_df,
                         num_outputs=1,
                         use_relu=False)
In [16]:
y_pred = layer_2

rmse = tf.sqrt(tf.reduce_mean(tf.squared_difference(y_pred, y_true)))
cost = tf.reduce_mean(rmse)

optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

accuracy = tf.sqrt(tf.reduce_mean(tf.squared_difference(y_pred, y_true)))
In [17]:
session = tf.Session()

def init_variables():
    session.run(tf.global_variables_initializer())
In [18]:
#function next_batch
def next_batch(num, data, labels):
    '''
    Return a total of `num` random samples and labels. 
    '''
    idx = np.arange(0 , len(data))
    np.random.shuffle(idx)
    idx = idx[:num]
    data_shuffle = [data[ i] for i in idx]
    labels_shuffle = [labels[ i] for i in idx]

    return np.asarray(data_shuffle), np.asarray(labels_shuffle)

#TEST
Xtr, Ytr = np.arange(0, 10), np.arange(0, 100).reshape(10, 10)
print(Xtr)
print(Ytr)

Xtr, Ytr = next_batch(5, Xtr, Ytr)
print('\n5 random samples')
print(Xtr)
print(Ytr)
[0 1 2 3 4 5 6 7 8 9]
[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]
 [30 31 32 33 34 35 36 37 38 39]
 [40 41 42 43 44 45 46 47 48 49]
 [50 51 52 53 54 55 56 57 58 59]
 [60 61 62 63 64 65 66 67 68 69]
 [70 71 72 73 74 75 76 77 78 79]
 [80 81 82 83 84 85 86 87 88 89]
 [90 91 92 93 94 95 96 97 98 99]]

5 random samples
[2 1 8 6 7]
[[20 21 22 23 24 25 26 27 28 29]
 [10 11 12 13 14 15 16 17 18 19]
 [80 81 82 83 84 85 86 87 88 89]
 [60 61 62 63 64 65 66 67 68 69]
 [70 71 72 73 74 75 76 77 78 79]]
In [19]:
batch_size_pred = 256

def predict_y(data, labels, cls_true):
    num_data = len(data)
    
    cls_pred = np.zeros(shape=num_data,dtype = np.int)
    i=0
    while i<num_data:
        j=min(i+batch_size_pred, num_data)
        feed_dict = {x : data[i:j, :],
                 y_true : labels[i:j, :]}
        
        cls_pred[i:j] = session.run(y_pred, feed_dict = feed_dict)
        
        i = j
    
    correct = (y_true == y_pred)
    
    return correct, y_pred
In [20]:
import time
from datetime import timedelta
In [21]:
def optimize(num_iterations, X):
    global total_iterations
    
    start_time = time.time()
    
    for i in range(num_iterations):
            total_iterations += 1
            # Get a batch of training examples.
            # x_batch now holds a batch of images and
            # y_true_batch are the true labels for those images.
            x_batch, y_true_batch = next_batch(batch_size, df_train_toUse, df_train_toPred)

            # Put the batch into a dict with the proper names
            # for placeholder variables in the TensorFlow graph.
            feed_dict_train = {x: x_batch,
                               y_true: y_true_batch}
            feed_dict_test = {x: df_test_toUse,
                               y_true: df_test_toPred}
            
            # Run the optimizer using this batch of training data.
            # TensorFlow assigns the variables in feed_dict_train
            # to the placeholder variables and then runs the optimizer.
            session.run(optimizer, feed_dict=feed_dict_train)
            
            # Print status every X iterations.
            if (total_iterations % X == 0) or (i ==(num_iterations -1)):
            # Calculate the accuracy on the training-set.
                acc_train = session.run(accuracy, feed_dict=feed_dict_train)
                acc_test = session.run(accuracy, feed_dict=feed_dict_test)
                
                msg = "Iteration: {0:>6}, Training Accuracy: {1}, Test Accuracy: {2}"
                print(msg.format(total_iterations, acc_train, acc_test))
    
    # Ending time.
    end_time = time.time()

    # Difference between start and end-times.
    time_dif = end_time - start_time

    # Print the time-usage.
    print("Time usage: " + str(timedelta(seconds=int(round(time_dif)))))
In [22]:
init_variables()
total_iterations = 0
In [23]:
optimize(num_iterations=5000, X=100)
Iteration:    100, Training Accuracy: 26.39575958251953, Test Accuracy: 23.229934692382812
Iteration:    200, Training Accuracy: 22.067331314086914, Test Accuracy: 18.723020553588867
Iteration:    300, Training Accuracy: 17.888702392578125, Test Accuracy: 15.831459045410156
Iteration:    400, Training Accuracy: 14.719524383544922, Test Accuracy: 13.157492637634277
Iteration:    500, Training Accuracy: 9.221675872802734, Test Accuracy: 9.866209030151367
Iteration:    600, Training Accuracy: 8.029967308044434, Test Accuracy: 8.009196281433105
Iteration:    700, Training Accuracy: 6.801394939422607, Test Accuracy: 7.626732349395752
Iteration:    800, Training Accuracy: 7.260751247406006, Test Accuracy: 7.643255233764648
Iteration:    900, Training Accuracy: 7.03613805770874, Test Accuracy: 7.664236545562744
Iteration:   1000, Training Accuracy: 6.270528316497803, Test Accuracy: 7.686746120452881
Iteration:   1100, Training Accuracy: 6.793045520782471, Test Accuracy: 7.6691460609436035
Iteration:   1200, Training Accuracy: 6.486378192901611, Test Accuracy: 7.698347568511963
Iteration:   1300, Training Accuracy: 6.4585862159729, Test Accuracy: 7.667328834533691
Iteration:   1400, Training Accuracy: 6.804159164428711, Test Accuracy: 7.717947483062744
Iteration:   1500, Training Accuracy: 6.880958557128906, Test Accuracy: 7.688291072845459
Iteration:   1600, Training Accuracy: 5.893691539764404, Test Accuracy: 7.656177997589111
Iteration:   1700, Training Accuracy: 6.317097187042236, Test Accuracy: 7.747777462005615
Iteration:   1800, Training Accuracy: 6.184149265289307, Test Accuracy: 7.779155731201172
Iteration:   1900, Training Accuracy: 7.237087249755859, Test Accuracy: 7.6863603591918945
Iteration:   2000, Training Accuracy: 6.296756744384766, Test Accuracy: 7.7188801765441895
Iteration:   2100, Training Accuracy: 6.700366973876953, Test Accuracy: 7.6587300300598145
Iteration:   2200, Training Accuracy: 6.758767604827881, Test Accuracy: 7.646804332733154
Iteration:   2300, Training Accuracy: 5.723584175109863, Test Accuracy: 7.651748180389404
Iteration:   2400, Training Accuracy: 6.965391635894775, Test Accuracy: 7.6936540603637695
Iteration:   2500, Training Accuracy: 6.149250030517578, Test Accuracy: 7.647763252258301
Iteration:   2600, Training Accuracy: 6.262087821960449, Test Accuracy: 7.683591842651367
Iteration:   2700, Training Accuracy: 5.934797763824463, Test Accuracy: 7.6465559005737305
Iteration:   2800, Training Accuracy: 5.853394031524658, Test Accuracy: 7.697845935821533
Iteration:   2900, Training Accuracy: 5.578268051147461, Test Accuracy: 7.679025650024414
Iteration:   3000, Training Accuracy: 6.447246074676514, Test Accuracy: 7.647743225097656
Iteration:   3100, Training Accuracy: 6.100986957550049, Test Accuracy: 7.657148838043213
Iteration:   3200, Training Accuracy: 5.5977067947387695, Test Accuracy: 7.657994270324707
Iteration:   3300, Training Accuracy: 6.802379608154297, Test Accuracy: 7.707957744598389
Iteration:   3400, Training Accuracy: 6.740683555603027, Test Accuracy: 7.635491371154785
Iteration:   3500, Training Accuracy: 6.13804817199707, Test Accuracy: 7.633821964263916
Iteration:   3600, Training Accuracy: 6.483687400817871, Test Accuracy: 7.694761276245117
Iteration:   3700, Training Accuracy: 5.78458833694458, Test Accuracy: 7.648858547210693
Iteration:   3800, Training Accuracy: 6.592751979827881, Test Accuracy: 7.6709513664245605
Iteration:   3900, Training Accuracy: 6.447442054748535, Test Accuracy: 7.691772937774658
Iteration:   4000, Training Accuracy: 6.293623924255371, Test Accuracy: 7.694045543670654
Iteration:   4100, Training Accuracy: 6.495583534240723, Test Accuracy: 7.712837219238281
Iteration:   4200, Training Accuracy: 5.788488388061523, Test Accuracy: 7.669111728668213
Iteration:   4300, Training Accuracy: 5.933437824249268, Test Accuracy: 7.682697296142578
Iteration:   4400, Training Accuracy: 5.678636074066162, Test Accuracy: 7.625570774078369
Iteration:   4500, Training Accuracy: 5.405104637145996, Test Accuracy: 7.6390700340271
Iteration:   4600, Training Accuracy: 6.9394330978393555, Test Accuracy: 7.616691589355469
Iteration:   4700, Training Accuracy: 5.555428981781006, Test Accuracy: 7.671112060546875
Iteration:   4800, Training Accuracy: 6.040245056152344, Test Accuracy: 7.598418712615967
Iteration:   4900, Training Accuracy: 6.241304397583008, Test Accuracy: 7.5983757972717285
Iteration:   5000, Training Accuracy: 6.15482234954834, Test Accuracy: 7.632728576660156
Time usage: 0:00:13
In [24]:
optimize(num_iterations=100000, X=10000)
Iteration:  10000, Training Accuracy: 6.777585983276367, Test Accuracy: 7.712630271911621
Iteration:  20000, Training Accuracy: 6.07899808883667, Test Accuracy: 7.641021728515625
Iteration:  30000, Training Accuracy: 6.114434242248535, Test Accuracy: 7.658448219299316
Iteration:  40000, Training Accuracy: 5.824429035186768, Test Accuracy: 7.651169300079346
Iteration:  50000, Training Accuracy: 7.028199672698975, Test Accuracy: 7.708177089691162
Iteration:  60000, Training Accuracy: 5.787368297576904, Test Accuracy: 7.606222152709961
Iteration:  70000, Training Accuracy: 6.635541915893555, Test Accuracy: 7.651212692260742
Iteration:  80000, Training Accuracy: 5.677894592285156, Test Accuracy: 7.673105239868164
Iteration:  90000, Training Accuracy: 6.491291522979736, Test Accuracy: 7.603838920593262
Iteration: 100000, Training Accuracy: 6.575507164001465, Test Accuracy: 7.635202407836914
Iteration: 105000, Training Accuracy: 6.221234321594238, Test Accuracy: 7.616175174713135
Time usage: 0:04:03
In [25]:
x_batch, y_true_batch = next_batch(batch_size,df_train_toUse ,df_train_toPred )

            # Put the batch into a dict with the proper names
            # for placeholder variables in the TensorFlow graph.
feed_dict_train = {x: df_test_toUse,
                    y_true: df_test_toPred}

acc_train = session.run(y_true, feed_dict=feed_dict_train)
acc_test = session.run(y_pred, feed_dict=feed_dict_train)
print("True : {0}, False : {1}".format(acc_train[0:5], acc_test[0:5]))
True : [[ 23.9758358 ]
 [ 58.07676315]
 [-15.9749918 ]
 [ 19.30395699]
 [ 13.03372002]], False : [[ 22.1019516 ]
 [ 56.95559692]
 [-12.15175533]
 [ 15.96432686]
 [ 13.13252735]]