ANN - Medical Diagnosis

We will use ANNs to diagnose Breast Cancer with some characteristics of the cell nuclei.

Dataset

Download the Dataset

We will use a breast cancer diagnosis dataset from Opeml.org

%%capture
!wget https://www.openml.org/data/get_csv/5600/BNG_breast-w.arff

Explore the Dataset

import pandas as pd

df = pd.read_csv('/content/BNG_breast-w.arff')
df.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

  • You can see all the features are real numbers, with different range, so they need to be scaled.
  • Class has to be changes to number {0, 1}.

Label Encoder

We can use sklearn’s Label Encoder to change malignant and benign to {0, 1}.

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df.loc[:, 'Class'] = label_encoder.fit_transform(df.loc[:, 'Class'])
df.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Scaling Features

We will use sklearn’s MinMaxScaler to scale the features, it will convert each column into a range of [0,1], you can also specify in which range you want to convert the features, by default its [0,1].

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df.loc[:,  df.columns != 'Class'] = scaler.fit_transform(df.loc[:,  df.columns != 'Class'])
df.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Dataframes to Arrays

X = df.loc[:,  df.columns != 'Class'].values
y = df.loc[:, 'Class'].values

print(X.shape, y.shape)
(39366, 9) (39366,)

Train-Validation Split

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)
(31492, 9) (7874, 9) (31492,) (7874,)

Model

import tensorflow as tf
from tensorflow import keras

tf.keras.backend.clear_session()

model = tf.keras.Sequential([
                             tf.keras.layers.Dense(units=50, input_shape=[9], kernel_regularizer=tf.keras.regularizers.l2(0.00001)), 
                             tf.keras.layers.Activation('relu'),
                             tf.keras.layers.Dropout(0.2),
                             tf.keras.layers.Dense(units=50, kernel_regularizer=tf.keras.regularizers.l2(0.00001)),
                             tf.keras.layers.Activation('relu'),
                             tf.keras.layers.Dropout(0.2),
                             tf.keras.layers.Dense(units=1), 
                             tf.keras.layers.Activation('sigmoid')
                             ])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 50)                500       
_________________________________________________________________
activation (Activation)      (None, 50)                0         
_________________________________________________________________
dropout (Dropout)            (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                2550      
_________________________________________________________________
activation_1 (Activation)    (None, 50)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 51        
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 3,101
Trainable params: 3,101
Non-trainable params: 0
_________________________________________________________________

Training

optimizer = tf.keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
tf_history_dp = model.fit(X_train, y_train, batch_size=50, epochs=100, verbose=True, validation_data=(X_val, y_val))
Train on 31492 samples, validate on 7874 samples
Epoch 1/100
31492/31492 [==============================] - 3s 104us/sample - loss: 0.1383 - acc: 0.9645 - val_loss: 0.0470 - val_acc: 0.9830
Epoch 2/100
31492/31492 [==============================] - 3s 101us/sample - loss: 0.0569 - acc: 0.9796 - val_loss: 0.0431 - val_acc: 0.9850
.
.
Epoch 99/100
31492/31492 [==============================] - 3s 97us/sample - loss: 0.0402 - acc: 0.9857 - val_loss: 0.0356 - val_acc: 0.9892
Epoch 100/100
31492/31492 [==============================] - 3s 92us/sample - loss: 0.0408 - acc: 0.9858 - val_loss: 0.0345 - val_acc: 0.9886
import matplotlib.pyplot as plt

plt.figure(figsize=(20,7))

plt.subplot(1,2,1)
plt.plot(tf_history_dp.history['loss'], label='Training Loss')
plt.plot(tf_history_dp.history['val_loss'], label='Validation Loss')
plt.legend()

plt.subplot(1,2,2)
plt.plot(tf_history_dp.history['acc'], label='Training Accuracy')
plt.plot(tf_history_dp.history['val_acc'], label='Validation Accuracy')
plt.legend()
plt.show()

png

We were able to get an accuracy of 98.8% on Validation set, but in these kind of medical diagnosis tasks even a 0.1% improvement is very important.

Try to improve the performance more.

Previous
Next
comments powered by Disqus