ANN - Medical Diagnosis

Open In Colab

Open in GitHub

Google ML Academy 2019

Instructor: Shangeth Rajaa


We will use ANNs to diagnose Brease Cancer with some characteristics of the cell nuclei.

Dataset

Download the Dataset

We will use a breast cancer diagnosis dataset from Opeml.org

%%capture
!wget https://www.openml.org/data/get_csv/5600/BNG_breast-w.arff

Explore the Dataset

import pandas as pd

df = pd.read_csv('/content/BNG_breast-w.arff')
df.head()
Clump_Thickness Cell_Size_Uniformity Cell_Shape_Uniformity Marginal_Adhesion Single_Epi_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli Mitoses Class
0 7.581819 9.745087 1.000000 4.503410 7.039930 10.0 4.412282 10.000000 5.055266 malignant
1 5.210921 8.169596 7.841875 6.033275 4.269619 10.0 4.236312 4.845350 1.000000 malignant
2 4.000000 4.594296 2.330380 2.000000 3.000000 1.0 10.701823 1.101305 1.000000 benign
3 2.428871 1.000000 1.000000 1.000000 4.099291 1.0 2.000000 1.000000 1.000000 benign
4 8.855971 2.697539 6.047068 3.301891 3.000000 1.0 5.297592 4.104791 3.115741 malignant
  • You can see all the features are real numbers, with different range, so they need to be scaled.
  • Class has to be changes to number {0, 1}.

Label Encoder

We can use sklearn’s Label Encoder to change malignant and benign to {0, 1}.

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df.loc[:, 'Class'] = label_encoder.fit_transform(df.loc[:, 'Class'])
df.head()
Clump_Thickness Cell_Size_Uniformity Cell_Shape_Uniformity Marginal_Adhesion Single_Epi_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli Mitoses Class
0 7.581819 9.745087 1.000000 4.503410 7.039930 10.0 4.412282 10.000000 5.055266 1
1 5.210921 8.169596 7.841875 6.033275 4.269619 10.0 4.236312 4.845350 1.000000 1
2 4.000000 4.594296 2.330380 2.000000 3.000000 1.0 10.701823 1.101305 1.000000 0
3 2.428871 1.000000 1.000000 1.000000 4.099291 1.0 2.000000 1.000000 1.000000 0
4 8.855971 2.697539 6.047068 3.301891 3.000000 1.0 5.297592 4.104791 3.115741 1

Scaling Features

We will use sklearn’s MinMaxScaler to scale the features, it will convert each column into a range of [0,1], you can also specify in which range you want to convert the features, by default its [0,1].

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df.loc[:,  df.columns != 'Class'] = scaler.fit_transform(df.loc[:,  df.columns != 'Class'])
df.head()
Clump_Thickness Cell_Size_Uniformity Cell_Shape_Uniformity Marginal_Adhesion Single_Epi_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli Mitoses Class
0 0.527352 0.885428 0.000000 0.344875 0.450241 0.761964 0.310056 0.929549 0.367161 1
1 0.344729 0.733487 0.589599 0.495474 0.243731 0.761964 0.294066 0.411081 0.000000 1
2 0.251456 0.388683 0.114646 0.098440 0.149088 0.084182 0.881553 0.034496 0.000000 0
3 0.130438 0.042047 0.000000 0.000000 0.231034 0.084182 0.090865 0.024306 0.000000 0
4 0.625495 0.205758 0.434931 0.226597 0.149088 0.084182 0.390499 0.336594 0.191558 1

Dataframes to Arrays

X = df.loc[:,  df.columns != 'Class'].values
y = df.loc[:, 'Class'].values

print(X.shape, y.shape)
(39366, 9) (39366,)

Train-Validation Split

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)
(31492, 9) (7874, 9) (31492,) (7874,)

Model

import tensorflow as tf
from tensorflow import keras

tf.keras.backend.clear_session()

model = tf.keras.Sequential([
                             tf.keras.layers.Dense(units=50, input_shape=[9], kernel_regularizer=tf.keras.regularizers.l2(0.00001)), 
                             tf.keras.layers.Activation('relu'),
                             tf.keras.layers.Dropout(0.2),
                             tf.keras.layers.Dense(units=50, kernel_regularizer=tf.keras.regularizers.l2(0.00001)),
                             tf.keras.layers.Activation('relu'),
                             tf.keras.layers.Dropout(0.2),
                             tf.keras.layers.Dense(units=1), 
                             tf.keras.layers.Activation('sigmoid')
                             ])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 50)                500       
_________________________________________________________________
activation (Activation)      (None, 50)                0         
_________________________________________________________________
dropout (Dropout)            (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                2550      
_________________________________________________________________
activation_1 (Activation)    (None, 50)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 51        
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 3,101
Trainable params: 3,101
Non-trainable params: 0
_________________________________________________________________

Training

optimizer = tf.keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
tf_history_dp = model.fit(X_train, y_train, batch_size=50, epochs=100, verbose=True, validation_data=(X_val, y_val))
Train on 31492 samples, validate on 7874 samples
Epoch 1/100
31492/31492 [==============================] - 3s 104us/sample - loss: 0.1383 - acc: 0.9645 - val_loss: 0.0470 - val_acc: 0.9830
Epoch 2/100
31492/31492 [==============================] - 3s 101us/sample - loss: 0.0569 - acc: 0.9796 - val_loss: 0.0431 - val_acc: 0.9850
.
.
Epoch 99/100
31492/31492 [==============================] - 3s 97us/sample - loss: 0.0402 - acc: 0.9857 - val_loss: 0.0356 - val_acc: 0.9892
Epoch 100/100
31492/31492 [==============================] - 3s 92us/sample - loss: 0.0408 - acc: 0.9858 - val_loss: 0.0345 - val_acc: 0.9886
import matplotlib.pyplot as plt

plt.figure(figsize=(20,7))

plt.subplot(1,2,1)
plt.plot(tf_history_dp.history['loss'], label='Training Loss')
plt.plot(tf_history_dp.history['val_loss'], label='Validation Loss')
plt.legend()

plt.subplot(1,2,2)
plt.plot(tf_history_dp.history['acc'], label='Training Accuracy')
plt.plot(tf_history_dp.history['val_acc'], label='Validation Accuracy')
plt.legend()
plt.show()

png

We were able to get an accuracy of 98.8% on Validation set, but in these kind of medical diagnosis tasks even a 0.1% improvement is very important.

Try to improve the performance more.

Previous
Next
comments powered by Disqus