ANN - Medical Diagnosis

Google ML Academy 2019

Instructor: Shangeth Rajaa

We will use ANNs to diagnose Brease Cancer with some characteristics of the cell nuclei.

Dataset

Download the Dataset

We will use a breast cancer diagnosis dataset from Opeml.org

%%capture
!wget https://www.openml.org/data/get_csv/5600/BNG_breast-w.arff

Explore the Dataset

import pandas as pd

df = pd.read_csv('/content/BNG_breast-w.arff')
df.head()

	Clump_Thickness	Cell_Size_Uniformity	Cell_Shape_Uniformity	Marginal_Adhesion	Single_Epi_Cell_Size	Bare_Nuclei	Bland_Chromatin	Normal_Nucleoli	Mitoses	Class
0	7.581819	9.745087	1.000000	4.503410	7.039930	10.0	4.412282	10.000000	5.055266	malignant
1	5.210921	8.169596	7.841875	6.033275	4.269619	10.0	4.236312	4.845350	1.000000	malignant
2	4.000000	4.594296	2.330380	2.000000	3.000000	1.0	10.701823	1.101305	1.000000	benign
3	2.428871	1.000000	1.000000	1.000000	4.099291	1.0	2.000000	1.000000	1.000000	benign
4	8.855971	2.697539	6.047068	3.301891	3.000000	1.0	5.297592	4.104791	3.115741	malignant

You can see all the features are real numbers, with different range, so they need to be scaled.
Class has to be changes to number {0, 1}.

Label Encoder

We can use sklearn’s Label Encoder to change malignant and benign to {0, 1}.

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df.loc[:, 'Class'] = label_encoder.fit_transform(df.loc[:, 'Class'])
df.head()

	Clump_Thickness	Cell_Size_Uniformity	Cell_Shape_Uniformity	Marginal_Adhesion	Single_Epi_Cell_Size	Bare_Nuclei	Bland_Chromatin	Normal_Nucleoli	Mitoses	Class
0	7.581819	9.745087	1.000000	4.503410	7.039930	10.0	4.412282	10.000000	5.055266	1
1	5.210921	8.169596	7.841875	6.033275	4.269619	10.0	4.236312	4.845350	1.000000	1
2	4.000000	4.594296	2.330380	2.000000	3.000000	1.0	10.701823	1.101305	1.000000	0
3	2.428871	1.000000	1.000000	1.000000	4.099291	1.0	2.000000	1.000000	1.000000	0
4	8.855971	2.697539	6.047068	3.301891	3.000000	1.0	5.297592	4.104791	3.115741	1

Scaling Features

We will use sklearn’s MinMaxScaler to scale the features, it will convert each column into a range of [0,1], you can also specify in which range you want to convert the features, by default its [0,1].

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df.loc[:,  df.columns != 'Class'] = scaler.fit_transform(df.loc[:,  df.columns != 'Class'])
df.head()

	Clump_Thickness	Cell_Size_Uniformity	Cell_Shape_Uniformity	Marginal_Adhesion	Single_Epi_Cell_Size	Bare_Nuclei	Bland_Chromatin	Normal_Nucleoli	Mitoses	Class
0	0.527352	0.885428	0.000000	0.344875	0.450241	0.761964	0.310056	0.929549	0.367161	1
1	0.344729	0.733487	0.589599	0.495474	0.243731	0.761964	0.294066	0.411081	0.000000	1
2	0.251456	0.388683	0.114646	0.098440	0.149088	0.084182	0.881553	0.034496	0.000000	0
3	0.130438	0.042047	0.000000	0.000000	0.231034	0.084182	0.090865	0.024306	0.000000	0
4	0.625495	0.205758	0.434931	0.226597	0.149088	0.084182	0.390499	0.336594	0.191558	1

Dataframes to Arrays

X = df.loc[:,  df.columns != 'Class'].values
y = df.loc[:, 'Class'].values

print(X.shape, y.shape)

(39366, 9) (39366,)

Train-Validation Split

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(31492, 9) (7874, 9) (31492,) (7874,)

Model

import tensorflow as tf
from tensorflow import keras

tf.keras.backend.clear_session()

model = tf.keras.Sequential([
                             tf.keras.layers.Dense(units=50, input_shape=[9], kernel_regularizer=tf.keras.regularizers.l2(0.00001)), 
                             tf.keras.layers.Activation('relu'),
                             tf.keras.layers.Dropout(0.2),
                             tf.keras.layers.Dense(units=50, kernel_regularizer=tf.keras.regularizers.l2(0.00001)),
                             tf.keras.layers.Activation('relu'),
                             tf.keras.layers.Dropout(0.2),
                             tf.keras.layers.Dense(units=1), 
                             tf.keras.layers.Activation('sigmoid')
                             ])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 50)                500       
_________________________________________________________________
activation (Activation)      (None, 50)                0         
_________________________________________________________________
dropout (Dropout)            (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                2550      
_________________________________________________________________
activation_1 (Activation)    (None, 50)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 51        
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 3,101
Trainable params: 3,101
Non-trainable params: 0
_________________________________________________________________

Training

optimizer = tf.keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
tf_history_dp = model.fit(X_train, y_train, batch_size=50, epochs=100, verbose=True, validation_data=(X_val, y_val))

Train on 31492 samples, validate on 7874 samples
Epoch 1/100
31492/31492 [==============================] - 3s 104us/sample - loss: 0.1383 - acc: 0.9645 - val_loss: 0.0470 - val_acc: 0.9830
Epoch 2/100
31492/31492 [==============================] - 3s 101us/sample - loss: 0.0569 - acc: 0.9796 - val_loss: 0.0431 - val_acc: 0.9850
.
.
Epoch 99/100
31492/31492 [==============================] - 3s 97us/sample - loss: 0.0402 - acc: 0.9857 - val_loss: 0.0356 - val_acc: 0.9892
Epoch 100/100
31492/31492 [==============================] - 3s 92us/sample - loss: 0.0408 - acc: 0.9858 - val_loss: 0.0345 - val_acc: 0.9886

import matplotlib.pyplot as plt

plt.figure(figsize=(20,7))

plt.subplot(1,2,1)
plt.plot(tf_history_dp.history['loss'], label='Training Loss')
plt.plot(tf_history_dp.history['val_loss'], label='Validation Loss')
plt.legend()

plt.subplot(1,2,2)
plt.plot(tf_history_dp.history['acc'], label='Training Accuracy')
plt.plot(tf_history_dp.history['val_acc'], label='Validation Accuracy')
plt.legend()
plt.show()

png

We were able to get an accuracy of 98.8% on Validation set, but in these kind of medical diagnosis tasks even a 0.1% improvement is very important.

Try to improve the performance more.

comments powered by Disqus

Last updated on Sep 6, 2019