Google ML Academy 2019
Instructor: Shangeth Rajaa
We will use ANNs to diagnose Brease Cancer with some characteristics of the cell nuclei.
Dataset
Download the Dataset
We will use a breast cancer diagnosis dataset from Opeml.org
%%capture
!wget https://www.openml.org/data/get_csv/5600/BNG_breast-w.arff
Explore the Dataset
import pandas as pd
df = pd.read_csv('/content/BNG_breast-w.arff')
df.head()
Clump_Thickness | Cell_Size_Uniformity | Cell_Shape_Uniformity | Marginal_Adhesion | Single_Epi_Cell_Size | Bare_Nuclei | Bland_Chromatin | Normal_Nucleoli | Mitoses | Class | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 7.581819 | 9.745087 | 1.000000 | 4.503410 | 7.039930 | 10.0 | 4.412282 | 10.000000 | 5.055266 | malignant |
1 | 5.210921 | 8.169596 | 7.841875 | 6.033275 | 4.269619 | 10.0 | 4.236312 | 4.845350 | 1.000000 | malignant |
2 | 4.000000 | 4.594296 | 2.330380 | 2.000000 | 3.000000 | 1.0 | 10.701823 | 1.101305 | 1.000000 | benign |
3 | 2.428871 | 1.000000 | 1.000000 | 1.000000 | 4.099291 | 1.0 | 2.000000 | 1.000000 | 1.000000 | benign |
4 | 8.855971 | 2.697539 | 6.047068 | 3.301891 | 3.000000 | 1.0 | 5.297592 | 4.104791 | 3.115741 | malignant |
- You can see all the features are real numbers, with different range, so they need to be scaled.
- Class has to be changes to number {0, 1}.
Label Encoder
We can use sklearn’s Label Encoder to change malignant and benign to {0, 1}.
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df.loc[:, 'Class'] = label_encoder.fit_transform(df.loc[:, 'Class'])
df.head()
Clump_Thickness | Cell_Size_Uniformity | Cell_Shape_Uniformity | Marginal_Adhesion | Single_Epi_Cell_Size | Bare_Nuclei | Bland_Chromatin | Normal_Nucleoli | Mitoses | Class | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 7.581819 | 9.745087 | 1.000000 | 4.503410 | 7.039930 | 10.0 | 4.412282 | 10.000000 | 5.055266 | 1 |
1 | 5.210921 | 8.169596 | 7.841875 | 6.033275 | 4.269619 | 10.0 | 4.236312 | 4.845350 | 1.000000 | 1 |
2 | 4.000000 | 4.594296 | 2.330380 | 2.000000 | 3.000000 | 1.0 | 10.701823 | 1.101305 | 1.000000 | 0 |
3 | 2.428871 | 1.000000 | 1.000000 | 1.000000 | 4.099291 | 1.0 | 2.000000 | 1.000000 | 1.000000 | 0 |
4 | 8.855971 | 2.697539 | 6.047068 | 3.301891 | 3.000000 | 1.0 | 5.297592 | 4.104791 | 3.115741 | 1 |
Scaling Features
We will use sklearn’s MinMaxScaler to scale the features, it will convert each column into a range of [0,1], you can also specify in which range you want to convert the features, by default its [0,1].
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df.loc[:, df.columns != 'Class'] = scaler.fit_transform(df.loc[:, df.columns != 'Class'])
df.head()
Clump_Thickness | Cell_Size_Uniformity | Cell_Shape_Uniformity | Marginal_Adhesion | Single_Epi_Cell_Size | Bare_Nuclei | Bland_Chromatin | Normal_Nucleoli | Mitoses | Class | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.527352 | 0.885428 | 0.000000 | 0.344875 | 0.450241 | 0.761964 | 0.310056 | 0.929549 | 0.367161 | 1 |
1 | 0.344729 | 0.733487 | 0.589599 | 0.495474 | 0.243731 | 0.761964 | 0.294066 | 0.411081 | 0.000000 | 1 |
2 | 0.251456 | 0.388683 | 0.114646 | 0.098440 | 0.149088 | 0.084182 | 0.881553 | 0.034496 | 0.000000 | 0 |
3 | 0.130438 | 0.042047 | 0.000000 | 0.000000 | 0.231034 | 0.084182 | 0.090865 | 0.024306 | 0.000000 | 0 |
4 | 0.625495 | 0.205758 | 0.434931 | 0.226597 | 0.149088 | 0.084182 | 0.390499 | 0.336594 | 0.191558 | 1 |
Dataframes to Arrays
X = df.loc[:, df.columns != 'Class'].values
y = df.loc[:, 'Class'].values
print(X.shape, y.shape)
(39366, 9) (39366,)
Train-Validation Split
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)
(31492, 9) (7874, 9) (31492,) (7874,)
Model
import tensorflow as tf
from tensorflow import keras
tf.keras.backend.clear_session()
model = tf.keras.Sequential([
tf.keras.layers.Dense(units=50, input_shape=[9], kernel_regularizer=tf.keras.regularizers.l2(0.00001)),
tf.keras.layers.Activation('relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(units=50, kernel_regularizer=tf.keras.regularizers.l2(0.00001)),
tf.keras.layers.Activation('relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(units=1),
tf.keras.layers.Activation('sigmoid')
])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 50) 500
_________________________________________________________________
activation (Activation) (None, 50) 0
_________________________________________________________________
dropout (Dropout) (None, 50) 0
_________________________________________________________________
dense_1 (Dense) (None, 50) 2550
_________________________________________________________________
activation_1 (Activation) (None, 50) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 50) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 51
_________________________________________________________________
activation_2 (Activation) (None, 1) 0
=================================================================
Total params: 3,101
Trainable params: 3,101
Non-trainable params: 0
_________________________________________________________________
Training
optimizer = tf.keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
tf_history_dp = model.fit(X_train, y_train, batch_size=50, epochs=100, verbose=True, validation_data=(X_val, y_val))
Train on 31492 samples, validate on 7874 samples
Epoch 1/100
31492/31492 [==============================] - 3s 104us/sample - loss: 0.1383 - acc: 0.9645 - val_loss: 0.0470 - val_acc: 0.9830
Epoch 2/100
31492/31492 [==============================] - 3s 101us/sample - loss: 0.0569 - acc: 0.9796 - val_loss: 0.0431 - val_acc: 0.9850
.
.
Epoch 99/100
31492/31492 [==============================] - 3s 97us/sample - loss: 0.0402 - acc: 0.9857 - val_loss: 0.0356 - val_acc: 0.9892
Epoch 100/100
31492/31492 [==============================] - 3s 92us/sample - loss: 0.0408 - acc: 0.9858 - val_loss: 0.0345 - val_acc: 0.9886
import matplotlib.pyplot as plt
plt.figure(figsize=(20,7))
plt.subplot(1,2,1)
plt.plot(tf_history_dp.history['loss'], label='Training Loss')
plt.plot(tf_history_dp.history['val_loss'], label='Validation Loss')
plt.legend()
plt.subplot(1,2,2)
plt.plot(tf_history_dp.history['acc'], label='Training Accuracy')
plt.plot(tf_history_dp.history['val_acc'], label='Validation Accuracy')
plt.legend()
plt.show()
We were able to get an accuracy of 98.8% on Validation set, but in these kind of medical diagnosis tasks even a 0.1% improvement is very important.
Try to improve the performance more.