Submitted in partial fulfilment
of the requirements of the degree of
of Noroff University College
Lewi Lie Uberg
Arendal, Norway
May 2022
I declare that the work presented for assessment in this submission is my own, that it
has not previously been presented for another assessment, and that work completed
by others has been appropriately acknowledged.
Name: Lewi Lie Uberg Date: May 24, 2022
An increasing rate of deadly brain tumors in humans also sees the increasing need
for highly educated medical personnel like neurologists and radiologists for diagno-
sis and treatment. Thus, to reduce the workload and the time from initial suspi-
cion of disease to diagnosis and a suitable treatment plan, there is a need to im-
plement a Computer-Aided-Disease-Diagnosis (CADD) system for brain tumor clas-
sification. By studying the types of tumors involved, how the convolutional neural
network functions, the evolution of its pre-defined architectures, models using pre-
trained weights, and their application in brain tumor classification, the likelihood of
producing a promising CADD system increases heavily. The outcome of the re-
search conducted in this project presents the starting point of an open-source project
to further develop a CADD system for brain tumor classification with reliable results.
The project includes all components of a working CADD system, including the data
preprocessing pipeline, the pipeline for defining and training CNN classification mod-
els, and a user interface in the form of an API as the backend and a website as
the frontend. The project is intended to be open to the general public—however, its
primary focus is on facilitating medical imaging researchers, medical students, radi-
ologic technologists, and radiologists.
Keywords: Brain Tumor Classification, Magnetic Resonance Imaging, Convolutional
Neural Networks, Machine Learning, Deep learning.
I want to thank my friends over at for their great articles and gen-
eral counsel with anything related to programming. I am also grateful to the peo-
ple on Stackoverflow who shares their knowledge and feedback, Khan Academy for
straightforwardly explaining math, the content creators of Youtube, and the open-
source community, who openly share their code on GitHub.
Thanks to my classmate, friend, and sparring partner Zeljka Matic for all our discus-
sions and her help in previous demanding classes.
I am grateful for the general counsel I received from Maxine Brandal V
agnes on how
to structure this report.
I thank my supervisor, Professor Seifedine Kadry, for the guidance, help, and feed-
back he has given me.
I am grateful for the manually labeled dataset provided by Dr. Saed Khawaldeh.
I want to thank my mother for all the nights she has helped out looking after the kids
when my wife is working, and an assignment is due.
Finally, I want to thank my wife and three sons for putting up with a husband and dad
that studies while working more than full-time. The time-dept I now owe them will be
paid back with interest after this final submission.
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope and Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 4
2.1 Brain Tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Evolution of CNN Architectures . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Brain Tumor Classification . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Data, Design and Implementation 17
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Data Prepossessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Model Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Architecture Selection . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Model Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Results 31
4.1 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Conclusion 42
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Summary of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
A Short Paper 49
B Source Code Repository 56
List of Figures
3.1 Custom CNN Flowchart with The AlexNet Architecture as Comparison. 24
3.2 Custom CNN Architecture. . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Training History of The MobileNet Architecture. . . . . . . . . . . . . . 34
4.2 Training History of The AlexNet Architecture. . . . . . . . . . . . . . . 34
4.3 Running The API using Uvicorn . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Expose Local Server To The Internet . . . . . . . . . . . . . . . . . . . 36
4.5 Interactive Documentation . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Running cURL Commands From Terminal . . . . . . . . . . . . . . . . 38
4.7 Home Page of the User Interface . . . . . . . . . . . . . . . . . . . . . 39
4.8 Normally Classified MRI Image . . . . . . . . . . . . . . . . . . . . . . 40
4.9 LGG Classified MRI Image . . . . . . . . . . . . . . . . . . . . . . . . 41
List of Tables
3.1 Dataset Sample Distribution . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Manually Labeled Dataset Sample Distribution . . . . . . . . . . . . . 21
4.1 Concatenation of The Best Training Exploration Results . . . . . . . . 32
4.2 CNN Architecture Results . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.1 Problem Statement
Glioblastoma is a very aggressive form of cancer and the most common form of
glioma. Occurrences of glioblastoma increase consistently, and with a five-year mor-
tality rate of 97% (Korja et al. 2018), it is considered the most deadly form of cancer.
Glioblastoma originates from glial cells in the brain, a cell type that acts as support
cells for the central nervous system. There is no specific blood test (tumor marker)
that can reveal the presence of a glioblastoma; therefore, imaging examinations of
the brain like Magnetic Resonance Imaging (MRI) or computerized tomography (CT)
are central to the discovery of glioblastoma. After imaging examinations of the brain
are performed, it remains to make a diagnosis, followed by a treatment plan. Highly
educated people are needed to set a diagnosis, such as a radiologist. However, the
amount of suitable personnel depends on the patients’ location, and the waiting list
can be significant. Implementation of a Computer-Aided-Disease-Diagnosis (CADD)
system for brain tumor classification can reduce the workload of radiologists and
reduce the time from suspicion of disease to diagnosis and suitable treatment.
1.2 Research Objectives
The research project’s objective is to acquire knowledge about all the needed com-
ponents to develop and implement a working Computer-Aided-Disease-Diagnosis
(CADD) system for brain tumor classification.
Research brain tumors themselves, as well as the convention of their classifi-
Develop a good understanding of the needed data and how it is collected and
Research the convolutional neural network (CNN), to understand its structure
and functionality, thereby how to manipulate it.
Research different pre-defined CNN architectures, their evolution, and how they
are implemented.
Research transfer learning to take advantage of previously gained knowledge,
and if it can be beneficial to classifying MRI images of brain tumors.
Research examples of related work to provide ideas and for comparisons.
Research ways for the project outcome to be usable to the general public.
Make use of the research by implementing and deploying a fully working CADD
system for brain tumor classification that is easy to maintain and build upon.
1.3 Scope and Limits
The research aims to explore the feasibility of a CADD system for brain tumor clas-
sification based on deep learning techniques such as CNNs in order to reduce the
workload of radiologists and reduce the time from suspicion of disease to diagno-
sis and suitable treatment. In addition, the CADD system must be easy to maintain
and build upon, as well as being usable to individuals with limited technical knowl-
The research is limited to data comprised of labeled samples of non-tumorous, LGG,
and HGG MRI images. Also, the amount of data that is available to the general public
since applying for research funding and any following administration needed to follow
up a collaboration with radiologists are too time-consuming for this project.
1.4 Document Structure
The document is sectioned into chapters representing the significant sections of the
research and implementation process. Chapter 2 covers the research in the form of a
literature review. Chapter 3 covers the data collection and preprocessing, as well as
the system design and implementation. Chapter 4 covers the results of the research
and implementation of the system. Finally, Chapter 5 covers a closing summary of
the project.
All the source code used in this project is provided in appendix B.
Literature Review
2.1 Brain Tumors
The cell of origin and features found when examining the cells tissue; the histopatho-
logical characteristics, define central nervous system tumors and predict their be-
havior (Louis et al. 2007). For example, cerebral gliomas are neuroepithelial tumors
originating from the supporting glial cells of the central nervous system (Forst et al.
After meningiomas, a usually benign tumor originating from the meningeal tissue of
the brain, the most common primary brain tumor in adults overall are gliomas, with a
rate of 5 to 6 persons per 100,000 annually (Hu et al. 2020).
The World Health Organization (WHO) tissue classification system categorizes gliomas
from lowest to highest, with grade 1 being the lowest grade and grade 4 being the
highest grade. Thus, low-grade gliomas (LGG) consist of grade I and grade II tumors
(Forst et al. 2014), while high-grade gliomas (HGG) consist of grade III and grade
IV (Hu et al. 2020). Grade I are the least malignant or benign tumors, including
Pilocytic Astrocymatoma, Craniopharyngioma, Grangliocytoma, and Ganglioglioma.
Grade II is relatively slow-growing but may recur as higher grade, including Astro-
cytoma, Pineocytoma, and Pure Oligodendroglioma. Grade III are malignant and
tend to recur as higher grade, including Anaplastic Astrocytoma, Anaplastic Ependy-
moma, and Anaplastic Oligodendroglioma. Finally, grade IV is the most malignant,
aggressive, necrosis and recurrence prone, including tumor types Glioblastoma Mul-
tiforme (GBM), Pineoblastoma, Medulloblastoma, and Ependymoblastoma (Louis et
al. 2007, p. 107).
Often occurring in young, otherwise healthy patients, LGGs are a diverse group of
primary brain tumors. Generally, they have a relatively good prognosis and prolonged
survival rate (Forst et al. 2014). However, over 75% of gliomas are HGG, GBM being
the most common and aggressive, accounting for 56.1% of all gliomas. In addition,
HGGs, particularly GBM, can exhibit a distinct tumor cell population that confounds
clinical diagnosis and management. As a result, GBM has a grim prognosis, with
a median survival of 15 months, despite the best available treatments. Having a
relatively high occurrence frequency and being difficult to diagnose and treat has
made HGGs, and GBM, in particular, the subject of tremendous interest in neuro-
oncologic research (Hu et al. 2020).
Histopathologic examination to study tissue morphology, diagnose and grade brain
tumors is the gold standard (Forst et al. 2014). However, surgical resection for diag-
nosing a brain tumor is an invasive and risky method. Nevertheless, there are sev-
eral non-invasive diagnostic methods, like neuroimaging. Neuroimaging techniques
widely used by the medical community include Computed Tomography (CT), Mag-
netic Resonance Imaging (MRI), Positron Emission Tomography (PET), and Plain-
Film Radiography (PFR) (Luo et al. 2018).
Conventional MRI is the current imaging procedure of choice and identifies tumor
size and associated Peritumoral Edema (PTE), one of the main features of malignant
glioma (Wu et al. 2015), with or without contrast enhancement. Nevertheless, char-
acteristic MRI findings cannot determine the tumor grade alone (Forst et al. 2014).
Moreover, MRIs of HGG, GMB in particular, also lack the capability to resolve intra-
tumoral heterogeneity that may be present. Nevertheless, more advanced imaging
procedures like PET offer a range of physiologic and biophysical image features that
can improve the accuracy of imaging diagnoses (Hu et al. 2020).
Studies comparing imaging features with tissue benchmarks have employed either
qualitative descriptions or have compared individual quantitative metrics in a uni-
variate fashion, which has produced solid correlations for certain clinical scenarios.
However, for other scenarios, the correlation of imaging and histopathological char-
acteristics may not be self-evident by visual inspection or sufficiently represented by
simple statistical features. Thus, in pursuit of brain tumor diagnosis without surgical
intervention, researchers have developed more advanced imaging methods such as
texture analysis, mechanic modeling, and machine learning (ML) that form a pre-
dictive multi-parametric image-based model. The application of ML models is an
emerging field in radiogenomics and represents a data-driven approach to identifying
meaningful patterns and correlations from often complex data sources. ML models
trains by feeding the ML algorithm a substantial amount of pre-classified image data
as input, such as MRI images, to learn which patterns belong to the different classes.
The resulting ML model uses these previously learned patterns to predict the appro-
priate class for the new instance. These ML models have enabled researchers to
predict tumor cell density, and together with texture analysis, recognize regional and
genetically distinct subpopulations coexisting within a single GBM tumor (Hu et al.
2.2 Convolutional Neural Network
The convolutional neural network (CNN) is a concept introduced by (Fukushima
1980) as the Neocognitron, as a model of the brain’s visual cortex; an improve-
ment of (Fukushima 1975) previous model for visual pattern recognition. Further-
more, (Lecun et al. 1998) significantly improved the Neocognitron to one of the most
successful pattern recognition models, which has significantly impacted the field of
computer vision.
The most common use case for CNNs is pattern detection in images. One or more
hidden convolution layers uses filters to convolve or scan over an input matrix, such
as a binary image. These filters closely resemble neurons in a dense layer of an
artificial neural network. Here a filter is learned to detect a specific pattern such
as an edge or circle; adding more filters to a convolutional layer will enable more
features to be learned. The filter size is the size of the matrix convolving over the
image matrix, and the stride is the number of pixel shifts over the input matrix. The
convolutional layer performs matrix multiplication on the image and filter matrix for
each stride taken. The resulting output is called a feature map. Two options are
available when the filter does not fit the image. The part of the image matrix that
does not fit the filter gets dropped; this is called valid padding. Alternatively, zeros
are added to the image matrix’s edges, enabling the filter matrix to fit the image
matrix entirely; this is called zero-padding.
In (Fukushima 1980)s original paper, tanh is used as the activation function. How-
ever, after Rectified Linear Unit (ReLU) was introduced as the activation function by
(Krizhevsky, Sutskever, and Hinton 2012) in 2012 with AlexNet, it has become the
most common activation function for a convolutional layer. ReLu is an almost linear
function with a low computational cost. ReLu converges fast by transforming the in-
put to the maximum of zero or the input value, meaning that the positive linear slope
does not saturate or plateau when the input becomes large. Also, ReLU does not
have a vanishing gradient problem like sigmoid or than. Hidden dense, alternatively
called fully-connected layers, also tend to use ReLU as their activation function.
Spatial pooling, a subsampling or downsampling, can be applied to reduce the num-
ber of tunable parameters, which is the dimensionality of a feature map. The most
commonly used type of spatial pooling is Max-pooling. By using filters and stride,
the Max-pooling layer operates much like a convolutional layer. However, the Max-
pooling layer takes the maximum value in its filter matrix as the output value. So,
for example, a Max-pooling layer with an input matrix of 8x8 and a filter size of 2x2
would have an output of 4x4 containing the larges value for each region. Thus, the
Max-poolings downsampling will decrease the computational cost for the following
layers in the network. While also concluding the feature extraction part of the CNN
and initiating the feature learning part.
Both convolutional and max-pooling layer outputs matrices while fully-connected
layer only accepts vectors. Adding a flattening layer reduces the last convolutional
layer’s output dimensionality to the shape (-1, 1), or it directly transforms the matrix
into a vector. In addition, this operation is a computationally cheap way of learning
non-linear combinations of higher-level feature representation from the convolutional
or max-pooling layer’s output.
The final layer, the output layer, is a dense or fully-connected layer with the same
number of neurons as classes to be classified. The activation function for the fi-
nal layer is strongly dependent on the loss function. For example, a single neuron
sigmoid activated fully-connected layer as the output, compiled with binary cross-
entropy as the loss function, would yield an equivalent result as two softmax acti-
vated neurons in a network using categorical cross-entropy as the loss function; in
other words, a binary classification.
2.3 Evolution of CNN Architectures
LenNet-5’s architecture of stacking convolutional layers, activation functions, and
pooling layers, thereby concluding with fully-connected layer(s), has become the
common starting point when designing a CNN. With its two convolutional and three
dense layers, LeNet-5 is one of the most straightforward CNN architectures. Trained
initially on 60,000 patterns and later trained on an additional 540,000 artificially gen-
erated patterns by randomly distorting the original dataset to support the authors’
hypothesis that there was a strong correlation between train and test errors and
training set size. Unfortunately, while being a breakthrough when introduced, the ar-
chitecture is relatively shallow. Therefore it does not generalize well or perform well
with color images (Lecun et al. 1998).
AlexNet was the first to use ReLU as the activation function and the recently de-
veloped dropout method to reduce overfitting in fully-connected layers. Built upon
Lenet-5, AlexNet comprises five convolutional layers and three fully-connected lay-
ers, with 60 million parameters and 650,000 neurons (Krizhevsky, Sutskever, and
Hinton 2012). With its increased size and new activation function, the architecture
performs well with color imaged. However, compared to its successors, it struggles
to learn the dataset’s features due to its small depth. In addition, Tandel et al. (2020)
proposed using transfer learning on the AlexNet model for multiclass MRI brain tumor
classification with positive results.
The motivation for developing the VGG family of architectures was to improve AlexNets
performance, which was done by significantly increasing the depth of the network.
As a result, several different configurations of the architecture were developed, for
example, the VGG-16. While the architecture significantly increases accuracy and
speed, it also suffers from the vanishing gradient problem. VGG-16 is comprised
of ReLu activated thirteen convolutional layers, two ReLu activated fully-connected
layers, and finally, one softmax activated fully-connected layer, and it has 138 million
parameters (Simonyan and Zisserman 2015). In addition, Belaid and Loudini (2020)
proposed combining several CNNs based on pre-trained VGG-16s for MRI classifi-
cation of three tumor types; meningioma, glioma, and pituitary tumor. In addition,
being the best performing of the three pre-trained networks Sevli (2021) used for
performance comparison on brain tumor classification.
Inception-v1 uses small neural networks within the main neural network called in-
ception modules (Lin, Chen, and Yan 2014). The inception modules use parallel
towers of convolutions and filters, thereby combining the output of the small net-
works instead of the linear approach seen in the previously mentioned architecture.
In addition, auxiliary networks are added to the main network to increase discrim-
ination and provide additional regularisation. Finally, the output of these auxiliary
networks is discarded. Inception-v1 has twenty-two layers in total. Nevertheless, the
number of parameters is significantly reduced from the previously mentioned archi-
tecture; now, the number of parameters is down to 5 million. However, the Inception
architecture can be biased towards certain classes in an unbalanced dataset, as
well as being prone to overfitting on smaller datasets (Szegedy, Liu, et al. 2015). In
addition, while Irmak (2021) proposes a custom CNN model for brain tumor clas-
sification, the Inception-v1 model is used as one of the five used for comparisons;
however, achieving the poorest accuracy results.
Built upon Inception-v1, Inception-v3 uses factorization of the convolutions. In ad-
dition, it adds batch normalization to the auxiliary layers in the auxiliary network,
thereby stabilizing and significantly reducing the number of epochs required to train
the network. However, with its 48 layers, Inception-v1 significantly increases the
number of parameters from its predecessor, now having 24 million. A large amount
of parameters does, however, make the network more prone to overfitting and adds
computational cost (Szegedy, Vanhoucke, et al. 2016). In addition, being the least
performing of the three pre-trained networks Sevli (2021) used for performance com-
parison on brain tumor classification.
ResNet-50 was one of the first to implement batch normalization, and it addresses
the problem of saturated and rapidly degrading accuracy. At the same time, the archi-
tecture is a quite deep neural network that allows gradients to flow from layer to layer
by using bridges or shortcuts called skip connections, thereby solving the problem of
vanishing gradients. However, many layers need to be added to improve accuracy,
thereby increasing the computational cost. With 50 layers comprising 48 convolu-
tional layers, a Max-pooling layer, and an Average pooling layer, ResNet-50 has 26
million parameters. (He et al. 2016). In addition, being the second-best performing
of the three pre-trained networks Sevli (2021) used for performance comparison on
brain tumor classification.
By entirely replacing the inception modules of Inception-v3 with depthwise sepa-
rable convolutions, Xception deals with separable spatial dimensions of the image
and kernel and depth dimensions or the number of channels of each processed im-
age. While the Xception architecture offers good memory usage and computational
speeds, it comes at the cost of accuracy performance. Xception is 71 layers deep
and has 23 million parameters (Chollet 2017).
Inception-v4 modified the initial layers before the first inception module, also called
the stem. Additionally, adding more inception modules and using the same filters
for every inception module increases model size. However, the researchers found
that with a filter number exceeding 1000, residual variants became unstable, and
the network suddenly “died” early in training. Inception-v4 is 22 layers deep with 43
million parameters (Szegedy, Ioffe, et al. 2017).
Inception-ResNet-V2 was introduced in the same paper as Inception-v4. Inception-
ResNet-V2 adds more previously seen inception modules as well as some modified
inception modules. It also adds residual inception blocks, such that the output of
a layer is added to another layer deeper in the network. Residual inception blocks
allow information to flow from one layer to another without any gates in their skip
connection. However, increasing the size of the network also increases the needed
computational resources and the number of parameters, making it inclined to over-
fitting on small datasets. Inception-ResNet-V2 is 164 layers deep and has 56 million
parameters (Szegedy, Ioffe, et al. 2017).
ResNeXt-50 builds upon ResNet and the Inception family by adding parallel towers
within the modules as a new dimension called cardinality. Adding cardinality is a
more efficient way of increasing accuracy than expanding deeper or wider since the
last two start to give diminishing results when expanded deeply. However, adapting
ResNeXt-50 to a new dataset type is a significant task due to its many hyperpa-
rameters and computations. ResNeXt-50 is 50 layers deep/wide, and the number of
parameters is not given (Xie et al. 2017).
MobileNet is TensorFlow’s first architecture designed for mobile applications. The
architecture is a simple but efficient general-purpose CNN Architecture often used in
object detection and fine-grained image classification. The MobileNet Architecture
uses depth-wise separable convolutions, a combination of depth-wise and point-wise
convolutions. Depth-wise, convolutions apply a single filter for each input channel
as opposed to standard convolutions that apply the filters to all the input channels.
The depth-wise convolutions do not combine the filters to produce a new feature.
Therefore, an additional layer called point-wise convolution is added. The point-wise
convolutional layer computes a linear combination of all the depth-wise convolutions
output to produce a new feature. The MobileNet architecture is designed to be as
efficient as possible while still easy to train. (Howard et al. 2017)
DenseNet is a CNN architecture where all layers with matching feature-map sizes
are directly connected. The feed-forward nature is preserved by receiving additional
inputs from all preceding layers and passing them on to succeeding layers. The
DenseNet architecture solves the problem of vanishing gradients, increasing feature
propagation and feature reuse, and considerably reducing the number of parameters.
On the other hand, memory usage increases as the input from previous layers are
concatenated. DenseNet-121 is 121 layers deep with 8 million parameters. (Huang
et al. 2019).
2.4 Brain Tumor Classification
Kang, Ullah, and Gwak (2021) proposed a fully automatic hybrid solution for brain
tumor classification comprised of several steps. First, pre-process the brain MRI im-
ages by cropping, resizing, and augmenting. Second, use pre-trained CNN models
for feature extraction with better generalization. Third, select the top three perform-
ing features using fined-tuned ML classifiers and concatenate these features. Finally,
use the concatenated feature as input for the ML classifiers to predict the final output
for the brain tumor MRI.
The researchers selected three different publicly available brain tumor MRI datasets
for experimentation. The researchers established a naming convention of three
parts, the type, the size, and the number of classes, i. e., a medium brain tumor
dataset with three classes is named “BT-medium-3c”. The first dataset, BT-small-2c,
comprises 253 images, 155 images classified as containing tumors, and 98 images
classified as without tumors. The second dataset, BT-large-2c, comprises 3000 im-
ages, 1500 images containing tumors, and 1500 images without tumors. The third
and final dataset, BT-large-4c, comprises 3064 images containing four classes, not
tumorous, glioma tumor, meningioma tumor, and pituitary tumor. All the datasets
follow the standard convention of subdividing into 80% for training and 20% for test-
Most of the images in the datasets contain undesired spaces and areas. However,
cropping the image only to contain the relevant area for analysis can lead to better
classification performance. In addition, if a dataset is imbalanced or small, augmen-
tation may boost the learning capabilities. Augmentation creates multiple copies of
the images, modified in different ways, like mirroring, rotating, or adjusting the im-
age brightness. In addition to dataset augmentation, the images are resized to fit
the pre-trained CNN’s expected dimensions; 224x224px, except Inception V3, which
expects 299x299px.
The proposed scheme uses a novel feature evaluation and selection mechanism,
an ensemble of 13 pre-trained CNNs, to extract robust and discriminative features
from the brain MRI images without human supervision. The CNN ensemble, is com-
prised of ResNet-50, ResNet-101, DenseNet-121, DenseNet-169, VGG-16, VGG-
19, AlexNet, Inception V3, ResNext-50, ResNext-101, ShuffleNet, MobileNet, and
MnasNet. Since the researchers use fairly small datasets for training, they take a
transfer learning-based approach by using the fixed weights on the bottleneck layers
of each CNN model pre-trained on the ImageNet dataset.
Using the features extracted from the CNN models, a synthetic feature is formed by
evaluating each feature from the CNN ensemble with an ensemble of nine different
ML classifiers and concatenating the top three features from the different CNNs.
Since different CNN architectures capture different aspects of the processed data,
the synthetic feature represents a more discriminative feature than features extracted
from a single CNN.
The ML classifier ensemble, implemented using the scikit-learn library, is comprised
of a fully-connected (FC) neural network (NN) layer, Gaussian Na
ıve Bayes (Gaus-
sian NB), Adaptive Boosting (AdaBoost), K-Nearest Neighbors (k-NN), Random for-
est (RF), Extreme Learning Machine (ELM), Support Vector Machines (SVM) with
three different kernels: linear, sigmoid, and radial basis function (RBF).
The first classifier uses the conventional CNN approach. A softmax activated FC
layer with a cross-entropy loss function; the most commonly used loss function for
neural networks. This first classifier with an initial learning rate of 0.001 uses Adap-
tive Moment Estimation (Adam) optimization of the layer weights and adaptively re-
calculates the learning rate. Finally, collecting the highest average accuracy per run
for a total of 100 epochs.
The researchers also use the Gaussian variant of Na
ıve Bayes that follows the
Gaussian (normal) distribution with no co-variance between the attributes in the
The next classifier Adaptive Boosting, or AdaBoost for short, is an ensemble learning
algorithm that combines multiple weaker classifiers (Decision trees with a single split,
called stumps.) to improve performance. AdaBoost works iteratively and assigns
higher weights to the mislabeled instances.
The following classifier is one of the simplest classifiers, the k-Nearest Neighbors
(kNN). kNN does not train a model but calculates predictions directly from the data
currently stored in memory. Using Euclidean distance as the distance metric, the
kNN classifier finds the k nearest neighbors of the training instances closest to the
given feature. It then assigns the most common class label among the given neighbor
based on the most common label of its neighbors, the majority vote. Setting the
nearest neighbors from 1 to 4, the one with the highest accuracy was selected.
Random Forest (RF) is a learning algorithm that creates multiple decision trees using
the bootstrap aggregation (bagging) method to classify features into a class—using
the Gini index as a cost function while creating the decision trees. RF selects ran-
dom n attributes or features to find the optimal split point, reducing the correlation
among the trees and having lower ensemble error rates. RF predicts by feeding
features into all the classification trees, counting the number of predictions for each
class, and choosing the class with the most significant number of votes as the cor-
rect class for the given feature. To find the optimal split, the researchers set the
feature consideration number to the square root of the total number of features and
the number of decision trees from 1 to 150, thereby selecting the one with the highest
Extreme Learning Machine (ELM) is a learning algorithm for Single-Layer Feed-
Forward Neural Networks (SLFN), which provides good performance at a fast learn-
ing speed. ELM is not an iterative algorithm, like the back-propagation algorithm
used in traditional SLFNs. Instead, ELM uses a gradient-based technique, only tun-
ing the weights once. The researchers used 5000, 6000, 7000, 8000, 9000, 10,000
hidden layers and selected the one with the highest accuracy.
The Support Vector Machine (SVM) uses the kernel function to transform the original
data space, the number of features, into a higher-dimensional space. Then aims to
find a hyperplane in that spacial dimension that distinctly classifies the given feature.
The researchers use the three most common kernel functions, linear, sigmoid, and
radial basis function (RBF). In addition, the SVM has two hyper-parameters. First,
C, the soft margin cost function that controls each support vector’s influence; set to
0.1, 1, 10, 100, 1000, 10000. Secondly, Gamma, which decides the curvature of
the decision boundaries; set to 0.00001, 0.0001, 0.001, 0.01. The hyper-parameter
combination that yielded the highest accuracy is then selected.
Experimentation on the given datasets has two main tasks. First, compare the sev-
eral pre-trained CNN networks with several ML classifiers. Second, show the effec-
tiveness of the concatenation of the top 2 or 3 features with the best results from the
first experiment.
For example, the top three features on the BT-small-2c dataset are the DenseNet-
169, Inception V3, and ResNeXt-50 features. Then on the BT-large-2c dataset, the
DenseNet-121, ResNeXt-101, and MnasNet features are the top three. While on the
BT-large-4c dataset, the DenseNet-169, MnasNet, and ShuffleNet V2 features are
the top three.
Observations from the second experiment show that SVM with RBF kernel can find
a more effective and complex set of decision boundaries, outperforming the other
ML classifiers on the two most extensive datasets. However, this is not the case for
the smallest dataset since SVM with RBF tends to underperform when the number of
training data samples is smaller than the feature number for each data point. Further-
more, it is almost impossible that features extracted from the pre-trained CNNs are
entirely independent. Therefore, since Gaussian NB assumes that the features are
independent, it performs worst among the ML classifiers on all three datasets. On
the other hand, features extracted from the DenseNet architectures predict well on all
three datasets since they have all complexity levels, giving them smoother decision
boundaries, which tend to predict especially well on insufficient training data. While,
features from the VGG, with its more basic architecture and no residual blocks, yield
the worst results. The effectiveness of the concatenated top 2 or 3 features is evident
for all ML classifiers on the two largest datasets. However, on the small dataset, it is
only shown when using AdaBoost and k-NN.
The FC, RF, and Gaussian NB classifiers have the shortest inference time. In com-
parison, k-NN has the longest since it is the only ML classifier that needs to evaluate
every data point during prediction. While the results from these experiments show
promise on large datasets, further research is needed, especially on model reduction
for real-time medical systems deployment.
2.5 User Interface
The initial technical aspect of the project will be implemented using the Python pro-
gramming language. Therefore, the user interface will be developed using Python-
based technologies. Such technologies evolve rapidly; therefore, the best informa-
tion source is often the official documentation of the given technology. Such user
interfaces are usually in a desktop application but can also be implemented as a
web application. A web browser is the best option since there is no need for the user
to install any software or worry about platform compatibility.
The first component needed to be developed for a user interface is an API that will
be used to communicate with either a terminal or a web browser. Python has many
applicable frameworks for this purpose, such as Django, Flask, or FastAPI (2022).
Django is the most popular framework, but it can be a bit large and complicated to
use. Flask is a good option for this purpose. However, it does not have the same
amount of modern features as FastAPI.
Being built upon other frameworks, such as Starlette (2021), Pydantic (2022), Ope-
nAPI (2021), and JSON JSONSchema (2020), FastAPI fully supports asynchronous
programming, type validation during runtime, and autogeneration of interactive doc-
umentation. FastAPI is used to build RESTful APIs, the most common API standard
for web applications.
Since FastAPI is built to work with both Gunicorn and Uvicorn (2022), it operates at
high speeds, and it is also easy to deploy the API to the webserver. Furthermore,
FastAPI supports Jinja2 (2022) templating out of the box, making it much easier for
developers that primarily use python to build a web application since knowledge of
JavaScript is no longer needed for the most basic functionality.
Data, Design and Implementation
3.1 Data Collection
One could assume that with the whole internet as the arena, the likelihood of finding
a relevant dataset for something as familiar as brain tumor MRIs is high. However,
when the need for a dataset with specific criteria is high, the availability seems to de-
crease exponentially. Furthermore, the most readily available datasets were different
variations of the same source, usually lacking quality. Therefore, the data-gathering
part of the research project starts with outlining some criteria that the data needs
to meet to ensure quality in the later steps of the project. The data must contain
three types of samples, non-tumorous, LGG, and HGG. It is also essential that the
amount of samples between the types is well balanced to make the classification
model generalize well. The balance is essential since too much data augmentation
on medical images can disrupt the features of interest, thereby resulting in incorrect
With optimism, the search began on familiar places like Kaggle, Google Dataset
Search, GitHub, and paperswithcode. While some of the found sources looked
promising at first, none of them quite fit the bill. They were either too small, had
some but not all of the needed labels, or were very unbalanced between the labels.
However, the most significant problem is that most of them did not have the non-
tumorous label in the samples. This label is essential since the objective is to have
a classification system that can distinguish between non-tumorous, LGG, and HGG
MRIs. A combination of different datasets was considered for a period, but the idea
of not having a battle-tested dataset as the foundation did not sit well. With a good
foundation, each class can be supplemented with smaller batches of high-quality
data over time. So the search continued.
Finally, a well-suited candidate for this project was found at The Cancer Imaging
Archive (TCIA). The REMBRANDT (REpository for Molecular BRAin Neoplasia DaTa)
Dataset (Scarpace et al. 2019) seemed to have all the essential characteristics re-
quired by the project. In addition, the REMBRANDT dataset is one of the most
trusted publicly available datasets. The dataset is comprised of MRI scans from 130
subjects of three classes, non-tumorous, LGG, and HGG. Furthermore, the LGG
and HGG classes have subclasses that open the opportunity to make the classifi-
cation outcome more extensive in the future. For example, the LGG class includes
tumors of type Astrocytoma II and Oligodendroglioma II. On the other hand, the HGG
class includes tumors of type Astrocytoma III, Oligodendroglioma III, and Glioblas-
toma Multiforme IV. The dataset is a combination of metadata from various text and
spreadsheet files and the 110,020 images in the DICOM format, which also includes
a vast quantity of metadata.
3.2 Data Prepossessing
Due to the dataset size and the state of its metadata, preprocessing was a rather
large task. The preprocessing of the data uses the Visual Studio Code variant of the
Jupyter Notebook format “.ipynb”, which enables Markdown text cells and executable
Python code cells in the same file. Various helper functions are defined to aid in the
exploration of the data. The “Sample”, “Disease”, and “Grade” columns from the
metadata files are loaded into a pandas data frame; only these columns have value
for this particular task. The sample name in the metadata is compared to the paths
of the dataset. Differences like extra decimal points are manually removed; it is
important to ensure that future loading functions find the correct file.
The first step in preprocessing cleans the most common mistakes in any textual data.
Some of the data points have leading whitespace; these are removed. Datapoints
with missing values use differing naming conventions like “–” or “none”. In order
to make them easier to work with, all these values are changed to “NaN”. Next, all
data points are converted to lowercase since both uppercase and lowercase are
used. Any values that use “-” to separate words are replaced with ”. Finally, a new
column label is added to the data frame, which will be populated later.
The next step is performed after studying the metadata files for more specific infor-
mation. For example, the grade column for disease matching “gbm” is empty, and
from searching brain tumor grading conventions online, it is clear that the appropriate
grade is IV, which is inserted for those data points. Some data points in the disease
column have the value “mixed”. Since it is unclear which diseases are associated
with the data point, they are removed. Data points with missing values for disease
and grade are data points where the disease is unknown. Usually, these data points
would be removed. However, by studying the metadata, it is clear that in this case, it
means that no disease is associated with the data point. Therefore, these data points
are given the value “none” instead. In some data points, the disease is known, and
the grade is not known. Take “oligodendroglioma”, for example; it can be grade II or
III. In these cases, the datapoint is unusable and removed. Only diseases of type
“gbm” with missing grades can be correctly labeled since they are always graded as
After verifying that there are no missing values remaining in the data frame, the
correct labels are assigned for all the data points. For example, diseases of type
“oligodendroglioma” and “astrocytoma” with grade II are labeled as “lgg”. On the
other hand, the same diseases with grade III are labeled “hgg”. Finally, all diseases
of type “gbm” with grade IV are labeled as “hgg”.
In order to make the files easier to use in the model training, an algorithm that finds all
files in all subfolders is implemented. This algorithm stores the file path for every file
in a “.csv” file. The content of the “.csv” file is then merged with the corresponding
samples of the data frame. Files discovered by the algorithm that is not linked to
any sample are removed since the information needed to label them correctly is not
present in the metadata.
After completing the preprocessing, the dataset is comprised of 123 patients and
105,265 slides, distributed as shown in Table 3.1.
At this point, it was discovered that one key bit of information was missing; how to
separate the MRI slides that contained the tumorous cells from the ones that did not
contain them. For every scan that is labeled tumorous, the tumor is only visible in 20-
30% of the slides. If the dataset had been used in this state, the model would learn to
classify most healthy tissue as tumorous. In other words, it would be useless.
Table 3.1: Dataset Sample Distribution
Disease Grade Label Unique Samples Sample Count
Astrocytoma II LGG 30 25286
Astrocytoma III HGG 17 16038
GBM IV HGG 43 32837
Non-Tumorous n/a n/a 15 17041
Oligodendroglioma II LGG 11 9335
Oligodendroglioma III HGG 7 4728
Total 123 105265
All of the source data were extensively reexamined, but the needed information was
never found. Therefore, alternative methods were explored to use the now prepro-
cessed dataset. For example, each patient has several MRI scans, and every scan
comprises a group of slides. By creating an animation of each group, the place-
ment of the tumors could be observed. The idea was to find some kind of pattern
for each label in order to filter out these particular slides and use only them during
model training and testing. Unfortunately, no usable patterns were discovered; also,
intuition waved a big red flag. Therefore the idea was discarded.
While searching online for how to find the needed key in the metadata or the DI-
COM files, one paper stood out (Khawaldeh et al. 2018). The paper used that same
dataset for a similar type of application. However, the number of slides used in this
paper was reduced from the original 105,265 to 4069 slides. By reading the dataset
section of the paper and evaluating the included tables and values, it became clear
that the authors of this paper had found a way to filter the data further.
Not finding this critical bit of information was becoming a significant problem for the
project and took a large portion of the allocated time. In order to learn how to repro-
duce the results of the found paper, a meeting was requested by reaching out to the
paper’s primary author, Dr. Khawaldeh.
While being grateful for Dr. Khawaldeh’s collaboration, some devastating news arose.
The needed key was not part of the publically available dataset. After Dr. Khawaldeh’s
research team had cleaned the data to the same point as this project, they had em-
ployed help from neurologists to go through each slide manually and label them
correctly. Fortunately, Dr. Khawaldeh offered to share a dataset of labeled sam-
ples divided into Normal, LGG, and HGG. After some weeks, the dataset was re-
ceived—unfortunately, a much smaller sample than the 4069 slides described in
3.3. DESIGN 21
their paper. However, the received 736 correctly labeled slides were still a much
better candidate than anything publically available.
The turn of events has made the work done on data preprocessing obsolete. How-
ever, it includes all