3000 images, 1500 images containing tumors, and 1500 im-
ages without tumors. The third and final dataset, BT-large-4c,
comprises 3064 images containing four classes, not tumorous,
glioma tumor, meningioma tumor, and pituitary tumor. All the
datasets follow the standard convention of subdividing into
80% for training and 20% for testing.
Most of the images in the datasets contain undesired spaces
and areas. However, cropping the image only to contain the
relevant area for analysis can lead to better classification
performance. In addition, if a dataset is imbalanced or small,
augmentation may boost the learning capabilities. Augmen-
tation creates multiple copies of the images, modified in
different ways, like mirroring, rotating, or adjusting the image
brightness. In addition to dataset augmentation, the images
are resized to fit the pre-trained CNN’s expected dimensions;
224x224px, except Inception V3, which expects 299x299px.
The proposed scheme uses a novel feature evaluation and
selection mechanism, an ensemble of 13 pre-trained CNNs,
to extract robust and discriminative features from the brain
MRI images without human supervision. The CNN ensem-
ble, is comprised of ResNet-50, ResNet-101, DenseNet-121,
DenseNet-169, VGG-16, VGG-19, AlexNet, Inception V3,
ResNext-50, ResNext-101, ShuffleNet, MobileNet, and Mnas-
Net. Since the researchers use fairly small datasets for training,
they take a transfer learning-based approach by using the fixed
weights on the bottleneck layers of each CNN model pre-
trained on the ImageNet dataset.
Using the features extracted from the CNN models, a
synthetic feature is formed by evaluating each feature from
the CNN ensemble with an ensemble of nine different ML
classifiers and concatenating the top three features from the
different CNNs. Since different CNN architectures capture
different aspects of the processed data, the synthetic feature
represents a more discriminative feature than features extracted
from a single CNN.
The ML classifier ensemble, implemented using the scikit-
learn library, is comprised of a fully-connected (FC) neural
network (NN) layer, Gaussian Na
¨
ıve Bayes (Gaussian NB),
Adaptive Boosting (AdaBoost), K-Nearest Neighbors (k-NN),
Random forest (RF), Extreme Learning Machine (ELM),
Support Vector Machines (SVM) with three different kernels:
linear, sigmoid, and radial basis function (RBF).
The first classifier uses the conventional CNN approach. A
softmax activated FC layer with a cross-entropy loss function;
the most commonly used loss function for neural networks.
This first classifier with an initial learning rate of 0.001 uses
Adaptive Moment Estimation (Adam) optimization of the layer
weights and adaptively recalculates the learning rate. Finally,
collecting the highest average accuracy per run for a total of
100 epochs.
The researchers also use the Gaussian variant of Na
¨
ıve
Bayes that follows the Gaussian (normal) distribution with no
co-variance between the attributes in the classes.
The next classifier Adaptive Boosting, or AdaBoost for
short, is an ensemble learning algorithm that combines multi-
ple weaker classifiers (Decision trees with a single split, called
stumps.) to improve performance. AdaBoost works iteratively
and assigns higher weights to the mislabeled instances.
The following classifier is one of the simplest classifiers, the
k-Nearest Neighbors (kNN). kNN does not train a model but
calculates predictions directly from the data currently stored
in memory. Using Euclidean distance as the distance metric,
the kNN classifier finds the k nearest neighbors of the training
instances closest to the given feature. It then assigns the most
common class label among the given neighbor based on the
most common label of its neighbors, the majority vote. Setting
the nearest neighbors from 1 to 4, the one with the highest
accuracy was selected.
Random Forest (RF) is a learning algorithm that creates
multiple decision trees using the bootstrap aggregation (bag-
ging) method to classify features into a class—using the
Gini index as a cost function while creating the decision
trees. RF selects random n attributes or features to find the
optimal split point, reducing the correlation among the trees
and having lower ensemble error rates. RF predicts by feeding
features into all the classification trees, counting the number
of predictions for each class, and choosing the class with the
most significant number of votes as the correct class for the
given feature. To find the optimal split, the researchers set the
feature consideration number to the square root of the total
number of features and the number of decision trees from 1
to 150, thereby selecting the one with the highest accuracy.
Extreme Learning Machine (ELM) is a learning algorithm
for Single-Layer Feed-Forward Neural Networks (SLFN),
which provides good performance at a fast learning speed.
ELM is not an iterative algorithm, like the back-propagation
algorithm used in traditional SLFNs. Instead, ELM uses a
gradient-based technique, only tuning the weights once. The
researchers used 5000, 6000, 7000, 8000, 9000, 10,000 hidden
layers and selected the one with the highest accuracy.
The Support Vector Machine (SVM) uses the kernel func-
tion to transform the original data space, the number of
features, into a higher-dimensional space. Then aims to find a
hyperplane in that spacial dimension that distinctly classifies
the given feature. The researchers use the three most common
kernel functions, linear, sigmoid, and radial basis function
(RBF). In addition, the SVM has two hyper-parameters. First,
C, the soft margin cost function that controls each support
vector’s influence; set to 0.1, 1, 10, 100, 1000, 10000. Sec-
ondly, Gamma, which decides the curvature of the decision
boundaries; set to 0.00001, 0.0001, 0.001, 0.01. The hyper-
parameter combination that yielded the highest accuracy is
then selected.
Experimentation on the given datasets has two main tasks.
First, compare the several pre-trained CNN networks with
several ML classifiers. Second, show the effectiveness of the
concatenation of the top 2 or 3 features with the best results
from the first experiment.
For example, the top three features on the BT-small-2c
dataset are the DenseNet-169, Inception V3, and ResNeXt-50
features. Then on the BT-large-2c dataset, the DenseNet-121,
ResNeXt-101, and MnasNet features are the top three. While