ABSTRACT In Computer Vision

ABSTRACT
In Computer Vision, the task of food image recognition is considered to be one of the most significant and potential applications for visual object recognition. Food image recognition in itself is a visual recognition challenge than any conventional image recognition. Due to huge diversity and varieties of food available across the globe, food recognition becomes very challenging. A deep learning algorithm, Convolutional Neural Networks (CNN) is implemented to recognize and classify the Indian food images. CNN is considered to be the best deep learning algorithm for image classification tasks because of its ability to automatically extract and learn the features from the input images. An experimental study consists of a limited dataset consisting of 60,000 grayscale images of size 280*280 belonging to the ten classes of Indian food. The performance evaluation is based on the classification accuracy. A comparative study is carried out by training the model on CPU and GPU, and the running time is recorded. The model is trained taking only 10,000 images to study the importance of large dataset needed for developing a good model which outputs better accuracy. The graphical representation is also provided. The loss curve, accuracy curve is shown in the graph which in itself is explanation on the importance of large dataset to train the model and how the epochs improve the classification accuracy. The algorithm achieves a fairly good classification accuracy of 96.95% for correct classification of all the images tested in the dataset in just one epoch.
Keywords – Supervised Machine Learning, Pattern Recognition, Convolutional Neural Network (ConvNet/ CNN), Indian food dataset, food image classification.

Chapter 1
INTRODUCTION
Food is a necessity for the human existence. Since time immemorial, humans have rapidly progressed towards eating food that suits their taste buds concentrating more on the taste that tingles their taste. Every food has a unique shape, size, colour, texture etc, which makes food image recognition and classification a challenging task. There is a huge diversity in food. The food belonging to the same class too has diversity. The difficulty of food recognition increases, by the way it is presented. This raises a problem of recognition and classification of food items.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Food image classification has numerous applications ranging from dietary assessment, identifying the nutritional value of food, , identifying the calorie value of food, identifying the ingredients in food etc., to promote healthy eating, preventing wastage of food, assist people who maybe allergic to some foods and diabetics. So, image recognition and classification is very important.

Object recognition is an integral part of Computer Vision that identifies an object in the given image irrespective of backgrounds, occlusion, the angle of view, lighting, etc. Numerous Machine Learning techniques have been deployed for image recognition tasks, such as Bag-of-features (BoF) model, Support Vector Machine (SVM) in which local features are extracted and these hand-crafted features are fed into the Neural Network. The conventional methods do not produce the desired accuracy as the features from the images needs to hand-crafted, which is a manual process.

Deep Learning is a subset of Machine Learning that has evolved fairly quickly over the recent times. CNN gained popularity after Krizhevsky et al. 1 won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2012 competition. CNN is a deep learning technique that has shown incredible performance for various machine learning and computer vision based problems. Hence, CNN is used for various image recognition (identification) and image classification (categorization) problems such as computer vision, handwriting recognition, voice recognition, Natural Language Processing (NLP), etc.

CNN is a Deep Neural Network (DNN) consisting of additional repetitive Convolutional and pooling layers which automatically learns the features from the given input image instead of feeding the hand-crafted features to the network as observed in the traditional approaches. Thus, CNN is considered as a state-of-art for image classification problems. Image classification is a task of assigning labels to images from the fixed set of predefined categories. For the given input image, CNN calculates the class probability for all the classes considered and labels the image with a class that has the highest probability.

Implementing a CNN model requires significantly large number of training images to train the parameters of the network. There are two ways to deal with this problem. One approach is to pre-train the model using a dataset consisting of large number of images of different categories. This approach of model pre-training is called transferred learning. Another approach requires modifying the existing dataset by applying affine transformations such as rotation, flip (horizontal, vertical), and resize, etc., to expand the existing small dataset.

Problem statement
Food image recognition and classification is a challenging problem of visual object recognition in Computer Vision. Deep learning has shown fairly accurate results in food image recognition and classification task. A deep learning CNN model is developed for food image recognition and classification using data augmentation technique to expand the existing food image dataset. A dataset consisting of 60,000 grayscale Indian food images are chosen. The food images are the most common Indian breakfast and snacks belonging to 10 different classes. The task is to correctly classify one class per image i.e., each image can belong to only one class.
1.2. Goals
The following are some of the goals the project is aiming to achieve;
Expand the small dataset of food images belonging to 10 different classes, then apply affine transformations – scaling, rotation, etc., to expand the dataset which contains 60,000 grayscale images. A larger dataset is important to train the model, so that the model learns more detailed features which helps in correct classification of the given input image also thus increases the classification accuracy.

Design a five-layered CNN consisting of alternating Convolutional- Pooling layers followed by a fully connected layer. The hyperparameters are wisely chosen to improve the classification accuracy of the model and train the model in less time. The model is trained using only one epoch.

Evaluating the performance of the model based on the classification accuracy. The model outputs a best classification accuracy of 95% which is the highest recorded accuracy in testing the model so far.
GPUs are necessary to train the deep learning models due to their high computational capabilities. The importance of GPU is demonstrated by training the model both on CPU and GPU. The running time on GPU and CPU is recorded.

The importance of large dataset and more epochs is also demonstrated by taking only 10,000 images and the classification accuracy is evaluated per epoch and is represented graphically.

Chapter 2
LITERATURE SURVEY
Significant work is carried out related to image recognition and classification of food images for various purposes such as dietary assessment, recognize diverse food available, analyze calorie intake and eating habits of people, Amazon Go’s grocery image detections etc.
2.1.1. Using different Machine learning approaches
In 2, Abhishek Goswami and Haichen Lu have performed a comparative study on food image classification using various deep learning models and have provided the detailed description of the models and the obtained results.
26,984 colour images of 20 different classes from across the globe have been chosen for classification. The images are divided into the following three categories: training (18,927 images), validation (5,375 images) and testing (2,682 images). The images were preprocessed and resized to 32X32X3, which caused a loss of 10% of the images; because those images did not fit the dimensions specified. The experiments were carried out on various machine learning models. The evaluation was the accuracy obtained from each model. A detailed description of each model is explained below and summarized in a tabular form along with the validation accuracy and test accuracy.

A. Using Raw Image Pixels with Linear Classifier

Linear classifier works as a template match, in which the weights learnt row-wise are match up to a template for a particular class. Feature considered was the raw image pixels along with SVM (Support Vector Machine) classifier and Softmax classifier.

The learning rate was set to 1e-07 with the regularizing strength of 2.5e+04. The best validation accuracy was provided by the SVM classifier.

B. Raw Image Pixels with Fully Connected Neural Networks (NN)
The network architecture considered here consisted of six fully connected layers. ReLU non-linear operation was performed on all layers and softmax loss function was applied to the last layer. Raw image pixels were again used as a feature. The learning rate was set to 1e-03. The model was trained and validated on 20 epochs and best classification accuracy obtained on validation set was 0.19. Adam optimizer was used for parameter update rule. The test accuracy was 0.18. One important observation was batch normalization was useful for improving the performance of training the model.

C. Support Vector Machine (SVM) and Fully Connected NN on Image
Features
Image features were extracted and a feature vector was formed by concatenating the HOG (Histogram Oriented Gradients) and colour histogram. Each image had 155 features.

Linear SVM Classifier: With the learning rate set to 0.001 (1e-03) Validation accuracy obtained was 0.21with the regularization strength of 1e+00. SGD (Stochastic Gradient Descent) was used as the update rule.

NN Classifier: Two layered fully connected Neural Network was used which provided a validation accuracy of 0.26 and the test accuracy of 0.27. The learning rate was set to 0.9 with the regularization strength of 0 with SGD update rule.

D. Convolutional Neural Networks (CNN)
The CNN architecture had five convolutional and max-pooling layers with two fully connected layers. The kernel size was fixed to 32×32 with 32 filters in each convolutional layer. A dropout of 0.75 was applied to each layer. The last layer had softmax classifier with cross-entropy loss. The obtained validation accuracy and test accuracy was 0.4 with Adam optimizer over 25 epochs with a learning rate of 1e-04.

E. Transferred Learning using VGG-16 Pre-Trained Model
To further improve the accuracy obtained by the CNN model, a VGG-16 model 3 pre-trained on ImageNet was used. The VGG-16 model was modified by dropping the last fully connected layer and was replaced with 20 outputs. The last layer was trained with 10 epochs then the whole network was trained for 10 more epochs.

The dataset was pre-processed and cropped to 255X255. The training set was horizontally flipped to one half probability, and the entire dataset was subtracted with VGG colour mean.

2.1.2. CNN with hand-crafted features, Fisher Vector with HoG and Colour Patch
Yoshiyuki KAWANO, Keiji YANAI, in their work 3 have tried to improve the classification accuracy by using Deep Convolutional Neural Networks with the traditional hand-crafted features, Fisher Vectors with HoG and colour patches which is the work done by Chatfield et al. on generic dataset such as PASCAL VOC 2007 and Caltech-101/256.

UEC-FOOD100 dataset containing 100 classes, with each class containing more than 100 images of food was used in their experiment. Food photo also has a boundary box indicating the location of food. Food dataset with 70 different food items was used for the experiment. The earlier recorded accuracy on this dataset was 59.6%. The classification accuracy was evaluated based on various features and their corresponding accuracy is mentioned in the following table.

Table I Feature selection and the Classification accuracy
Feature Accuracy (%)
RootHoG-FV 50.14
Colour-FV 53.04
RootHoG-FV + Colour-FV 65.32
DCNN 57.87
RootHoG-FV + Colour-FV+DCNN 72.26 (top-1)
92.00 (top-5)
2.1.3. Using Pre-trained Convolutional Neural Networks and hyperparameter Fine-tuning
Keiji Yanai, Yoshiyuki Kawano further improved their obtained accuracy by using 2000 categories of food images on the pre-trained DCNN model 6.

The pre-trained model is the modified network structure of the AlexNet, which acts as a feature extractor. In relation to the work done by Oquab et al. 7, the features of DCNN were improved by extracting 1000 food-related classes from 21,000 categories ImageNet and adding them to the ILSVRC 1000 ImageNet, so as to pre-train the DCNN.

It took about one week to pre-train the DCNN on the NVidia GeForce TITAN BLACK GPU and 6GB RAM. Caffe was used to pre-train the model.

The model was trained on Japanese food item dataset, UEC-FOOD100 dataset and theUEC-FOOD256 dataset that is openly available for public use. Each class consisted of more than 100 images in each class. The best classification accuracy achieved on the fine-tuned UEC-FOOD100 and UEC-FOOD256 is 78.77% and 67.57% respectively.

2.1.4. CNN with parameter optimization and colour importance for feature extraction
A very interesting work has been done by Hokuto Kagaya, Kiyoharu Aizawa, and Makoto Ogawa who have developed CNN model by parameter optimization 4.
Colour is considered to be a very important factor for the food image recognition task. This important observation is made by them, i.e., feature kernels heavily rely on colour. This means that the colour images provide better understanding and better learning to the CNN model. This will be a useful factor for the model to learn more detailed features, which will obviously improve the classification accuracy. The dataset was created from the Food Logging (FL) system available for public use. The dataset consisted of 1, 70,000 images of everyday food meals belonging to 10 commonly logged meals category. The 80X80 images were cropped to 64×64 using cuda-convent python module.
A comparative study was conducted to measure the performance of CNN model vs the traditional hand-crafted SVM. It was observed that the SVM obtained 50-60% accuracy in comparison to two-layered CNN which achieved more than 70% accuracy. CNN model was 5X5 filter size, one time normalization with 6-fold cross validation provided a good accuracy of 73.70%.
A comparative study was done using CNN and SVM hand-crafted features. It was observed that the accuracy of SVM was 50-60%, whereas CNN achieved more than 70% accuracy. Also, the colour features further enhanced the accuracy of CNN. Thus, it was concluded that colour features dominate food recognition process which is in-line with the work done by Bosh 5, where hand-crafted colour features was regarded as best in hand-feature extraction.

2.1.5. CNN with Affine Data Transformation Technique
Yuzhen Lu 8 has used a small dataset consisting of 5,822 colour images belonging to 10 different classes. The classification accuracy is evaluated over Bag-of-features (BoF) model integrated with SVM model and the five layered CNN. The images were sourced from ImageNet. To increase the images in the dataset, affine transformations were applied on the dataset.
Prior to training the model, the images were down-sampled to 128X128. The entire dataset was divided into the training set with 4,654 images and the testing set with 1,168 images.

In the experiment, BoF model with SIFT (Scale Invariant Feature Transform) descriptors mined the features from the images which were used as the input to linear SVM to classify the images. The VLFeat library 9 was used to implement this procedure. The feature descriptors are unaffected by position, occlusion, illumination, the perspective of view, scaling etc.

The conventional powerful BoF model with the most robust and popular feature descriptor obtained the classification accuracy of 68% for training and 56% on test images.

The same dataset was used to train the four layered CNN, with three convolutional- pooling layers and one fully connected layer. ReLU activation function was applied to the CNN. The SGD (Stochastic Gradient Descent) with cross-entropy loss was used to minimize the loss and improve the accuracy of the CNN model. To prevent model overfitting, a dropout of 0.2 was applied to the third convolution-pooling layer and 0.5 was applied to the fully connected layer.

A decaying learning rate was used that was continuously updated using an exponential function with a cost value ? = ?0×exp(C), where ?0 is set to 0.00l and C indicates the training loss. The model was trained on GPU with keras library in the Spyder environment.

The CNN was initially trained without data expansion techniques with 100 epochs. The training accuracy obtained was 95% and the testing accuracy was 74%. The model encountered overfitting only after 10 epochs. This was due to limited images available for training.

To avoid issues of overfitting and to improve the model’s accuracy, the model was trained by expanding the available dataset by applying transformations, such as rotation, scaling (horizontal and vertical), etc. The CNN model showed better performance with the expanded dataset and the model became more generalized i.e., issue of overfitting was completely eliminated. The obtained test accuracy after 100 epochs was 87%.
The training cycles were increased to 400 epochs and the obtained test accuracy was more than 90%.
.

Chapter 3
CONVOLUTIONAL NEURAL NETWORKS
In Machine Learning, 12 a Convolutional Neural Network (CNN, or ConvNet) is a class of deep, feedforward Artificial Neural Network that is widely used for analyzing the visual image recognition tasks. A CNN consists of an input layer with ‘n’ neurons, an output layer with predefined classes, and many hidden layers depending on the CNN architecture designed. The hidden layers of the CNN include Convolutional layer, Pooling layer (subsampling), loss layer (drop out), fully connected layer (dense layer – Multi-layer Perceptron). CNN is a multilayer Neural Network in which input to each layer is fed with small patches from the previous layer.

Figure 3.1.: Convolutional Neural Network
The CNN consists of the following hidden layers in the order given below:
13Convolutional layer: is the first layer in CNN. It performs automatic feature extraction. A filter is an array of numbers (numbers are called weights/ parameters) denoted by nXn matrix, where (n<image size).

Filters (kernel/ neuron) slides over the input image from the top left corner moving towards right performing elementary multiplication of input image and filter. The performed operation is called Convolution. The multiplication result of each convolution is stored in an array. The matrix obtained after sliding the filter throughout the image is called ‘activation map/ feature map’.

The filters can be regarded as feature identifiers and the convolutional layer extracts the features to be learned.

Figure 3.2.: 5X5 input image and 3X3 Filter. Source 13
After each convolution operation, a non-linear operation is performed called ReLU (Rectified Linear Unit) for normalization. A pixel-wise operation is performed on the feature map to replace negative values with zero. The obtained feature map is called the ‘Rectified feature map’.

Figure 3.3.: The Convolution operation. The convolved Feature or Feature Map is the output of Convolution operation. Source 13
Pooling layer: The pooling operation is also called as downsampling or sub-sampling. The major function of the pooling layer is to shrink the size of the input image to 1/4th of its actual size by retaining the most important information. The output is produced by activation performed over the rectangular regions (window). The most commonly used activation methods are average pooling and maximum pooling. This makes the output of CNN invariant to the position.

Figure 3.4.: The Max-pooling operation. Source 13
There are several advantages of using pooling operation. Some of them are:
The dimensionality of the input image will be reduced which makes it manageable.

The number of trainable parameters considered and the number of computations performed in the network are reduced. This reduces one of the common issue in CNN which is overfitting of the CNN.

Pooling operation allows the network to be unaffected by small deformations, alterations and translations that have been made to the input image to while training the network.

Fully connected layer: is the final layer of the CNN in which the features extracted by the convolutional layer with pooling layer are used to perform the classification of input image that belongs to a particular class. The Fully Connected layer is a conventional Multi-Layered Perceptron in which each neuron in this layer is connected to each one of the neuron in the previous layer.
The softmax activation function is used in the in the final output layer. The input to the Softmax activation function is a vector of arbitrary real-valued scores. The output fo Softmax activation produces a vector of values in the range between zero and one that sums to one for the class that the input image belongs to. 13 The Softmax layer output is: ?(iwixi+b). This gives, for each class i, P (yi = 1; xi; w). For each sample x, the class i with the maximum Softmax output is the predicted class for sample x. For image classification problem, each unit of the final layer indicates the probability of a particular class.
Chapter 4
SYSTEM REQUIREMENTS
4.1. EXISTING SYSTEM

From the literature survey carried out it is apparent that the CNN is considered to be the cutting-edge technology for food image recognition and classification task. In comparison to the various Supervised Machine Learning techniques, it is concluded that CNN provides better classification accuracy. The CNN automatically learns the features which are very helpful for food image classification.

Various datasets have been used for food image classification. In 10, Pittsburgh Fast-Food Image Dataset consisting of American fast food is used for food classification using the mobile phone for dietary assessment. In another experiment, Hoashi et al. 11 have used 85 Japanese food items for classification and achieved 62.5% accuracy. Also many other generic food items around the world are chosen for recognition and classification purpose.
Some observations made are as follows:
Chinese, Japanese, American fast-food, etc., have been chosen for classification. But, the essence of Indian food is missing in these datasets.

Various models have been deployed and tested to improve the accuracy of food image classification. All conclude that CNN is the best model for image classification task, particularly food image classification. Food image recognition and classification is more fine-grained than any other image recognition because of its unique textures, shape, size colours etc. Even same food may look different, because of its way of presentation.

All the previous work focuses on improving classification accuracy so that the models can be used for dietary assessment, identify calorie count in food, identify nutritional value of food etc.

The motivation for the proposed work is drawn from the work carried out by Yuzhen Lu in 8. In his work on food image classification using CNN, Lu used data expansion technique to increase the number of images for training and testing the model by applying affine transformations to the existing food image dataset. He chose 5,822 colour images of size 128X128 belonging to 10 different classes and designed a 4 layered CNN with SGD. A dropout was applied on the last Convolutional layer and the Fully Connected layer. He was able to achieve test accuracy above 90% after applying affine transformations to the colour image dataset and training the model up to 400 epochs on GPU with keras library in the Spyder environment.

Table II Images per category used in the experiment by Yuzhen Lu
Food Item Number of images
Apple 1,050
Banana 310
Broccoli 327
Burger 519
Egg 626
French fry 296
Hotdog 639
Pizza 1,248
Rice 352
Strawberry 455
4.2. PROPOSED SYSTEM
The proposed CNN architecture is designed with an input layer, 5 convolutional layers with ReLU operation and Max-pooling, followed by a fully connected layer (combination of dense layer and flatten layer) with ReLU and a dropout layer. Final output layer is fully connected with softmax activation function. ReLU operation and Max-pooling operation is applied to each convolutional layer retaining window size of Max-pooling window to 5X5. The learning rate is set to 1e-3 (i.e., 0.001)
The dataset consists of 60,000 grayscale images of size 280X280 belonging to 10 different Indian snacks. The training dataset consists of 50,000 images, where 48,800 images are used for training the network and the remaining 200 images are used to validate the network.
14 The classic SGD (Stochastic Gradient Descent) is replaced with ‘Adam’ optimizer along with ‘Categorical cross entropy loss’ to measure the error at the softmax layer and update the weights accordingly to minimize the error.
4.3. FEASIBILITY STUDY

Implementing the Deep learning models require very powerful processors since there are complex computations performed at each step. At each step of CNN, huge computations are performed and the weight values change for every layer and every input. The output of one layer forms the input to the another layer, so the intermediate results must me saved to be fed into the next layer.

The basic factors to consider in designing the Deep Learning CNN model is the processor speed and the memory requirements. Depending on the dataset chosen, and the architecture implemented, the memory and the processor requirements vary.

Deep Learning models can be developed on an Intel HD graphics (CPU). However, to speed up the computational activities NVIDIA graphics cards are used (GPU). Small models with few layers can be trained on the local machine using a CPU/ GPU. To train the larger models, Amazon Web Servers (AWS) cloud space can be used.
Deep learning models involve a huge amount of matrix multiplications and other operations which must be parallelized. GPUs work well with DNN computations because GPUs support more resources and faster bandwidth to memory. DNN computations fit well with the GPU architecture.   Computational speed is extremely important because the training of Deep Neural Networks can range from few hours to days to weeks.  Much of the success of Deep Learning is due to the use of GPUs.

There is a wide variety of software available that can be easily downloaded for free to develop Deep Learning models. The choice of software depends on the problem statement and the design choices of the scientist developing the model. In this project, a free software Anaconda suite is used which has many built-in services and rich support for easily developing the CNN model.

The overall feasibility study concludes that the CNN model is technically, economically, financially and socially feasible. Further, the free support from AWS to use storage space makes it more convenient to run the code anywhere and on any machine successfully.

4.4. REQUIREMENTS ANALYSIS
Some of the necessary hardware and software needed for the implementation and evaluation of deep learning model, CNN are described below. Due to heavy computations, deep learning models require high-end computing processors, plenty of primary memory. Python is the basic and the necessary language in implementing the machine learning models.

4.4.1. HARDWARE REQUIREMENTS

The CNN model is implemented and tested using Asus ROG GL553VE which is suitable to run CNN model. The hardware (laptop) consists of Intel Core i7 7700HQ Processor (GPU) with a standard 32 GB of DDR4 memory. The NVIDIA Ge Force GTX 1050TI graphics with 1TB Hard Disk Drive (HDD), 250 GB Solid State Drive (SDD). The model is also run on an Aspire ES 15 with Intel core i3 processor (CPU) with a 4 GB RAM and 1000 GB HDD for comparative study.
These features provide faster speeds, increased memory space and inconceivable energy efficiency which are a necessity while developing and testing the Machine Learning models.
4.4.2. SOFTWARE REQUIREMENTS

Anaconda 3
Anaconda is a freely available, enterprise- ready Python distribution that can be used for processing the data, perform scientific computations, and data analytics. Anaconda comes with a built-in Python 2.7 or Python 3.4 along with more than 100 python packages that are tested on various cross-platforms and optimized. All Python-based tools can be used with Anaconda. It enables one to create isolated custom environments by combining various Python versions available and also switch between these using the command “conda”, which is an inventive multi-platform package manager for Python and other languages 14. Anaconda can be used with Linux/ Mac OS, Windows.

Spyder 3 (Scientific Python Development EnviRonment)
The Anaconda suite consists of an Integrated Development Environment (IDE) available for free which allows the users to write the Python codes. Spyder is a powerful IDE providing advanced editing, interactive testing, debugging and introspection features for scientific computing using the Python programming language. It comes with an Editor to write code, a Console to evaluate it and see its results at any time, a Variable Explorer to see what variables have been defined during evaluation, and several other facilities to help you to effectively develop the programs you need as a scientist.
The advent of IPython (enhanced interactive Python interpreter) and popular Python libraries such as NumPy (linear algebra), SciPy (signal and image processing) or matplotlib (interactive 2D/3D plotting) enabled to use the numerical computing environment. The matplotlib/pylab function in
Spyder can be used to plot figures in Python or IPython console.

Keras
Keras is an open source neural network library coded in Python. Keras can run on top of TensorFlow, Microsoft Cognitive Toolkit, Theano, or MXNet, but Tensorflow or Theano must be additionally installed to use Keras.

Being user-friendly, modular and extensible, Keras was designed to speed up the computations in the Deep Neural Networks. Keras was developed during the research work of the venture ONEIROS (Open-ended Neuro-Electronic Intelligent Robot Operating System)by a Google engineer, François Chollet, who is also responsible for its maintenance.
Instead of using it as an independent Machine Learning framework, Keras was intended to be used as an interface. It offers more sophisticated, more insightful set of generalizations that are helpful in easily building up the deep learning models irrespective of the computational backend used.  
Keras has abundant implementations of the generally used Neural Network building blocks such as optimizers (Adam, Adagrid, RMS Prop), activation functions (ReLU, softmax), cost functions (MSE, loss function), layers and an assortment of tools required to easily handle text and images.

Keras permits the users to develop the deep learning models for smartphones (iOS and Android), on the web, or on the Java Virtual Machine (JVM).

Python
The most well-situated means of installing Python is by means of the Anaconda scientific Python distribution. Anaconda has a set of the most frequently used Python packages that are preconfigured and are ready to use. The Anaconda installation incorporates approximately 150 scientific packages.

All Python softwares are supported on Windows, Linux, and Macintosh operating system. Nearly all the softwares used in a usual Python machine learning pipeline include almost any of the amalgamation of the subsequent tools:

NumPy, for matrix and vector manipulation.

Pandas for time series and R-like Data-Frame data structures.

The 2D plotting library matplotlib.

SciKit-Learn as a source for many machine learning algorithms
and utilities.

Keras for neural networks and deep learning.

Python is an all-purpose programming language that is widely deployed in many areas from web development to deep learning. In this era, Python is positioned among one of the three topmost accepted programming languages. Because of its popularity, Python has a flourishing open source community consisting of more than 80,000 free software packages for python that are accessible on the official Python Package Index (PyPI).

Chapter 5
SYSTEM DESIGN

5.1. FLOWCHART

Figure 5.1. : Deep Learning Flow Chart
The overall process of the CNN is summarized in the following steps:
Step 1: Iitialization
 Randomly initialize the filters, filter size and weights, stride etc.

Step 2: Score Function
 The training images are split into mini batches and fed into the network through the input layer. CNN performs feedforward propagation extracting essential features from the image.
i.e., ConvReLU Max-pooling Fully connected layer.

The CNN then finds the output probabilities for each class in training dataset using the score function, which plots the raw data to class scores.
15 The score function for linear classifier is f (xi, W, b) = Wxi + b:
where xi symbolizes the input image. The matrix W represents the weight associated with the neuron, and the vector b represents the bias related to the neuron. W and b signify the parameters of the score function.

For the image classification task, the input to the score function is an image xi, for which the score function calculates the vector f (xi,W) of the raw class scores denoted by ‘s’. Therefore, given an image xi, the predicted score for the jth class will be the jth element in ‘s’, given by sj = f(xi;W)j .
The class scores obtained from the training data will be used to compute the loss.

When the network was initially trained, the weights were randomly chosen. As a result the output probabilities are also random.

Step 3: Loss function
The output random class scores are used to calculate the loss function that finds the match between the predicted scores and the ground truth labels in the training data.

The loss function is also named as the cost function or objective which presents the discontent of the predicted scores output by the score function. instinctively, if the predicted scores are intimately equivalent to the training data labels then the loss is reduced. Otherwise the loss would be high.

Softmax Classifier
The Softmax classifier employs the cross entropy loss/ softmax loss. The softmax function can be represented as: fj(z) = ezj/ ?k ezk
The input to the softmax function is a vector of real-valued scores (in z) . It then compresses the scores to a vector of values ranging between zero and one so that the total sum is one.

Consider ith example in the data, the cross entropy loss is given as:
1689100topwhere fj means the jth element of the vector of class scores f.
The softmax function flattens the raw class scores denoted by ‘s’ into normalized positive values that add up to one on which the cross entropy loss can be applied.
The total error at the output layer that sums over all 10 classes considered in the problem is given by;
 Total Error = ? ½ (target probability – output probability) ²
The complete loss for the dataset is the mean of Li over all training examples, together with a regularization term, R(W).

L =Pi Li N + ?R(W)
where N corresponds to the total number of images in the training set.
? is the regularization strength, which is the network hyperparameter.
The loss function enables to evaluate the value of any particular set of parameters used in the developed model. The aim is to lessen the loss so that the model outputs an improved accuracy.

Step 4: Optimization
Optimization is the method of obtaining the set of parameters of the model that minimizes the total loss. The core principle for the optimization techniques is to calculate the gradient of the loss in relation with the parameters of the model. The gradient of a function provides the direction of steepest ascent. One means of computing the gradient proficiently is to compute the gradient systematically by recursively applying the chain rule. This method is called “Backpropagation” which provides an efficient way to optimize the random loss functions. These loss functions may convey diverse classes of network architectures (e.g. fully connected neural networks, convolutional networks etc).
Backpropagation algorithm is used to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values/weights and parameter values to minimize the output error.

Step 5: Parameter Updates
After calculating the analytic gradient using Backpropagation, the resulting gradients are used to perform the parameter update. Various approaches are available for performing the parameter update – SGD (Stochastic Gradient Desent), SGD+Momentum, Nesterov Momentum , Adagrad, RMSprop , Adam etc.

The weights are altered in the fraction to the total error.

When the same image is input again, output probabilities become closer to the target vector. This signifies that the network has now learnt to classify the particular image correctly by fine-tuning its weights/filters and decrease the output error.

The filter values produced during the convolution operation and the selected weights are the two parameters that get updated during the training process while the number of filters, filter size and the architecture remains unchanged.

One hot classification sets the value of particular class to 1 while other class label value will be set to 0. Each image should belong to only one class, so one hot classification is important.

Step 6: Repeat steps 2 to 4 for all images in the training set.

Step 7: Test the model with an unseen (new) image by giving input to the CNN and evaluate the model in terms of classification accuracy and Mean Square Error (MSE).

5.2. SYSTEM ARCHITECTURE
The proposed CNN architecture is designed with an input layer, 5 convolutional layers with ReLU operation and Max-pooling, followed by a fully connected layer
(combination of dense layer and flatten layer) with ReLU and a dropout layer. Final output layer is fully connected with softmax activation function. ReLU operation and
Max-pooling operation is applied to each convolutional layer retaining window size of Max-pooling window to 5X5. Filter size is fixed to 5X5 and remains unchanged for all the convolutional layers. The learning rate is set to 1e-3 (i.e., 0.001).

Figure 5.2. : Proposed CNN Architecture with hyperparameters
The image is fed through the input layer. The first convolutional layer outputs 32 feature maps/activation maps on which ReLU operation is performed. Max-pooling operation is performed on these rectified feature maps. This output is fed as input to the second convolutional layer which produces 64 feature maps. This max-pooled output is fed to the third convolutional layer which will produce 128 feature maps. Similarly, fourth convolutional layer produces 64 feature maps, and the fifth convolutional layer outputs 32 feature maps. The last layer of the CNN is the fully connected layer where all the features leant by the convolution-pooling layers are combined so that the model can recognize and classify the input image.
A dropout of 0.8 is applied in the fully connected layer. Dropout generalizes the learning of CNN and sets some of the neurons to 0 to prevent model overfitting. The fully connected layer has 1024 neurons. The output layer has 10 neurons which are connected to each of the 1024 neurons in the previous layer. Given an input image, only one neuron will be activated using one hot encoding based on the probability value using the softmax activation function.

Chapter 6
IMPLEMENTATION
The project is implemented using Spyder IDE, which is a part of Anaconda 3 suite. The Anaconda suite provides all the necessary libraries and functions for developing the CNN model. The code is written in Python Programming language.

Some necessary libraries imported to run the code are:
CV : is the main function of OpenCV (Open Source Computer Vision Library) for dealing with image processing and vision algorithms. OpenCV has C/C++, Python interfaces that can run on Windows, Linux, Mac, Android .

CV is used to resize the images in the dataset, so that all the images are of the same size. In this project, the image size is fixed to 280 X 280 pixels.

NumPy : is a common data structure, matrix manipulation , and linear algebra library for Python. In Machine Learning, the generated matrices and the vetors are always handled by the NumPy library. The NumPy library offers very useful functionality to manage the numerous matrix manipulations and data structures. Also it is optimized for speed. NumPy is the best known standard for dealing with data input, storage, and output in the Python machine learning and data science community.
OS : module in Python allows to use the Operating System dependent functionality. i.e., The OS functions allows the user to interface with the underlying operating system that the Python is running on – Windows, Mac or Linux.

SciPy : is a Python-based environment of open-source softwares for mathematics, science, and engineering. Some of the core packages included in SciPy are Python, NumPy, Matplotlib, SciPy library, IPython, Pandas. Depending on these packages, the SciPy environment has general and specialized tools for high performance computing, data management and computations, productive environment etc.
Matplotlib : In Python, a generally used 2D plotting library is matplotlib. It generates periodical quality plots.

Tensorflow : TensorFlow is an open source deep learning library from Google. It is used to perform numerical calculations using the data flow graphs. In the grah, the nodes denote mathematical operations and the edges denote the multidimensional data arrays (tensors) conversed between them. TensorFlow is comparatively new in contrast to the other available frameworks, but is getting popular due to its advantages. TensorFlow backend or few significant technical details available in TensorFlow can be used along with Keras for developing the Neural Networks in more structured and modular manner.

Tqdm : is an iterator that instantly makes your loops show a progress meter.

Shuffle : function must be imported since it is not accessible directly. The method shuffle() randomizes the ordered items. This is useful in training the deep learning models, so that the models do not learn the training patterns.

6.1. Dataset
In machine learning, the data to be evaluated will be represented and stored as matrices and vectors. The data represented as a matrix (2D array) is denoted by a bold upper case character, usually by X, and the label data is represented as a vector (1D array), denoted with a lower case bold character, y.

In a supervised machine learning problem, the standard intent is to predict the label for a given example. If the images in the dataset are fewer in number, it is considered as a classification problem. The entire dataset is split into two major subsets, such as a training set denoted by Xtrain and test set denoted by Xtest. The training set consists of labeled images which is used to train the model, which outputs a learned model, and the test set consists of unlabeled images which is used to verify how well the model fits to the unobserved data.

Table III Classes and the number of images chosen for Indian Snack dataset
Class name Number of images in each class
Aloo Paratha 5,000
Dosa 5,000
Idli 5,000
Jalebi 5,000
Kachori 4,952
Omelet 5,000
Paneer tikka 4,999
Poha 4,974
Samosa 4,999
Tea 5,000
Aloo Paratha
Dosa
Idli
Jalebi
Kachori
Omlette
Paneer Tikka
Poha
Samosa
Tea
Figure 6.1. : Classes chosen for Indian Food Image Dataset
The input layer accepts a batch of grayscale images. Each image will be fed to the network from the input layer. The training dataset with 50,000 images is divided into 125 mini-batches. Each mini batch consists of 400 images (i.e., batch size is 400). 48,800 images are used for training while the remaining 200 images are used for validation.
The training images are labeled images while the testing images are numbered. In the first convolution layer, lines, points etc, are learnt by the model through convolution operation. The second layer learns edges, corners etc. As we move deep into the network, more distinct features are learnt by the model such as textures, colours etc to distinctively identify and classify the image. The training and the test accuracy of the model will be recorded for the number of epoch (training cycle) chosen. In this project the epoch is set to one. i.e., the weights are adjusted only once after randomly initialized.
14 ‘Adam’ optimizer is used instead of the classic SGD (Stochastic Gradient Descent) for iteratively updating the neuron weights during the training process. Adam optimizer requires less tuning, less memory and also computationally efficient.

‘Categorical cross entropy loss’ is used along with the optimizer to measure the error at the softmax layer.
Chapter 7
RESULTS AND DISCUSSION
Once the model has learned the features of all the considered classes via training, the model will be tested for its correct classification (labeling images) by giving the unlabelled image as input.

Figure 7.1: Classification results on test images

Figure 7.2. : Results after running the CNN model
The model is able to correctly classify (label) the given unlabeled test image with the test accuracy of 93. 50 % with a loss of 0.2341 on 10,000 images. The training accuracy of the model with 50,000 labeled images is 96.9% with a loss of 0.1439.
To study the importance of large dataset and the impact of epochs in improving the accuracy, the model is run on the GPU instance (Ge-Force 1050Ti, 4GB card) taking a small dataset sample of 10,000 images from the 60,000 images, out of which 9,800 images are used for training the CNN and the remaining 200 images are used for testing the learned model. The epoch is set to 10 with the batch size of 32. It took 14 minutes to train on GPU. The accuracy per epoch and the loss are represented in the tabular form. Also, the graphical representation is included for study. The same dataset was run on CPU (Intel iCore 7 with 16 GB RAM) and the running time recorded is 9 hours.

The study clearly shows that more dataset contributes to better leaning of the model, which in turns provides better classification accuracy. Due to the limited food images available, affine transformations had to be done on the dataset to increase the number of images.

Table IV Accuracy at each epoch for 10,000 test images
Accuracy at each epoch Accuracy (%)
Epoch 1 0.6371428571428571
2 0.9492857142857143
3 0.9751020408163266
4 0.9775510204081632
5 0.9846938775510204
6 0.9964285714285714
7 0.9823469387755102
8 0.9878571428571429
9 0.9934693877551021
10 0.9972448979591837
Table V Loss at each epoch for 10,000 test images
loss at each epoch Loss (%)
Epoch 1 1.0057004473649194
2 0.16050662153153394
3 0.08757065726063994
4 0.07583146373308929
5 0.061145501460212436
6 0.013505996499759152
7 0.065771930565896
8 0.05465052229055318
9 0.02561323014391041
10 0.010072417010189235

Figure 7.3: Model Accuracy on test and train images with 10 epochs

Figure 7.4. : Model loss on train and test data with 10 epochs
7.1. HURDLES IN THE DEVELOPMENT OF PROJECT
Some hurdles that were encountered during the development of the model are;
It is very difficult to develop CNN model on a CPU because CPUs are computationally expensive and consume plenty of time for training and testing the model. Therefore GPU with more memory (RAM) is needed to perform computations.

Indian food image dataset has limited images (60,000 images). More training images contribute to better learning and better classification which enhances the classification accuracy. Due to restricted images available in the dataset, affine transformations such as rotation, translation and scaling are performed on the dataset to increase the training data.

In 4, the importance of colour for feature extraction process is studied. The use of grayscale image affects the learning accuracy. Due to memory and computational constraints, grayscale images are chosen for the experiment. Grayscale images have a single channel whereas colour images have 3 channels, one for each Red, Green and Blue component.

Further enhancements can be made by increasing the number of training images, increasing the number of food classes, reducing image size, using colour images, increasing the number of training epochs, etc.
CONCLUSION AND FUTURE WORK
The application of CNN for Indian Food Image Classification is briefly discussed. The details on working of CNN are explained in an elaborative manner. So far, no work has been done to classify the Indian food images. The proposed CNN model has provided remarkable classification accuracy of 93.50% with only one epoch. The model is tested for its running time on both CPU and GPU. It is found that GPUs are more suitable for developing the deep learning models. The trained model is saved on Amazon cloud and run on CPU.

The importance of larger dataset for CNN training is also demonstrated taking only 10,000 images from the 60,000 image dataset. The accuracy is recorded. The importance of considering more epochs is also tested on the sample 10,000 image dataset by taking 10 epochs.
In future, the image classification can be extended to smartphones where a person from a foreign state/ nation will be able to identify the particular food. Also, food image recognition and classification can be used for training the robots to identify and classify a variety of foods so that they can be deployed as butlers for serving in hotels.

REFERENCES
1 A. Krizhevsky, I. Sutskever & G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 2012. In NIPS, pages 1106{1114, 2012.

2 Abhishek Goswami, Haichen Liu, “Deep Dish : Deep Learning for Classifying Food Dishes”. (http://cs231n.stanford.edu/reports/2017/pdfs/6.pdf)
3 Y. Kawano and K. Yanai, “Food image recognition with deep convolutional features,” in Proc. of ACM UbiComp Workshop , 2014.

(http://ubicomp.org/ubicomp2014/proceedings/ubicomp_adjunct/workshops/CEA/p589-kawano.pdf)
4 Makoto Ogawa, Kiyoharu Aizawa, Hokuto Kagaya, “Food Detection and Recognition Using Convolutional Neural Network”.
(https://www.researchgate.net/publication/266357771)
5 M. Bosch, F. Zhu, N. Khanna, C. J. Boushey, and E. J. Delp. Combining global and local features for food identi_cation in dietary assessment. In IEEE ICIP, pages 1789{1792, 2011.

6 Y. Kawano and K. Yanai, “Food Image Recognition Using Deep Convolutional Network with Pre-Training and Fine-Tuning”.

(http://ieeexplore.ieee.org/document/7169816/)
7 M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in Proc. of IEEE Computer Vision and Pattern Recognition, 2014.

8Yuzhen Lu, “Food Image Recognition by Using Convolutional Neural Networks (CNNs)”. (https://arxiv.org/abs/1612.00983)
9 A. Vedaldi and B. Fulkerson, VLFeat: An open and portable library of computer vision algorithms, in Proceedings of the 18th ACM international conference on Multimedia, 2008, pp.1469-1472.
10 M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang. Pfid: Pittsburgh fast-food image dataset. In IEEE ICIP, 2009.
(http://ieeexplore.ieee.org/document/5413511/)
11 H. Hoashi, T. Joutou, and K. Yanai. Image recognition of 85 food categories by feature fusion. In IEEE ISM, pages 296{301, 2010.

(https://www.researchgate.net/publication/321070597_Food_photo_recognition_for_dietary_tracking_system_and_experiment)
12 https://en.wikipedia.org/wiki/Convolutional_neural_network

13https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner’s-Guide-To-Understanding-Convolutional-Neural-Networks/
14 http://continuum.io/anaconda
15 Abhishek Goswami, Haichen Lu, “Deep Dish: Deep Learning for Classifying Food Dishes”, cs231n.stanford.edu/reports/2017/pdfs/6.pdf
16http://github.com/NavinManaswi/IndianSnacks/tree/master/IndianSnacksdatasetand%20 code
APPENDIX – A
A paper entitled “Deep Indian Delicacy: Classification of Indian Food Images using Convolutional Neural Networks” is published in journal International Journal for Research in Applied Science ; Engineering Technology (IJRASET), ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 6.887, Volume 6 Issue III, March 2018.{Page 2653-2660.

A paper entitled “Bird’s Eye Review on Food Image Classification using Supervised Machine Learning” is published in journal International Journal of Latest Technology in Engineering, Management ; Applied Science (IJLTEMAS), Volume VII, Issue III, March 2018 | ISSN 2278-2540. { Page 153-159.

A paper entitled “Classification of Indian Snacks using Supervised Machine Learning” is presented at Panchajanya, 8th State Level Technical Paper Presentation, Ekalavya Institute of Technology (EIT), Ummathur, Chamarajanagar Tq., and Dist., on March 28, 2018.