Repositório ISCTE-IUL

—Different diseases can affect an individual’s gait in different ways and, therefore, gait analysis can provide important insights into an individual’s health and well-being. Currently, most systems that perform gait analysis using 2D video are limited to simple binary classification of gait as being either normal or impaired. While some systems do perform gait classification across different pathologies, the reported results still have a considerable margin for improvement. This paper presents a novel system that performs classification of gait across different pathologies, with considerably improved results. The system computes the walking individual’s silhouettes, which are computed from a 2D video sequence, and combines them into a representation known as the gait energy image (GEI), which provides robustness against silhouette segmentation errors. In this work, instead of using a set of hand-crafted gait features, feature extraction is done using the VGG-19 convolutional neural network. The network is fine-tuned to automatically extract the features that best represent gait pathologies, using transfer learning. The use of transfer learning improves the classification accuracy while avoiding the need of a very large training set, as the network is pre-trained for generic image description, which also contributes to a better generalization when tested across different datasets. The proposed system performs the final classification using linear discriminant analysis (LDA). Obtained results show that the proposed system outperforms the state-of-the-art, achieving a classification accuracy of 95% on a dataset containing gait sequences affected by diplegia


I. INTRODUCTION
Gait is a highly cognitive task that involves a coordinated, cyclic combination of movements which results in human locomotion [1].Analysis of gait can provide useful information about an individual's health, identity, gender or walking patterns, having a wide range of applications in fields such as sports, biometrics or medicine.In the field of medicine, gait analysis over time can help detect or follow the development of several types of pathologies, for instance resulting from neurological or systemic disorders, diseases affecting an individual's gait, injuries or ageing [2].Traditionally, gait was analysed by specialists by observing a patient's walk to identify different gait pathologies.However, with recent development in technology, various devices can now be used to acquire features that allow such analysis to be performed automatically, or semi-automatically.Typically, biomechanical features such as speed, cadence, step length, stance time, or swing time, can be used to identify different gait pathologies [3].However, as presented in [4], features acquired from a gait representation typically used for biometric recognition, called the gait energy image (GEI), can also be used.Although the effectiveness of such systems is currently lower than the systems that rely on biomechanical features [1], their robustness to silhouettes segmentation errors makes them advantageous in a daily life setting, where the capture of gait sequences cannot be performed in (near-) ideal conditions.This paper presents a novel system that performs classification of gait across different pathologies, based on features learned by a deep convolutional neural network (CNN) network, VGG-19, after appropriately finetuning it to the problem addressed.The system operates on videos acquired by a single 2D camera and does not require placing any kind of markers on the individual's body, making its deployment possible in a daily life setting, including in clinics or even home environments.

A. State-of-the-art
The existing acquisition systems for gait analysis vary significantly and can be broadly classified into wearable or non-wearable systems [2].Wearable systems include the use of sensors such as accelerometers [5] and gyroscopes [6], which acquire motion signals that represent human gait.These sensors do not limit gait acquisition to a laboratory environment, allowing it to be used in a daily life setting.However, setting up an individual with such sensors requires clinical professionals, as their position on the body must be precise.
The non-wearable systems can be further classified into floor based and vision based systems.Floor based systems use sensors such as force sensitive resistors [7] and pressure mats [8], which are setup on the floor, allowing them to measure the forces exerted by individuals as they walk.However, such systems can only operate in a controlled environment, such as a laboratory, where the walking path is clearly defined.Vision based systems on the other hand acquire images using one or multiple optical sensors.These images can then be processed to obtain the gait features.The most widely used systems in medical environments, being considered the gold standard type of system, are marker based systems.One example is the optoelectronic motion capture system [9], which operates by relying on the application of multiple markers to various parts of the body, together with a setup containing multiple calibrated optical sensors, to capture an individual's gait.However, due to the use of markers, such systems cannot operate outside laboratory environments.Also, the complex setup and the need for calibration limits their use to trained professionals.As an alternative, marker-less vision based systems use multiple cameras or depth sensing cameras to estimate gait features, such as joint angles [10] and joint positions [11], to analyze an individual's gait.However, such systems do not operate as accurately as the marker based system, and the depth sensing cameras typically operate between a limited range of 80 cm and 4 m.
Recently, a significant amount of work has also been done in capturing and analysing gait from a single 2D camera.Since the major articulations during a gait cycle occur in the sagittal plane [12], some vision based systems rely on a single side view video sequence of an individual to perform gait analysis.Such systems typically acquire several biomechanical features, such as step length, leg angles, gait cycle time [13], cadence, speed, and stride length [14], or the fraction of the stance and swing phases during a gait cycle [15], using the available side view body silhouettes.These features are then used to classify gait as being either normal or impaired.Specific posture instabilites can also be identified using features such lean and ramp angles [16], axial ratio and change in velocity [17], obtained from the body silhouettes.Apart from biomechanical features, some vision based systems use biometric feature representations that perform well in biometric applications [18] to perform classification of gait across different pathologies [4].A drawback of 2D video based systems is that they do not have access to depth information, which limits their accuracy when compared to other sensor based systems.However, features obtained from such systems are enough to identify the pathologies and since these systems are easier to install, they are suitable to operate in daily life settings.

B. Motivation and contribution
Most vision based systems that rely on a 2D video only perform a binary classification of whether the observed gait is normal or impaired.While some systems, such as [1], can identify gait pathologies, their results are very much dependent on the quality of the silhouettes used.In such situations, poor silhouette segmentation can significantly reduce the classification results.Since gait representations used for biometric recognition, such as the GEI, are robust to such limitations [4], they provide opportunities for a more robust assessment of an individual's health based on gait analysis, allowing operation in less constrained setups where silhouettes are expected to present segmentation errors, for instance due to video acquisition against a dynamic background.
Recently, the use of deep CNNs, such as VGG-16 [19] or pose based temporal-spatial networks [20], have significantly improved the performance of silhouette based gait recognition systems.Similar improvements have also been seen in the medical domain, especially in detecting Alzehimer's disease [21].Thus, it can be expected that the use of such deep learning techniques will also improve the performance of the gait pathology classification systems.
This paper presents a novel vision based system that relies on 2D video to perform classification of gait across different gait pathologies.It uses the deep CNN VGG-19 [22] to extract gait features.The pre-trained VGG-19 model is fine-tuned to be able to extract features that better describe gait pathologies, while also generalizing well to datasets different from the one considered in the fine-tuning stage.The use of a gait representation based on a GEI, as input to the VGG network makes the proposed system robust to silhouette segmentation errors.The VGG extracted features are used as input to a classifier using linear discriminant analysis (LDA).The use of a simple classifier highlights the effectiveness of the features used.

II. PROPOSED SYSTEM
As illustrated in Fig. 1, the proposed system operates in three steps: pre-processing, feature extraction and classification.During pre-processing, the proposed system extracts binary silhouettes from a given video sequence, and then transforms them into a GEI.Next, the resulting GEI is used as an input to the feature extraction step, using the VGG-19 network.Although VGG-19 can be used as a classifier, in this paper the output from its fully connected layer is used as a feature vector [19].The quality of the feature vectors resulting from VGG-19 can be further improved by transfer learning, where the final layers of the CNN are fine-tuned to better represent the different gait pathologies.The final step of classification is performed using LDA, which classifies each feature vector across different pathology groups.

A. Preprocessing
The first step of the proposed system involves transforming a 2D video sequence into a GEI, see Fig.

B. Feature extraction using VGG-19
VGG-19 is a 19-layer CNN [22].It can be considered as a stack of convolutional layers, as illustrated in Fig. 3, with a filter size of 3×3, with stride and pad of 1, along with max pooling layers of size 2×2 with stride of 2. The convolutional layers in VGG-19 detect the local features in the input GEI.Next, the max pooling layers reduce the size of the feature vectors obtained, thus reducing the computational complexity of the network.A series of such layers are followed by two fully connected layers that learn the non-linear relationship among the local features.The final layer of the network is a softmax layer, which performs classification.
The VGG-19 trained on ImageNet [22] can classify images across 1000 different image groups.That model has been trained using over 1.3 million images.However, a dataset of such a scale containing sequences of gait affected by different gait pathologies is currently unavailable.And, training VGG-19 with a small dataset is expected to result in problems such as overfitting to the small training set.This limitation can be addressed using transfer learning, a machine learning technique where a model trained to address one problem is re-purposed to address a second related problem.The proposed system therefore uses the VGG-19 model pretrained on ImageNet [22] and part of the network is retrained with gait GEIs from an available dataset, to fine-tune the model parameters.In a deep CNN, the initial convolutional layers of the network typically detect simple features, and features become more complex in subsequent network layers, with the final layers capturing more problem specific features.Thus, using transfer learning, the final layers of the network can be re-trained such that they are fine-tuned for the classification of gait pathologies.The details of the finetuning process are discussed in section III.A.Although the complete VGG-19 can be used as a classifier, the output of the first fully connected layer is more effective as a feature vector [19] when the training data is small.Thus, the proposed system extracts the 4096-dimensional vector from the first fully connected layer as a feature vector.The resulting feature vector is then used to perform classification.

C. Classification
To perform classification, the proposed system uses LDA, after selecting the most significant features using principal component analysis (PCA).In this paper, those values that contain the top 95% of the total variance are retained.Thus, the application of PCA reduces the dimensionality and the computational complexity of the proposed system, while improving the classification results by discarding noisy components.
The proposed system then applies LDA to the resulting feature vectors for data decorrelation and classification.LDA identifies a projection matrix onto a subspace that maximizes the ratio of within-class to between-class scatter, using Fisher's criterion.Given  pathology groups, their withinclass scatter matrix Σ  and the between-class scatter matrix Σ  can be used to obtain a transition matrix  that maximizes the ratio of the between-class scatter matrix to the within-class scatter matrix, given by (2).
Thus, given a test GEI, , it can be classified into one of the existing gait pathologies, using a simple Euclidean distance metric, according to (5), where   ̅̅̅ is the centroid of the k-th group.

III. RESULTS
In this paper, three different datasets are considered: INIT Gait Dataset [3], DAI Gait Dataset [15] and DAI Gait Dataset 2 [4].The INIT Gait Dataset is used to fine-tune VGG-19.Next, DAI Gait Dataset is used to test the ability of the proposed system to classify gait as either normal or impaired.Finally, DAI Gait Dataset 2 is used to perform classification of gait across different gait pathologies.All three datasets contain gait sequences captured from a lateral viewpoint.The type of gait pathologies are also provided as the ground truth.The summary of the three datasets is presented in table I.
The INIT Gait Dataset contains binary silhouettes of 10 individuals simulating 4 different leg related gait pathologies.Each individual is recorded 2 different times in a LABCOM [3] studio, at 30 fps, capturing multiple gait cycles in each of these 2 sessions.The sequences are labelled as: restricted full body movement, restricted right leg movement, restricted left leg movement and normal gait.
The DAI Gait Dataset [15] contains 30 gait sequences divided into two groups.The first 15 gait sequences are considered normal, while the remaining 15 contain impaired gait sequences, considering randomly selected pathologies, simulated by 5 walking individuals.The individuals are captured walking over a distance of 3m using the RGB camera of the Kinect sensor.
The third dataset, DAI Gait Dataset 2 [4], contains healthy subjects simulating gait affected by diplegia, hemiplegia, neuropathy and Parkinson's diseases, together with normal gait sequences.Each individual is recorded 3 times walking a distance of 8m, resulting in a total of 75 samples.

A. Fine-tuning VGG-19
Having three relatively small datasets available, it was decided to take 60% of the largest one for fine-tuning the CNN weights, and use the remaining gait sequences of that dataset for validation.Tests are also conducted on the other two datasets, using the previously adopted CNN weights, as a way to test the generalization ability of the obtained model.
The INIT Gait Dataset was thus selected to fine-tune VGG-19, as it includes 20 video sequences for each of the four pathology groups.Since each sequence captures at least two gait cycles, a GEI for every gait cycle is generated, increasing the total number of GEIs to 160.As mentioned above, the dataset is split into a training set, with 60% of the sequences, and a validation set, with the remaining 40%.To further increase the size of the training set and to make the network robust to minor changes such as flips, scale changes and translations, data augmentation is performed on the training dataset.Data augmentation allows the system to account for situations not foreseen in the original training set, such as walking in the opposite direction of that available in the training sequences.The training set is thus augmented using small shifts, shear, zoom, as well as horizontal flipping, resulting in a total of 480 GEIs.
The pre-trained VGG-19 model has been optimized to perform classification across the 1000 ImageNet image groups.To fine-tune the network for pathology classification the final softmax layer of the VGG network needs to be replaced with a new one, performing classification only across the four INIT dataset groups considered.Fine-tuning is done using backpropagation considering a learning rate of 0.001, as further increasing the learning rate may lead to convergence problems.The batch size is set to 34 to optimally use the available graphic card memory size, and the number of training epochs is set to 150 to prevent underfitting.The remaining parameters, such as dropout regularisation and loss function, maintain their default settings.
For the fine-tuning of the VGG-19, several alternatives are considered, notably by changing the set of layers whose weights are adjusted when running the backpropagation optimization.In a first experiment, only the fully connected layers (FC) are re-trained.In the following experiments, the FC layers along with one or more convolutional layers (CONV) are re-trained.This process is repeated re-training an extra CONV layer at each experiment.The classification accuracy over the training and the validation is reported in table II.The final column reports the results on the validation set after replacing the softmax layer with the proposed LDA classifier.During each experiment, the LDA classifier is trained using features extracted by the fine-tuned VGG-19 over the INIT training set.As illustrated in table II, the best results are obtained by re-training the fully connected and the convolutional layers 4 and 5. Further training other layers reduces the accuracy in the validation set, indicating overfitting of the model.Thus, this configuration was selected for feature extraction.It can also be concluded from the results in table II that the fine-tuned VGG-19 is better suited for feature extraction rather than classification of gait pathologies, as the LDA classifier outperforms the softmax classifier of the VGG-19 architecture.

B. Classification of gait pathologies
Once VGG-19 is fine-tuned, the proposed system can use it to obtain features for the classification step.Since the VGG-19 was fine-tuned using the INIT dataset, the proposed system will be tested using the other 2 datasets: DAI and DAI 2. These two datasets respectively contain 2 and 5 different types of gait pathologies, as reported in table I. Thus, the LDA classifier must be trained separately on the two datasets.
The training, as well as the testing of the proposed system, is performed using a fivefold cross-validation technique, dividing the data into five mutually exclusive sets of individuals.The process is repeated 5 times, such that each time one of the five sets is used for testing and the other four sets are used for training the system.Finally, an average is computed to represent the classification accuracy of the system.
The proposed system is first tested using VGG-16 and VGG-19 trained on ImageNet.The resulting features are not specifically fine-tuned for classification of gait across different pathologies.However, even without fine-tuning, the proposed architecture achieves a classification accuracy above 90% across both datasets.It can be noted in table III that the accuracy of the system improves when VGG-16 is replaced with VGG-19 for the feature extraction step.Since the deeper VGG-19 network performs better than VGG-16, it is selected for the fine-tuning process.It should also be noted that although VGG-19 is fine-tuned for the INIT Gait Dataset, the resulting model improves the classification accuracy across the other two datasets.This is significant because the INIT Dataset is captured in a LABCOM studio [3], which produces perfectly segmented silhouettes.The other two datasets, however, are captured in less constrained environments, where the segmentation of silhouettes is far form perfect, often missing parts of the walking person's silhouette.Thus, it can be concluded that the proposed finetuning scheme generalizes well across datasets, even in the presence of silhouette segmentation errors, which affect the performance of most systems based on biomechanical handcrafted features [1].Table III also reports results for the state-of-the-art methods.Among them, only the GEI method [4] operates on both test datasets, as this method also uses the GEI, making it robust to silhouette segmentation errors.The GEI method [4] performs well with the binary classification problem of the DAI Gait dataset, but its performance degrades significantly when the number of pathology groups increases.The biomechanical feature extraction method [1] that operates on INIT Gait Dataset with a classification accuracy of 98% cannot be applied to the DAI and DAI 2 datasets, since it relies on features such as shift in centre of gravity, torso orientation and amount of movement, which cannot be reliably computed in the presence of silhouete segmentation errors.Finally, the leg angle method [15] is very effective in performing binary classification over the DAI dataset but, as reported in [4], it cannot be used to perform pathology specific classification.
As a conclusion, and although the proposed system does not have the best performance in all situations across both datasets, it performs consistently well under different conditions, even when its feature extraction module has been trained on a dataset different from the ones considered for testing.It also provides the best results in the presence of silhouettes with segmentation errors, which can be a challenging task for the current state-of-the-art 2D video based methods.
It should also be noted that the proposed system can distinguish between different gait pathologies with a high level of certainty, as illustrated by the confusion matrix in table IV.Only the gait affected by hemiplegia presents a classification accuracy lower than 90%, due to its similarities with gait affected by diplegia and neuropathy.This is a significant improvement over the GEI method [4], which fails in classifying diplegia, with a classification accuracy of only 40 %.It also performs poorly in the classification of gait affected by hemiplegia and neuropathy.Hence, the proposed system can be considered a step forward, when compared to the current state-of-the-art 2D video based systems.IV.CONCLUSION This paper presents a novel system to perform classification of gait across different pathologies.These pathologies vary from restrictions in leg movement to alterations in gait caused by neurological or systemic disorders such as diplegia, hemiplegia, neuropathy and Parkinson's diseases.The proposed system is also capable of performing binary classification between normal and impaired gait with a high level of accuracy.The classification results are better than most state-of the-art methods.The proposed system operates even in situations where some state-of the-art methods fail, such as in the presence of poorly segmented silhouettes.The proposed system tackles this problem by using a GEI for feature representation.To further improve the classification accuracy, the proposed system obtains the best features from the GEI using a fine-tuned VGG-19 deep neural network.The results indicate that VGG-19 fine-tuned for the classification of gait pathologies performs significantly better than VGG-16 and VGG-19 trained on ImageNet, while also generalizing well on other datasets.
Although the results look promising, the size of the 3 datasets currently being considered is relatively small, with the biggest dataset among them containing only 20 sequences per pathology.Thus, the future work will consider capturing a dataset with more individuals and different types of gait pathologies.The dataset can then be used to fine-tune and test the system to obtain more significant results.

Fig. 1 .
Fig. 1.Architecture of the proposed system

Fig. 2 .
Fig. 2. Converting silhouettes into a GEI Depending on the number of gait cycles present in a video sequence, multiple GEIs can be obtained.The resulting GEIs to be used for feature extraction are resized to 224×224 pixels, as it is the default input size for VGG-19.

TABLE IV .
CONFUSION MATRIX FOR THE PROPOSED SYSTEM OPERATING ON DAI GAIT DATASET 2