from fastai.vision.all import *
Sign Language Classification
Introduction
I am going to attempt to use Deep Learning to create a model that can learn the American Sign Language. For this part, we will focus on model training and for the second part, we are going to create an application from our model that we get here.
We are going to utilize Transfer Learning for this project, which is an important part of Deep Learning.
While I do not claim that this will be the best application out there for this particular problem, this small project could serve as motivation and can be expanded in future to create products that help the affected people who must use sign language to communicate.
Importing Packages
We are going to be using fastai
so let’s import it:
The Data
The dataset we are going to be using is American Sign Language Dataset from Kaggle. It contains 87,000 images each of 200x200 pixels. It has 29 classes: 26 for the letters A-Z and 3 classes for space
, delete
and nothing
. We are going to use the Kaggle API to get the data.
!kaggle datasets download -d grassknoted/asl-alphabet
Downloading asl-alphabet.zip to /content
100% 1.03G/1.03G [00:05<00:00, 222MB/s]
Let’s unzip the data and get rid of the zip file:
!unzip *zip -d data && rm -rf *zip
We create a Pathlib
object pointing to our data folder and look inside to see what it contains:
= Path('data')
path = path Path.BASE_PATH
path.ls()
(#2) [Path('asl_alphabet_test'),Path('asl_alphabet_train')]
Let’s peek into one of those folders:
/'asl_alphabet_train').ls() (path
(#29) [Path('asl_alphabet_train/X'),Path('asl_alphabet_train/G'),Path('asl_alphabet_train/V'),Path('asl_alphabet_train/I'),Path('asl_alphabet_train/space'),Path('asl_alphabet_train/N'),Path('asl_alphabet_train/W'),Path('asl_alphabet_train/P'),Path('asl_alphabet_train/H'),Path('asl_alphabet_train/Z')...]
We have 29 folders, as explained earlier.
Data Preprocessing
Now we are ready to create a DataBlock
blueprint to hold our data. We use the fastai DataBlock API which is a convenient way to define how to handle our data.
Since we don’t have validation data provided, we will split 20% of the training images and use it as our validation data.
We then create a DataLoaders
object from the DataBlock, we will use a batch-size of 64.
= DataBlock(
signs =(ImageBlock, CategoryBlock),
blocks=get_image_files,
get_items=parent_label,
get_y=RandomSplitter(seed=42, valid_pct=0.2),
splitter=Resize(200),
item_tfms=aug_transforms()
batch_tfms
)
= signs.dataloaders(path/'asl_alphabet_train', bs=64) dls
Let’s look into one batch of the data:
dls.show_batch()
This is the number of steps we are going to take in an epoch:
len(dls.train)
1087
Using Transfer Learning to Create A Model
Now we can create a model and use Transfer Learning
to train it on our data. Transfer Learning is important since it enables us to get good results with less training and data.
For those who wish to replicate this experiment, we use: resnet18
architecture, Cross Entropy Loss
since this is a Classification Task, and for our optimizer, we select the Adam
Optimizer. We will output error rate and accuracy as our metrics to help as analyze how our model is doing.
We use the Learning Rate Finder
provided by fastai, using insights from Leslie Smith’s work, that enable us to find us a good learning rate, in short time instead of us trying a couple of learning rates experimentally and seeing what works.
If you are interested in reading more about the Learning Rate Finder, read this paper.
For our tast, it looks like a learning rate of 1x10-2 will work, so we fine-tine (transfer learn) for 4 epochs.
= cnn_learner(dls, resnet18, loss_func=CrossEntropyLossFlat(),
learn =[error_rate, accuracy], opt_func=Adam)
metrics
learn.lr_find()
Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /root/.cache/torch/hub/checkpoints/resnet18-5c106cde.pth
SuggestedLRs(lr_min=0.017378008365631102, lr_steep=0.015848932787775993)
= cnn_learner(dls, resnet18, loss_func=CrossEntropyLossFlat(),
learn =[error_rate, accuracy], opt_func=Adam)
metrics
4, base_lr=1e-2) learn.fine_tune(
epoch | train_loss | valid_loss | error_rate | accuracy | time |
---|---|---|---|---|---|
0 | 0.230984 | 0.065030 | 0.019310 | 0.980690 | 04:11 |
epoch | train_loss | valid_loss | error_rate | accuracy | time |
---|---|---|---|---|---|
0 | 0.160366 | 0.875754 | 0.117989 | 0.882011 | 05:24 |
1 | 0.038482 | 0.008822 | 0.002529 | 0.997471 | 05:26 |
2 | 0.008333 | 0.000855 | 0.000230 | 0.999770 | 05:25 |
3 | 0.002372 | 0.000346 | 0.000057 | 0.999943 | 05:25 |
We get a very good accuracy only after 4 epochs.
Now we can tackle the small test set that comes with this dataset, although we will scale up to a better test dataset in a few moments.
Testing Our Model
= (path/'asl_alphabet_test').ls()
test_images test_images
(#28) [Path('asl_alphabet_test/U_test.jpg'),Path('asl_alphabet_test/space_test.jpg'),Path('asl_alphabet_test/N_test.jpg'),Path('asl_alphabet_test/R_test.jpg'),Path('asl_alphabet_test/H_test.jpg'),Path('asl_alphabet_test/P_test.jpg'),Path('asl_alphabet_test/T_test.jpg'),Path('asl_alphabet_test/C_test.jpg'),Path('asl_alphabet_test/X_test.jpg'),Path('asl_alphabet_test/V_test.jpg')...]
We have 28 images in this test set, each for the classes of data we have.
Let’s take the first two images and predict them using our model.
= (path/'asl_alphabet_test'/'U_test.jpg')
U = (path/'asl_alphabet_test'/'space_test.jpg') space
0] learn.predict(U)[
'U'
0] learn.predict(space)[
'space'
That looks like its working well. To predict on all the images in the test set, we are going to need a way to get the labels of the images, so as to compare with our prediction.
Let’s work with one image first:
= test_images[0]
u_test u_test
Path('asl_alphabet_test/U_test.jpg')
As you can see, the label of the test images is contained in the filename. So we are going to use regular expressions to extract the label from the filenames.
Here is a simple regular expression that does the job:
'(.+)_test.jpg$', u_test.name)[0] re.findall(
'U'
And our prediction on that image:
0] learn.predict(u_test)[
'U'
And now, a way to compare our prediction, to the true label of the test set:
'(.+)_test.jpg$', u_test.name)[0] == learn.predict(u_test)[0] re.findall(
True
Let us write a function that is going to extract the labels, and store them in a list:
def get_test_names(images):
= []
labels
for i in images:
= re.findall('(.+)_test.jpg$', i.name)[0]
label
labels.append(label)
return labels
We can now get all the labels for the 28 images in our test set:
= get_test_names(test_images) test_labels
len(test_labels)
28
Now we need a function to run inference on the images. It is going to take in the images, our model and the labels we just got as parameters and output the mean accuracy of our predictions:
def run_inference(images, model, labels):
= []
corrects # get the number of images to inference
= len(images)
num_images
for i in range(num_images):
# get the inference for an image
= model.predict(images[i])[0]
prediction
# compare with the label for that image
= (prediction==labels[i])
is_equal
# append result to the list
corrects.append(is_equal)
# convert the list of inferences to float Tensor
= torch.Tensor(corrects).float()
corrects
# return the mean accuracy
return corrects.mean().item()
We can use that function to get the mean accuracy of our model on the small test dataset.
= run_inference(test_images, learn, test_labels)
test_accuracy
test_accuracy
1.0
We get 100% accuracy. That’s fishy, you should always be worried when your model achieves a high accuracy like this. There could be data leaking. To add to that, this is a small dataset.
Luckily, there is another dataset recommended to be used as a test set for this dataset. It contains 870 images, 30 images for each category.
Let us use the Kaggle API again to get this new dataset:
!kaggle datasets download -d danrasband/asl-alphabet-test
Downloading asl-alphabet-test.zip to /content
70% 17.0M/24.3M [00:00<00:00, 24.5MB/s]
100% 24.3M/24.3M [00:00<00:00, 33.3MB/s]
And unzip it to a test folder:
!unzip *zip -d test && rm -rf *zip
= Path('test')
test_path test_path.ls()
(#29) [Path('test/X'),Path('test/G'),Path('test/V'),Path('test/I'),Path('test/space'),Path('test/N'),Path('test/W'),Path('test/P'),Path('test/H'),Path('test/Z')...]
We use the get_image_files
to recursively get images from the newly-created test path. We get 870 images, so that seems to be working fine.
= get_image_files(test_path)
test_files
len(test_files)
870
To run inference, we are required to perform the same data preprocessing we perfomed on the training images. To make this easier, fastai suggest we use a test_dl
that is created using the following syntax:
= learn.dls.test_dl(test_files, with_label=True)
test_dl
test_dl.show_batch()
We can now get the predicitons on all the test images easily using the get_preds function and store it in a test_preds variable.
= learn.get_preds(dl=test_dl) test_preds
This are our current vocabs of our data:
learn.dls.vocab
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'del', 'nothing', 'space']
To extract the true labels from the test images, we are again going to turn to regular expressions. But this time, we are required to write a regular expression robust enough to handle this three cases that represent how the rest of the images are named:
= test_dl.items[370].name
example_name example_name
'del0002_test.jpg'
r'([A-Za-z]+)\d+_test.jpg$', example_name)[0] re.findall(
'del'
= test_dl.items[690].name
example_name_2 example_name_2
'nothing0013_test.jpg'
r'([A-Za-z]+)\d+_test.jpg$', example_name_2)[0] re.findall(
'nothing'
= test_dl.items[0].name
example_name_3 example_name_3
'X0023_test.jpg'
r'([A-Za-z]+)\d+_test.jpg$', example_name_3)[0] re.findall(
'X'
Now that we have that robust expression, we can proceed to check the accuracy of our prediction that we calculated:
We create a list to hold the result of our comparisons, from the predictions and the true labels, which we are going to use to calculate the final accuracy.
We also create a category_corrects dictionary, to tally for us, for each category, how many we predicted correct, so that we can see how our model performs on each category individually.
# create a list to hold True or False when comparing
= []
corrects
# count how many predictions we get correct per category
= dict.fromkeys(learn.dls.vocab, 0)
category_corrects
# for each enumerated predictions
for index, item in enumerate(test_preds[0]):
# get the predicted vocab
= learn.dls.categorize.decode(np.argmax(item))
prediction # get the confidence of the prediction
= max(item)
confidence = float(confidence)
confidence_percent # get the true label for the image we are predicting
= test_dl.items[index].name
image_name = re.findall(r'([A-Za-z]+)\d+_test.jpg$', image_name)[0]
label # get the comparison and append it to our corrects list
= (prediction==label)
is_correct
corrects.append(is_correct)
# if we got the prediction correct for that category,
# increase the count by one
if is_correct:
+= 1
category_corrects[prediction]
# convert the list of inferences to float Tensor
= torch.Tensor(corrects).float()
corrects
# print the mean accuracy
print(f'Accuracy on the test set: {corrects.mean().item():.4f}')
Accuracy on the test set: 0.6195
As you can see, using this better test set, we can see that the accuracy reduces.
Since this is out of domain data, my intuition is that a better dataset that varies more could be collected and used in future. However for now, let’s work with these results.
Remember, the test set is used as a final measure, and we shouldn’t use it to improve our model
Let us check on the per-category prediction:
category_corrects
{'A': 15,
'B': 30,
'C': 17,
'D': 25,
'E': 19,
'F': 13,
'G': 13,
'H': 30,
'I': 29,
'J': 22,
'K': 21,
'L': 29,
'M': 22,
'N': 14,
'O': 19,
'P': 30,
'Q': 29,
'R': 17,
'S': 7,
'T': 0,
'U': 11,
'V': 0,
'W': 27,
'X': 4,
'Y': 28,
'Z': 11,
'del': 23,
'nothing': 14,
'space': 20}
A plot would be better to analyze the information:
= plt.subplots(figsize=(20, 10))
fig, ax *zip(*category_corrects.items()))
ax.bar( plt.show()
Our model performs really poorly on the letters T, V, and X!! We will see if this is going to be a problem in our Application that we create in part 2.
Making our Inference Model More Robust
Since I plan on using this model to create a Computer Vision Model, I deciced to retrain my model, adding the new dataset in order to make it more robust, since that data varied more.
NOTE: This is not how it should be done, you should never use your test set to train the model. I only combined the two datasets into one in order to get a better model since good Sign Language Data is hard to come by and the test set we used isn’t the official test set, just a recommended dataset that could have been used for training a different model. Also, I didn’t use my new model to get a better prediction on the test set.
= DataBlock(
signs =(ImageBlock, CategoryBlock),
blocks=get_image_files,
get_items=parent_label,
get_y=RandomSplitter(seed=42),
splitter=Resize(200),
item_tfms=aug_transforms()
batch_tfms
)
= signs.dataloaders(path/'asl_alphabet_train', bs=64) dls
len(dls.train)
1098
= cnn_learner(dls, resnet18, loss_func=CrossEntropyLossFlat(),
learn =[error_rate, accuracy], opt_func=Adam)
metrics
4, base_lr=1e-2) learn.fine_tune(
epoch | train_loss | valid_loss | error_rate | accuracy | time |
---|---|---|---|---|---|
0 | 0.248037 | 0.073583 | 0.020769 | 0.979231 | 04:13 |
epoch | train_loss | valid_loss | error_rate | accuracy | time |
---|---|---|---|---|---|
0 | 0.121224 | 0.061025 | 0.016502 | 0.983498 | 05:27 |
1 | 0.036989 | 0.008055 | 0.001878 | 0.998122 | 05:27 |
2 | 0.007163 | 0.003641 | 0.001081 | 0.998919 | 05:27 |
3 | 0.003785 | 0.002509 | 0.000569 | 0.999431 | 05:27 |
I can now export my model, which I will use in the next Part, to create a Computer Vision Model to predict new data. Stay tuned!
'sign_language.pkl') learn.export(
Here is the link to the second part