Sign Language Inference

technical
project
computer-vision
Creating a realtime sign language intepreter.
Published

January 21, 2021

Introduction

After training our model in Part A, we are now going to develop an application to run inference with for new data.

I am going to be utilizing opencv to get live video from my webcam, then run our model against each frame in the video and get the prediction of what Sign Language Letter I am holding up.

Here is an example of what the output will look like:

The whole code + training notebooks from Part A can be found in this github repo.

This tutorial assumes some basic understanding of the cv2 library and general understanding of how to run inference using a model.

The Full Code

Here is the full code of making the App if you just want the code.

I will explain each part of the code and what was my thinking behind it in the next section.

from collections import deque, Counter

import cv2
from fastai.vision.all import *

print('Loading our Inference model...')
# load our inference model
inf_model = load_learner('model/sign_language.pkl')
print('Model Loaded')


# define a deque to get rolling average of predictions
# I go with the last 10 predictions
rolling_predictions = deque([], maxlen=10)

# get the most common item in the deque
def most_common(D):
    data = Counter(D)
    return data.most_common(1)[0][0]


def hand_area(img):
    # specify where hand should go
    hand = img[50:324, 50:324]
    # the images in the model were trainind on 200x200 pixels
    hand = cv2.resize(hand, (200,200))
    return hand

# capture video on the webcam
cap = cv2.VideoCapture(0)


# get the dimensions on the frame
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

# define codec and create our VideoWriter to save the video
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter('output/sign-language.mp4', fourcc, 12, (frame_width, frame_height))


# read video
while True:
    # capture each frame of the video
    ret, frame = cap.read()

    # flip frame to feel more 'natural' to webcam
    frame = cv2.flip(frame, flipCode = 1)

    # draw a blue rectangle where to place hand
    cv2.rectangle(frame, (50, 50), (324, 324), (255, 0, 0), 2)

    # get the image
    inference_image = hand_area(frame)

    # get the current prediction on the hand
    pred = inf_model.predict(inference_image)
    # append the current prediction to our rolling predictions
    rolling_predictions.append(pred[0])

    # our prediction is going to be the most common letter
    # in our rolling predictions
    prediction_output = f'The predicted letter is {most_common(rolling_predictions)}'

    # show predicted text
    cv2.putText(frame, prediction_output, (10, 350), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255, 0, 0), 2)
    # show the frame
    cv2.imshow('frame', frame)
    # save the frames to out file
    out.write(frame)


    # press `q` to exit
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# release VideoCapture()
cap.release()
# release out file
out.release()
# close all frames and video windows
cv2.destroyAllWindows()

Explaining the Code

Imports

Install fastai and opencv-python.

Next, this are the packages I utilize for this App. fastai is going to be used to run Inference with, cv2 is going to handle all the WebCam functionality and we are going to utilize deque and Counter from collections to apply a nifty trick I am going to show you.

from collections import deque, Counter

import cv2
from fastai.vision.all import *

Loading our Inference Model

print('Loading our Inference model...')
# load our inference model
inf_model = load_learner('model/sign_language.pkl')
print('Model Loaded')

The next part of our code loads the model we pickled in Part A and prints some useful information.

Rolling Average Predictions

When I first made the App, I noticed one problem when using it. A slight movement of my hand was changing the predictions. This is known as flickering. The video below shows how flickering affects our App:

{{< video https://youtu.be/aPAG39MjN68>}}

The Video you saw in the beginning shows how ‘stable’ our model is after using rolling predictions.

# define a deque to get rolling average of predictions
# I go with the last 10 predictions
rolling_predictions = deque([], maxlen=10)

# get the most common item in the deque
def most_common(D):
    data = Counter(D)
    return data.most_common(1)[0][0]

To solve this, a utilized the deque from Collections. I used 10 as the maxlength of the deque since I wanted the App, when running inference, to output the most common prediction out of the last 10 predictions. This makes it more stable than when we are using only the current one.

The function most_common will return the most common item in our deque.

Hand Area

def hand_area(img):
    # specify where hand should go
    hand = img[50:324, 50:324]
    # the images in the model were trainind on 200x200 pixels
    hand = cv2.resize(hand, (200,200))
    return hand

Next, we define a function that tells our model which part of the video to run inference on. We do not want to run inference on the whole video which will include our face! We will eventually draw a blue rectangle in this area so that you’ll know where to place your hand.

Capture Video on the WebCam and Define Our Writer

# capture video on the webcam
cap = cv2.VideoCapture(0)

# get the dimensions on the frame
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

# define codec and create our VideoWriter to save the video
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter('sign-language.mp4', fourcc, 12, (frame_width, frame_height))

Here, we define a VideoCapture that will record our video. The parameter 0 means capture on the first WebCam it finds. If you have multiple WebCams, this is the parameter you want to play around with until you find the correct one.

Next, we get the dimensions of the frame being recorded by the VideoCapture. We are going to use this dimensions when writing (outputting) the recorded video

Finally, we create a VideoWriter that we are going to use to output the video and write it to our Hard Disk. To do that, opencv requires us to define a codec to use, and so we create a VideoWriter_fourcc exactly for that purpose and we use ‘mp4v’ with it.

In our writer, we first pass the name we want for the output file, here I use ‘sign-language.mp4’ which will be written in the current directory. You can change this location if you wish to. Next we pass in the codec. After that you pass in your fps (frame rate per second). I found that 12 worked best with my configuration but you probably want to play around with that until you get the best one for you. Finally, we pass in the frame sizes, which we had gotten earlier.

The Main Video Loop

# read video
while True:
    # capture each frame of the video
    ret, frame = cap.read()

    # flip frame to feel more 'natural' to webcam
    frame = cv2.flip(frame, flipCode = 1)

    # draw a blue rectangle where to place hand
    cv2.rectangle(frame, (50, 50), (324, 324), (255, 0, 0), 2)

    # get the image
    inference_image = hand_area(frame)

    # get the current prediction on the hand
    pred = inf_model.predict(inference_image)
    # append the current prediction to our rolling predictions
    rolling_predictions.append(pred[0])

    # our prediction is going to be the most common letter
    # in our rolling predictions
    prediction_output = f'The predicted letter is {most_common(rolling_predictions)}'

    # show predicted text
    cv2.putText(frame, prediction_output, (10, 350), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255, 0, 0), 2)
    # show the frame
    cv2.imshow('frame', frame)
    # save the frames to out file
    out.write(frame)


    # press `q` to exit
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

This is a long piece of code so lets break it down bit by bit:

# read video
while True:
    # capture each frame of the video
    _ , frame = cap.read()

    # flip frame to feel more 'natural' to webcam
    frame = cv2.flip(frame, flipCode = 1)
    


    # ......
    # truncated code here
    # ......



    
    # show the frame
    cv2.imshow('frame', frame)
    # save the frames to out file
    out.write(frame)


    # press `q` to exit
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

We create a infinite While loop that will always be running, until the user presses the ‘q’ letter on the keyboard, as defined by our last if statement at the very bottom of the loop.

After that, we use the reader we created earlier and call cap.read() on it which returns the current frame of the video, and another variable that we are not going to use.

A little intuition how videos works. A frame is somewhat equivalent to just one static image. Think of it as that. So for a video what usually happens it these single frames are played one after the other quickly, like 30-60 times faster hence creating the illusion of a continious video.

So for our App, we are going to get each frame, and run it through our model (which expects the input to be an image, so this will work) and return the current prediction. This is also why we decided to use rolling average predictions and not the just the current prediction. To reduce the flickering that may occur by passing a different frame each second.

Next:

    frame = cv2.flip(frame, flipCode = 1)

This flips our frame to make it feel more natural. What I mean is, without flipping, the output image felt reversed, where if I raised my left arm it seemed like I was raising my right. Try running the App with this part commented out and you’ll get what I mean.

This shows the frames one after the other and the out writes it to disk

    cv2.imshow('frame', frame)
    # save the frames to out file
    out.write(frame)

# read video
while True:
    # ......
    # truncated code here
    # ......

    # draw a blue rectangle where to place hand
    cv2.rectangle(frame, (50, 50), (324, 324), (255, 0, 0), 2)

    # get the image
    inference_image = hand_area(frame)

    # get the current prediction on the hand
    pred = inf_model.predict(inference_image)
    # append the current prediction to our rolling predictions
    rolling_predictions.append(pred[0])

    # our prediction is going to be the most common letter
    # in our rolling predictions
    prediction_output = f'The predicted letter is {most_common(rolling_predictions)}'

    # show predicted text
    cv2.putText(frame, prediction_output, (10, 350), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255, 0, 0), 2)


    # ......
    # truncated code here
    # ......


    # press `q` to exit
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

Next, we draw a blue rectangle where the user should place the hand. The first parameter is where we want to draw the rectangle and we tell opencv to draw it on our current frame. The next two parameter describe the area where we want our rectangle to be. Note that this dimensions are exactly the same as those in the hand_area function we created earlier. This is to make sure we are running inference on the correct area. Lastly we pass in the color of the rectangle (in BGR formart) and the thickness of the line (2).

cv2.rectangle(frame, (50, 50), (324, 324), (255, 0, 0), 2)

Next, from our whole frame, we just extract the hand area and store it. This is the image we are going to pass to our model

inference_image = hand_area(frame)

Next, we pass our extracted image to our inference model and get the predictions and append that prediction to our rolling predictions deque. Remember that this deque will only hold the most recent 10 predictions and discard everything else

pred = inf_model.predict(inference_image)

rolling_predictions.append(pred[0])

We get the most common Letter predicted in our Deque and use opencv to write that letter to the video. The parameters are almost similar to the rectangle code, with a slight variation since here we have to pass in the font(hershey simplex) and font size (0.9)

prediction_output = f'The predicted letter is {most_common(rolling_predictions)}'

cv2.putText(frame, prediction_output, (10, 350), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255, 0, 0), 2)

The final part of the code just releases the resources we had acquired initially: the Video reader, the Video Writer and then destroys all windows created.

# release VideoCapture()
cap.release()
# release out file
out.release()
# close all frames and video windows
cv2.destroyAllWindows()

And that’s all in this project. Hope you enjoyed it.

In future, I am going to look for ways to improve this system and how to actually make it useful.