from collections import deque, Counter
import cv2
from fastai.vision.all import *
print('Loading our Inference model...')
# load our inference model
= load_learner('model/sign_language.pkl')
inf_model print('Model Loaded')
# define a deque to get rolling average of predictions
# I go with the last 10 predictions
= deque([], maxlen=10)
rolling_predictions
# get the most common item in the deque
def most_common(D):
= Counter(D)
data return data.most_common(1)[0][0]
def hand_area(img):
# specify where hand should go
= img[50:324, 50:324]
hand # the images in the model were trainind on 200x200 pixels
= cv2.resize(hand, (200,200))
hand return hand
# capture video on the webcam
= cv2.VideoCapture(0)
cap
# get the dimensions on the frame
= int(cap.get(3))
frame_width = int(cap.get(4))
frame_height
# define codec and create our VideoWriter to save the video
= cv2.VideoWriter_fourcc(*'mp4v')
fourcc = cv2.VideoWriter('output/sign-language.mp4', fourcc, 12, (frame_width, frame_height))
out
# read video
while True:
# capture each frame of the video
= cap.read()
ret, frame
# flip frame to feel more 'natural' to webcam
= cv2.flip(frame, flipCode = 1)
frame
# draw a blue rectangle where to place hand
50, 50), (324, 324), (255, 0, 0), 2)
cv2.rectangle(frame, (
# get the image
= hand_area(frame)
inference_image
# get the current prediction on the hand
= inf_model.predict(inference_image)
pred # append the current prediction to our rolling predictions
0])
rolling_predictions.append(pred[
# our prediction is going to be the most common letter
# in our rolling predictions
= f'The predicted letter is {most_common(rolling_predictions)}'
prediction_output
# show predicted text
10, 350), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255, 0, 0), 2)
cv2.putText(frame, prediction_output, (# show the frame
'frame', frame)
cv2.imshow(# save the frames to out file
out.write(frame)
# press `q` to exit
if cv2.waitKey(1) & 0xFF == ord('q'):
break
# release VideoCapture()
cap.release()# release out file
out.release()# close all frames and video windows
cv2.destroyAllWindows()
Sign Language Inference
Introduction
After training our model in Part A, we are now going to develop an application to run inference with for new data.
I am going to be utilizing opencv
to get live video from my webcam, then run our model against each frame in the video and get the prediction of what Sign Language Letter I am holding up.
Here is an example of what the output will look like:
The whole code + training notebooks from Part A can be found in this github repo.
This tutorial assumes some basic understanding of the cv2
library and general understanding of how to run inference using a model.
The Full Code
Here is the full code of making the App if you just want the code.
I will explain each part of the code and what was my thinking behind it in the next section.
Explaining the Code
Imports
Install fastai and opencv-python.
Next, this are the packages I utilize for this App. fastai
is going to be used to run Inference with, cv2
is going to handle all the WebCam functionality and we are going to utilize deque
and Counter
from collections to apply a nifty trick I am going to show you.
from collections import deque, Counter
import cv2
from fastai.vision.all import *
Loading our Inference Model
print('Loading our Inference model...')
# load our inference model
= load_learner('model/sign_language.pkl')
inf_model print('Model Loaded')
The next part of our code loads the model we pickled in Part A and prints some useful information.
Rolling Average Predictions
When I first made the App, I noticed one problem when using it. A slight movement of my hand was changing the predictions. This is known as flickering
. The video below shows how flickering affects our App:
{{< video https://youtu.be/aPAG39MjN68>}}
The Video you saw in the beginning shows how ‘stable’ our model is after using rolling predictions.
# define a deque to get rolling average of predictions
# I go with the last 10 predictions
= deque([], maxlen=10)
rolling_predictions
# get the most common item in the deque
def most_common(D):
= Counter(D)
data return data.most_common(1)[0][0]
To solve this, a utilized the deque from Collections. I used 10 as the maxlength of the deque since I wanted the App, when running inference, to output the most common prediction out of the last 10 predictions. This makes it more stable than when we are using only the current one.
The function most_common
will return the most common item in our deque.
Hand Area
def hand_area(img):
# specify where hand should go
= img[50:324, 50:324]
hand # the images in the model were trainind on 200x200 pixels
= cv2.resize(hand, (200,200))
hand return hand
Next, we define a function that tells our model which part of the video to run inference on. We do not want to run inference on the whole video which will include our face! We will eventually draw a blue rectangle in this area so that you’ll know where to place your hand.
Capture Video on the WebCam and Define Our Writer
# capture video on the webcam
= cv2.VideoCapture(0)
cap
# get the dimensions on the frame
= int(cap.get(3))
frame_width = int(cap.get(4))
frame_height
# define codec and create our VideoWriter to save the video
= cv2.VideoWriter_fourcc(*'mp4v')
fourcc = cv2.VideoWriter('sign-language.mp4', fourcc, 12, (frame_width, frame_height)) out
Here, we define a VideoCapture
that will record our video. The parameter 0 means capture on the first WebCam it finds. If you have multiple WebCams, this is the parameter you want to play around with until you find the correct one.
Next, we get the dimensions of the frame being recorded by the VideoCapture. We are going to use this dimensions when writing (outputting) the recorded video
Finally, we create a VideoWriter
that we are going to use to output the video and write it to our Hard Disk. To do that, opencv requires us to define a codec to use, and so we create a VideoWriter_fourcc
exactly for that purpose and we use ‘mp4v’ with it.
In our writer, we first pass the name we want for the output file, here I use ‘sign-language.mp4’ which will be written in the current directory. You can change this location if you wish to. Next we pass in the codec. After that you pass in your fps (frame rate per second). I found that 12 worked best with my configuration but you probably want to play around with that until you get the best one for you. Finally, we pass in the frame sizes, which we had gotten earlier.
The Main Video Loop
# read video
while True:
# capture each frame of the video
= cap.read()
ret, frame
# flip frame to feel more 'natural' to webcam
= cv2.flip(frame, flipCode = 1)
frame
# draw a blue rectangle where to place hand
50, 50), (324, 324), (255, 0, 0), 2)
cv2.rectangle(frame, (
# get the image
= hand_area(frame)
inference_image
# get the current prediction on the hand
= inf_model.predict(inference_image)
pred # append the current prediction to our rolling predictions
0])
rolling_predictions.append(pred[
# our prediction is going to be the most common letter
# in our rolling predictions
= f'The predicted letter is {most_common(rolling_predictions)}'
prediction_output
# show predicted text
10, 350), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255, 0, 0), 2)
cv2.putText(frame, prediction_output, (# show the frame
'frame', frame)
cv2.imshow(# save the frames to out file
out.write(frame)
# press `q` to exit
if cv2.waitKey(1) & 0xFF == ord('q'):
break
This is a long piece of code so lets break it down bit by bit:
# read video
while True:
# capture each frame of the video
= cap.read()
_ , frame
# flip frame to feel more 'natural' to webcam
= cv2.flip(frame, flipCode = 1)
frame
# ......
# truncated code here
# ......
# show the frame
'frame', frame)
cv2.imshow(# save the frames to out file
out.write(frame)
# press `q` to exit
if cv2.waitKey(1) & 0xFF == ord('q'):
break
We create a infinite While loop that will always be running, until the user presses the ‘q’ letter on the keyboard, as defined by our last if statement at the very bottom of the loop.
After that, we use the reader we created earlier and call cap.read()
on it which returns the current frame of the video, and another variable that we are not going to use.
A little intuition how videos works. A frame is somewhat equivalent to just one static image. Think of it as that. So for a video what usually happens it these single frames are played one after the other quickly, like 30-60 times faster hence creating the illusion of a continious video.
So for our App, we are going to get each frame, and run it through our model (which expects the input to be an image, so this will work) and return the current prediction. This is also why we decided to use rolling average predictions and not the just the current prediction. To reduce the flickering that may occur by passing a different frame each second.
Next:
frame = cv2.flip(frame, flipCode = 1)
This flips our frame to make it feel more natural. What I mean is, without flipping, the output image felt reversed, where if I raised my left arm it seemed like I was raising my right. Try running the App with this part commented out and you’ll get what I mean.
This shows the frames one after the other and the out writes it to disk
cv2.imshow('frame', frame)
# save the frames to out file
out.write(frame)
# read video
while True:
# ......
# truncated code here
# ......
# draw a blue rectangle where to place hand
50, 50), (324, 324), (255, 0, 0), 2)
cv2.rectangle(frame, (
# get the image
= hand_area(frame)
inference_image
# get the current prediction on the hand
= inf_model.predict(inference_image)
pred # append the current prediction to our rolling predictions
0])
rolling_predictions.append(pred[
# our prediction is going to be the most common letter
# in our rolling predictions
= f'The predicted letter is {most_common(rolling_predictions)}'
prediction_output
# show predicted text
10, 350), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255, 0, 0), 2)
cv2.putText(frame, prediction_output, (
# ......
# truncated code here
# ......
# press `q` to exit
if cv2.waitKey(1) & 0xFF == ord('q'):
break
Next, we draw a blue rectangle where the user should place the hand. The first parameter is where we want to draw the rectangle and we tell opencv to draw it on our current frame. The next two parameter describe the area where we want our rectangle to be. Note that this dimensions are exactly the same as those in the hand_area
function we created earlier. This is to make sure we are running inference on the correct area. Lastly we pass in the color of the rectangle (in BGR formart) and the thickness of the line (2).
cv2.rectangle(frame, (50, 50), (324, 324), (255, 0, 0), 2)
Next, from our whole frame, we just extract the hand area and store it. This is the image we are going to pass to our model
inference_image = hand_area(frame)
Next, we pass our extracted image to our inference model and get the predictions and append that prediction to our rolling predictions deque. Remember that this deque will only hold the most recent 10 predictions and discard everything else
pred = inf_model.predict(inference_image)
rolling_predictions.append(pred[0])
We get the most common Letter predicted in our Deque and use opencv to write that letter to the video. The parameters are almost similar to the rectangle code, with a slight variation since here we have to pass in the font(hershey simplex) and font size (0.9)
prediction_output = f'The predicted letter is {most_common(rolling_predictions)}'
cv2.putText(frame, prediction_output, (10, 350), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255, 0, 0), 2)
The final part of the code just releases the resources we had acquired initially: the Video reader, the Video Writer and then destroys all windows created.
# release VideoCapture()
cap.release()# release out file
out.release()# close all frames and video windows
cv2.destroyAllWindows()
And that’s all in this project. Hope you enjoyed it.
In future, I am going to look for ways to improve this system and how to actually make it useful.