Detect Hand Sign Languages with Tensorflow
Interested in learning how to use Tensorflow to detect hand sign languages in your apps? By the end of this read, you will know how to implement Tensorflow in your application with very simple steps. In our example today, we will be using Vue.
What is Tensorflow?
Tensorflow is an end-to-end platform _(meaning: delivering complex systems or services in functional form after developing it from beginning to end.)_ used for building Machine Learning applications, and it is also open-source. TensorFlow enables you to build dataflow graphs and structures to define how data moves through a graph by taking inputs as a multi-dimensional array called Tensor. You can read more on Tensorflow here.
What is a Model?
A model is a function with learnable parameters that maps an input to an output. A well-trained model will provide an accurate mapping from the input to the desired output.
Tensorflow Models
Tensorflow models are pre-trained models, and there are four defined categories of them:
- Vision: Analyze features in images and videos.
- Body: Detect key points and poses on the face, hands, and body with models from MediPipe
- Text: Enable NLP in your web app using the power of BERT and other Transformer encoder architectures.
- Audio: Classify audio to detect sounds.
If you want to go into more detail, check out Tensorflow Models.
All these models are broken down into subs and for our case, we will be making use of the Body Model which has the hand pose detection we need in order to detect the hand signs.
Hand Pose Detection
This model used a 2D and 3D multi-dimensional array which enables it to predict the keypoints of the hands.
Example of a 2D is [[1,2],[3,5],[7,8],[20,44]] and that of a 3D is [[1,2,5],[3,5,8],[7,8,6],[20,44,100]].
This hand pose detection is a model from the MediPipe as we established above, and it provides us with two model types which are lite and full. The accuracy of the prediction increases from lite to full while the inference speed reduces, i.e. the response time will be slower as the accuracy increases.
What do we need?
There are a few dependencies we need to get things working, and I also will be assuming that you have your project set up as well.
You will need to add these dependencies to the project
`
Above, in the commands, you will notice we added a fingerpose. Let's talk a little about what we need the figerpose for.
Fingerpose
Fingerpose is a gesture classifier for hand landmarks detected by Mediapipe hand pose detection. It also allows you to add your own hand gesture, which means that a gesture that signifies the letter Z can signify Hello based on your fingerpose data. We will see an example of how the data looks in a bit. You can check out fingerpose for more details.
Get started
We are going to use Vue for this illustration. We will start by looking at the HTML first, and then we will cover the JavaScript.
Our Template will be a basic HTML that will have a video tag so we can show a video after getting access to our webcam.
Template
`
The snippet above shows a div and a video tab. The video is used when we gain access to the webcam.
We will now be writing the JS required to initialize the webcam.
Script
`
We imported two methods from vue: onMounted and ref. The onMounted runs when the page is fully mounted while the ref is used to declare a reactive value to reference the video element. If you look at the video tag in the template, you will notice a ref property. You can check out Template ref and onMounted lifecycle hook.
In the openCam function, we first try to test if mediaDevices is available on your browser navigation.
>The MediaDevices interface provides access to connected media input devices like cameras and microphones, as well as screen sharing. In essence, it lets you obtain access to any hardware source of media data.
This MediaDevice has a method getUserMedia which prompts the user for permission to use a media input. You can find all you need to know about getUserMedia here.
From the snippet, we can see that getUserMedia returns a promise, and with that, we can get the media stream as a response using then(). We check if the video element has srcObject or not. If it does we assign the media stream to the srcObject and if not, we convert the media stream to a URL and assign it to the src of the video element.
With this Snippet and with a few style, you should have your video showing your awesome face!
Introducing Tensorflow and Hand Detection
Now that we got our webcam working, we will update the Template and the script in order to detect, predict, and display the alphabet based on the hand sign prediction.
The updated HTML should now look like this:
`
The div with class name alphabet will display the alphabet based on the hand sign prediction.
We will be introducing two(2) new functions, createDetectionInstance and handleSignDetection.
Firstly, lets begin with the createDetectionInstance which is an integral part of the hand sign detection and then we will introduce handleSignDetection which predicts and displays the hand sign.
`
To be able to detect hand poses, we need to create an instance of the handpose detector, and here, we created a function createDetectionInstance which is an asynchronous function.
You can check out this Tensorflow blog to see more details.
Now that we have created an avenue to detect hand signs, let us start detecting the hand.
In that light, we will be adding a handleSignDetection function.
`
The handleSignDetection runs after creating the detection instance. We have a setInterval that runs every 2 seconds (PS: _the 2 seconds timing is arbitrary and can be less or more_) to check if there is any hand sign. We also have a conditional statement to ensure the video element exists, and the detection instance was created accordingly.
So, the detector calls a method estimateHands, which tries to predict the hand pose by getting keypoints with values that are either in 2D or 3D (Multi-dimensional Array).
If you check your console log, you will see an array of data if any hand pose is detected.
Now that we can detect hand poses, we will now add fingerpose that will help predict and display the alphabet based on the hand sign.
`
Assuming that our detector sensed a hand, it is time to match this value based on the hand signs we created with the fingerpose.
The landmark variable is a 3D array pulled from the hand result's keypoint3D key value. There is also a keypoint as well, which is a 2D value, and both will give the same result.
Now, using GE.estimate, we can generate a possible gesture that matches the sign, and a score/confidence is assigned to each gesture pending the amount of gesture predicted. So, the gesture with the highest score/confidence is selected since it is estimated to be the closest to the hand sign from the figerpose hand signs we created.
We also imported Handsigns and its content looks like this:
You can also get the handsigns folder from the 100-ms-vue repository. Looking at the screenshot, there is a GestureDescription instance that takes a string A which will represent what the hand sign will stand for. So, it could be anything you want the handsign to stand for.
>onMounted is asynchronous because we need to ensure that our detection instance is created, which is required to detect the hand sign.
With the updated code, you should be able to display some letters.
Conclusion
Don't forget, you can see in detail how this was implemented in one of This Dot Labs' open source projects 100-ms-vue. Please note that what we did is just a basic implementation, and to have a production-ready version, it will need a bigger model, and a more complex detection to be able to identify hand sign language....