Skip to content

Detect Hand Sign Languages with Tensorflow

This article was written over 18 months ago and may contain information that is out of date. Some content may be relevant but please refer to the relevant official documentation or available resources for the latest information.

Interested in learning how to use Tensorflow to detect hand sign languages in your apps? By the end of this read, you will know how to implement Tensorflow in your application with very simple steps. In our example today, we will be using Vue.

What is Tensorflow?

Tensorflow is an end-to-end platform (meaning: delivering complex systems or services in functional form after developing it from beginning to end.) used for building Machine Learning applications, and it is also open-source. TensorFlow enables you to build dataflow graphs and structures to define how data moves through a graph by taking inputs as a multi-dimensional array called Tensor. You can read more on Tensorflow here.

What is a Model?

A model is a function with learnable parameters that maps an input to an output. A well-trained model will provide an accurate mapping from the input to the desired output.

Tensorflow Models

Tensorflow models are pre-trained models, and there are four defined categories of them:

  • Vision: Analyze features in images and videos.
  • Body: Detect key points and poses on the face, hands, and body with models from MediPipe
  • Text: Enable NLP in your web app using the power of BERT and other Transformer encoder architectures.
  • Audio: Classify audio to detect sounds.

If you want to go into more detail, check out Tensorflow Models.

All these models are broken down into subs and for our case, we will be making use of the Body Model which has the hand pose detection we need in order to detect the hand signs.

Hand Pose Detection

This model used a 2D and 3D multi-dimensional array which enables it to predict the keypoints of the hands.

Example of a 2D is [[1,2],[3,5],[7,8],[20,44]] and that of a 3D is [[1,2,5],[3,5,8],[7,8,6],[20,44,100]].

This hand pose detection is a model from the MediPipe as we established above, and it provides us with two model types which are lite and full. The accuracy of the prediction increases from lite to full while the inference speed reduces, i.e. the response time will be slower as the accuracy increases.

What do we need?

There are a few dependencies we need to get things working, and I also will be assuming that you have your project set up as well.

You will need to add these dependencies to the project


yarn add @tensorflow-models/hand-pose-detection

# Run the below commands if you want to use TF.js runtime.
yarn add @tensorflow/tfjs-core @tensorflow/tfjs-converter
yarn add @tensorflow/tfjs-backend-webgl

# Run the below commands if you want to use MediaPipe runtime.
yarn add @mediapipe/hands

yarn add fingerpose

Above, in the commands, you will notice we added a fingerpose. Let's talk a little about what we need the figerpose for.

Fingerpose

Fingerpose is a gesture classifier for hand landmarks detected by Mediapipe hand pose detection. It also allows you to add your own hand gesture, which means that a gesture that signifies the letter Z can signify Hello based on your fingerpose data. We will see an example of how the data looks in a bit. You can check out fingerpose for more details.

Get started

We are going to use Vue for this illustration. We will start by looking at the HTML first, and then we will cover the JavaScript.

Our Template will be a basic HTML that will have a video tag so we can show a video after getting access to our webcam.

Template

<template>
  <div class="wrapper">
    <video
      ref="videoCam"
      class="peer-video"
      preload="auto"
      autoPlay
      muted
      playsInline
    />
  </div>
</template>

The snippet above shows a div and a video tab. The video is used when we gain access to the webcam.

We will now be writing the JS required to initialize the webcam.

Script


<script  setup>
import { onMounted, ref } from "vue";

const videoCam = ref();
function openCam() {
  let all_mediaDevices = navigator.mediaDevices;

  if (!all_mediaDevices || !all_mediaDevices.getUserMedia) {
    console.log("getUserMedia() not supported.");
    return;
  }
  all_mediaDevices.getUserMedia({
    audio: true,
    video: true,

  })
    .then(function (vidStream) {
      if ("srcObject" in videoCam.value) {
        videoCam.value.srcObject = vidStream;
      } else {
        videoCam.value.src = window.URL.createObjectURL(vidStream);
      }
      videoCam.value.onloadedmetadata = function () {
        videoCam.value.play();
      };
    })
    .catch(function (e) {
      console.log(e.name + ": " + e.message);
    });
}
onMounted(() => {
  openCam();
});
</script>

We imported two methods from vue: onMounted and ref. The onMounted runs when the page is fully mounted while the ref is used to declare a reactive value to reference the video element. If you look at the video tag in the template, you will notice a ref property. You can check out Template ref and onMounted lifecycle hook.

In the openCam function, we first try to test if mediaDevices is available on your browser navigation.

The MediaDevices interface provides access to connected media input devices like cameras and microphones, as well as screen sharing. In essence, it lets you obtain access to any hardware source of media data.

This MediaDevice has a method getUserMedia which prompts the user for permission to use a media input. You can find all you need to know about getUserMedia here.

From the snippet, we can see that getUserMedia returns a promise, and with that, we can get the media stream as a response using then(). We check if the video element has srcObject or not. If it does we assign the media stream to the srcObject and if not, we convert the media stream to a URL and assign it to the src of the video element.

With this Snippet and with a few style, you should have your video showing your awesome face!

webcam-works

Introducing Tensorflow and Hand Detection

Now that we got our webcam working, we will update the Template and the script in order to detect, predict, and display the alphabet based on the hand sign prediction.

The updated HTML should now look like this:

<template>
  <div class="wrapper">
    <video
      ref="videoCam"
      class="peer-video"
      preload="auto"
      autoPlay
      muted
      playsInline
    />
    <div class="alphabet">{{ sign }}</div>
  </div>
</template>

The div with class name alphabet will display the alphabet based on the hand sign prediction.

We will be introducing two(2) new functions, createDetectionInstance and handleSignDetection.

Firstly, lets begin with the createDetectionInstance which is an integral part of the hand sign detection and then we will introduce handleSignDetection which predicts and displays the hand sign.


<script setup>
import { onMounted, ref } from "vue";
import * as handPoseDetection from "@tensorflow-models/hand-pose-detection";

let detector;
const videoCam = ref();

function openCam() {
   …
}

const createDetectionInstance = async () => {
  const model = handPoseDetection.SupportedModels.MediaPipeHands;
  const detectorConfig = {
    runtime: "mediapipe",
    modelType: "lite",
    solutionPath: "https://cdn.jsdelivr.net/npm/@mediapipe/hands/",
  } as const;
  detector = await handPoseDetection.createDetector(model, detectorConfig);
};

onMounted(async () => {
  openCam();
  await createDetectionInstance();
});
</script>

To be able to detect hand poses, we need to create an instance of the handpose detector, and here, we created a function createDetectionInstance which is an asynchronous function.

You can check out this Tensorflow blog to see more details.

Now that we have created an avenue to detect hand signs, let us start detecting the hand.

In that light, we will be adding a handleSignDetection function.


<script setup>
import { onMounted, ref } from "vue";
import * as handPoseDetection from "@tensorflow-models/hand-pose-detection";

let detector;
const videoCam = ref();

function openCam() {
  …
}

const createDetectionInstance = async () => {
  const model = handPoseDetection.SupportedModels.MediaPipeHands;
  const detectorConfig = {
    runtime: "mediapipe",
    modelType: "lite",
    solutionPath: "https://cdn.jsdelivr.net/npm/@mediapipe/hands/",
  } as const;
  detector = await handPoseDetection.createDetector(model, detectorConfig);
};

const handleSignDetection = () => {
  if (!videoCam.value || !detector) return;
  setInterval(async () => {
    const hands = await detector.estimateHands(videoCam.value);
    if (hands.length > 0) {
      console.log(hands)
    }
  }, 2000);
};
onMounted(async () => {
  openCam();
  await createDetectionInstance();
  handleSignDetection();
});
</script>

The handleSignDetection runs after creating the detection instance. We have a setInterval that runs every 2 seconds (PS: the 2 seconds timing is arbitrary and can be less or more) to check if there is any hand sign. We also have a conditional statement to ensure the video element exists, and the detection instance was created accordingly.

So, the detector calls a method estimateHands, which tries to predict the hand pose by getting keypoints with values that are either in 2D or 3D (Multi-dimensional Array).

If you check your console log, you will see an array of data if any hand pose is detected.

Now that we can detect hand poses, we will now add fingerpose that will help predict and display the alphabet based on the hand sign.


<script setup>
import { onMounted, ref } from "vue";
import * as handPoseDetection from "@tensorflow-models/hand-pose-detection";
import * as fp from "fingerpose";
import Handsigns from "@/utils/handsigns";

let detector;
const videoCam = ref();
let sign = ref(null);

function openCam() {
  …
}

const createDetectionInstance = async () => {
  const model = handPoseDetection.SupportedModels.MediaPipeHands;
  const detectorConfig = {
    runtime: "mediapipe",
    modelType: "lite",
    solutionPath: "https://cdn.jsdelivr.net/npm/@mediapipe/hands/",
  } as const;
  detector = await handPoseDetection.createDetector(model, detectorConfig);
};

const handleSignDetection = () => {
  if (!videoCam.value || !detector) return;
  setInterval(async () => {
    const hands = await detector.estimateHands(videoCam.value);
    if (hands.length > 0) {
      const GE = new fp.GestureEstimator([
        fp.Gestures.ThumbsUpGesture,
        Handsigns.aSign,
        Handsigns.bSign,
        Handsigns.cSign,
        …
        Handsigns.zSign,
      ]);

      const landmark = hands[0].keypoints3D.map(
        (value) => [
          value.x,
          value.y,
          value.z,
        ]
      );
      const estimatedGestures = await GE.estimate(landmark, 6.5);

      if (estimatedGestures.gestures && estimatedGestures.gestures.length > 0) {
        const confidence = estimatedGestures.gestures.map((p) => p.score);
        const maxConfidence = confidence.indexOf(
          Math.max.apply(undefined, confidence)
        );

        sign.value = estimatedGestures.gestures[maxConfidence].name
      }
    }
  }, 2000);
};
onMounted(async () => {
  openCam();
  await createDetectionInstance();
  handleSignDetection();
});
</script>

Assuming that our detector sensed a hand, it is time to match this value based on the hand signs we created with the fingerpose.

The landmark variable is a 3D array pulled from the hand result's keypoint3D key value. There is also a keypoint as well, which is a 2D value, and both will give the same result.

Now, using GE.estimate, we can generate a possible gesture that matches the sign, and a score/confidence is assigned to each gesture pending the amount of gesture predicted. So, the gesture with the highest score/confidence is selected since it is estimated to be the closest to the hand sign from the figerpose hand signs we created.

We also imported Handsigns and its content looks like this:

Asign

You can also get the handsigns folder from the 100-ms-vue repository. Looking at the screenshot, there is a GestureDescription instance that takes a string A which will represent what the hand sign will stand for. So, it could be anything you want the handsign to stand for.

onMounted is asynchronous because we need to ensure that our detection instance is created, which is required to detect the hand sign.

With the updated code, you should be able to display some letters.

webcam-signs

Conclusion

Don't forget, you can see in detail how this was implemented in one of This Dot Labs' open source projects 100-ms-vue. Please note that what we did is just a basic implementation, and to have a production-ready version, it will need a bigger model, and a more complex detection to be able to identify hand sign language.

This Dot is a consultancy dedicated to guiding companies through their modernization and digital transformation journeys. Specializing in replatforming, modernizing, and launching new initiatives, we stand out by taking true ownership of your engineering projects.

We love helping teams with projects that have missed their deadlines or helping keep your strategic digital initiatives on course. Check out our case studies and our clients that trust us with their engineering.

Let's innovate together!

We're ready to be your trusted technical partners in your digital innovation journey.

Whether it's modernization or custom software solutions, our team of experts can guide you through best practices and how to build scalable, performant software that lasts.

Prefer email? hi@thisdot.co