A device to track attendance using facial and gesture recognition
In Fall 2022, our club began working on the Smart Attendance project. The goal of this project is to create a device that tracks attendance at our club meetings using AI technologies like facial and gesture recognition. This project is being completed by 4 teams: front-end, back-end, database, coding.
The user interactivity of this application is implemented by processing a stream of video frames from a webcam. There are two components: hand gesture recognition and face recognition. Hand gesture recognition offers the possibility for users to interact with the application through a set of pre-defined gestures like thumbs up and thumbs down, whereas face recognition identifies the members uniquely to track attendance.
To develop the hand gesture recognition component, we have leveraged the MediaPipe library to convert an image frame into some 3D hand keypoints. MediaPipe is an open-source package developed by Google in 2019, offering customizable machine learning solutions for streaming media. Once the hand keypoints are extracted from a video frame, we can train our own classifier to identify gestures to support the user interface. For this step, we found that a simple support vector machine (SVM) is sufficient. Currently, the application is able to identify the gestures of “one”, “two”, “three”, “thumbs up” and “thumbs down”, with each gesture being assigned to specific interface roles. In the future, this set of gestures could also be easily expanded by re-training the classifier with less than 50 labeled images for each new gesture.
To develop the face recognition component, we have adopted the deep learning face model in Dlib that has an accuracy of 99.38% on the standard “Labeled Faces in the Wild” benchmark. The model is a residual neural network (ResNet) with 29 convolutional layers, adopted from the famous ResNet-34 by K. He, et al. (2016). It takes in an image with a face and outputs an array of 128 floating-point numbers, which can be seen as some “encoded face”, independent of how the picture was taken. In our application, we “encode” each video frame into a 128D array. To check if the faces appearing in two frames are the same person, we only have to compare the similarity of the two 128D arrays, without having to analyze the actual pixels from the original frames. In a few tests we did, this approach was shown to be robust against light conditions, hairstyle, makeup, glasses, and so on — just as a human would perform, and sometimes even better.
🤖 Try out the demo here: https://test.msuaiclub.com/