Eye Tracking Algorithm
The eye tracking algorithm is the core component of the system, responsible for accurately estimating the user's eye gaze and mapping it to the corresponding screen coordinates. This section delves into the details of facial landmark detection and eye gaze estimation techniques employed in the project.
Facial Landmark Detection
Facial landmark detection plays a crucial role in the eye tracking algorithm. It involves identifying and locating specific facial features, particularly the eyes, which serve as the basis for gaze estimation.
MediaPipe Library
The project utilizes the MediaPipe library, a powerful open-source framework developed by Google, for facial landmark detection. MediaPipe offers a pre-trained face mesh model that can accurately detect and track 468 facial landmarks in real-time. The face mesh model in MediaPipe is based on a deep learning architecture that employs a convolutional neural network (CNN) to analyze the input image and predict the positions of the facial landmarks. The model has been trained on a large dataset of diverse faces, enabling it to handle variations in facial features, expressions, and head poses.
Eye Landmark Extraction
Among the 468 facial landmarks detected by the MediaPipe face mesh model, specific landmarks corresponding to the eyes are of particular interest for eye gaze estimation. The project focuses on extracting the landmarks associated with the left and right eyes. The relevant eye landmarks are identified based on their indices in the face mesh model. For example, landmarks with indices 468 and 473 correspond to the center of the left and right eyes, respectively. These landmarks provide the coordinates of the eye centers in the image space.
By extracting the eye landmarks from the face mesh, the system obtains precise information about the position and orientation of the eyes, which forms the foundation for subsequent gaze estimation steps.
Eye Gaze Estimation
Once the eye landmarks are extracted, the next step is to estimate the user's eye gaze direction and map it to the corresponding screen coordinates. Eye gaze estimation involves analyzing the relative positions of the eye landmarks and applying mathematical transformations to determine the gaze point on the screen.
Average Eye Coordinate Calculation
To estimate the overall gaze direction, the system calculates the average coordinates of the left and right eye landmarks. This is done by taking the mean of the x and y coordinates of the left and right eye landmarks separately. The average eye coordinates provide a more stable and robust representation of the gaze direction, as it accounts for any slight variations or noise in the individual eye landmark positions. By considering both eyes together, the system can better handle cases where one eye may be partially occluded or less accurately detected.
Mapping Eye Coordinates to Screen Coordinates
The final step in eye gaze estimation is mapping the average eye coordinates from the image space to the corresponding screen coordinates. This mapping allows the system to determine the precise location on the screen where the user is looking. To achieve accurate mapping, the system relies on a calibration process (described in detail in the next Section) that establishes a relationship between the eye coordinates and the screen coordinates. During calibration, the user is prompted to look at specific points on the screen, and the corresponding eye coordinates are recorded.
Using the collected calibration data, the system computes a projective transformation matrix that maps the eye coordinates to the screen coordinates. This transformation takes into account the individual variations in eye positions and the geometric relationship between the camera and the screen. Once the transformation matrix is obtained, the system can apply it to the average eye coordinates in real-time to estimate the gaze point on the screen. The resulting screen coordinates are then used to control the mouse cursor or perform other desired actions based on the user's gaze. By combining facial landmark detection and eye gaze estimation techniques, the eye tracking algorithm enables accurate and responsive gaze-based interaction with the system.
Projective Transformation
Once the calibration data is collected, the system computes a projective transformation matrix to map the eye coordinates to the screen coordinates. Projective transformation allows for accurate mapping even in the presence of perspective distortion caused by the camera-screen setup.
Four-Surface Projective Transformer
The system implements a four-surface projective transformer, which divides the screen into four quadrilateral regions. Each region is defined by its four corners, and a separate projective transformation matrix is computed for each region. The four-surface approach enhances the accuracy of the gaze mapping by accounting for variations in eye positions across different screen regions. It allows for more precise control and smoother transitions between regions. During the calibration process, the system associates the captured eye coordinates with their corresponding quadrilateral regions. The projective transformation matrices are then calculated using the OpenCV library's getPerspectiveTransform function, which takes the source (eye) and destination (screen) coordinates as input.
Outlier Removal and Averaging
To improve the robustness and stability of the calibration process, the system incorporates outlier removal and averaging techniques. Outlier removal helps to mitigate the impact of any erroneous or inconsistent eye coordinate samples captured during calibration. The system implements a simple outlier removal approach based on the distance from the mean. It calculates the mean eye coordinates for each calibration point and removes a certain percentage (e.g., 30%) of the samples that deviate the most from the mean. After outlier removal, the system calculates the average eye coordinates for each calibration point. Averaging helps to smooth out any remaining noise or variations in the eye coordinates, resulting in a more reliable and stable calibration. The final calibration result is a set of projective transformation matrices that can be used to map the eye coordinates to the screen coordinates in real-time. These matrices are stored and applied during the gaze tracking phase to estimate the user's gaze point on the screen accurately. By implementing a robust calibration algorithm with user guidance, projective transformation, outlier removal, and averaging techniques, the eye gaze tracking system ensures accurate and reliable mapping of the user's gaze to the screen coordinates.
Distance Checking Algorithm
The distance checking algorithm is an essential component of the eye gaze tracking system that ensures the user maintains an appropriate distance from the camera for accurate gaze tracking. This section delves into the details of the triangle area calculation, distance estimation, and user feedback and guidance mechanisms employed in the project.
Triangle Area Calculation
The distance checking algorithm relies on the calculation of the area of a triangle formed by specific facial landmarks to estimate the user's distance from the camera. The triangle is typically defined by the points corresponding to the left eye, right eye, and nose tip. In the system, the DistanceChecker class implements the distance checking functionality. It utilizes the MediaPipe face mesh model to detect and extract the relevant facial landmarks from the input video frame. The check_distance method of the DistanceChecker class performs the triangle area calculation. It retrieves the coordinates of the left eye (landmark index 468), right eye (landmark index 473), and nose tip (landmark index 1) from the face mesh landmarks. The area of the triangle is calculated using the formula: triangle_area = abs((left_x * (right_y - nose_y) + right_x * (nose_y - left_y) + nose_x * (left_y - right_y)) / 2) where left_x, left_y, right_x, right_y, nose_x, and nose_y represent the x and y coordinates of the left eye, right eye, and nose tip, respectively. The triangle area serves as an indicator of the user's distance from the camera. As the user moves closer to the camera, the triangle area increases, and as the user moves away, the triangle area decreases.