Pose Extraction

We use MediaPipe pose extraction on videos of tennis play to extract the 3D skeletal animation of the player. For this, we extract the pose coordinates of 15 joints in world space (with the origin at the centre of the hips), measured in metres. The joints we consider are: head, neck, elbows, heels, wrists, hips, knees, shoulders, and torso. We output the pose data represented as a dictionary where the keys are the joint names and the values are arrays containing the 3D joint coordinates for each frame.

Shot Recognition Model

We use a Convolutional Neural Network (CNN) for classification of tennis shots from 3D skeletal animations. Our model uses four classes: forehand, backhand, smash, and service. The input to the network is a skeletal animation represented in RGB image form, normalised to the dimensions 15x100. Thus, each RGB pixel colour represents the XYZ coordinates of one of the 15 skeleton joints in one of the 100 animation frames. We use TensorFlow to build and train the model in a Jupyter notebook.
The training data we use are videos of tennis shots being performed from the THETIS dataset. We pre-process the training data to extract the skeletal animations using MediaPipe as described above and convert these into image format. Our CNN model is saved as a single .h5 file.

Shot Detection

To detect the frame intervals where shots occur, we consider the speeds of the left and right wrist joints. We assume that the swing speed at any frame is the maximum of the left and right wrist speeds in this frame. We blur and downsample the swing speed to remove noise and form a coarser representation. We then find the frame intervals where the swing speed is greater than half a standard deviation above the median; these are when shots occur. We assume shots that are too short (< 0.6s) are invalid detections and ignore these. We implement this data processing using NumPy and SciPy.

Shot Analysis

We perform shot analysis on the pose data extracted from a video. We first calculate the 3D speeds of the left and right wrist joints by computing their 3D gradients using NumPy. We then feed these speeds into our shot detection heuristic to find the shot intervals. For each shot interval, we find the corresponding sequence of skeletal animation frames of the player. We convert these frames into image format (making sure to keep the same joint ordering that our CNN recognition model is trained on) and feed these into our recognition model to get the corresponding shot classification.
We build a shot analysis dictionary where the keys are the names of the different analytics we consider, and the values are arrays containing the corresponding analytic entries for each detected shot. These include the start and end frame indices of the shot intervals, the shot classifications (and confidence score), and the 3D joint coordinate frames for each shot. We also include the handedness of each shot by comparing the mean left and right wrist speeds over the shot interval, and we also include the corresponding wrist speed in the analysis dictionary as the shot speed.

Video Analysis

We use OpenCV to extract the frames of a video of tennis play, before feeding these frames into our pose extraction implementation. We then give the returned 3D pose data to our shot analysis implementation, which returns the analysis dictionary. We use this to create a .json file of the shot analysis. This contains an array of shot objects, where each shot object contains fields for each analytic. The .json file also contains the Frames-Per-Second (FPS) of the video.
For each shot interval, we also use OpenCV to annotate the frames of the original video with the image joint landmarks extracted by MediaPipe as well as the shot classification and confidence score. We save the full-length annotated video, as well as the individual clips for each shot interval (in .mp4 format).

BVH Conversion

We convert the 3D pose data extracted by MediaPipe into the motion capture .bvh file format. This requires building a skeleton hierarchy based on joint connections and offsets. We consider the same 15 joints described previously and use the first frame of animation to find the joint offsets. We then use the 3D pose frames together with the joint hierarchy and offsets to compute the rotation of each joint for every frame. This allows us to animate the skeleton.

Flask Server

We deploy our analysis pipeline using a lightweight Flask server. This facilitates the uploading of videos via HTTP POST requests, where the allowed file formats are .mp4, .mov, and .avi. The server hands the received videos over to our video analysis implementation. The server maintains a directory for analysis results where it stores the analysis for each video in its own subdirectory that contains the .json file, annotated videos, and .bvh files. The subdirectory for each video is named using a unique analysis ID generated for the video. This ID is generated by hashing the video filename concatenated with the current time. The ID is truncated so it is not too long. The server also runs a concurrent process that deletes any analysis results that are more than 24 hours old, in accordance with our data policy.
Our server also facilitates the viewing of analysis results (via a URL containing the unique analysis ID) by retrieving the analysis results from the corresponding subdirectory and using them to render a Jinja HTML template for the analysis dashboard. The server also handles requests to retrieve just the JSON analysis data.

3D Shot Reconstruction

We use Three.js to render the 3D skeletal animations of shots in our webapp. We first retrieve the shot analysis JSON that contains the 3D joint coordinate frames for each shot. We do this via an AJAX request to the Flask server as described above. We linearly interpolate between the 3D joint coordinates for each frame and draw 3D lines to visualize the joint connections. We use the FPS field from the JSON to correctly time the skeletal animation.