January 15, 2024 • 6 min read

Mastering Volume Computation of Objects from Videos

Rédigé par Simon Playe

Introduction to Volume Computation of Objects

Have you ever worked on volume computation of objects from videos using AI? Did it prove to be either not relevant or highly difficult to achieve? If you answered ‘yes’ to two questions, and are a fan of volume computing or AI, then this article is for you. Using two videos, I will show you how to compute the volume of objects.
Uses cases mostly include inventory management. As many more companies are trying to reduce their environmental impacts, such a tool would improve the efficiency of supply chains and improve inventory management. However, this tool can also be used for furniture placement: interior design, event planning, construction, or renovation...

Volume computation of objects raises several challenges:

  • locating the objects within the video
  • defining their boundaries
  • scaling the video to convert a pixel distance to meters

This article will go through all the steps below to retrace how to perform volume computation of objects:

diagram volume computation process
Steps of Volume Computation Process

Underlying Set-Up and Assumptions

To perform volume computation of objects, I rely on a few assumptions:

  • The area filmed is a closed room.
  • This room is filmed twice from the same point of view: once without the objects and once with them.
  • The videos are identical, except for the objects
  • The objects lie directly against a wall, without any space.

For simplicity, in the remainder of this article, the filmed room is empty, and the objects are rectangular boxes. Therefore, volume computation will be performed on these boxes. An example of a video sequence respecting these assumptions:

volume computation on two boxes
An example of a video sequence respecting these assumptions. Here the studied objects are the two boxes.

The most important tool to perform volume computation is the camera. Indeed, a normal 2D camera only displays color values for all pixels filmed. In this article, the set-up consists of a special camera named Intel RealSense D435. This camera is composed of two infrared cameras that offer a 3D representation of what is filmed. A Python API, pyrealsense2, enables to retrieve, from every pixel, a 3D-coordinate. Thus, one can get the distance in meters from the camera on the x-axis (right to left), on the y-axis (up to down), and on the z-axis (close to far) for all pixels. Any other camera can be used as long as it provides such coordinates.

A few words about pyrealsense2. Pyrealsense2 is the Python wrapper for Intel RealSense SDK 2.0, which is a library for Intel RealSense cameras. The SDK can normally be accessed by using librealsense, a C++package. According to the Github page of librealsenseThe SDK allows depth and color streaming, and provides intrinsic and extrinsic calibration information. The library also offers synthetic streams (pointcloud, depth aligned to color and vise-versa), and a built-in support for record and playback of streaming sessions.”. To provide these services, the SDK uses a deterministic model that relies on the input from the two infrared cameras.


Let’s now dig into the process of volume computation. Here is the depth from the images of these two videos:

volume computation raw depth image with boxes
Raw image with boxes
volume computation raw depth image without boxes
Raw image without boxes

These images illustrate a classic statistical problem, outliers:

  • Black areas on the images of the videos correspond to pixels without any coordinates (depth is set to 0).
  • Colorful areas are flawed coordinates (depth is 10 meters, while the wall is 1.5 meters away from the camera).

Addressing Outliers: Best Practices

Firstly, let’s define more precisely what is an outlier. Here, an outlier is a pixel with a z-coordinate (corresponding to the depth) value equal to 0 or higher than 1.6 meters. Since the wall is around 1.5 meters away from the camera, no pixel can have a z-value higher than 1.6 meters.

Such a method also enables us to select the pixels with aberrant x and y values. Indeed, a rapid check shows that pixels with a z-coordinate value of 0 or higher than 1.6 also have aberrant x and y-coordinate values. On the other hand, no pixel had been found with an aberrant x and y-coordinate value but a z-coordinate value between 0 and 1.6. Therefore, only the z-coordinate value defines whether a pixel is an outlier.

Let’s take one specific pixel of one image of one video and imagine that this pixel is an outlier.

The best way to replace the z-coordinate is to use the z-coordinate of the closest not-outlier pixel. Here, the “closest not-outlier pixel” is the pixel with the lowest Euclidean distance from our pixel.

The method used to retrieve x- and y-coordinates is more complicated. Focusing first on x too, let’s use the grid below to visualize the method used. This grid gives the x-coordinates for pixels located on a 7x7 portion of the image. The red cells represent outliers, the green cells represent non-outliers. The x-coordinate value is written inside the non-outlier pixels.

grid representation subset image with outliers
Subset of the image: red areas are outliers, green areas are pixels with values

Here, I want to compute the x-coordinate for the circled outlier? located in (3,2). To do so, I first selected the two closest points from this outlier (circled in the grid) which are not on the same column. The x-coordinate difference between these two pixels is 0.82-0.80 = 0.02. These two pixels are one column away from each other. Therefore, one can estimate that the moving of one column should change the x-value of 0.02. Therefore, since the outlier is two columns away from the circled pixel “0.82”, its estimated value is 0.78 (=0.82-0.02*2).

The same method is used to replace y-coordinate values for outliers.

These preprocessing methods enable to get a much better 3D representation of what the camera filmed (here for the depth):

volume computation processed depth image with boxes
Processed image with boxes
volume computation processed depth image without boxes
Processed image without boxes

Volume Computation Process

Technique to Perform Volume Computation Using Two Videos

Having videos with consistent 3D coordinates, I can perform volume computation. Reading this article, you might have wondered: “Why insist on recording a video without boxes while only boxes’ volume is needed?”. The trick to perform volume computation of the boxes is to compute the volume for one image of the video with boxes and subtract this volume from the volume of the same image without boxes. This difference corresponds to the computed volume of the boxes.

Compute Local Volumes Using Kernels

How to perform “volume computation” in one image? Looking at the image below, one can split it into smaller squares, namely, kernels:

volume computation processed depth image with boxes with kernels
Processed image with boxes with kernels

Let’s simply this problem by first trying to compute the areas of the kernels and then their corresponding volumes.

Having the 3D coordinates of the four pixels that make up one kernel, one can compute the area of the kernel in different ways. To be more precise, we approximated a kernel to a parallelogram. Then, I got rid of the z-coordinates of the points and used this formula.

area of a parallelogram
Vector formula of a parallelogram area
square four points
A kernel

This formula ignores one point (here C), and the four points are considered to be in the same 2D plane, since the z-coordinates of each point are ignored.

Focusing on only one kernel, one can consider this kernel as the basis of a rectangular prism, the other basis being located at the coordinates z=0 and being identical. One way to picture it is to imagine that the camera is located on a plane that is perpendicular to the z-axis. Therefore, the rectangular prism will have one basis on this plane and one will be the kernel on your image. The height of this rectangular prism will be the average of the z-coordinates of all the points inside the kernel, such as A, B, C, D, E, F, and G in the picture below:

square seven points
E, F, and G are also used to compute the average height

Subtract Kernel Volumes

Accordingly, computing the area of a kernel and getting the height of the corresponding rectangular prism enables one to perform volume computation for one kernel. By covering one image with kernels and summing their corresponding volumes, it is possible to get the volume of the full image. Here is a visual representation of the volume: one pixel of each image represents the volume computed with a 2x2 pixels kernel:

volume computation volume without boxes
Computed volume - image without boxes
Compute volume - image with boxes

In the images above, I added two squares: S1 and S2 on the image without boxes, and S1’ and S2’ on the image with boxes. Comparing S1 and S1’, the volumes should be approximately the same, because they enclose the same area and have the same depth. Thus, the difference between S1 and S1’ should be close to 0. However, comparing S2 and S2’, the volume of S2 should be greater than the one of S2’ because S2’ is on a box. Thus, even if S2 and S2’ have the same area, the mean depth of pixels in S2’ is lower than in S2. The difference between S2 and S2’ corresponds to the volume taken by the box there. Therefore, by covering the two images with squares and computing the difference in volume for each corresponding square, one can get the volume of the boxes.


To conclude, volume computation using this technique gives pretty good results in estimating the volume of objects. For two boxes, the absolute error between the computed volume and the real volume is less than 1%. For one and three boxes, this error is less than 3%. Despite the many assumptions underlying the room and the objects, this algorithm proves to be efficient in computing objects’ volume. Furthermore, it provides a basis for more exciting Computer Vision applications, such as computing the volume of different kinds of objects using object detection algorithms.

Are you looking for experts in data and Computer Vision? Don’t hesitate to contact us!

Cet article a été écrit par

Simon Playe