The Objective

One of the objectives to get our pipeline to work, regards input processing.

As of this moment we train our neural networks on silhouette images we rendered from the dataset. Since our last post on the data preparation, a few things have changed about the way we render these, with the main difference being that the silhouette images are now scaled relative to the maximum stature present in the dataset: 2183mm. Because of this they now contain information regarding the subject’s height.

The objective of our pipeline is to take regular photos as its input images. In order to use these photos, we have to find a way to convert these into similar silhouettes that the 3d models were rendered to.


A greenscreen would be an effective way to extract a person’s silhouette from a photo, but not every situation offers the possibility to use a greenscreen.

Therefore we landed on a concept called “background subtraction”, which is mostly used to extract moving objects from video.

Now the problem is that we are working with photos, rather than with video. This did however remind us of a similar functionality that had been in the “Photo Booth” application shipped with MacOS since 2005. This application allowed you to replicate a “greenscreen effect” by first taking a picture of the background before stepping into frame. It would then “subtract” this background from the video feed in order to simulate a greenscreen. We decided to take a similar approach and use the photo of the background and the photo of the person as two separate “frames” in the “video” to subtract.

One software package that allowed for this functionality is OpenCV: A computer vision library that allows for many sorts of visual processing. This is what we started our experimentation with

Language & Integration

As we had hoped to create our client application as a web application, the first thing we tried looking into was the OpenCV JavaScript library. We could however not find a lot of information about it, and many of the references to its source code were broken. We therefore decided to not choose this way, and instead do our processing in the backend.

The next method of implementing OpenCV we looked at was “node-opencv”, a JavaScript library for NodeJS, that provided native bindings to the OpenCV library. As our current backend is written in NodeJS, this seemed to us like the way to go. We however did not manage to get it to run properly and eventually started looking into alternatives.

Another popular integration for OpenCV was its Java library, “opencv-java”. This API was however dependent on the native library as well. We followed several tutorials on how to properly setup a development environment for OpenCV with Java, but we kept running into obscure errors that yielded no results when searched for on Google.

The next solution we tried was a success: Directly interfacing with OpenCV via C++. Because we could now work with OpenCV, this is where our experimentation phase began.

Code can be found over at GitHub


We divided the process of extracting the silhouette up in several steps. Before these steps we had to define the function inputs, which were as follows:

  • The Background Image (Excluding the person)
  • The Foreground Image (Including the person)
  • The scale
    This value represents the stature of the person in relation to the maximum possible stature (2183mm).
    Thefore the scale for a person with a height of 196.47 centimeters, would be 0.9. 
  • The subtraction threshold
    This value represents how much pixels can differ between the two images before they are considered part of the person. Changing this value produces different success and is also very much dependent on the lighting conditions and similarities between the person and the background. In our testing, 512 seems to produce roughly decent results, but this may vary. This could potentially later be controlled via the user interface so a human can verify the silhouette extraction and change this value if needed.

The source images:


Step 1. Resizing the image

As processing these images take time and not all input images are the same size, all the images are resized to a max width and height. They do maintain their aspect ratio. This allows any subsequent processing to happen faster.

The resized image

Step 2. Person Detection & Crop

The next step was to cut away most of the unnecessary areas from the photo. As OpenCV allows you to detect persons in a photo, we used this technique as the basis for this step.

First we got OpenCV to detect any people within the image. This returns a bunch of areas. Sometimes OpenCV can detect multiple people for a certain person as shown in the following image by Adrian Rosebrock:

Before (Left) and After (Right) the application of non-maximum suppression.

Because of this it is useful to merge multiple areas into a single area given they have a specific amount of overlap.

For this, we used the “non-maximum suppression” algorithm. The specific C++ implementation of this algorithm that we used, was the implementation by Martin Kersner.

Finally, we would crop the front- and background images based on the output of the merged area bounds:

The cropped image based on the person detection from OpenCV

Step 3. Background Subtraction

The next step was to subtract the background and produce a matte. This is where OpenCV’s BackgroundSubtractor came into play. This subtractor was created with the specified threshold. First, the background image would be inserted as the first frame, after which the foreground would be inserted as the second frame.

The result of the background subtractor would be written to a new black and white “Matte” image.

The resulting matte obtained through background subtraction

Step 4. Applying contours

The edges for the resulting matte were still a bit rough. Luckily OpenCV allows you to detect contours via the findContours function, which we could then draw onto the matte with the drawContours method, to improve the quality of the edges.

The resulting matte after processing the contours

Step 5. Trimming

Now there was still a bunch of useless padding surrounding the matte, which we would have to trim away. This is easily solved using OpenCV’s findNonZero method, which allows you to find the bounding points of the image. These can be used to generate a bounding area, which can be used for cropping the image.

The trimmed result from detecting the bounding area and cropping the image

Step 6. Inverting

The current state of the matte is white on black. The silhouettes for our training data were black on white. Let’s invert the image.

The inverted matte

Step 7. Scaling

Now all the training silhouettes are scaled based on their height, relative to the maximum height in the datasets (2183mm). Based on the scale parameter supplied to the silhouette extraction function, we can replicate this behaviour.
we use this by first creating a new white canvas the size of 1 / scale times the height of the matte. We can then draw the existing matte onto the new canvas, aligned at the center horizontally and the bottom vertically. Now the matte is relatively scaled to its canvas size.

The scaled matte on its new canvas

Step 8. Resizing

Now the final step we have to do is resize the current canvas to the desired resolution. This is just a simple resize operation that OpenCV provides through its resize method.

The final resized silhouette.

Final Thoughts

The final extracted silhouette is not perfect, and its quality is very much dependent on the lighting conditions and quality of the background. We think that with a properly controlled environment and most of all a stable camera, the results can be adequate enough to use as input for our prediction algorithm. Should this however prove not to be case we’ll be forced to resort to other, possibly more limiting techniques like greenscreening, or depth extraction through multiple camera angles.


Leave a Reply

Your email address will not be published. Required fields are marked *