How We Trained Neural Network to Count People for AI-Powered Transportation, Retail, & Smart Cities

Head of Adaptive Computing Systems Unit at Promwad

Deep learning boosts productivity and reduces costs in various business applications. In this article, we will show how our engineers created a neural network for human counting. This solution can now be applied to public transport, smart buildings, retail and queue management.

Table of contents

Where neural networks are used

How head & face detection systems work

How the neural network is trained

Fine tuning deep learning model with iterative dataset

Deep learning is a type of machine learning in which neural networks are trained on large amounts of data. This technology allows for automating complex business processes and implementing analytics for data-driven decisions in various areas.

Over the past 5 years, the deep learning market has grown almost 22(!) times. At the end of 2022, it was estimated at USD 49.6 billion.

Where neural networks are used

Before the introduction of neural networks, human counting tasks were solved with the help of a human operator and surveillance cameras, as well as optical sensors in turnstiles (where available). Computer vision and deep learning technologies simplify and automate this process in many areas:

People are waiting for a public transport

Public Transporation. People counting solutions help control passenger traffic, analyse rush hours and traffic congestion. Using analytics data, city authorities can optimise public transport schedules: shorten intervals between vehicles, add extra trips on the route, etc.

People are working in an office

Smart Building. AI-based cameras make it easier to operate and maintain buildings. Here are just a few scenarios on how to apply machine learning for smart buildings:

monitor the capacity of office space and meeting rooms in a hybrid work or co-working format;
count people and monitor occupancy and social distancing;
adjust heating, air-conditioning and lighting systems according to needs;
monitor lift occupancy and avoid overloading.

People do shopping

Retail. Machine learning in retail analytics optimises staff workloads. The data obtained helps to study customer interest in products and displays, capture the trajectory of people moving through the sales floor and change the shop design: placing sections in a convenient sequence or favouring promotional materials.

Detecting people in a queue

Queue Management. Head recognition cameras help manage queues at airports, train stations and big shops by recording the number of people in the line and their waiting time. If the data shows that the queue is growing quickly and moving slowly, it is a signal to open additional check-in counters, cashiers, or call staff.

Detecting people in the street

City Mobility. Smart cameras for video surveillance make the urban environment accessible for people with disabilities or parents with prams. Such cameras on public transport help drivers identify passengers who need assistance: a ramp to get down or a mechanical lift.

How head & face detection systems work

Let's look at how a neural network works when paired with a camera. This combination is used in most video surveillance solutions. Previously, we described how this works in our case study on a smart bike parking with the SONY Spresense AI camera.

Here are the key steps in the process of counting objects in the camera video:

Capture images from CCTV, IP, or specialised sensors covering the target area.

Improve image quality: filtering, noise reduction, stabilisation and clarity of images or video frames.

Detect objects using computer vision algorithms, including single-shot detection (SSD).

Count objects and analyse their characteristics: crowd density, queue time, flow patterns, etc.

Integrate and visualise data: display dashboards and transfer to control systems and applications.

For such a system to work properly, the neural network must be trained; it is a complex but very exciting process, which we describe below. Now, Promwad engineers train a neural network for face and head detection within an ongoing machine learning project in the automotive industry.

How the neural network is trained

Head detection is a subset of object detection tasks. Why is it important to identify the head? In a number of applications it is not sufficient or possible to identify the presence of a person. A person's figure may be partially hidden in a crowd or the camera may be filming a crowd from above or from the side.

A neural network recognises heads and faces

^{A neural network trained to recognise heads and faces. Video source: "Stable multi-target tracking in real-time surveillance video" by Benfold, Ben and Reid, I. CVPR 2011}

To count people, a neural network must do several things:

detect the presence of a person in the frame;
find the area where the head is located;
determine the coordinates of the frame surrounding the head;
assign an identifier to the object so that it can be tracked from frame to frame.

We will use SSD object detection architecture – one of the popular algorithms.

In the first stage, we have a mathematical model with randomly initialised weight values, which is useless without special training and processing. Our training will make it functional and smart: we pass thousands of preprocessed images through our model and add annotations with heads’ coordinates in each image so our model could "understand" that there is a target object in this area. Each batch of images goes through the model (forward propagation), launching calculations of the loss function, which measures how "far" are the predictions of our model and the annotations we have fed into the model.

The loss function is one of the essential parts of deep learning that determines how neural networks learn. In its simplest case, it can be visualised as the sum of our data deviations from the model prediction (see the graph below). However, the accurate picture is much more complex, as modern neural networks contain a huge number of trainable parameters.

^{The simplest case of a loss function in machine learning}

After the loss function is calculated, backpropagation of errors is on. During this stage, we use an optimisation algorithm to correct weights values and minimise the loss function. Tiny loss means our model approximates the annotations we passed to our model.

We will use the model to detect people in full-face and profile and their heads from the back and above. If the uploaded images have them, this area will get the relevant numerical value.

At the beginning of training, the network's confidence that there is a face or head in an area is randomly distributed from 0 to 1. But as it learns, its confidence tends to one in the areas with a head or face, and it tends to 0 where there are no desired images.

By learning, the model determines correct and erroneous predictions and "encourages" or "punishes" itself for the obtained results. The model can make a mistake and produce a digital result showing that there is no face on an image that actually has it. In this case, the loss function will be much larger, and hence the backpropagation error will change the weights so that the model will be able to determine this face in the future. Eventually, the model makes increasingly accurate predictions.

Fine tuning deep learning model with iterative dataset increasing for face counting

When training a neural network, we face two important interrelated issues:

Where and how many images to take for our training in order to achieve acceptable accuracy of the neural network;
How many iterations should be done for the neural network to learn sufficiently.

The learning time and accuracy of the algorithm depends on how these problems are solved.

We have chosen an iterative approach to train the neural network, loading a new portion of images at each iteration. This approach solves a potential performance problem with the platform where the model is trained: we pick up more data and iteratively extend the existing dataset. The cycle "ensuring model performance → expanding the dataset → fine-tuning the model" is repeated until we get a good result.

Training our neural network took 7 iterations. In the zero iteration, we trained the model on a dataset from free sources . It consisted of 2405 images that illustrated heads and faces with different characteristics: angle, size, and density.

An example of how people can be detected

^{Images from a free dataset are used to train the model for zero iteration. Source: github.com}

The zero iteration showed an accuracy (mean average precision) of 20% in our validation subset. We continued to improve this value for another 6 iterations and used another set of images for this purpose.

The first dataset contained a variety of photos with different scenes, brightness, and scale, while we needed more specific images for our problem. A model trained on these photos is helpful for facial recognition systems, e.g. for access control to restricted areas or secure buildings, access systems for carshare vehicles, or driver and passengers tracking platforms.

For our task focused on head detection and tracking of a specific person in motion, our datasets must include photos in profile or full-face and images of heads from different angles: from behind, above and below. So our model could detect faces in full-face and heads from different angles.

For the following iterations, we used a dataset that we compiled ourselves. We installed several cameras in various locations in our engineering office. The cameras were placed at various levels of inclination and in areas with varying lighting conditions. The images were marked accordingly: we framed the areas we wanted to identify.

It was a semi-automatic process: part of the labelling was done by our trained model, while a human refined or corrected errors in the labelling.

For the whole training, we collected ~ 3,400 images and divided them into 6 parts. Each part was divided into sets for training (~ 80%) and validation (~ 20%). Our main metric in object detection was mAP (mean Average Precision).

After the first iteration of training on the new data, the model produced a result of ~ 77%.

We continued training and loading new portions of photos. We got an accuracy result of ~ 92% at the sixth iteration and stopped there. The further the learning process goes, the smaller the increase in accuracy it gives. In some cases, the model shows more accurate results with each iteration, but the performance decreases at the next iteration. It happens because of internal failures or because the wrong inputs were loaded.

Diagram of our neuromodel training for people counting

^{Iteration diagram of our neuro network training for people counting}

When we train a model in iterations, we, on the one hand, extend the data on which it is trained. On the other hand, we control the accuracy and can predict the number of additional iterations.

We have created a service for iterative learning of our neuromodel, which includes several open source services and tools. Our iterative process involves several steps:

Training process monitoring. With a specialist service we can log different metrics such as loss, map, etc. – this is how we understand the model's behaviour in the training process.

Inference on new data. At this stage, we make predictions (output) on the newly selected data. Predictions in Common Objects in Context (COCO) format are needed for future annotation – creating truth bounding boxes. This procedure significantly reduces the time since the annotation process is one of the most time-consuming in machine learning.

Annotations refinement. After the output, we have a file with predictions of values for the heads in the new batch of data. These predictions need to be refined because the models are not perfect. We load this file into Computer Vision Annotation Tool (CVAT) and manually refine the predictions.

As a result, we have a neuro model that detects faces and heads from different angles with 92% accuracy, which can be used in surveillance systems in public places, airports, train stations, large transportation hubs and retail analytics. We are ready to use this solution for your business, adapt it to the specific task and bring the recognition accuracy to the required level.

* * *

Human counting solutions help improve security in buildings and crowded areas. Based on deep learning algorithms, these solutions provide reliable and accurate results in real-time, even in complex scenarios.

Advances in technology and the development of more sophisticated sensors will encourage companies to develop face recognition software and conduct accurate training of neural networks. More efficient and intelligent systems will be created that meet the needs of business and society.

As a company specialising in machine learning development, we offer clients services, from consulting to developing business-specific solutions. Contact us, and we will tell you how our engineers can develop a people-counting solution tailored to your business.