Building a digital guide dog for railway passengers with impaired vision

This post has been republished via RSS; it originally appeared at: AI - Customer Engineering Team Blog articles.

Video: MS_Blindenhund_English (microsoft.com)

Background/Motivation

Training a young student to find the doors on the Munich subway, using a white cane.

Catching your train on time can be challenging under the best of circumstances. Trains typically only stop for a few minutes, leaving little room for mistakes. For example, at Munich Main station around 240 express trains and 510 regional trains leave from 28 platforms per day. Some trains can also be quite long, up to 346 meters (1,135 ft) for express ICE trains. It is extremely important to quickly find the correct platform and platform section, and then the door closest to a reserved seat needs to be located. This already challenging adventure becomes even more so, if a vision impairment forces a customer to rely exclusively on auditory or tactile feedback. When traveling autonomously, without assistance, it is common practice to walk along the outside of a train, continuously tapping it with a white cane, to discover opened and closed doors (figure 1). While this works in principle, this practice has limitations, both in terms of speed and reliability. We therefore partnered with DB Systel GmbH, the digital partner for all Deutsche Bahn Group companies, to build the Digital Guide Dog. This is a feasibility study based on an AI-powered smartphone application that uses computer vision, auditory and haptic feedback to guide customers to the correct platform section and train car door. In this blog post, we are sharing some of the details and unique challenges that we experienced while the AI model behind this application.

Approach

The application has found an open door, and guides customers towards them.

At its core, the application relies on an object detection model, which draws bounding boxes around every opened and closed door in the camera image. Using the coordinates of the corners of bounding boxes, the application can then guide customers into the correct direction (figure 2). Even though this is probably one of the most common and canonical applications of AI these days, there were a couple of unique challenges that made this project interesting. First, it is important to select an AI model that considers the context around a door, to decide whether it is looking at an opened or closed door, or something entirely different. Second, model errors can have detrimental, even fatal consequences, because of dangerous hazards that come with the territory of being inside a busy train station. Third, the model has to process video frames at a high rate, directly on the smart phone. In the following sections, we talk in more detail about how we tackled each of these challenges.

Considering the context of an object

Examples of closed (left) and opened (right) train doors on a light rail car.

It is important to select an AI model that considers the context around a door, to decide whether it is looking at an opened or closed door, or something entirely different. For example, in figure 3, the model has to be able to recognize the closed door on the left, and an opened door on the right. The tricky part for the opened door is that it contains the same door flies that would represent a closed door (except that they would be touching each other). This gets even trickier for doors that only have one door fly that is pushed to one side. It would be a catastrophic failure, if the model recognized the door fly as a closed door. This situation would overwhelm many computer vision algorithms that treat object detection and classification as two separate problems. Many computer vision algorithms rely on approaches related to selective search (see this blog post, for example), in effect resizing and moving a window across an image, to then classify the objects contained in the window. We therefore chose to use the YOLO (You Only Look Once) v5, because it reformulates object detection and classification into a single challenge, taking the entire image input under consideration. We used a model that had been pretrained on the ImageNet dataset, and fine-tuned using Azure Machine Learning, including hyperparameter sweeps with HyperDrive.

Error Analysis

The second challenge was that we had to ensure that the model could be trusted in guiding customers to their doors. A train station contains many potential dangers, most of all the risk of falling onto the train tracks and being run over by a train. For this purpose, we had to take great care in preparing the model for various potential scenarios, exactly understanding its limitations, so that we can communicate those clearly to users. We carefully curated an annotated image dataset that would cover various types of train models and model years, diverse perspectives of doors, as well as diverse surroundings. In addition to training the model on objects we were interested in, we also trained the model to recognize objects that could be mistaken for doors (e.g., gaps between cars and windows). We then partnered with a team in Microsoft Research to perform error analysis (released open source in form of Jupyter notebook widgets). In essence, this approach involves assigning features to images, such as train model and year, and distance and angle to doors, to then train a decision tree that aims to predict model errors based on these features.

CoreML

One remaining challenge was to then convert the YOLO v5 model from PyTorch to CoreML, so that it would be able to process camera images in real-time on the smartphone. This was necessary to avoid costs related to transferring data between the phone and the cloud, reduce processing latency, and, most importantly, due to privacy concerns, ensuring that camera images are not intercepted or stored (see this repository for how to anonymize images when creating a dataset). Model conversion to CoreML can be accomplished using Core ML Tools. To achieve high enough image throughput, we ensured that all neural network operations are supported by the Neural Engine of the smart phone. This required us to explore various changes to the model architecture. We then used HyperDrive to combine a search over these changes with a search over common hyperparameters (e.g., learning rate, weight decay, momentum), to optimize model speed and accuracy.

Conclusion

In this blog post, we tried to share our learnings about unique challenges that we encountered when working on a project that initially appeared to be a canonical use case of computer vision model for object detection. In future work, we are planning to expand the scope of the application, to further improve the autonomy of passengers with impaired vision. Please let us know your thoughts in the comments below.