High-Resolution Network: A universal neural architecture for visual recognition

This post has been republished via RSS; it originally appeared at: Microsoft Research.

A timeline: AlexNet (2012), GoogleNet (2014), VGGNet (2014), ResNet (2015), DenseNet (2016), HRNet (2019).

Figure 1: Milestone network architectures (2012 – present)

Since AlexNet was invented in 2012, there has been rapid development in convolutional neural network architectures in computer vision. Representative architectures (Figure 1) include GoogleNet (2014), VGGNet (2014), ResNet (2015), and DenseNet (2016), which are developed initially from image classification. It’s a golden rule that classification architecture is the backbone for other computer vision tasks.

What’s next for a new architecture that is broadly applicable to general computer vision tasks? Can we design a universal architecture from general computer vision tasks rather than from classification tasks?

We pursued these questions and developed HRNet, a network that comes from general vision tasks and wins on many fronts of computer vision, including semantic segmentation, human pose estimation, and object detection. We’ve also released the code for HRNet on GitHub, and the paper on an extension of HRNet, called “HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation,” has been published at CVPR 2020.

What’s essential for tasks beyond classification? The typical tasks, such as those mentioned in the paragraph above, require spatially fine representations. Before HRNet, most techniques extend classification networks, that is, they add an extra stage to raise the spatial granularity (Figure 2) or use dilated convolutions.

Structure figure: (A) Arrows progressing from left to right through two tall, off-white planes, two medium orange planes, two smaller pink planes, and then two smallest red planes. (B) Arrows progressing from left to right through two small pink planes, two medium orange planes, and two large off-white planes. Dotted line arrows show relationship between A and B in three locations: between yellow and orange plane sets, between orange and pink plane sets, and bookending the smallest red planes in the middle.

Figure 2: The structure of recovering high resolution from low resolution. (a) A low-resolution representation learning subnetwork (such as AlexNet, GoogleNet, VGGNet, ResNet, DenseNet), which is formed by connecting high-to-low convolutions in series. (b) A high-resolution representation recovering subnetwork, which is formed by connecting low-to-high convolutions in series. Representative examples include SegNet, DeconvNet, U-Net and Hourglass, encoder-decoder, and SimpleBaseline.

How does HRNet do this? It is conceptually different from the classification architecture. HRNet is designed from scratch, rather than from the classification architecture, and it breaks the dominant design rule, connecting the convolutions in series from high resolution to low resolution, which goes back to LeNet-5 (LeCun et al., 1998).

High-Resolution Network: Design and its four stages

The HRNet maintains high-resolution representations through the whole process. We start from a high-resolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several (four in the current design) stages as depicted in Figure 3, and the nth stage contains n streams corresponding to n resolutions. We conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.

The high-resolution representations learned from HRNet are not only semantically strong, but also spatially precise. This comes from two aspects. First, our approach connects high-to-low resolution convolution streams in parallel rather than in series. Therefore, our approach is able to maintain the high resolution instead of recovering high resolution from low resolution, and the learned representation is spatially more precise accordingly. Second, most existing fusion schemes aggregate high-resolution low-level and upsampled low-resolution high-level representations. Instead, we repeat multi-resolution fusions to boost the high-resolution representations with the help of the low-resolution representations, and vice versa. As a result, all the high-to-low resolution representations are semantically stronger.

Arrows are running left to right through 5 tall white planes. The fourth white plane points to the fifth white plane and a medium orange plane below. This pattern is repeated, with five white planes and five orange planes. It is repeated again, adding a row of small pink planes, and then again with smallest red planes. As the new rows are added, arrows show interconnected pattern between planes at the start of each sequence.

Figure 3: An example HRNet. Only the main body is illustrated, and the stem is not included. There are four stages. The 1st stage consists of high-resolution convolutions. The 2nd (3rd, 4th) stage repeats two-resolution (three-resolution, four-resolution) blocks several (that is, 1, 4, 3) times.

Applications

The HRNet is a universal architecture for visual recognition. The HRNet has become a standard for human pose estimation since the paper was published in CVPR 2019. It has been receiving increasing attention in semantic segmentation due to its high performance. HRNet shows superior or competitive performance on a wide-range of position-sensitive tasks, including object detection, face detection, and facial landmark detection, elaborated on in this paper published at IEEE TPAMI 2020. In our CVPR 2020 paper, we recently extended it to learn higher-resolution multi-scale representations for handling the scale diversity in bottom-up pose estimation and obtained the state-of-the-art result.

(A) A tall white plane sits horizontal. It points to a gray rectangle, and then points to the other side of the rectangle above, and then to another horizontal tall white plane above the gray rectangle. A series of planes follow to the right, each pointing up to the gray box: A medium orange plane, a small pink plane, and a smallest red plane respectively. (B) Same as A except the horizontal white plane above the gray rectangle now includes an orange, pink, and red plane stacked on top of it. Arrow from the orange, pink, and red planes below the rectangle point to the rectangle as in A but also point to the stacked planes at the top of the rectangle. (C) Same as B, except that now, the stacked planes above the box now point to another, narrower stack of planes with the same layers. This points to a similar stack that is narrower. Finally, this narrower stack points to the narrowest stack of planes.

Figure 4: (a) HRNetV1: only output the representation from the high-resolution convolution stream. (b) HRNetV2: Concatenate the (upsampled) representations that are from all the resolutions. (c) HRNetV2p: form a feature pyramid from the representation by HRNetV2. The four-resolution representations at the bottom in each sub-figure are outputted from the network in Figure 3. The gray box indicates how the output representation is obtained from the input four-resolution representations.

Human Pose Estimation

Human pose estimation, also known as keypoint detection, aims to detect the locations of keypoints or parts (for example, elbow, wrist, and so on) from an image. The HRNet applied to human pose estimation uses the representation head shown in Figure 4(a), called HRNetV1. Visual example results are shown in Figure 5.

The comparison with ResNet-based methods is shown in Figure 6. We can see that HRNet outperforms ResNet in terms of estimation performance (AP), parameter complexity (#parameters), and computation complexity (GFLOPS). The detailed comparison is given in Table 1.

4 images from left to right. Bright neon colors show straight lines connected to dots showing pose estimation results: 1. A team wearing uniforms forms a human pyramid on a playing field. 2. Five cross country skiers ski across a snow-covered track. 3. Children playing soccer on a grass field while adults stand behind them as spectators against a backdrop of trees. 4. An ultimate frisbee player jumps to grab a frisbee and a defender also jumps in front. Another player stands in the foreground watching.

Figure 5: Qualitative COCO human pose estimation results over representative images with various human size, different poses, or clutter background.

ResNet Mask-RCNN: 63.1 ResNet CPN: 72.1 ResNet CPN (ensemble): 73 ResNet Simple Baseline: 73.7. #parameters, 68.5, computation complexity, 35.6 HRNetV1-W32: 74.9. #parameters, 28.5, computation complexity, 16.0 HRNetV1-W48: 75.5. #parameters, 63.6, computation complexity, 32.9

Figure 6: Comparison on COCO human pose estimation between ResNet and HRNet under the same setting. HRNet performs better in terms of AP, #parameters, and computation complexity. 32 (48) in W32 (48) is the width of the high-resolution convolution.

HRNetV1-W32: #parameters, 28.5M, GFLOPS, 16.0, AP 74.9, AR 80.1 HRNetV1-W48: #parameters, 63.6M, GFLOPS 32.9, AP 75.5, AR 80.5 HRNetV1-W48 plus extra data: #parameters, 63.6M, GFLOPS 32.9, AP 77.0, AR 82.0

Table 1: Comparison with state-of-the-arts on COCO test-dev.

Sematic Segmentation

Semantic segmentation is a problem of assigning a class label to each pixel. The HRNet applied to semantic segmentation uses the representation head shown in Figure 4(b), called HRNetV2. Some visual example results are given in Figure 7.

4 images from left to right: 1. A man in a suit walking across a street carrying a briefcase. 2. A color segmentation of image one. Trees behind man are purple, cars behind man are green, curb in foreground in dark red. Building to the right and behind the man are orange. The sky is teal. 3. An image of a person riding a bike across a street. 4. A color segmentation of image 3. The trees behind the cyclist are purple. The cars behind the cyclist are green. The buildings behind the cyclist are orange. The sky is teal.

Figure 7: Qualitative segmentation on Cityscapes images.

The HRNet compared to state-of-the-art methods, U-Net++, DeepLab and PSPNet on the Cityscapes validation data is given in Table 2. We can see that the HRNet achieves better results with even less parameter and computation complexities. Comparison to existing state-of-the-arts on Cityscapes test is provided in Table 3. The results on other datasets can be found in this IEEE TPAMI 2020 paper.

Table shows HRNet-W40 and -W48 outperform U-Net++, DeepLabv3 (+), and PSPNet. HRNetV2-W40 #Params., 45.2M, GFLOPS, 493.2, mIoU 80.2. HRNetV2-W48 #Params., 65.9M, GFLOPS, 747.3, mIoU 81.1

Table 2: Comparison with representative segmentation methods on Cityscapes validation. HRNet performs superiorly in terms of parameter complexity, computation complexity, and segmentation quality.

HRNetV2 outperforms DeepLab (70.4), PSANet (80.1), and Dense ASPP (80.6). HRNetV2-W48 mIoU, 81.6. HRNetV2-W48 mIoU, 81.6. HRNetV2-W48 plus OCR mIoU, 82.5.

Table 3: Comparison to existing state-of-the-arts on Cityscapes test. OCR is the abbreviation of object-contextual representation we proposed.

Object Detection and Instance Segmentation

Object detection aims to identify the bounding box of the object instance in an image, and instance segmentation aims to identify the pixels belonging to an object instance. Examples are shown in Figure 8. We apply the multi-level representations, HRNetV2p, shown in Figure 4(c) to object detection and instance segmentation. The comparison shown in Tables 4 and 5 shows that HRNet outperforms with ResNet and ResNeXt.

6 images from left to right: 1. bounding boxes outline three cows in a field and separate bounding boxes outline many birds in the sky. 2. Bounding boxes outline a baseball batter who is swinging at a pitch, a catcher, and an umpire. 3. Bounding boxes outline a polar bear and bird in foreground with wings outspread. 4. Bounding boxes outline two cars parked side by side. 5. Bounding boxes outline a group of small mammals in foreground and four giraffes in background. 6. Bounding boxes outline two monkeys in a tree from ground view.

Figure 8: Qualitative examples for COCO object detection (left three images) and instance segmentation (right three images).

HRNet outperforms ResNet on Faster RCNN and Cascade R-CNN. Faster R-CNN HRNet2p-W32 (AP 41.1, APS 24.0, APM 43.1, APL 51.4) Faster R-CNN HRNetV2p-W48 (AP 42.4, APS 24.9, APM 44.6, APL 53.0) Cascade R-CNN HRNet2p-W32 (AP 43.7, APS 25.5, APM 46.0, APL 55.3)

Table 4: Object detection comparison with ResNet and ResNeXt with similar parameter and computation complexes under the Faster R-CNN and Cascade R-CNN frameworks on COCO test-dev without mutli-scale training and testing. This shows that HRNet HRNet performs better than ResNet and ResNeXt

HRNet comparison with ResNet. HRNet2p-W18 mask (AP 35.3, APS 16.9, APM 37.5, APL 51.8) bbox (AP 39.2, APS 23.7, APM 41.7, APL 51.0) HRNet2p-W32 mask (AP 37.6, APS 17.8, APM 40.0, APL 55.0) bbox (AP 42.3, APS 25.0, APM 45.4, APL 54.9)

Table 5: Object detection (bbox) and instance segmentation (mask) Comparison with ResNet with similar parameter and computation complexes under the Mask R-CNN framework on COCO val. without mutli-scale training and testing. This shows that HRNet HRNet performs better than ResNet and ResNeXt.

Runtime Cost

What about runtime costs for HRNet? Is HRNet expensive in terms of memory and computation complexity? The answer is an emphatic no. The comparison is given in Table 6 for the runtime cost comparison on the PyTorch 1.0 platform. In human pose estimation, HRNet gets superior estimation score with much lower training and inference memory cost and slightly larger training time cost and inference time cost. In semantic segmentation, HRNet overwhelms PSPNet and DeepLabV3 in terms of all the metrics, and the inference-time cost is less than half of PSPNet and DeepLabV3. In object detection, HRNet is also better than ResNet and ResNeXt.

HRNet V1-W32 Human pose estimation (train memory 5.7G, inference memory/image 0.13 G, train seconds per iteration 1.153, inference seconds/image 0.057 0.015, AP/MIoU 74.4) HRNet V2-W48 Semantic segmentation (train memory 13.9G, inference memory/image 1.79 G, train seconds per iteration 0.692, inference seconds/image 0.150, AP/MIoU 81.1)

Table 6.1: Memory and time cost for human pose estimation on COCO val and semantic segmentation on Cityscapes val.

We report inference time for pose estimation on MXNet 1.5.1, which supports static graph inference that multi-branch convolutions used in the HRNet benefits from. The numbers for training are obtained on a machine with 4 V100 GPU cards. During training, the input sizes are 256×192$, 512×1024, and 800×1333, and the batch sizes are 128, 8 and 8 for pose estimation, segmentation and detection respectively. The numbers for inference are obtained on a single V100 GPU card. The input sizes are 256×192, 1024×2048, and 800×1333, respectively. The score means AP for pose estimation on COCO val and detection on COCO val, and mIoU for cityscapes val segmentation. PSPNet and DeepLabV3 use dilated ResNet-101 as the backbone. (See Tables 6.1 and 6.2.)

HRNet V2p-W32 Object detection Faster R-CNN (train memory 8.5G, inference memory/image 0.51G, train seconds per iteration 0.690, inference seconds/image 0.101, AP 40.9) HRNet V2p-W48 Object detection Faster R-CNN (train memory 11.3G, inference memory/image 0.79G, train seconds per iteration 0.965, inference seconds/image 0.116, AP 41.8)

Table 6.2: Train and inference Memory and time cost for object detection on COCO segmentation.

ImageNet Pretraining

We pretrain HRNet, augmented by a classification head, shown in Figure 9. We do not aim to push the state-of-the-art result for ImageNet classification, and so we do not utilize some tricks to improve training. The pretraining results and the comparison with ResNet are given in Table 7. The results are similar with and slightly better than ResNet.

Two tall, white planes connected by an arrow. Below are similar configurations for medium orange planes, small pink planes, and smallest red planes. Arrows from orange planes connect to a medium transparent plane on the right. Arrows connect from the small pink planes to a small transparent plane on the right. Arrows connect from the smallest red planes to two smallest transparent planes on the right. Arrows connect the White planes to the medium transparent plane. Arrows connect the medium transparent plane to the small transparent plane. Arrows connect the small transparent plane to the first smallest transparent plane.

Figure 9: Representation for ImageNet classification. The input of the box is the representations of four resolutions.

HRNet-W44-C (#Params. 21.9M, GFLOPS 3.90, Top-1 err. 23.0%, Top-5 err. 6.5%) HRNet-W76-C (#Params. 40.8M, GFLOPS 7.30, Top-1 err. 21.5%, Top-5 err. 5.8%) HRNet-W96-C (#Params. 57.5M, GFLOPS 10.2, Top-1 err. 21.0%, Top-5 err. 5.7%)

Table 7: ImageNet Classification results of HRNet and ResNet. The proposed method is named HRNet-Wx-C. In this case, x means the width.

Conclusions

The high-resolution network (HRNet) is a universal architecture for visual recognition. The applications of the HRNet are not limited to what we have shown above, and they are suitable to other position-sensitive vision applications, such as face alignment, face detection, super-resolution, optical flow estimation, depth estimation, and so on. There are already follow-up works, looking into using HRNet for image stylization, inpainting, image enhancement, image dehazing, temporal pose estimation, drone object detection.

It is reported in this paper that a slightly-modified HRNet combined with ASPP achieved the best performance for Mapillary panoptic segmentation in the single model case. In the COCO and Mapillary Joint Recognition Challenge Workshop with ICCV 2019, the COCO Dense Pose challenge winner and almost all the COCO keypoint detection challenge participants adopted the HRNet. The OpenImage instance segmentation challenge winner (ICCV 2019) also used the HRNet.

The post High-Resolution Network: A universal neural architecture for visual recognition appeared first on Microsoft Research.