A Review on Deep Learning Techniques Applied to Semantic Segmentation

April 18, 2022 Post a Comment

ane. Enquiry questions

This paper summarizes the semantic segmentation algorithms based on deep learning , The main contributions are as follows ：

Provides an all-encompassing survey of existing data sets .
This newspaper deeply reviews the semantic sectionalization algorithms based on deep learning .
Comprehensive performance evaluation , Collection accuracy 、 Quantitative indicators such every bit execution fourth dimension and memory occupation .
expectation

two. Terminology and background concepts

Scene agreement is a task from coarse to fine ： nomenclature 、 Detection or positioning 、 Semantic segmentation 、 Example segmentation . Every bit shown in the effigy below .
Insert picture description here
This paper mainly summarizes the algorithms of semantic segmentation , Partition is not limited to two-dimensional images , And it can exist extended to stereo data and hyperspectral semantic segmentation .

2.1 Common deep network architecture

two.1.1 AlexNet

ILSVRC-2012, TOP-v test accurateness of 84.6%
Insert picture description here

two.ane.2 VGG

ILSVRC-2013, Height-five test accuracy of 92.7%
Insert picture description here

ii.1.3 GoogLeNet

ILSVRC-2014, TOP-five examination accurateness of 93.iii%
Insert picture description here

2.1.iv ResNet

ILSVRC-2016, Top-five test accuracy of 96.4%
Insert picture description here

2.1.v ReNet

Insert picture description here

ii.2 The migration report

1 of the main transfer learning scenarios is to fine tune the weight of the pre training network past continuing the grooming process .

Research has proved that , The pre training weight is ever more important than the weight of random initialization . It should be noted that , Migration learning by and large reuses existing architectures , And fine-tuning is generally fine-tuning the higher level , And the learning rate should be set smaller .

2.3 Data preprocessing and enhancement

Data enhancement is oftentimes used to expand data sets , Prevent overfitting and provide regularization . Common information enhancement methods are ： translation 、 rotate 、 Distortion 、 The zoom 、 Color space conversion 、 Cutting and so on. .

three. Information sets and challenges

The article enumerates 2D（ Grayscale or RGB） Data sets 、2.5D（RGB-D） Data sets and volumes （3D） Data sets .

Insert picture description here

3.1 2D Data sets

PASCAL Visual Object Classes (VOC)： Altogether 21 Categories （ Including object and background ）, The training set and the examination set are 1464 and 1449 Zhang image .
PASCAL Context： yes PASCAL VOC 2010 Extended data set , incorporate 540 Categories , But for the convenience of research , Just marked 59 Categories , All other categories are marked every bit .
PASCAL Part： yeah PASCAL VOC 2010 Extended data fix , Keep the original VOC Categories , And introduced the label of the object part , for case , The bicycle is now broken downward into rear wheels 、 Sprocket 、 The front end wheel 、 Handlebar 、 Headlights and saddles .
Semantic Boundaries Dataset (SBD)： yeah PASCAL VOC 2011 Extended data prepare , Contains semantic segmentation tags and instance segmentation tags , And the training set up is divided into training set （8498） And validation set （2587）, Information technology can usually be used as PASCAL VOC An alternative .
Microsoft Common Objects in Context (COCO)： It's a large information set , contain fourscore Categories , The training set up has 82783 Zhang image , The verification prepare has 40504 Zhang epitome , The test set has 80000 Zhang prototype , Divided into four groups. 20000 Subset .
SYNTHetic Collection of Imagery and Annotations (SYNTHIA)： It is a large information set of constructed urban scenes , incorporate eleven Categories , altogether 13407 Grooming images , It is also characterized by its scene （ Cities and towns 、 City 、 Pike ）、 Dynamic objects 、 Diverseness in seasons and weather condition .
Cityscapes： It is a large database focusing on the semantic agreement of urban street view . It is divided into 8 Category 30 Two categories provide semantics 、 Instance and dense pixel annotation . The information set consists of approximately 5000 A finely labeled image and 20000 A crude labeled image . Has the following characteristics ： A large number of dynamic objects 、 Changing scene layout and irresolute background .
CamVid： It's a road / Driving scene agreement database , Divided into 32 Categories . Later, some researchers divided the data ready into 367、100、233 One for training 、 Verify and examination images . This division uses a subset of class labels .
KITTI： Is a data set dedicated to mobile robots and autopilot . Despite its popularity , Simply the dataset itself does non contain the basic label of semantic segmentation . Withal , Various researchers accept manually annotated parts of the dataset to suit their needs .
Youtube-Objects： It's from YouTube Collected video database , It contains from ten PASCAL VOC Class objects . The database does not comprise per pixel labels , But scholars manually annotated 126 Subsets of sequences . All in all 10167 An note frame , A resolution of 480 360 Pixels .
Adobe's Portrait Sectionalization： This is from Flickr Collected 800*600 Data set of pixel portrait image , The database consists of 1500 A preparation image and 300 Composed of images reserved for testing , Both sets are fully binary annotated ： Character or groundwork . The information set is suitable for people in foreground division applications .
Materials in Context (MINC)： This work is a dataset for patch material classification and full scene material segmentation . This dataset provides 23 Segmented comments for categories . It contains 7061 Marking material segments for training ,5000 For testing ,2500 For validation .
Densely-Annotated VIdeo Sectionalization (DAVIS)： This challenge is intended for video object segmentation . Its information prepare consists of 50 A high-definition sequence , Add up separately 4219 The frame and 2023 Frames are used for training and verification . In that location are four categories .
Stanford background： With data from existing public datasets （LabelMe、MSRC、PASCAL VOC and Geometric Context） Data ready of imported outdoor scene images . The dataset contains 715 Zhang paradigm （ The size is 320*240 Pixels ）, At least one foreground object , And has a horizontal position in the paradigm .
SiftFlow： contain 2688 A fully annotated image , They are LabelMe A subset of the database . About images are based on 8 Different outdoor scenes , Including the streets 、 mountains 、 field 、 Beaches and buildings . The image size is 256*256 , Belong to 33 One of the semantic categories . Unmarked pixels or pixels marked as different semantic categories are considered unmarked .

iii.two ii.5D Data sets

3.3 3D Data sets

4. Method

Fully convolutional network （FCN） It is the cornerstone of semantic partitioning network , It replaces the full connection layer of our unremarkably used nomenclature model with the convolution layer , And lose Space oestrus map Not the classification score , The network construction is as follows . The network also proposes deconvolution for up sampling , Only bloggers remember that in later on articles, they confirmed that deconvolution upsampling has no effect . And so a bound connectedness is proposed .

Insert picture description here
Even though FCN The model has powerful function and flexibility , Even so, information technology withal lacks various characteristics that hinder its application to some problems and situations ： Its inherent spatial invariance does not consider the useful global context information , Past default, there is no instance awareness , Efficiency is far from high-resolution real-fourth dimension execution , And it's not entirely suitable for unstructured data , for case 3D Point cloud or model .

Here are some representative papers reviewed in this paper .

Insert picture description here
The post-obit figure is a summary of semantic sectionalization methods based on deep learning .

iv.1 Decoder variants

except FCN Outside the architecture , In that location are also encoders - The architecture of the decoder , It stands for SegNet, Every bit shown in the figure below . The network uses the maximum pool index from the encoder in the upwardly sampling of the decoder layer . This is what we know equally maximum pooled upsampling .

Insert picture description here
SegNet and FCN The comparing is shown below .

Insert picture description here

4.two Integrate contextual information

Integrating context information can solve local fuzziness , Therefore, we need to residuum local information and global information .

There are many means to make CNN Empathize the global information ： Conditions of employ are random (CRF)、 Cavity convolution 、 Multiscale aggregation or postpone context modeling to another deep network , for case RNN.

4.2.1 Conditional random field （CRF）

We mentioned earlier , because CNN Inherent Space invariance , Information technology limits the spatial accuracy of sectionalization tasks . Bring together in CRF Every bit a post-processing phase , It can improve the power of the network to capture fine-grained details .CRF Be able to transfer low-level paradigm information （ For instance, the interaction between pixels ） Combined with the output of the multi category reasoning system that generates the category score of each pixel . This combination is of import for capturing CNN Remote dependencies that cannot exist considered and fine local details are particularly important .

DeepLab The model uses fully connected pairs CRF As a separate post-processing step in its pipeline to better the partitioning results . It models each pixel as a node in the field , And use a pair of items for each pair of pixels , No matter how far they are （ This model is called a dumbo or fully connected factor graph ）. By using this model , Brusque range and long-range interaction Are taken into account , Enable the organization to recover due to CNN The spatial invariance is lost in the sectionalisation of the detailed construction . Although usually fully continued models are inefficient , Nonetheless, the model tin exist effectively approximated by probabilistic reasoning .

The following figure shows this method based on CRF The post-processing of DeepLab Model generated fraction （softmax front ） and Belief map （softmax after ） Influence .

Insert picture description here
Bell Used by others CNN Variants and CRF As a post-processing to dissever the material （MINC Data sets ）.

Zheng And so on CRFasRNN, It volition CRF Model as RNN, And integrated into CNN End to end grooming in the network , This is about CRF An important work of .

4.ii.2 Crenel convolution

Void convolution is used to expand the receptive field , Simply without calculation boosted parameters , in addition , Information technology tin likewise avoid the loss of information acquired past excessive pooled down sampling . Our ordinary convolution is i-dilated Cavity convolution . The hole convolution is shown in the figure below .

Insert picture description here
Void convolution is equivalent to... Before convolution , Fill the inside of the convolution kernel with nil , And then convolution . In this way, further features tin be extracted . Equally shown in the figure below .

Insert picture description here
Void convolution is oftentimes used to Multiscale context integration In the network .

four.2.3 Multiscale prediction

because CNN Each parameter in the affects the calibration of the generated feature map , This shows that the network implicitly learns to detect the characteristics of a specific scale , Therefore, if a single calibration network is used, information technology will be difficult to generalize it to networks with different scales . So we employ multiscale networks .

Raj And so on Total connectedness VGG-16 Multiscale network .

Roy Et al. Proposed by 4 A multi-scale network CNN. Every network is Eigen Et al From coarse to fine multiscale network .

Insert picture description here
Another outstanding task is Bian Et al due north individual FCN Multiscale networks , Its chief contribution is a two-stage learning process , Firstly, ii scale networks are trained independently , Then they are combined and fused with additional convolution layers, and so fine tuned .

4.ii.iv Feature fusion

Characteristic fusion is too a way to add context . At that place are two main architectures ：Skip-connection-similar architecture and ParseNet context module.

Insert picture description here

4.2.v Cyclic neural network

By linking pixel level information with local information ,RNN It can successfully model the global context and amend semantic segmentation . Withal , An important problem is the lack of natural sequential structure in the image , And standards vanilla RNN The architecture focuses on one-dimensional input .

Based on the method for image nomenclature ReNet,Visin Et al. Proposed a method chosen ReSeg Semantic sectionalization architecture , Every bit shown in the figure below . In this way , The gated bicycle unit of measurement... Has been used (GRU), Because they achieve a practiced performance balance between memory usage and computing power .LSTM and GRU Can well overcome Vanilla RNN The slope vanishing problem in modeling long-term dependencies .

Insert picture description here
There are many more advanced methods ...

4.3 Example segmentation

Instance segmentation is considered as the next step afterwards semantic segmentation , Its main purpose is to split objects of the same class into different instances . Case tags provide us with boosted information , Used to infer occlusion , It also calculates the number of elements belonging to the same course , And detecting specific objects used for grasping in robot tasks , And many other applications .

Because bloggers don't study example division for the time existence , Then don't go into details ...

4.iv RGB-D data

In the front, we utilize photometric data for semantic segmentation . With RGB-D Low price of camera , Currently based RGB-D Semantic segmentation of data has also received more attention . Because deep information hides rich structured information , The accuracy of semantic segmentation can exist improved by adding semantic information .

Using depth images with methods that focus on photometric data is not easy , The depth data needs to be encoded using three channels at each pixel , As if information technology were RGB The epitome is the same .

Specific methods will not be introduced ...

4.5 3D information

3D Geometric data （ Such every bit point cloud or polygon mesh ） It provides rich spatial data , Intuitive feeling tin be used to segment .CNN Is designed to handle structured data , and 3D Data is unstructured data .

Almost researchers accept adopted 3D Voxel mesh or project to convert unstructured and unordered indicate clouds or meshes into regular representations , And then input them into the network , Then they map the label back to the indicate cloud . Equally shown in the figure beneath .

Insert picture description here
Although this method has been successfully applied , But it has some disadvantages , Such as quantification 、 Loss of spatial data and unnecessary large representation .

For this reason , Various researchers accept focused on creating unstructured systems that tin be used directly 3D On the depth architecture of signal set up or filigree .

PointNet It's a pioneering piece of work , It proposes a deep neural network with the original point deject every bit the input , It provides a unified framework for classification and segmentation , Every bit shown in the effigy beneath .

Insert picture description here

PointNet It is a deep network architecture , It stands out because information technology is based on the fact that the full connection layer is not the convolution layer . The architecture has ii subnetworks ： I for classification , The other is used to segment . The nomenclature sub network adopts point cloud and applies a gear up of transformations and multi-layer perceptrons (MLP) To generate features , Then utilize maximum pooling aggregation to generate global features that describe the original input deject . This global feature is represented by another MLP classification , To generate the output score of each class . The sectionalization sub network connects the global features with each point feature extracted past the nomenclature network , And apply the other two MLP To generate features and output scores for each point .

iv.6 Video sequence

When dealing with semantic segmentation of video sequences , Intuitively, nosotros can directly use the previous semantic division method to frame-past-frame Sectionalization . still , This leads to a huge computational cost . Let's think about , We ignore the fourth dimension continuity hidden in the video , Information technology may better segmentation accurateness and reduce running fourth dimension .

The most outstanding piece of work is clockwork FCN, The network is for FCN Accommodation of , In guild to use the time clues in the video to reduce the reasoning time , While maintaining accuracy .

Insert picture description here
Others will not exist expanded in item ...

5. Discuss

This section quantitatively analyzes the sectionalization algorithm , First , We volition get-go from execution time 、 Memory usage and accuracy Three aspects describe the most popular evaluation indicators that tin be used to measure the performance of semantic partitioning systems . Side by side , Nosotros will use the indicators described above to collect the results of the method on the most representative data set up . after , We volition summarize these results and draw conclusions . Concluding , We listed possible futurity enquiry routes that nosotros believe are of great significance to this field .

v.1 Evaluation indicators

5.1.1 execution time

Speed or running fourth dimension is a very valuable indicator , Because most systems accept to meet difficult requirements almost how much time they can spend in the reasoning process . However, the execution time is usually afflicted past hardware devices . So this indicator is not particularly of import , It's just a reference for researchers , Say based on established equipment , Whether the algorithm tin can achieve the required running speed .

five.i.2 Memory footprint

Memory usage is another important gene in segmentation methods . Although it can be said that it is less restrictive than the execution time —— Expanding memory chapters is usually feasible . But information technology can likewise be a limiting cistron . Consider the same implementation related aspects as the runtime , Of the recording method Peak and average memory usage And a consummate description of the execution conditions tin can be very helpful .

5.1.three accuracy

Many evaluation criteria accept been proposed , It is often used to evaluate the accurateness of any semantic segmentation engineering . These indicators are usually pixel accuracy and IoU A variation of the .

For the sake of explanation , Nosotros annotate the post-obit symbols ： Let'due south assume that there are m+ane Classes （ from $L_0$ To $L_k$ , Include an empty class or background ）, likewise $p_{ij}$ Described as Vest to i course , To be inferred equally j Number of classes . therefore , We can define ,

$p_{ii}$ ： The number of true positives .
$p_{ij}$ ： Number of false positives .
$p_{ji}$ ： The number of fake negatives .

The accuracy evaluation indicators used in semantic segmentation are listed below ：

5.two effect

5.2.ane RGB

Insert picture description here

five.2.2 ii.5D

5.ii.three 3D

5.2.iv Sequence

5.3 summary

DeepLab about RGB Image partitioning is the best , The most stable .

5.four Future inquiry direction

6. Decision

Reference resources

1、 Overview of image semantic segmentation
two、A 2017 Guide to Semantic Segmentation with Deep Learning
3、Medical Images Semantic segmentation overview notes

版权声明
本文为[will be that man]所创，转载请带上原文链接，感谢
https://chowdera.com/2022/03/202203081116334582.html

shorpase1943.blogspot.com

Source: https://chowdera.com/2022/03/202203081117102120.html

A Review on Deep Learning Techniques Applied to Semantic Segmentation

ane. Enquiry questions

two. Terminology and background concepts

2.1 Common deep network architecture

two.1.1 AlexNet

two.ane.2 VGG

ii.1.3 GoogLeNet

2.1.iv ResNet

2.1.v ReNet

ii.2 The migration report

2.3 Data preprocessing and enhancement

three. Information sets and challenges

3.1 2D Data sets

iii.two ii.5D Data sets

3.3 3D Data sets

4. Method

iv.1 Decoder variants

4.two Integrate contextual information

4.2.1 Conditional random field （CRF）

4.ii.2 Crenel convolution

four.2.3 Multiscale prediction

4.ii.iv Feature fusion

4.2.v Cyclic neural network

4.3 Example segmentation

4.iv RGB-D data

4.5 3D information

iv.6 Video sequence

5. Discuss

v.1 Evaluation indicators

5.1.1 execution time

five.i.2 Memory footprint

5.1.three accuracy

5.two effect

5.2.ane RGB

five.2.2 ii.5D

5.ii.three 3D

5.2.iv Sequence

5.3 summary

5.four Future inquiry direction

6. Decision

Reference resources

Post a Comment for "A Review on Deep Learning Techniques Applied to Semantic Segmentation"