SGPLane: Efficient lane detection via sampled grid points for autonomous driving

Xuewei Tang; Mengmeng Yang; Kun Jiang; Tuopu Wen; Benny Wijaya; Diange Yang

doi:10.1016/j.fmre.2023.11.013

. 2023 Dec 23;5(4):1659–1667. doi: 10.1016/j.fmre.2023.11.013

SGPLane: Efficient lane detection via sampled grid points for autonomous driving

Xuewei Tang ^1,^⁎, Mengmeng Yang ^1,^⁎, Kun Jiang ¹, Tuopu Wen ¹, Benny Wijaya ¹, Diange Yang ^1,^⁎

PMCID: PMC12327821 PMID: 40777802

Graphical abstract

Keywords: Autonomous driving, Autonomous vehicle navigation, Vehicle perception, Road scene understanding, Lane detection

Abstract

Lane detection is one of the critical tasks for autonomous driving. Earlier works revolved around semantic segmentation and object detection with a special program for lanes. However, most methods still suffer from unstable post-processing algorithms which leads to a gap between camera input and downstream applications. In this paper, we propose a novel detection presentation form for lanes and design a simple network without any complicated post-process. Specifically, we use sampled gird points to express lane lines and construct a network for the special lane format, which is called SGPLane. Therefore, the network learns a regression branch and a confidence branch to realize end-to-end lane detection by setting the threshold confidence value. Our model is validated on the typical dataset and real-world driving scenes. Experiments on lane detection benchmarks show that our method outperforms previous methods with accuracy score of 96.84 $%$ on Tusimple dataset with high FPS and 76.85 $%$ on our real-world dataset.

1. Introduction

Lane detection is a critical challenge in the domain of autonomous driving, with its performance significantly influencing downstream applications [1]. For dynamic maps, positioning and decision making, high quality vectorized lane lines are required. For other applications, such as the crowdsource mapping, the low time cost is a crucial capability. With the rapid advancements in deep learning theories, a plethora of impressive methods have emerged, providing inspiration for robust lane extraction. Several works in lane detection have demonstrated significant progress by employing effective models.

From a representation perspective, lane detection methods can be roughly divided into three categories: point-based [2], [3], [4], [5], [6], line-based [7], and region-based [8], [9]. The point-based methods usually represent lanes as a set of sampled points [6]. The line-based methods usually use a curve line fitting formula to describe the lanes or anchors with rotation angle and offset [10]. The region-based methods usually consider each lane as a set of contiguous regions, usually processed by a classical semantic segmentation model [8]. The region-based representation is closest to the original lane pixels in the image, which suffers from distinguishing lanes at the instance level. In addition, this representation is weak when multiple lanes are very close together. Line-based representations can export lanes directly. Formula-based line methods have no direct connection to the shape of the image, so the results are seriously distorted by parameters that are far from the ground truth [11]. Anchor-based line methods are usually bottlenecked by anchor shapes and anchor filtering [10]. We choose points to represent lane lines mainly because points are flexible enough to characterise line shapes with high accuracy. Points can be combined to form lane instances. However, previous point-based methods [2], [3], [4], [5], [6] are inefficient at learning the exact position of abundant points [12] and associating points that belong to the same lane.

To overcome the above limitations, we formulate a lane detection task as a grid-sampled point regression problem and propose an end-to-end lane detection model. As shown in Fig. 1d, the $y$ -axis of the images are sampled with a certain proportion of the grid, so that the distance between the near coordinates is large and the distance between the far coordinates is small. The grid is a set of fixed sampled $y$ value from the $y$ -axis as shown in Fig. 1d, which can be formulated as Eqs. 2,3. As the grid $y$ value is preset, we only need to sample the points of the lane line on the grid and regress the corresponding $x$ coordinate values to obtain the lane line results. However, not all preset grids have corresponding lane lines, so the classification value of non-existent sampled points is set to 0 to control.

Fig. 1 — **Comparison of our method with previous work on lane detection based on sampled points**[2], [3], [6]. (a) FOLOLane [3] predicts the candidate keypoints, constructs associations between them, and then groups the keypoints into lane lines. (b) RCLane [2] models the keypoints of each lane as a relay chain. RCLane predicts the starting point and then flows up and down one by one. (c) GANet [6] directly regresses each key point to its associated lane by predicting the offset between each key point and the starting point of its corresponding lane line. (d) Our model samples the key points of each lane through a grid on the y-axis, and regresses the $x$ -coordinate values of sampled key points.

This approach enables a detailed description of the shape changes of each lane through point sampling, even for distant lane lines. The advantage of fixed ordinate is that the network can learn the corresponding normalised abscissa in the order of ordinate sampling, so each group of network outputs corresponds to a lane line, and the refined instantiation lane line result can be obtained directly. The points belonging to a line are output on the same layer of the network so that this model can directly obtain the instantiated lane line. Coordinates are output directly through the sampled design mesh. These two special designs avoid complex post-processing, so the model can be applied to real vehicles with high efficiency. In the process of network training, this ordered sampling design makes the receptive field of the network automatically focus on different positions of the lane line, which is equivalent to an adaptive point anchor. To effectively control the occurrence of redundant points in the lane line, we added a confidence branch to determine whether the points exist with a certain threshold. This allows us to arbitrarily control the number of lane lines and sampling points output. To make this pipeline effective, we need to add Hungarian matching to the training process to avoid complex post-processing such as NMS [13].

The main contributions of this work are summarized as follows: (1) This paper proposes a novel lane line representation based on sampled grid points, which simplifies lane detection to an $x$ -coordinate regression problem. (2) We design an end-to-end CNN network model to extract lane lines without additional expensive post-processing.

The experiment results show that our model achieves good performance on Tusimple, with an accuracy of 96.84 $%$ , and is suitable for real-world corner cases, which are rare in public datasets.

This paper is organized as follows: Section 2 summarizes the related work of lane detection. Section 3 primarily explains the technical methods at length. Section 4 describes the experiment details and shows the results on the open dataset. Finally, in Section 5 we make a conclusion.

2. Related work

As a classic and widely used computer vision task, lane detection has a long research history. Initially, the lane detection task was considered as a colour thresholding problem [14], an edge detection problem [15] or a Hough transform [16] using traditional digital image processing methods. With the development of deep learning and computing power, training a deep neural network with a large amount of lane data has become mainstream. Thanks to the impressive capabilities of deep learning frameworks, lane detection is becoming increasingly flexible. By designing flexible expression forms of lane lines and corresponding neural networks, it is possible to achieve increasingly accurate and efficient lane detection.

In this section, we will discuss these methods separately from the perspective of lane representation and lane organisation process.

Description of lanes: From the perspective of expression, lanes can be regarded as points, lines, and areas. Corresponding to the expression format, lane detection models are designed particularly. The description of lanes determines the output format of a deep network, which has a great impact on the detection performance. Lanes can be described in wide ways because of their narrow shape.The point-based works module lanes as a set of sample points. Then the task can be considered as a point regression or classification problem. FOLOLane [3], RCLane [2] and GLANet [6] use a set of points to depict the shape of the lane line, and the difference is how to combine points belonging to the same lane line, as shown in Fig. 1. PiNet [5] is one of the point-based methods, which estimates the points of lanes. UFAST [4] selects the location of sampled points row by row. Girds are made to sample lanes, so the module only needs to classify the sampled points. To sum up, this kind of work give the discrete points to describe lanes. The line-based works module lanes as several curves. Then the frameworks usually learn the key parameter of the curves. LaneNet [8] designs a small CNN model to forecast the curve-fitting parameters of each lane. PolyLaneNet [7] models the lanes as curves with the formation of polynomials. This kind of method regards lane detection as a task of parameter regression, which simplifies the amount of regression value compared with the method of regression point directly. The region-based works module lanes as connected domains. This kind of description usually uses semantic segmentation works to discriminate the pixels of lanes. SCNN [9] is one of the most classic lane detection works. SCNN [9] designs a special convolution network that conducts segmentation with convolution in a certain direction. The first step of LaneNet [8] uses a CNN backbone to learn the binary segmentation mask of lanes. The binary mask is the repression of the lane region. IntRA-KD [17] treats the lanes as one of the classes and conducts semantic segmentation to distinguish the lane pixels. The post-processing of the region-based method is complex, inefficient and unstable, and the line-based method is greatly disturbed by the influence of parameters. Among these representations, the key point is the most flexible and accurate form that fits the shape of the lane line.

The procedure of lane detection: From the perspective of the detection procedure, we can roughly divide the framework into two categories: top-down and bottom-up. According to the processing pipeline, lane detection can be roughly divided into top-down methods and bottom-up methods, which is corresponding to the classification of instance segmentation methods. Top-down methods have been a mainstream of lane detection. In the first instance, top-down works give the sketchy proposal of lanes. A set of offset values is then learned to refine the lane shape. The anchor-based method is a typical type of top-down work with fixed anchors or flexible anchors. Inspired by the region anchors in Faster-RCNN [18], line-CNN [19] first sets line anchors to detect lanes. This pipeline has gained in popularity over recent years. Several works such as CurveLane-NAS [20], LaneATT [10], and SGNet [21] have pursued the framework. LaneATT [10] designs the anchor-based feature pooling particularly based on line-CNN [19]. SGNet [21] utilizes bounding boxes as the anchors and learns the offset to the edges of bounding boxes to specify the lane shape. CondLaneNet [22] is another type of top-down framework. It designs a proposal head primarily to give the line proposal, then a conditional shape head is constructed to refine the shape of the lines with the conditional convolution. However, the anchor-based methods have a bottleneck due to the limitation of anchors. The bottom-up method is to fetch the refined lines and extract the lines at the instance level based on the first step. When it comes to the pipeline of bottom-up, the segmentation-based method has been mainstream. LaneNet [8] introduces the bottom-up idea of instance segmentation to lane detection. This work learns the value of embeddings to represent lane instances. In the space of embeddings, pixels belonging to the same lane are very close to each other, while points belonging to different lanes are very far from each other. However, clustering pixels according to embeddings leads to huge post-processing costs. To overcome the shortcomings of clustering post-processing, LaneAF [23] designs the affinity fields to help clustering. Another type of bottom-up method is to first extract the maximum amount of proposed lanes and use NMS to find out the redundant lane proposals. SOLOLaneNet [24] is one of the NMS based methods. No matter whether the extraction method is top-down or bottom-up, the two links of lane line instance extraction and lane line shape extraction need to be organised respectively. Our SGPLane adopts a more efficient framework, which neither needs top-down to screen the redundant lane instance, nor bottom-up to cluster or combine keypoints to lane line. Each lane line instance is obtained through the preset network channel, and the key points belonging to the same lane line are sampled through the preset grid and represented by coordinate values.

To sum up, due to the performance bottleneck of the lane line representation method based on line and region, we select key points as the effective expression method of lane line to express the shape of lane line. For this point representation, we use the preset grid sampled keypoints to design a one-stage end-to-end network SGPLane to avoid the complexity of lane line shape characterization and instance extraction.

3. Methodology

The overall architecture of our proposed Sampled Grid Points Lane Detection Network (SGPLane) is illustrated in Fig. 2. Given a front-view camera image $I \in R^{H \times W \times 3}$ as input, where H and W denote the height and width of the input image, the goal of SGPLane is to predict a set of lanes as the form of sampled points axis. The input image is fed into a CNN backbone, which can extract deep features. We design two branches to get the final output without any post-processing method. One is for the regression of lane points, and the other is for the confidence estimation of corresponding lane points. Through the two branches of sampled points decoder networks, the deep features are translated to the axis and the confidence of each lane point. In the following subsections, we introduce the crucial components of our method.

3.1. Sampled points representation

To obtain the specific location information of each lane line in the image directly according to the lane instance, and inspired by the idea of DETR [25], we directly design each column of the network output as a lane line information. Therefore, the sampled lane line point representation is the core of this work. Given a front-view camera image $I \in R^{H \times W \times 3}$ , we model lanes in the image $I$ as a set of lines:

L = {l_{1}, l_{2}, \dots l_{N}}

(1)

where N is the preset number of lanes in the image.

As shown in Fig. 1d, the $y$ -axis is sampled by a preset grid with a fixed start value, a fixed end value and fixed intervals. K is the preset fixed number of sampled grid of the whole image. The sampling intervals are also preset through experience. The fixed start value is set to H-1 because this is the $y$ coordinate value of the image bottom. The fixed end value is set in the range of [0, H/2] as Eq. 3. According to the distribution of lane lines in the driving image, the end of the lane line is basically within this interval, and we usually set it as H/4.

Then we set the end $y$ -axis value (the start $y$ value is always set to the bottom edge of the image) and sample $K - 2$ other fixed coordinates on the $y$ -axis:

y = {y_{1}, y_{2}, \dots y_{K}}

(2)

y_{1} = H - 1, y_{K} \in [0, H / 2]

(3)

y_{j} - y_{j + 1} = I (y_{j + 1} - y_{j + 2})

(4)

where $y_{j}$ is the $y$ coordinate of the $j$ th sampled point on all lane and $I$ is the interval coefficient, which is usually set to 1.1. We use a set of sampled points to represent a lane as:

l_{i} = {p_{1}^{i}, p_{2}^{i}, \dots, p_{K}^{i}}_{i = 1}^{N}

(5)

where $p_{j}^{i}$ is the $j$ th sampled point on the $i$ th lane line. Then coordinates of $p_{j}^{i}$ can be expressed as:

p_{j}^{i} = (x_{j}^{i}, y_{j}^{i})

(6)

where $y_{j}^{i}$ is the $y$ coordinate of the $j$ th sampled point on the $i$ th lane line, $x_{j}^{i}$ is the $x$ coordinate of the $j$ th sampled point on the $i$ th lane line. Sicne we fix the sampled grid, the equation can be expressed as:

y_{j}^{i} = y_{j}

(7)

Therefore, coordinates of $p_{j}^{i}$ can be expressed as:

p_{j}^{i} = (x_{j}^{i}, y_{j})

(8)

Not all lane lines exist on all sampled $y$ grid. For non-existent lane line sampled points, it is necessary to set the value to zero through confidence threshold filtering, so that the structured lane line results with fixed points conform to the real situation. $θ$ is the set threshold value of confidence to figure out whether there exists a point.

Algorithm 1 is the algorithm to judge the existence of points from the classification branch output, where $\hat{C_{j}^{i}}$ denotes the corresponding confidence value of the $j$ th point in the $i$ th lane from classification branches. We set the $x$ coordinate of non-existent points to zero.

Algorithm 2 is the algorithm to set the $x$ -axis value of non-existent point.

Since $y$ -axis grid is set and the values are known, the lane line can be simplified as $x$ -axis values:

l_{i} = {{x_{1}^{i}}^{'}, {x_{2}^{i}}^{'}, \dots, {x_{K}^{i}}^{'}}_{i = 1}^{N}

(9)

Through this representation, the coordinates of the sampling point can be directly corresponding to the network output, so it is only necessary to output each branch of the network as a $K \times N$ matrix.

3.2. Network architecture

The whole framework is a traditional encoder-decoder network. We adopt some available backbones as the encoder module. To enhance the performance of points regression, we immediately link the layer of the encoder and the two output branches as FPN [26]. We choose ERFNet [27] as our backbones to extract deep features of each input image.

Spatial position implicit coding. When we fix the y-axis grid value of sampled points, the attention field is fixed to the longitudinal distribution of the lane line. Therefore, features belonging to each grid only need to pay attention to nearby values and we do not need to associate global image information. The output form of grid is equivalent to implicit lane anchor. The visualization of all the regression branch results is shown in Fig. 3. It can be seen that without classification value, all preset N lanes are N lane implicit anchors, and all lane implicit anchors only care about the adjacent area.

Fig. 3 — **Visualization of regression branch results**.

Regression head. The regression branch is designed to decode the axis of each sampled points. In order to make the network easier to converge, we use image width W as denominator to normalize all $x$ coordinate of sampled points in this branch.

Classification head. The classification branch is designed to estimate the confidence of each sampled points. The output of this branch represents the probability of point existence. Through the classification branch, our network can ignore the empty lane prediction and filter out points that do not exist in the sampling $y$ grid.

3.3. Hungarian matching and losses

To establish a unique relationship between the predict lane point set and the ground truth point set, we adopt the Hungarian matching method used in DETR [25]. The Hungarian algorithm is the most common algorithm for partial graph matching. The core of this algorithm is to find the augmented path. It is an algorithm for finding the maximum matching of Bipartite graph by using the augmented path. Through the Hungarian matching algorithm, the cost matrix is to find the optimal assignment of detected lane ground truth to potential lanes. The algorithm minimises the total cost by finding the best pairing of lane ground truth and lane prediction, ensuring that each lane is associated with only one lane. Through this matching algorithm, the network can reasonably converge rather than over-fit and remove the redundant lane line prediction value during training. Furthermore, this method avoids the NMS post-processing [13] to reach end-to-end performance.

To train the proposed CNN network, we utilize two kinds of loss function for the two branches. Because the existing points are sparse relative to the preset $K \times N$ matrix output, we adopt focal loss to regress the confidence value of classification branch. Focal loss [28] is a frequently-used loss function to do the key point estimation:

\begin{matrix} {Loss}_{focal} & = & \frac{- 1}{K \times N} \sum_{i = 1}^{N} \sum_{j = 1}^{K} \\ \cdot {\begin{matrix} {(1 - \hat{C_{j}^{i}})}^{α} \log (\hat{C_{j}^{i}}), if C_{j}^{i} = 1 \\ {(1 - C_{j}^{i})}^{β} {(\hat{C_{j}^{i}})}^{α} \log (1 - \hat{C_{j}^{i}}), otherwise \end{matrix} \end{matrix}

(10)

where $α$ and $β$ are set the same as the value in CenterNet [29]. $\hat{C_{j}^{i}}$ denotes the predicted confidence value of the $j$ th point in the $i$ th lane from classification branches. $C_{j}^{i}$ denotes the corresponding ground truth, where 1 means that the point exists.

Because there are more preset points than are actually visible, there are many more other value than 1s in the matrix representing the true value. Focal Loss is designed for unbalanced samples, so using Focal Loss can speed up convergence.

For the $x$ -axis regress branch, L1 regression loss is used to regress the normalized $x$ values. We only compute the loss of matched lanes through Hungarian matching:

{Loss}_{L 1} = \frac{1}{K \times N} \sum_{i = 1}^{K} \sum_{j = 1}^{N} | X_{j}^{i} - \hat{X_{j}^{i}} |

(11)

where $X_{j}^{i}$ denotes the ground truth of the normalized $x$ value of the $j$ th point on the $i$ th lane line, $\hat{X_{j}^{i}}$ denotes the prediction result of the corresponding point.

Then the total loss of the network is computed as:

Loss = λ_{f o c a l} {Loss}_{f o c a l} + λ_{L 1} {Loss}_{L 1}

(12)

where $λ_{f o c a l}$ and $λ_{L 1}$ are the loss weight coefficients to balance the training of the two branches. We adjust the two values based on the experimental results. $λ_{L 1}$ is set more larger because L1 loss is only calculated on the lane line results that match the true value, while focal loss is calculated for each preset point.

3.4. Synthesis of lane line instances

Though the two branches of SGPLane, the ordinate information and the confidence are known. Then we use a threshold $θ$ to change the confidence branch into one-hot to separate the non-existent points from the visible points. The post-processing method is shown in Fig. 4.

{\begin{matrix} T_{j}^{i} = 1, \hat{C_{j}^{i}} > θ \\ T_{j}^{i} = 0, \hat{C_{j}^{i}} < θ \end{matrix}

(13)

where $\hat{C_{j}^{i}}$ is the value at row $i$ and column $j$ of the classification head output matrix $\hat{C} \in {[0, 1]}^{K \times N}$ , which denotes the confidence of the $i$ th point on the $j$ th lane.

Fig. 4 — **Post-processing pipeline for our SGPLane**. (a) is the output result of regression branch. (b) is the output result of classification branch. (b) is filtered by the threshold and multiplied by (a), as (a) is normalized value that needs to be multiplied by the image width W to get the final sampled grid points result (c). (d) is the visual diagram.

Then we get the formulate of output value from Algorithm 1:

{x_{j}^{i}}^{'} = W * \hat{X_{j}^{i}} * T_{j}^{i}

(14)

where $\hat{X_{j}^{i}}$ is the value at row $i$ and column $j$ of the regression head output matrix $\hat{X} \in {[0, 1]}^{K \times N}$ , which denotes the normalized $x$ coordinate of the $i$ th point on the $j$ th lane. Because the predicted value $\hat{X}$ is a normalized quantity, it is necessary to multiple the W value (width of image) to obtain the final result.

4. Experiment

In this part, experiments have been carried on several datasets to verify the practicability of our work.

4.1. Dataset

The Tusimple dataset [30] is a widely used open dataset to demonstrate the feasibility of works. It has 3626 images for training, 358 images for validation and 2782 images for test with excellent visual condition. The main evaluation metric of TuSimple dataset is accuracy, which is formulated as follows:

accuracy = \frac{\sum_{c l i p} C_{c l i p}}{\sum_{c l i p} S_{c l i p}}

(15)

where $C_{c l i p}$ is the number of points correctly predicted by the model and $S_{c l i p}$ is the total number of points in the clip. Since Tusimple is a dataset organized as a set of points, the basis of quantification is the position relationship of points. A predicted point is regarded correct only if it is within 20 pixtels of the ground truth point. The predicted lane with an accuracy greater than $85 %$ is considered a true positive.

We also conduct experiments on our private driving scene dataset named Shougang dataset. Shougang dataset is consist of the data collected by driving vehicles in Shougang, which focuses on the complex real world scene at the end of the vehicle, including multiple lane lines close to each other, lane lines blocked by guardrails, and large differences in image brightness. Shougang dataset has 4768 images for training, 402 images for validation and 1927 images for test. Because the lanes are very close, a predicted point is regarded correct only if it is within 8 pixtels of the ground truth point. The predicted lane with an accuracy greater than $85 %$ is considered a true positive. The evaluation metric is more strict than TuSimple.

4.2. Implementation details

We choose ERFNet [27] as the backbones to form two different versions of SGPLane.

The size of input pictures is set to $480 \times 768$ . The number of sampled points of each lane is set as $K = 32$ in Shougang dataset. The number of sampled points of each lane is set as $K = 48$ in Tusimple dataset. The number of lane instance N is set to 40. The loss weight coefficients $λ_{f o c a l}$ is set to 1 and $λ_{L 1}$ is set to 40. The hyper-parameters $α$ and $β$ in Eq. 10 are set as 2 and 4 respectively. For optimization, we used Adam optimizer and poly learning rate decay with an initial learning rate of 0.001. We train 300 and 500 epochs for Tusimple and Shougang dataset respectively with a batchsize of 32 per GPU. Training are performed on NVIDIA GTX 1080 Ti. Testing are performed on Tesla-V100 GPUs.

In the test phase, we set the threshold $θ$ of keypoints as 0.4. As the ground truth value of the classification branch is set to 0 and 1, 0 means non-existent, 1 means visible. Due to the focal loss, the distribution of the classification value is basically close to 0 or 1. Therefore, our model can easily filter the lane line sampling points that do not exist in the actual image through the middle threshold value. Obviously, as the threshold $θ$ increases, more potential points are filtered and FP increases. As the threshold $θ$ decreases, more possible error points are added, so FN increases. 0.4 is a result that takes into account various indicators, and we hope to detect as many as possible rather than miss them. In fact, the choice of 0.4, 0.5 or 0.6 doesn’t make much difference.

4.3. Results

Some of the visualization results on Tusimple and our driving scene dataset are shown in Figs. 5 and 6. To evaluate the performance of our works, we choose FP, FN, accuracy and the running speed FPS as the quality index. FP and FN are the rate to reflect errors. FP is the number of lanes the model has detected while there is no such a lane. FN is the number of lanes which exist but the model forgets. The accuracy is defined as:

Accuracy = \frac{N_{p r e d}}{N_{g t}}

(16)

where $N_{p r e d}$ is the quantity of the right-predicted points and $N_{g t}$ is the ground-truth.

Fig. 5 — **The visualization results of Tusimple**.

Fig. 6 — **The visualization results of some real world scenes**.

Fps is an abbreviation for the number of frames transmitted per second, representing the speed of network computing. FPS is defined as:

FPS = \frac{1}{m T}

(17)

where $m T$ is the average time required for the network to process each frame of image.

The comparison results on TuSimple dataset are shown in Table 1. Our SGPLane has good performance and achieves the best accuracy score of 96.84 $%$ with high FPS. The low FP value means our SGPLane is rarely disturbed to produce wrong results. High FPS mainly comes from the simple process pipeline, which illustrates our sampled gird points model is very efficient and effective. Low FP with high FPS means our SGPLane is very valuable for actual driving.

Table 1.

Performance of different methods on Tusimple.

Method	Accuracy	FP	FN	FPS
Region-based
SCNN [9]	$96.53 %$	0.0617	0.018	7.5
LaneNet [8]	$96.38 %$	0.0780	0.0244	52.6
LaneAF [23]	$96.49 %$	0.0280	0.0418	–
Line-based
LaneATT-ResNet34 [10]	$95.63 %$	0.0353	0.0292	171
BezierLaneNet [11]	$95.65 %$	0.051	0.039	–
Point-based
GANet-M [6]	$95.87 %$	0.0199	0.0264	127
RCLane-M [2]	$96.51 %$	0.0224	0.0236	43.8
SGPLane(Ours)	96.84 $%$	0.0183	0.0256	134

Open in a new tab

The comparison results on our Shougang dataset are shown in Table 2. Our SGPLane outperforms all other methods and achieves the best score among them. Because the Shougang dataset is composed of complex real world lane line scenes, the state-of-the-art performance shows that our SGPLane network can extract lane lines with high accuracy, which is particularly effective for complex scenes such as dense lane lines. Because this dataset is composed of complex lane line scenes, sota performance shows that our network can extract lane lines with high accuracy, which is particularly effective for complex scenes such as dense lane lines. This is because we have formed implicit anchors for individual lane line instances in the training process through preset sampling grids, and the corresponding positions of anchors pay more attention to the close sampling points, making the convolution more effective. This method can not only get the position of the point with high accuracy, but also distinguish the lane line directly through the implicit anchor, so that the error detection of complex scenes is less. As shown in Fig. 7, other methods make it difficult to separate tightly connected lane lines through the mask of lane lines, resulting in multiple lane lines sticking together. Moreover, due to the instability of the pixel area, there is also a lot of noise, and there are also many cases of discontinuous pixels in farther lane lines, which poses a great challenge to post-processing.

Table 2.

Performance of different methods on Shougang dataset.

Method	ACC	FP	FN	FPS
GANet [6]	$68.34 %$	0.0576	0.0459	127
RCLane [2]	$73.96 %$	0.0591	0.0563	43.8
SGPLane(Ours)	76.85 $%$	0.0538	0.0442	134

Open in a new tab

Fig. 7 — **The visualization results of some error detection results**.

The visualization results of some complex lane line detection scenarios is shown in Fig. 8. The green points in the graph represent the predicted results, while the red points represent the true values. The complex lane line situation includes multiple lane lines closely connected, with lane lines obstructed by guardrails and closely connected to each other, bends, lane lines merging, and visibility obstructed in foggy weather. It is clear from the images that our network has very good lane detection results even in complex lane situations.

5. Conclusion

In this paper, we propose a novel expression module of lane lines and construct an end-to-end two-branch network model to correspond with this special sampled grid points. Each key point is directly regressed to normalized $x$ coordinate, then we combine the $x$ coordinates of each lane line into a sequential chain of lane line points in the format of network output. Owing to the simple but effective model, we can realize robust, accurate, clear, end-to-end lane line extraction. Experimental results show our SGPLane reaches competitive performance on public benchmarks Tusimple. Moreover, SGPLane is superior to the existing methods in our real driving dataset.

Declaration of competing interest

The authors declare that they have no conflicts of interest in this work.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (52102464, U22A20104), and was funded in part by Tsinghua University-DiDi Joint Research Center for Future Mobility (20230908006).

Biographies

Xuewei Tang received the BS degree from the Department of Automation, Tsinghua University, Beijing, China, in 2020. She is currently working toward the PhD degree at the School of Vehicle and Mobility of Tsinghua University, Beijing, China. Her research interests include computer vision and high definition map construction.

Mengmeng Yang received the PhD degree in photogrammetry and remote sensing from Wuhan University,Wuhan, China in 2018. She is currently an assistant research professor at Tsinghua University, Beijing, China. Her research interests include autonomous vehicles, high precision digital map, and sensor fusion.

Diange Yang received his PhD in automotive engineering from Tsinghua University in 2001. He is now a professor at the School of Vehicle and Mobility at Tsinghua University. He currently serves as the director of the Tsinghua University Development & Planning Division. His research interests include autonomous driving, environment perception, and HD map. He has more than 180 technical publications and 100 patents. He received the National Technology Invention Award and Science and Technology Progress Award in 2010, 2013, 2018, and the Special Prize for Progress in Science and Technology of China Automobile Industry in 2019.

Contributor Information

Xuewei Tang, Email: tangxw20@mails.tsinghua.edu.cn.

Mengmeng Yang, Email: yangmm_qh@tsinghua.edu.cn.

Diange Yang, Email: ydg@mail.tsinghua.edu.cn.

References

1.Deo N., Rangesh A., Trivedi M.M. How would surround vehicles move? A unified framework for maneuver classification and motion prediction. arXiv e-prints. 2018 [Google Scholar]
2.Xu S., Cai X., Zhao B., et al. Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII. Springer; 2022. Rclane: Relay chain prediction for lane detection; pp. 461–477. [Google Scholar]
3.Qu Z., Jin H., Zhou Y., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Focus on local: Detecting lane marker from bottom up via key point; pp. 14122–14130. [Google Scholar]
4.Qin Z., Wang H., Li X. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer; 2020. Ultra fast structure-aware deep lane detection; pp. 276–291. [Google Scholar]
5.Ko Y., Lee Y., Azam S., et al. Key points estimation and point instance segmentation approach for lane detection. IEEE Trans. Intell. Transp. Syst. 2021;23(7):8949–8958. [Google Scholar]
6.Wang J., Ma Y., Huang S., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. A keypoint-based global association network for lane detection; pp. 1392–1401. [Google Scholar]
7.Tabelini L., Berriel R., Paixao T.M., et al. 2020 25th International Conference on Pattern Recognition (ICPR) IEEE; 2021. PolyLaneNet: Lane estimation via deep polynomial regression; pp. 6150–6156. [Google Scholar]
8.Neven D., De Brabandere B., Georgoulis S., et al. 2018 IEEE Intelligent Vehicles Symposium (IV) IEEE; 2018. Towards end-to-end lane detection: An instance segmentation approach; pp. 286–291. [Google Scholar]
9.Pan X., Shi J., Luo P., et al. Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 2018. Spatial as deep: Spatial CNN for traffic scene understanding. [Google Scholar]
10.Tabelini L., Berriel R., Paixao T.M., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Keep your eyes on the lane: Real-time attention-guided lane detection; pp. 294–302. [Google Scholar]
11.Feng Z., Guo S., Tan X., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Rethinking efficient lane detection via curve modeling; pp. 17062–17070. [Google Scholar]
12.Youjin T., Wei C., Xingguang L., et al. A robust lane detection method based on vanishing point estimation. Procedia Comput. Sci. 2018;131:354–360. [Google Scholar]
13.Neubeck A., Van Gool L. 18th International Conference on Pattern Recognition (ICPR’06) Vol. 3. IEEE; 2006. Efficient non-maximum suppression; pp. 850–855. [Google Scholar]
14.He Y., Wang H., Zhang B. Color-based road detection in urban traffic scenes. IEEE Trans. Intell. Transp. Syst. 2004;5(4):309–318. [Google Scholar]
15.Canny J. A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell. 1986;PAMI-8(6):679–698. [PubMed] [Google Scholar]
16.Duda R.O., Hart P.E. Use of the hough transformation to detect lines and curves in pictures. Cacm. 1972;15(1):11–15. [Google Scholar]
17.Hou Y., Ma Z., Liu C., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. Inter-region affinity distillation for road marking segmentation; pp. 12486–12495. [Google Scholar]
18.Ren S., He K., Girshick R., et al. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015;28 doi: 10.1109/TPAMI.2016.2577031. [DOI] [PubMed] [Google Scholar]
19.Li X., Li J., Hu X., et al. Line-CNN: End-to-end traffic line detection with line proposal unit. IEEE Trans. Intell. Transp. Syst. 2019;21(1):248–258. [Google Scholar]
20.Xu H., Wang S., Cai X., et al. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. Springer; 2020. CurveLane-NAS: Unifying lane-sensitive architecture search and adaptive point blending; pp. 689–704. [Google Scholar]
21.J. Su, C. Chen, K. Zhang et al., Structure guided lane detection, arXiv preprint arXiv:2105.05403 (2021).
22.Liu L., Chen X., Zhu S., et al. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. CondLaneNet: A top-to-down lane detection framework based on conditional convolution; pp. 3773–3782. [Google Scholar]
23.Abualsaud H., Liu S., Lu D.B., et al. LaneAF: Robust multi-lane detection with affinity fields. IEEE Rob. Autom. Lett. 2021;6(4):7477–7484. [Google Scholar]
24.Zhang H., Gu Y., Wang X., et al. 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) IEEE; 2021. SOLOLaneNet: Instance segmentation-based lane detection method using locations; pp. 2725–2731. [Google Scholar]
25.Carion N., Massa F., Synnaeve G., et al. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer; 2020. End-to-end object detection with transformers; pp. 213–229. [Google Scholar]
26.Lin T.-Y., Dollár P., Girshick R., et al. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Feature pyramid networks for object detection; pp. 2117–2125. [Google Scholar]
27.Romera E., Alvarez J.M., Bergasa L.M., et al. ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2017;PP(1):1–10. [Google Scholar]
28.Lin T.-Y., Goyal P., Girshick R., et al. Proceedings of the IEEE International Conference on Computer Vision. 2017. Focal loss for dense object detection; pp. 2980–2988. [Google Scholar]
29.X. Zhou, D. Wang, P. Krähenbühl, Objects as points, arXiv preprint arXiv:1904.07850 (2019).
30.Tusimple-benchmark, 2019, (https://github.com/TuSimple/tusimple-benchmark).

[bib0001] 1.Deo N., Rangesh A., Trivedi M.M. How would surround vehicles move? A unified framework for maneuver classification and motion prediction. arXiv e-prints. 2018 [Google Scholar]

[bib0002] 2.Xu S., Cai X., Zhao B., et al. Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII. Springer; 2022. Rclane: Relay chain prediction for lane detection; pp. 461–477. [Google Scholar]

[bib0003] 3.Qu Z., Jin H., Zhou Y., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Focus on local: Detecting lane marker from bottom up via key point; pp. 14122–14130. [Google Scholar]

[bib0004] 4.Qin Z., Wang H., Li X. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer; 2020. Ultra fast structure-aware deep lane detection; pp. 276–291. [Google Scholar]

[bib0005] 5.Ko Y., Lee Y., Azam S., et al. Key points estimation and point instance segmentation approach for lane detection. IEEE Trans. Intell. Transp. Syst. 2021;23(7):8949–8958. [Google Scholar]

[bib0006] 6.Wang J., Ma Y., Huang S., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. A keypoint-based global association network for lane detection; pp. 1392–1401. [Google Scholar]

[bib0007] 7.Tabelini L., Berriel R., Paixao T.M., et al. 2020 25th International Conference on Pattern Recognition (ICPR) IEEE; 2021. PolyLaneNet: Lane estimation via deep polynomial regression; pp. 6150–6156. [Google Scholar]

[bib0008] 8.Neven D., De Brabandere B., Georgoulis S., et al. 2018 IEEE Intelligent Vehicles Symposium (IV) IEEE; 2018. Towards end-to-end lane detection: An instance segmentation approach; pp. 286–291. [Google Scholar]

[bib0009] 9.Pan X., Shi J., Luo P., et al. Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 2018. Spatial as deep: Spatial CNN for traffic scene understanding. [Google Scholar]

[bib0010] 10.Tabelini L., Berriel R., Paixao T.M., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Keep your eyes on the lane: Real-time attention-guided lane detection; pp. 294–302. [Google Scholar]

[bib0011] 11.Feng Z., Guo S., Tan X., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Rethinking efficient lane detection via curve modeling; pp. 17062–17070. [Google Scholar]

[bib0012] 12.Youjin T., Wei C., Xingguang L., et al. A robust lane detection method based on vanishing point estimation. Procedia Comput. Sci. 2018;131:354–360. [Google Scholar]

[bib0013] 13.Neubeck A., Van Gool L. 18th International Conference on Pattern Recognition (ICPR’06) Vol. 3. IEEE; 2006. Efficient non-maximum suppression; pp. 850–855. [Google Scholar]

[bib0014] 14.He Y., Wang H., Zhang B. Color-based road detection in urban traffic scenes. IEEE Trans. Intell. Transp. Syst. 2004;5(4):309–318. [Google Scholar]

[bib0015] 15.Canny J. A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell. 1986;PAMI-8(6):679–698. [PubMed] [Google Scholar]

[bib0016] 16.Duda R.O., Hart P.E. Use of the hough transformation to detect lines and curves in pictures. Cacm. 1972;15(1):11–15. [Google Scholar]

[bib0017] 17.Hou Y., Ma Z., Liu C., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. Inter-region affinity distillation for road marking segmentation; pp. 12486–12495. [Google Scholar]

[bib0018] 18.Ren S., He K., Girshick R., et al. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015;28 doi: 10.1109/TPAMI.2016.2577031. [DOI] [PubMed] [Google Scholar]

[bib0019] 19.Li X., Li J., Hu X., et al. Line-CNN: End-to-end traffic line detection with line proposal unit. IEEE Trans. Intell. Transp. Syst. 2019;21(1):248–258. [Google Scholar]

[bib0020] 20.Xu H., Wang S., Cai X., et al. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. Springer; 2020. CurveLane-NAS: Unifying lane-sensitive architecture search and adaptive point blending; pp. 689–704. [Google Scholar]

[bib0021] 21.J. Su, C. Chen, K. Zhang et al., Structure guided lane detection, arXiv preprint arXiv:2105.05403 (2021).

[bib0022] 22.Liu L., Chen X., Zhu S., et al. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. CondLaneNet: A top-to-down lane detection framework based on conditional convolution; pp. 3773–3782. [Google Scholar]

[bib0023] 23.Abualsaud H., Liu S., Lu D.B., et al. LaneAF: Robust multi-lane detection with affinity fields. IEEE Rob. Autom. Lett. 2021;6(4):7477–7484. [Google Scholar]

[bib0024] 24.Zhang H., Gu Y., Wang X., et al. 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) IEEE; 2021. SOLOLaneNet: Instance segmentation-based lane detection method using locations; pp. 2725–2731. [Google Scholar]

[bib0025] 25.Carion N., Massa F., Synnaeve G., et al. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer; 2020. End-to-end object detection with transformers; pp. 213–229. [Google Scholar]

[bib0026] 26.Lin T.-Y., Dollár P., Girshick R., et al. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Feature pyramid networks for object detection; pp. 2117–2125. [Google Scholar]

[bib0027] 27.Romera E., Alvarez J.M., Bergasa L.M., et al. ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2017;PP(1):1–10. [Google Scholar]

[bib0028] 28.Lin T.-Y., Goyal P., Girshick R., et al. Proceedings of the IEEE International Conference on Computer Vision. 2017. Focal loss for dense object detection; pp. 2980–2988. [Google Scholar]

[bib0029] 29.X. Zhou, D. Wang, P. Krähenbühl, Objects as points, arXiv preprint arXiv:1904.07850 (2019).

[bib0030] 30.Tusimple-benchmark, 2019, (https://github.com/TuSimple/tusimple-benchmark).

PERMALINK

SGPLane: Efficient lane detection via sampled grid points for autonomous driving

Xuewei Tang

Mengmeng Yang

Kun Jiang

Tuopu Wen

Benny Wijaya

Diange Yang

Graphical abstract

Abstract

1. Introduction

Fig. 1.

2. Related work

3. Methodology

Fig. 2.

3.1. Sampled points representation

Algorithm 1.

Algorithm 2.

3.2. Network architecture

Fig. 3.

3.3. Hungarian matching and losses

3.4. Synthesis of lane line instances

Fig. 4.

4. Experiment

4.1. Dataset

4.2. Implementation details

4.3. Results

Fig. 5.

Fig. 6.

Table 1.

Table 2.

Fig. 7.

Fig. 8.

5. Conclusion

Declaration of competing interest

Acknowledgments

Biographies

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases