-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
204 lines (170 loc) · 8.34 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
<!doctype html>
<html lang="en">
<!-- === Header Starts === -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>DETR3D</title>
<link href="./assets/bootstrap.min.css" rel="stylesheet">
<link href="./assets/font.css" rel="stylesheet" type="text/css">
<link href="./assets/style.css" rel="stylesheet" type="text/css">
<script src="./assets/jquery.min.js"></script>
<script type="text/javascript" src="assets/corpus.js"></script>
</head>
<!-- === Header Ends === -->
<script>
var lang_flag = 1;
</script>
<body>
<!-- === Home Section Starts === -->
<div class="section">
<!-- === Title Starts === -->
<div class="logo" align="center">
<!-- <a href="" target="_blank"> -->
<img style=" width: 400pt;" src="images/detr3d_logo.png">
<!-- </a> -->
</div>
<div class="header">
<div style="" class="title" id="lang">
<b>DETR3D</b>: 3D Object Detection from Multi-view Images via 3D-to-2D Queries
</div>
</div>
<!-- === Title Ends === -->
<div class="author" style="margin-top: -30pt">
<a href="https://people.csail.mit.edu/yuewang" target="_blank">Yue Wang</a><sup>1</sup>,
<a href="https://scholar.google.com.br/citations?user=UH9tP6QAAAAJ&hl=en" target="_blank">Vitor Guizilini</a><sup>2</sup>,
<a href="https://tianyuanzhang.com/" target="_blank">Tianyuan Zhang</a><sup>3</sup>,
<a href="https://scholar.google.com.hk/citations?hl=en&user=nUyTDosAAAAJ">Yilun Wang</a><sup>4</sup>,
<a href="https://hangzhaomit.github.io/">Hang Zhao</a><sup>5</sup>
<a href="https://people.csail.mit.edu/jsolomon/">Justin Solomon</a><sup>1</sup>
</div>
<div class="institution">
<div><sup>1</sup>MIT,
<sup>2</sup>Toyota Research Institute,
<sup>3</sup>CMU,
</div>
<div>
<sup>4</sup>Li Auto,
<sup>5</sup>Tsinghua University
</div>
</div>
<div class="institution">
Conference on Robot Learning (CoRL 2021)
</div>
<table border="0" align="center">
<tr>
<td align="center" style="padding: 0pt 0 15pt 0">
<a class="bar" href="https://tsinghua-mars-lab.github.io/DETR3D/"><b>Webpage</b></a> |
<a class="bar" href="https://github.com/WangYueFt/detr3d"><b>Code</b></a> |
<a class="bar" href="https://arxiv.org/abs/2110.06922"><b>Paper</b></a>
</td>
</tr>
</table>
<!--<div align="center">
<table width="100%" style="margin: 0pt 0pt; text-align: center;">
<tr>
<td>
<video style="display:block; width:100%; height:auto; "
autoplay="autoplay" muted loop="loop" controls playsinline>
<source src="https://raw.githubusercontent.com/decisionforce/archive/master/MetaDrive/metadrive_teaser.mp4"
type="video/mp4"/>
</video>
</td>
</tr>
</table>
</div> -->
<p>
A multi-camera 3D object detection framework that does NOT require dense depth prediction or post-processing.
</p>
<div class="logo" style="" align="center">
<img style="width: 700pt;" src="images/detr3d_teaser.png">
</div>
</div>
<!-- === Home Section Ends === -->
<div class="section">
<div class="title" id="lang">Abstract</div>
<!-- <div class="logo" style="" align="center">
<img style="width: 700pt;" src="images/DETR3D_teaser.png">
</div> -->
<p>
We introduce a framework for multi-camera 3D object detection. In contrast to existing works, which estimate 3D bounding boxes directly from monocular images or use depth prediction networks to generate input for 3D object detection from 2D information, our method manipulates predictions directly in 3D space. Our architecture extracts 2D features from multiple camera images and then uses a sparse set of 3D object queries to index into these 2D features, linking 3D positions to multi-view images using camera transformation matrices. Finally, our model makes a bounding box prediction per object query, using a set-to-set loss to measure the discrepancy between the ground-truth and the prediction. This top-down approach outperforms its bottom-up counterpart in which object bounding box prediction follows per-pixel depth estimation, since it does not suffer from the compounding error introduced by a depth prediction model. Moreover, our method does not require post-processing such as non-maximum suppression, dramatically improving inference speed. We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.
</p>
<!-- <ul>
<li>
<b>Compositional</b>: It supports generating infinite scenes with various road maps and traffic settings for the
research of generalizable RL.
</li>
<li>
<b>Lightweight</b>: It is easy to install and run. It can run up to 300 FPS on a standard PC.
</li>
<li>
<b>Realistic</b>: Accurate physics simulation and multiple sensory input including Lidar, RGB images, top-down
semantic map and first-person view images. The real traffic data replay is also supported.
</li>
</ul> -->
</div>
<div class="section">
<div class="title" id="lang">Method</div>
<div class="logo" style="" align="center">
<img style="width: 700pt;" src="images/detr3d_model.png">
</div>
<p>
<b> A multi-camera 3D object detection framework.</b> DETR3D extracts image features with a 2D backbone, followed by a set of queries defined in 3D space to correlate 2D observations and 3D predictions. Finally, a set-to-set loss is used to remove the necessity of post-processing such as non-maximum suppresion.
</p>
<ul>
<li><b> 3D aware.</b> We incorporate 3D information into intermediate computations within our architecture, rather than performing purely 2D computations in the image plane.
</li>
<li><b> Sparse.</b> We do not estimate dense 3D scene geometry, avoiding associated reconstruction errors.</li>
<li><b> Post-processing free.</b> We avoid post-processing steps such as NMS.</li>
</ul>
</div>
<div class="section">
<div class="title" id="lang">Related Projects on <a href="https://vcad-ai.github.io/">VCAD (Vision-Centric Autonomous Driving)</a></div>
<div class="col text-center">
<table width="100%" style="margin: 0pt 0pt; text-align: center;">
<tr>
<td>
BEV Mapping<br>
<a href="https://tsinghua-mars-lab.github.io/HDMapNet/" class="d-inline-block p-3"><img height="100"
src="images/hdmapnet_thumbnail.gif" style="border:1px solid" data-nothumb><br>HDMapNet</a>
</td>
<td>
BEV Vectorized Mapping<br>
<a href="https://tsinghua-mars-lab.github.io/vectormapnet/" class="d-inline-block p-3"><img height="100"
src="images/VectorMapNet_thumbnail.png" style="border:1px solid"
data-nothumb><br>VectorMapNet</a>
</td>
<td>
BEV Fusion<br>
<a href="https://tsinghua-mars-lab.github.io/futr3d/" class="d-inline-block p-3"><img height="100"
src="images/futr3d_thumbnail.png" style="border:1px solid"
data-nothumb><br>FUTR3D</a>
</td>
<td>
BEV Tracking<br>
<a href="https://tsinghua-mars-lab.github.io/mutr3d/" class="d-inline-block p-3"><img height="100"
src="images/mutr3d_thumbnail.png" style="border:1px solid"
data-nothumb><br>MUTR3D</a>
</td>
</tr>
</table>
</div>
</div>
<!-- === Reference Section Starts === -->
<div class="section">
<div class="bibtex">
<div class="title" id="lang">Reference</div>
</div>
<p>If you find our work useful in your research, please cite our paper:</p>
<pre>
@inproceedings{
detr3d,
title={DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries},
author={Wang, Yue and Guizilini, Vitor and Zhang, Tianyuan and Wang, Yilun and Zhao, Hang and and Solomon, Justin M.},
booktitle={The Conference on Robot Learning ({CoRL})},
year={2021}
}
</pre>
<!-- Adjust the frame size based on the demo (Every project differs). -->
</div>
</body>
</html>