The main contributions of this work are the following:
- A novel factor graph formulation that tightly fuses tracked
features from any number of stereo and monocular
cameras, along with IMU measurements, in a single
consistent optimization process.
- A simple and effective method to track features across
cameras with overlapping FoVs to reduce duplicate
landmark tracking and improve accuracy.
- A submatrix feature selection (SFS) scheme that selects
the best landmarks for optimization with a fixed feature
budget. This bounds computational time and achieves the
same accuracy compared to using all available features.
- Extensive experimental evaluation across a range of
scenarios demonstrating superior robustness, particularly
when VIO with an individual camera fails.
The state of the sensor rig at time ti
is defined as follows,
xi , [Ri
, pi
, vi
, b
g
i b
a
i
] ∈ SO(3) × R
12 (1)
where: Ri
is the orientation, pi
is the position, vi
is the linear
velocity, and b
g
i
, b
a
i
are, respectively, the usual IMU gyroscope
and accelerometer biases. In addition to the states, we track
point landmarks m` as triangulated visual features.
The location of the landmarks detected by the stereo camera
pair is initialized using stereo triangulation. For landmarks
detected in monocular cameras, we triangulate the feature
location over the last Nobs frames using the Direct Linear
Transform (DLT) algorithm from [20].
VII-B. Initialization
We initialize the IMU biases by averaging the first 1 s of
data at system start up (assuming the IMU is stationary).
To solve the scale initialization problem, which is often
present in monocular visual odometry systems, we combine
preintegrated IMU measurements and depth from the stereo
camera pair. Notably, the CCFT method allows features from
the stereo camera to flow into the monocular camera, speeding
up the depth initialization process.
个人想法
相机间的特征追踪(匹配)可以分成两步:
- 使用重投影,最小距离的为其匹配者,但如果外参不准,则该方法大概率失效
- 如果相机间重叠区特征点多,但匹配的数目却很少,则可以意识到是外参的问题,此时可以
使用描述子的方法进行匹配
此外也可以线用描述子的方法校准好各个相机的外参,从而使第一步尽可能地成功,因为第二步是用来兜底的,计算量比较大
CSLAM Front End
CSLAM 的前端与单机的前端不同在于多了基于地图的测量:
\(z^{t_k \rarr t_l}_{i \rarr j}\)
Multi S-Graphs: An Efficient Distributed Semantic-Relational Collaborative SLAM
原文:arXiv:2401.05152
针对什么问题?
- 大多数 CSLAM 技术依赖于原始传感器测量,或诸如关键帧描述符之类的低级特征,可能导致缺乏对环境的深入理解而导致错误的回环检测
- 交换原始测量值和低级特征占用较多通信带宽,限制了系统的可扩展性
采用什么方法?
- Distributed LiDAR-based collaborative SLAM algorithmm
- Take advantage of the hierachical semantic information from situational graph(S-Graph)
- Works in multi-robot kidnapped problem: Each robot does not know where it has started and where the other robots are.
a local S-Graph is executed in every robot,the distilled S-Graphs and room descriptors are sent to other robot
room closure detection, the local S-Graph and the distilled S-Graph from other robots are used to generate collaborative S-Graph
Room-Based Inter-Robot Loop Closure: In order to combine the raw
information of the point clouds and take advantage of the
high-level semantic information that each room contains,
we generate a hybrid descriptor for each room, a Room
Descriptor.
- What is the Distilled S-Graphs?
- We marginalize
the local S-graph generated by each robot to only trans-
mit semantic-relational information and not the low-level
information stored on each keyframe. Each distilled graph
includes a reference node that represents the origin of the
local coordinated system of each robot. This distilled S-graph
is transmitted to the rest of the robots.
My thought: Pose-semantic constrains can be represented as What a 3D semantic object look like in measurements,
or which part of the measurements are belong to one semantic object
Scan-Context is linear motion sensity, to solve this problem, we exploit the semantic and hierarchical
information from Local S-Graph to obtain the room center
and its boundaries to extract all the keyframes obtained
from within the room and generate its corresponding Room
Centric point cloud.
达到什么效果?
存在什么不足?
It is crucial for Room
Descriptors to remain unique and constant over time for ac-
curate matches, as identical rooms or changing environments
can lead to erroneous matches.
D2 SLAM: Decentralized and Distributed Collaborative Visual-inertial SLAM System for Aerial Swarm
T-RO 2024 已投稿
原文:arXiv:2211.01538
D2SLAM使用基于地图的观测来修正相对漂移,当特征稀少的时候,
就退化为了单机 VIO,而使用线特征和视角则可以解决这个问题 Environments: In open environments with limited environmental features, such as grasslands or rough walls, sparse
visual SLAM faces challenges in feature matching for relative
localization and loop closure detection. In response to these
limitations, our system is designed to downgrade to single-robot
VIO, ensuring flight safety under such conditions.
In D2SLAM, to conserve communication bandwidth, complete keyframe information, such as SuperPoint features(feature descriptors & feature 3D positions) and
global descriptors for each camera view, is broadcasted to other
UAVs only in discover or near communication modes. On the
other hand, the compact keyframes, include only the NetVLAD
descriptors, are used in far mode to save bandwidth.
V-E. Distributed Loop Closure Detection
Following [1], we utilize Faiss [52] for whole image searching, kNN for feature matching, Perspective-n-Point (PnP) for
relative pose extraction, and implement outlier rejection. Subsequently, loop edges are processed by the back end for PGO. As
illustrated in Fig. 8, D2SLAM adopts distributed loop closure
detection [2], [3] to minimize communication bandwidth in far
mode, and direct loop clousure detection [1] in near and discover
mode.
Unlike [1], in D2SLAM, we use 3-D positions of Superpoint
landmarks, estimated by D2VINS from historical frames, along
with 2-D observations from the current frame for PnP computation. In addition, with quad fisheye cameras as input, solving a
multicamera PnP problem becomes necessary for relative pose
extraction and geometric verification. In practice, we initially
apply the UPnP RANSAC algorithm [53] for pose estimation
and outlier rejection. Subsequent inlier results are then used in a
BA problem for multicamera PnP [54], enhancing pose accuracy.
VI-D. Sliding Window & Marginalization
1) Sliding Window: In D2VINS, each UAV independently
manages its keyframes within a sliding window, akin to the
approach in [12]. Upon adding a new frame, if it is not a
keyframe, the second newest frame is discarded; otherwise, the
oldest keyframe is removed. Following each update of the sliding
window, the latest information is shared with other UAVs in the
swarm.
2) Management of Remote Keyframes: Successful multirobot sparse feature matching, as detailed in Section V-D, results
in the inclusion of the remote frame’s pose vkTˆt
j into the local
state vector Xi for optimization in Problem (8). Upon receiving
updated sliding window data from other UAVs, D2SLAM removes any remote keyframes from UAV k that are no longer
within its local sliding window. This removal, akin to local
keyframe deletion, leads to marginalization, which will be discussed further.
3) Marginalization: When discarding old keyframes, each
subproblemfcvioi (X i) of Problem(8)is linearized, and a new set
of priors {(rp0 , Hp0 ),(rp1 , Hp1 )···(rpN−1 , HpN−1 )} is computed using the Schur complement to exclude old states. This
process, known as marginalization [12], [60], treats each subproblem independently and is therefore executed distributively.
Since Problem (8) involves states averaged across UAVs (11),
the marginalization process maintains consistency across UAVs,
meaning states are linearized at similar or identical points on
different UAVs.
VI-E. Initialization
D2VINS initialization involves two phases: 1) local keyframe
initialization and 2) remote UAV initialization. For local
keyframes, utilizing stereo or multicamera setups, we employ
triangulation for landmark position initialization where measurements have a sufficient baseline (typically 0.05 cm) from
multiple cameras or due to motion. In addition, IMU predictions
initialize poses of subsequent local UAV Keyframes. To enhance stability, priors are added to the unobservable components
(x, y, z, yaw) of the first keyframe. In multi-UAV scenarios, PnP
RANSAC (or UPnP) is used to initialize the pose of remote
keyframes when their coordinate system references differ from
the local UAV.
XI-F. Communication
Effective communication is pivotal for successful aerial
swarm operations. To shed light on this aspect, Table VII details
the sizes and broadcast frequencies of key messages transmitted
by D2SLAM on each UAV. In addition, Table VIII presents a
comparison between D2SLAM’s communication strategy (as
detailed in Section IV-B) and a baseline method. This baseline
strategy involves broadcasting all measurements (represented as
complete keyframes in our context) without changing communication modes and is referenced in [1]. D2SLAM’s tailored strategy significantly reduces communication volume, especially
when UAVs are widely dispersed.
Transitioning from the broader communication strategy, we
delve into specifics for a 5-UAV swarm. Here, the maximum
value of pk reaches 51, leading to a maximum D2VINS update
size of 3.5 kB per timestep. The value of ek, which varies
depending on the map and environment. For example, in TUM
Corr 5 dataset, a typical ek value is 1262, resulting in each
D2PGO update being approximately 81 kB.
Omni-swarm: A Decentralized Omnidirectional Visual-Inertial-UWB State Estimation System for Aerial Swarms
T-RO 2022
原文:arXiv:2103.04131
The global descriptor and landmarks(descriptor and 3D-position), together with the odometry and extrinsic, are packed into keyframe $\mathcal{F}^t_i$ , which will be broadcast to the entire swarm later.
There
are two separate visual databases on each drone: 1) a remote
database Dr that stores keyframes from remote drones and 2)
a local database Dl that stores keyframes from the local drone.
Extracting the map-based measurement for a pair of keyframes
from Dr is avoided to save computational resources since all the
extracted map-based measurements are broadcast to the whole
swarm.
Loop Closure
VINS-Mono
The relocalization process starts with a
loop detection module that identifies places that have already
been visited. Feature-level connections between loop closure
candidates and the current frame are then established. These
feature correspondences are tightly integrated into the monocular VIO module, resulting in drift-free state estimates with
minimum computation overhead. Multiple observations of
multiple features are directly used for relocalization, resulting
in higher accuracy and better state estimation smoothness. A
graphical illustration of the relocalization procedure is shown
in Fig. 9(a).
DBoW2
returns loop closure candidates after temporal and geometrical
consistency check. We keep all BRIEF descriptors for feature
retrieving, but discard the raw image to reduce memory
consumption.
The only difference
is that the pose (qˆ
w
v
, pˆ
w
v
) of the loop closure frame, which is
taken from the pose graph (Sect. VIII), or directly from past
odometry output (if this is the first relocalization), is treated as
a constant.
Note
that the global optimization of past poses and loop closure
frames happens after relocalization, and will be discussed in
Sect. VIII.
After relocalization, the local sliding window shifts and
aligns with past poses. Utilizing the relocalization results, this
additional pose graph optimization step is developed to ensure
the set of past poses are registered into a globally consistent
configuration.
Since our visual-inertial setup renders roll and pitch angles
fully observable, the accumulated drift only occurs in four
degrees-of-freedom (x, y, z and yaw angle). To this end, we
ignore estimating the drift-free roll and pitch states, and only
perform 4-DOF pose graph optimization.
Pose graph optimization and relocalization (Sect. VII-C)
runs asynchronously in two separate threads. This enables
immediate use of the most optimized pose graph for relocalization whenever it becomes available. Similarly, even if
the current pose graph optimization is not completed yet,
relocalization can still take place using the existing pose graph
configuration. This process is illustrated in Fig. 9(b).