longer95479@home:~$

[Paper Reading] Balancing the Budget: Feature Selection and Tracking for Multi-Camera Visual-Inertial Odometry using SE2(3) Based Exact IMU Pre-integration


The main contributions of this work are the following:

  • A novel factor graph formulation that tightly fuses tracked features from any number of stereo and monocular cameras, along with IMU measurements, in a single consistent optimization process.
  • A simple and effective method to track features across cameras with overlapping FoVs to reduce duplicate landmark tracking and improve accuracy.
  • A submatrix feature selection (SFS) scheme that selects the best landmarks for optimization with a fixed feature budget. This bounds computational time and achieves the same accuracy compared to using all available features.
  • Extensive experimental evaluation across a range of scenarios demonstrating superior robustness, particularly when VIO with an individual camera fails.

The state of the sensor rig at time ti is defined as follows, xi , [Ri , pi , vi , b g i b a i ] ∈ SO(3) × R 12 (1) where: Ri is the orientation, pi is the position, vi is the linear velocity, and b g i , b a i are, respectively, the usual IMU gyroscope and accelerometer biases. In addition to the states, we track point landmarks m` as triangulated visual features.

The location of the landmarks detected by the stereo camera pair is initialized using stereo triangulation. For landmarks detected in monocular cameras, we triangulate the feature location over the last Nobs frames using the Direct Linear Transform (DLT) algorithm from [20].

VII-B. Initialization We initialize the IMU biases by averaging the first 1 s of data at system start up (assuming the IMU is stationary). To solve the scale initialization problem, which is often present in monocular visual odometry systems, we combine preintegrated IMU measurements and depth from the stereo camera pair. Notably, the CCFT method allows features from the stereo camera to flow into the monocular camera, speeding up the depth initialization process.


个人想法

相机间的特征追踪(匹配)可以分成两步:

  • 使用重投影,最小距离的为其匹配者,但如果外参不准,则该方法大概率失效
  • 如果相机间重叠区特征点多,但匹配的数目却很少,则可以意识到是外参的问题,此时可以 使用描述子的方法进行匹配

此外也可以线用描述子的方法校准好各个相机的外参,从而使第一步尽可能地成功,因为第二步是用来兜底的,计算量比较大



CSLAM Front End Summary


CSLAM Front End

CSLAM 的前端与单机的前端不同在于多了基于地图的测量: \(z^{t_k \rarr t_l}_{i \rarr j}\)

Multi S-Graphs: An Efficient Distributed Semantic-Relational Collaborative SLAM

原文:arXiv:2401.05152

针对什么问题?

  • 大多数 CSLAM 技术依赖于原始传感器测量,或诸如关键帧描述符之类的低级特征,可能导致缺乏对环境的深入理解而导致错误的回环检测
  • 交换原始测量值和低级特征占用较多通信带宽,限制了系统的可扩展性

采用什么方法?

  • Distributed LiDAR-based collaborative SLAM algorithmm
  • Take advantage of the hierachical semantic information from situational graph(S-Graph)
  • Works in multi-robot kidnapped problem: Each robot does not know where it has started and where the other robots are.

a local S-Graph is executed in every robot,the distilled S-Graphs and room descriptors are sent to other robot

room closure detection, the local S-Graph and the distilled S-Graph from other robots are used to generate collaborative S-Graph

Room-Based Inter-Robot Loop Closure: In order to combine the raw information of the point clouds and take advantage of the high-level semantic information that each room contains, we generate a hybrid descriptor for each room, a Room Descriptor.

  • What is the Distilled S-Graphs?
    • We marginalize the local S-graph generated by each robot to only trans- mit semantic-relational information and not the low-level information stored on each keyframe. Each distilled graph includes a reference node that represents the origin of the local coordinated system of each robot. This distilled S-graph is transmitted to the rest of the robots.

My thought: Pose-semantic constrains can be represented as What a 3D semantic object look like in measurements, or which part of the measurements are belong to one semantic object

Scan-Context is linear motion sensity, to solve this problem, we exploit the semantic and hierarchical information from Local S-Graph to obtain the room center and its boundaries to extract all the keyframes obtained from within the room and generate its corresponding Room Centric point cloud.

达到什么效果?

存在什么不足?

It is crucial for Room Descriptors to remain unique and constant over time for ac- curate matches, as identical rooms or changing environments can lead to erroneous matches.

D2 SLAM: Decentralized and Distributed Collaborative Visual-inertial SLAM System for Aerial Swarm

T-RO 2024 已投稿
原文:arXiv:2211.01538

D2SLAM使用基于地图的观测来修正相对漂移,当特征稀少的时候, 就退化为了单机 VIO,而使用线特征和视角则可以解决这个问题 Environments: In open environments with limited environmental features, such as grasslands or rough walls, sparse visual SLAM faces challenges in feature matching for relative localization and loop closure detection. In response to these limitations, our system is designed to downgrade to single-robot VIO, ensuring flight safety under such conditions.

In D2SLAM, to conserve communication bandwidth, complete keyframe information, such as SuperPoint features(feature descriptors & feature 3D positions) and global descriptors for each camera view, is broadcasted to other UAVs only in discover or near communication modes. On the other hand, the compact keyframes, include only the NetVLAD descriptors, are used in far mode to save bandwidth.

V-E. Distributed Loop Closure Detection Following [1], we utilize Faiss [52] for whole image searching, kNN for feature matching, Perspective-n-Point (PnP) for relative pose extraction, and implement outlier rejection. Subsequently, loop edges are processed by the back end for PGO. As illustrated in Fig. 8, D2SLAM adopts distributed loop closure detection [2], [3] to minimize communication bandwidth in far mode, and direct loop clousure detection [1] in near and discover mode.
Unlike [1], in D2SLAM, we use 3-D positions of Superpoint landmarks, estimated by D2VINS from historical frames, along with 2-D observations from the current frame for PnP computation. In addition, with quad fisheye cameras as input, solving a multicamera PnP problem becomes necessary for relative pose extraction and geometric verification. In practice, we initially apply the UPnP RANSAC algorithm [53] for pose estimation and outlier rejection. Subsequent inlier results are then used in a BA problem for multicamera PnP [54], enhancing pose accuracy.

VI-D. Sliding Window & Marginalization 1) Sliding Window: In D2VINS, each UAV independently manages its keyframes within a sliding window, akin to the approach in [12]. Upon adding a new frame, if it is not a keyframe, the second newest frame is discarded; otherwise, the oldest keyframe is removed. Following each update of the sliding window, the latest information is shared with other UAVs in the swarm. 2) Management of Remote Keyframes: Successful multirobot sparse feature matching, as detailed in Section V-D, results in the inclusion of the remote frame’s pose vkTˆt j into the local state vector Xi for optimization in Problem (8). Upon receiving updated sliding window data from other UAVs, D2SLAM removes any remote keyframes from UAV k that are no longer within its local sliding window. This removal, akin to local keyframe deletion, leads to marginalization, which will be discussed further. 3) Marginalization: When discarding old keyframes, each subproblemfcvioi (X i) of Problem(8)is linearized, and a new set of priors {(rp0 , Hp0 ),(rp1 , Hp1 )···(rpN−1 , HpN−1 )} is computed using the Schur complement to exclude old states. This process, known as marginalization [12], [60], treats each subproblem independently and is therefore executed distributively. Since Problem (8) involves states averaged across UAVs (11), the marginalization process maintains consistency across UAVs, meaning states are linearized at similar or identical points on different UAVs.

VI-E. Initialization D2VINS initialization involves two phases: 1) local keyframe initialization and 2) remote UAV initialization. For local keyframes, utilizing stereo or multicamera setups, we employ triangulation for landmark position initialization where measurements have a sufficient baseline (typically 0.05 cm) from multiple cameras or due to motion. In addition, IMU predictions initialize poses of subsequent local UAV Keyframes. To enhance stability, priors are added to the unobservable components (x, y, z, yaw) of the first keyframe. In multi-UAV scenarios, PnP RANSAC (or UPnP) is used to initialize the pose of remote keyframes when their coordinate system references differ from the local UAV.

XI-F. Communication Effective communication is pivotal for successful aerial swarm operations. To shed light on this aspect, Table VII details the sizes and broadcast frequencies of key messages transmitted by D2SLAM on each UAV. In addition, Table VIII presents a comparison between D2SLAM’s communication strategy (as detailed in Section IV-B) and a baseline method. This baseline strategy involves broadcasting all measurements (represented as complete keyframes in our context) without changing communication modes and is referenced in [1]. D2SLAM’s tailored strategy significantly reduces communication volume, especially when UAVs are widely dispersed. Transitioning from the broader communication strategy, we delve into specifics for a 5-UAV swarm. Here, the maximum value of pk reaches 51, leading to a maximum D2VINS update size of 3.5 kB per timestep. The value of ek, which varies depending on the map and environment. For example, in TUM Corr 5 dataset, a typical ek value is 1262, resulting in each D2PGO update being approximately 81 kB.

Omni-swarm: A Decentralized Omnidirectional Visual-Inertial-UWB State Estimation System for Aerial Swarms

T-RO 2022
原文:arXiv:2103.04131

The global descriptor and landmarks(descriptor and 3D-position), together with the odometry and extrinsic, are packed into keyframe $\mathcal{F}^t_i$ , which will be broadcast to the entire swarm later.

There are two separate visual databases on each drone: 1) a remote database Dr that stores keyframes from remote drones and 2) a local database Dl that stores keyframes from the local drone. Extracting the map-based measurement for a pair of keyframes from Dr is avoided to save computational resources since all the extracted map-based measurements are broadcast to the whole swarm.

Loop Closure

VINS-Mono

The relocalization process starts with a loop detection module that identifies places that have already been visited. Feature-level connections between loop closure candidates and the current frame are then established. These feature correspondences are tightly integrated into the monocular VIO module, resulting in drift-free state estimates with minimum computation overhead. Multiple observations of multiple features are directly used for relocalization, resulting in higher accuracy and better state estimation smoothness. A graphical illustration of the relocalization procedure is shown in Fig. 9(a).

DBoW2 returns loop closure candidates after temporal and geometrical consistency check. We keep all BRIEF descriptors for feature retrieving, but discard the raw image to reduce memory consumption.

The only difference is that the pose (qˆ w v , pˆ w v ) of the loop closure frame, which is taken from the pose graph (Sect. VIII), or directly from past odometry output (if this is the first relocalization), is treated as a constant.

Note that the global optimization of past poses and loop closure frames happens after relocalization, and will be discussed in Sect. VIII.

After relocalization, the local sliding window shifts and aligns with past poses. Utilizing the relocalization results, this additional pose graph optimization step is developed to ensure the set of past poses are registered into a globally consistent configuration.

Since our visual-inertial setup renders roll and pitch angles fully observable, the accumulated drift only occurs in four degrees-of-freedom (x, y, z and yaw angle). To this end, we ignore estimating the drift-free roll and pitch states, and only perform 4-DOF pose graph optimization.

Pose graph optimization and relocalization (Sect. VII-C) runs asynchronously in two separate threads. This enables immediate use of the most optimized pose graph for relocalization whenever it becomes available. Similarly, even if the current pose graph optimization is not completed yet, relocalization can still take place using the existing pose graph configuration. This process is illustrated in Fig. 9(b).



[Paper Reading] MAVIS: Multi-Camera Augmented Visual-Inertial SLAM using SE2(3) Based Exact IMU Pre-integration


多相机系统的好处是,更大的FOV,对环境的全向观测,提高定位精度和弱纹理场景下的鲁棒性。 坏处则是更多的传感器带来了更多的计算消耗。The proper handling of measurements from all cameras is crucial for balancing accuracy, robustness, and computational efficiency.

[14] extended ORB-SLAM2 [15] to multi-camera setups, supporting various rigidly coupled multi-camera systems. [16] introduced an adaptive SLAM system design for arbitrary multi-camera setups, requiring no sensorspecific tuning. Several works [17]–[21] focus on utilizing a surround-view camera system, often with multiple nonoverlapping monocular cameras, or specializing in motion estimation for ground vehicles. While demonstrating advantages in robustness in complex environments, these methods exhibit limited performance in highly dynamic scenarios and minor accuracy improvements in real-world experiments.

Inspired by [22] and [9], we employ a localization strategy based on a local map, utilizing 2D extracted features and local 3D map points for precise pose estimation



Total views. cowboys. Hits