JOINT MULTIMODAL EMBEDDING AND BACKTRACKING SEARCH IN VISION-AND-LANGUAGE NAVIGATION

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Blog Article

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input womens sports bra data such as images and text.Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction.This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks.

The proposed JMEBS model uses a transformer-based joint multimodal embedding module.JMEBS uses both multimodal context and temporal context.It also employs backtracking-enabled greedy local search (BGLS), a foot claws novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions.

A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions.The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.

Report this page