Learning Object Context for Dense Captioning
Closed captioning
Minimum bounding box
Benchmark (surveying)
Leverage (statistics)
Spatial contextual awareness
DOI:
10.1609/aaai.v33i01.33018650
Publication Date:
2019-08-20T07:45:23Z
AUTHORS (3)
ABSTRACT
Dense captioning is a challenging task which not only detects visual elements in images but also generates natural language sentences to describe them. Previous approaches do leverage object information for this task. However, objects provide valuable cues help predict the locations of caption regions as often highly overlap with (i.e. are usually parts or combinations them). Meanwhile, important describing target region corresponding description depicts its properties, involves interactions image. In work, we propose novel scheme an context encoding Long Short-Term Memory (LSTM) network automatically learn complementary each region, transferring knowledge from regions. All contextual arranged sequence and progressively fed into module obtain features. Then both learned features used bounding box offsets generate descriptions. The learning procedure conjunction optimization location prediction generation, thus enabling LSTM capture aggregate useful context. Experiments on benchmark datasets demonstrate superiority our proposed approach over state-of-the-art methods.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (33)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....