In this review, automatic image description, its closely related problems and the recent advances are carefully studied. The related work is divided into three categories:
(i) Direct description generation from images,
(ii) Retrieval of images from a visual space, and
(iii) Retrieval of images from multimodal (joint visual and language-based) space.
Also, a brief review of the existing body of written work and automatic (process of figuring out the worth, amount, or quality of something) measures is given, and discussed some future directions for vision and language research. Compared to traditional keyword-based image note (using object recognition, attribute detection, scene labeling, etc.), automatic image description systems produce more human-like explanations of visual content, providing a more complete picture of the scene. Progress in this field could lead to more smart (not made by nature/fake) vision systems, which can make guesses (based on what you’ve been told) about the scenes through the created grounded image descriptions and therefore interact with their surrounding conditions in a more natural manner. They could also directly affect (related to computers and science) uses from which visually damaged/weakened people can benefit through more (easy to get to, use, or understand) (connecting points/ways of interacting with something). Even though there is the existence of the amazing and interesting increase in the number of image description systems over the last few years, experimental results suggest that system performance still falls short of human performance. An almost the same challenge lies in the automatic (process of figuring out the worth, amount, or quality of something) of systems using reference descriptions. The measures and the tools now in use are not (good or well enough) highly strongly related to human judgments, pointing to/showing a need for measures that can deal with the complex difficulty of the image description problem well enough