Large language models like GPT-4 are incredibly powerful, but they generally struggle with basic tasks involving visual perception – like counting objects in a picture. It seems a part of the difficulty may stem from how these models process high-resolution images. 

Most current multimodal AI systems can only perceive images at a hard and fast low resolution, like 224×224 pixels. But real-world images are available all sizes and shapes. Simply resizing or cropping them results in distortion, blurriness, and lack of detail that stops the models from understanding fine-grained visual information.

Researchers from Tsinghua University, National University of Singapore and University of Chinese Academy of Sciences tackled this challenge by developing LLaVA-UHD (shown in Figure 4), a brand new method for constructing encoder-decoder models that may flexibly handle high-resolution images at any aspect ratio. But how does it actually work?

The core idea is to intelligently split up large images into smaller, variable-sized “slices” that don’t stray too removed from the unique training data for the visual encoder. Each slice is resized to suit the encoder while preserving its native aspect ratio. A shared “compression layer” then condenses the visual tokens for every slice to scale back the computational load on the language model.  

To give the language model spatial context for the slice layout, LLaVA-UHD uses an easy positional encoding scheme with comma separators for rows and newlines between rows. Clever, right? The overview effect is that LLaVA-UHD can flexibly parse high-res images as much as 672×1088 pixels using just 94% of the compute needed for low-res 336×336 images with previous models.

The researchers put their method through its paces on 9 difficult multimodal benchmarks spanning visual query answering, optical character recognition, and more. Across the board, , all while using far less computing power during training. On the TextVQA benchmark testing OCR capabilities, it achieved a 6.4 point accuracy boost over the previous best as shown in Table 1.

Why such a performance leap? By preserving high quality visual details in native high resolutions, LLaVA-UHD can simply understand images higher than models squinting at low-res, blurry inputs. No more making best guesses – it gets the complete picture.

Of course, the work isn’t over. Even higher resolutions and more advanced tasks like object detection await. But LLaVA-UHD takes a vital step toward true visual intelligence for AI by letting language models perceive the world in vivid detail, just as we humans do.

Check out the Paper and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

Don’t Forget to hitch our 39k+ ML SubReddit

This article was originally published at