Show HN: Tarsier – vision for text-only LLM web agents that beats GPT-4o


Tarsier, created by Reworkd and hosted on GitHub, offers vision utilities for web interaction agents. The goal is to assist LLMs in automating web interactions by providing a system that visually tags interactable elements on web pages, enabling actions such as 'CLICK'. It leverages OCR to create a whitespace-structured string representation of webpage content, which can be understood by LLMs. The utility is compatible with LLMs like GPT-4 and is accessible through pip installation.

  • Tarsier provides a visual tagging system for web elements.
  • It aids LLMs in automating web tasks.
  • The tool includes a custom OCR algorithm.
  • It supports Google Cloud Vision OCR Service.
  • Reworkd plans to add more features and support.