How Generative AI will transform web accessibility: Building an AI screen reader

In this post, I wanted to share a pet project with you: I've been playing around with generative AI to build a smarter, more intuitive screen reader.

This experiment is greatly inspired by the pioneering efforts of Be My Eyes, an application engineered to connect the visually impaired with sighted volunteers via video call, and OpenAI's groundbreaking work in creating a virtual volunteer. By harnessing similar ingenuity, the goal is to forge an AI-powered conduit that helps translate visual data into comprehensible auditory feedback.

Background

Here at Elastic, we're on a mission, not only to build the best search-powered AI platform but also to make sure that as many people as possible enjoy using that platform. We believe that opening up accessibility not only expands our audience to new users but also drives improved experiences for all users. However, the problem is that traditional web accessibility, while well-intentioned, can feel like sticking a square peg in a round hole. This is particularly true with complex, visually-stacked applications like Kibana.

Navigating these sophisticated visual fields can be a bit like trying to solve a Rubik's Cube in the dark for users with visual impairments. Generative AI however holds the possibility to radically improve accessibility and user's interaction with IT in general.

For individuals with sight impairments, the internet can represent a labyrinth of obstacles, with the potential for inclusivity often stifled by inaccessible website design. Screen readers have become essential tools in this space, enabling visually impaired users to access and interact with web content. By converting text to speech or braille output, these assistive technologies allow users to 'hear' or 'feel' what is displayed on the screen.

Screen readers function by parsing the code of a webpage and reading out the textual content in a logical sequence. They rely on web developers to use specific tools such as semantic HTML – marking up headings, lists, links, and form elements with proper tags that provide context and structure to the content. This well-organized markup is critical, as it directly influences the usability of a site with a screen reader.

Furthermore, the adoption of ARIA (Accessible Rich Internet Applications) landmarks plays a pivotal role. These landmarks indicate sections of a web page (like banners, navigation bars, and main content areas), helping users to navigate more efficiently within pages. When these tools are wielded correctly, a better experience is crafted for screen reader users.

However, the limitations of screen readers are readily felt when users encounter web applications that are heavily visual or dynamically interactive, such as data visualizations or complex dashboards like those presented by Kibana. These types of content often resist traditional screen reader interpretation because they do not translate neatly into text.

Moreover, issues arise with non-text content, such as images or graphics, which require alternative text descriptions to be of any use to someone using a screen reader. But alternative text can fall short for complex images like charts, making it hard for users to fully grasp the deeper insights that visuals can provide.

Interactivity presents another hurdle; dynamic content that updates without a page refresh (common in many modern web applications) can be problematic unless properly announced by the screen reader. Without explicit notifications from the web application to the screen reader software, users may miss critical updates or changes, leading to confusion and a disrupted experience.

While screen readers provide a valuable service, we must acknowledge the pressing need to evolve these technologies. As web applications grow in complexity and visual content continues to dominate digital experiences, it's clear that the traditional approach of screen readers must adapt. The usefulness of these tools hinges on our ability to innovate and expand their capabilities to meet the diverse needs of all users in our technologically rich world.

What's it really like to use a screen reader even on a simple website? Here is a nice example from YouTube:

The role of Be My Eyes in web accessibility

Be My Eyes' is a remarkable example of harnessing technology to empower the visually impaired community, enabling a level of independence that was previously more challenging to attain. This revolutionary mobile application connects individuals with sight impairments to sighted volunteers worldwide via live video calls. Launched in 2015, 'Be My Eyes' has since cultivated a vibrant support network, fostering a sense of solidarity that transcends geographical boundaries.

The premise of the app is simple yet profound: a user with a visual impairment requests assistance, and a sighted volunteer receives a notification to join a video call. Within moments, the volunteer can lend their eyesight to help the user with a variety of tasks that range from reading expiration dates on perishable items to navigating new surroundings. This instantaneous and personal support goes a long way in making everyday challenges more manageable, providing a much-needed bridge to a world designed predominantly for those without visual limitations.

I have been registered as a volunteer and I would highly recommend registering.

Recently, Be My Eyes announced a partnership with OpenAI to introduce a virtual assistant called Be My AI. This replaces the live volunteer with a virtual one powered by Chat GPT 4 and specifically the new image analysis capabilities of GPT 4o. You can read their blog here.

This got me thinking, how could we quickly explore how something similar might be possible in a web browser.

I will add the caveat that I do not use a screen reader, so I only have an outsider's perspective.

The easiest way to get something working that I could think of was by building a Google Chrome Extension. These let you run code in the browser and get access to the site being viewed.

The goal for the experiments were:

Describe the current page
Be able to ask a specific question, even if it wasn't in the initial summary
Be able to use both speech-to-text and text-to-speech

Conceptually, the extension works like this:

It's hard to present the results of such an interactive tool in a blog so I recorded a short video with some sample results.

If you can't view the video, I will walk through a couple of examples.

The first one is a web page featuring funny cat images but with very little text visible. I thought this would be a good test of how the system could describe, and put in context images like this. The screen looked something like this:

The LLM did a great job of providing a high level summary of the page without getting lost in detail but also highlighting elements like the cookie notice. Here is the transcript:

"you are on an adobe stock web page with a search for funny cats the page displays several images of cats in humorous situations from left to right top to bottom one a cat peeking out of a cardboard box two a cat wearing sunglasses with a colorful background three a cat sitting inside a plastic container four a cat holding a sign that says I hate people in front of a height measurement chart five a cat wearing sunglasses against a light blue background there is also a cookie consent notice at the bottom of the page with options to enable all or don't enable cookies and a cookie settings"

You can then ask followup questions. When I ask specifically about the cat against the height chart I get a much more detailed description. You'll notice the LLM is able to reference not just that it's a height chart but also the cultural reference that it is implying the cat is a criminal. Here is the reply:

"the image you referred to shows a cat standing in front of a height chart typically used for police mug shots the cat has a Stern expression and is holding a sign that reads I hate people the background features height markers ranging from 40 to 65 adding to the humor effect of the mugshot concept the overall tone of the image is playful and funny playing on the idea of the cat being a criminal"

We then moved on to looking at a dashboard in Kibana.

Again, it provided a very thorough overview of the elements in the dashboard and what the dashboard was trying to achieve, throwing in some details but not too many. It did, consistantly struggle with one element which described a latency in ms. It's not clear why this would cause an issue. The transcript of the description is:

"the page is a monitoring dashboard from elastic here is a simple description of its main elements one header shows options for settings saving sharing and switching to view mode there's a search bar to find apps content and more modeled for mat's monitoring cluster with unsaved changes indicated two dashboard title editing benefits bot monitoring is displayed prominently three main panel rag search app monitoring with a description of the dashboard's purpose monitoring log data sent from the elastic powered rag app four visualization screens multiple visual representations including bar chart charts and pie charts that display various data count of Records 38 median of transaction duration 4,634 mless contribution to data by user Alice 26.32% and Bob 73.68% bar charts showing records over time and top values of host IPS audit log lists queries with Tim stamp service name query details user full name and the corresponding replies this dashboard helps in monitoring application performance and analyzing specific queries and data distributions over time"

What I found really impressive though was the ability to ask direct questions about the tables and charts and get a direct answer. When I ask "what is the top Host IP?" it has to identify which chart I'm talking about and then interpret that chart. The reply I get is simply:

"The top Host IP is 107.178.238.211 as shown in the bar chart titled Top 5 Valies of Host IP on the dashboard"

I had a lot of fun playing with asking it different things to discover its strengths and weaknesses and my whole experiment cost me about $0.50 in OpenAI credits. If you want to try the extension for yourself I've made the code available here:

https://github.com/m-adams/chrome-gpt-reader

Conclusions

During my testing, despite some small errors the model made, I was hugely impressed with what could be achieved so quickly with a relatively naive approach. The first thing that stood out was the difference modern text-to-speech technology makes in how pleasant it is to use a system. A natural sounding voice automatically makes the interaction more enjoyable and accessible.

The second observation was the immediate shift from iteratively reading individual elements on a page to getting a quick overall summary, in context. This is hugely powerful and would be nearly impossible without something like a large language model (LLM). The ability to ask specific questions and get direct answers worked extremely well and was responsive enough to facilitate an iterative conversation, even with multiple calls to the LLM for speech-to-text and text-to-speech. This included being able to ask questions about complex pages like Kibana dashboards with multiple charts.

If all of this is possible without any integration into the site being used, just imagine the possibilities if you augmented an app like Kibana to:

Provide a system prompt with much more specific context on the use cases.
Expose possible actions the user could take, e.g., "Set Time Range," "Search," "Set Filter." *Integrate navigation around the app.

Such changes could essentially allow users to have a conversation with Kibana via the LLM. But would this just solve an accessibility problem? Think broader—imagine you are an SRE who gets a 2am on-call alert. Imagine being able to ask Kibana some follow-up questions to determine if immediate action is necessary or if it can wait until morning, without even logging in. Of course, Elastic is already moving in this direction with the development of AI Assistants embedded in Kibana, such as the Observability Assistant.

While this simple project demonstrates the potential of AI in enhancing accessibility, it also shows that this technology is not confined to accessibility alone. It can revolutionize how all users interact with complex data-driven interfaces.

I believe even this simple project could be extended by using content scripts to employ function calling to take actions on the site being viewed. The possibilities are vast, and we are just scratching the surface of what generative AI can do in making web applications more accessible and user-friendly for everyone.

By continuing to innovate and push the boundaries of what's possible, we can create more inclusive and powerful digital experiences. This experiment has shown me that the future of web accessibility is bright, driven by the interactive potential of Generative AI.

Elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

Report an issue