Research Directions

Accessibility of Multimodal Programming Tutorials And Lectures

Many methods have been conceived to record text instead of visual screens. Most are based around the abstraction of the UNIX terminal, like asciinema, ttyrec. But we would like to propose that the semantics of these programs are still based on the incorrect underlying assumptions about data (the UNIX terminal).

Recording the difference in text between two points in time is a generalizable operation that supports open protocols to accomplish it. This gives you the choice to coalesce your updates to the required update rate as required by your application, while recording at effectively infinite precision. This is achieved by using relative times, text insertion/deletion, source (user, editor, other), text, and position; this must be integrated as an editor plugin.

This idea out of my work as an alternate format transcriber: so much of the work was re-recording visual artifacts in non-visual formats. While some progress in this area (LLMs) have added much utility, there are still not tools related to recording and publishing effective programming tutorials. I have done some basic prototype work in this area with Semantigram which aims to interactively unify semantics with visual and non-visual representation of the underlying data.

Are The Semantics of Data Inherent or Inferred By Context?

The semantics of data—what a given data set means so to speak—is this something with is inherent in the data, or something explicitly being conveyed by the writer of the data? Or, in the alternate case, is data semantics actually contextually inferred based on the use case?

If it is the latter, could a system for interpreting data be made on different "shapes" of data. For example, trees have a defined, specific structure; could the concept of viewing data through a tree lens be written into a library of functions for visual, sonic, tactile, and semantic feedback? If so, data "lenses" would be become the way in which to interpret data.

Additionally, this opens other lines of research, like how to identify which lenses of data are the most likely to be used, if multiple data lenses are useful information to the end user, and its application to assistive technology.

Based off of the following works:

If it is true that the most passionate people in a field will be those most impacted by its creation and usage, then screen reader users (who are mostly BVI) would benefit the most from this field being accessible to them. Given that TTS resources are mostly visual resources: focusing primarily on how sound is created, manipulated, and stitched together, it would benefit screen reader users greatly if there were available resources in this area.

There are a lot of pre-TTS knowledge required before pure-TTS resources can be digested. There cross multiple domains from computer science, to mathematics, and linguistics. All of these would be necessary pre-requisite knowledge before general TTS resources are created.

Working with an existing provider of accessible materials like OpenStax would be the primary goal: making sure that base resources are in place first. Then, making an accessible version of Paul Taylor's textbook Text-to-Speech Synthesis available (even if payed).

Screen reader users navigate documents in similar ways to sighted readers, but instead of visually scanning for large/bold headers, they navigate structurally using their screen reader. Given that large documents could take some time to process this, are there cases where screen reader users experience cognitive load during actions of high latency? What are those actions? Why are those actions so slow? Can they be made more performant (at least within threshold latency for human recognition)?

The Structure And Latency of Accessibility Trees

Most assistive technology APIs use the abstraction of a "tree" in order to describe how elements of the system are related to each other. Largo documents often slow down screen reader software, causing P99 latencies effecting cognitive load (would need to run the cognitive load study first; or prove from general HCI literature). Screen readers take advantage of the tree structure by providing functionality like "structural navigation" where the user can jump forward/backward to the next element with a given role; this is another method by which screen readers are often (citation needed) performance strained. Especially so if the user attempts to navigate to a role of which there is none of—this causes (confirmed on Linux) extremely long latency times during the search of each node in the entire tree.

This study aims to record the shape of the accessibility tree on 100 popular websites across two platform (Windows and Linux). Finally, based on included screen reader functionality, we determine ways to reduce the latency of jumps between elements of a page—including options from caching, to alternate structures and accessibility APIs.

Caches for Assistive Technology: Are They Used or Effective?

Conduct a study on screen reader usage among users. Is a caching system used to hold temporarily fresh state, and can stale cache entires be marked as such? How is caching done amongst open-source screen readers (NVDA, Orca, Talkback)?

This curiosity came out of my development of Odilia, whereby I wanted to make a screen reader with lower latency (other options on Linux at the time had performance issues) and the complexities of implementing a cache in an highly-concurrent asynchronous system.

Most screen readers work effectively off of the assumption of a defined tree hierarchy being exposed via the system accessibility API. Likewise, screen reader actions and reactions need to be synchronized to prevent overlapping sounds; therefore, screen readers tend to be implemented in a serial manner—this allows for serial processing of commands, which may cause a "jamming" due to a single command having an extremely high latency.

This research would investigate the possibility of screen readers as a set of parallel, reactive (via partially-materialized views) into the accessibility tree as input and producing the desired text for the TTS engine as output. It builds on the research of Jon Gjengset's PhD thesis on database views but applied to assistive technology. The primary reasoning for partially materialized database views is that of massively parallel reads—such that there is effectively no overhead in reading the data, and no caching is required "in front" of the database. However, our goals are somewhat different: it is to think of manners in which reactive state changes can produce TTS output—acting like a screen reader would. Some screen reader queries are done in reaction to user input as well, these would always run instantly as the only thing to do is read the value from the view.

Although I suspect the overhead to this would be quire large, at least it would place the overhead on the CPU, memory. Importantly, it theoretically should not impact the latency of screen readers at all—unless the sheer quantity of updates sent to the engine overwhelm the view calculations.

This research would investigate the possibility of screen readers as a set of parallel scripts attached to a full accessibility tree as input, and spoken text as output. How often should the scripts be run? And could they be made reactive to change—which leads to the next specialization: partially-materialized state views for reactive change in accessibility tree state.

Research Directions

Table of Contents

Accessibility of Multimodal Programming Tutorials And Lectures

Are The Semantics of Data Inherent or Inferred By Context?

Text-To-Speech Resources for Screen Reader Users?

How Does Screen Reader Latency Effect Cognitive Load?

The Structure And Latency of Accessibility Trees

Caches for Assistive Technology: Are They Used or Effective?

Partially-Materialized Stateful Viewing as a Parallel Screen Reader Technology?

Parallel Screen Reader Technology?