tait.tech

From Software Noob To Linux Accessibility Master


Latest edit on: January 12th, 2021.

Skip table of contents

Table of Contents

Introduction

Here are some interesting problems I have faced when working with DBus, AT-SPI (Accessible Technology–Serial Protocol Interface) and the Rust programming language. I realize these are fairly unique constraints, and this information is likely only relevant for a select few, but I thought the experience might be worthwhile to write down: for my own sanity when I inevitably experience these same issues later and for others who may want to contribute to our new screen reader project Odilia.

DBus

DBus is a cool API! Well it’s not an API, but rather a mechanism to share messages across processes in Linux; this is generally called IPC or Inter-Process Communication. DBus can be used to send and receive desktop notifications, shutdown your computer and, for my purposes get accessibility events.

Inner Workings

DBus is an object-oriented approach to IPC. It is split up into 5 main components that work together:

  1. Objects
  2. Interfaces
  3. Methods
  4. Properties
  5. Buses

Objects, Methods & Properties

Objects are just like objects you learned in your CS classes; it is a structure which contains attributes, and methods which can be called on the object. DBus’ objects are very similar, except that attributes are called properties.

Most DBus libraries provide a way for you to use “native objects” (i.e., a Python object, a C++ object, a Rust structure + implementation, etc.); this allows access to DBus methods using the language features available to you. So for example, in Python you might write:

obj = get_a_dbus_object()
print(obj.get_text()) # using a method
print(obj.locale) # using a property

This would print out whatever may be returned from the object’s GetText method and what is found in the locale property. Notice that DBus methods are always Pascal case (i.e., capitalized at each starting letter of a word).

Interfaces

A DBus interface (not to be confused with a Java interface, or a Rust trait) is a definition of a collection of methods. For example, the “Text” interface may have an attribute like “Length” or a method like “GetText”. So the interface “Text” is just a list of methods and attributes all wrapped up together. That’s it! That simple!

This will come in handy later when we need to check if an object implements a method; this way we can check for an entire interface of methods and properties instead of checking for each individually.

Busses

A bus’ closest equivalent in standard computer science terms would be an IP address. A bus address looks like “:1.39”; think of this like a raw IP address. Some addresses have names associated with them like “org.a11y.Bus”; think of this like a DNS A record pointing at an IP (bus) address. So a bus is just a place to send IPC requests, just like you’d send HTTP requests to a web server at a specific IP/port combination.

How does this relate to a screenreader?

Accessibility Events & Information

Let’s assume for a moment that you cannot see anything. You are blind. If you try to read an article you obviously cannot see what is on your screen, so you need something to read it to you. This technology that reads your screen to you is, uncreatively called a screenreader, sometimes abbreviated “SR”. Well how does a screen reader know what is on the screen? How does it know what a button is? And a link? How does it know if content has changed or if an alert has been sent?

The former describes accessibility information (i.e., this button contains a certain string of text); the latter describes an accessibility event (an aria-live region has been updated, or an alert box has been displayed).

DBus can send these events and information to your process, if you ask for it. This is what you want if I’m to create anything like a screenreader.

AT-SPI & Rust

The specification that is used to send this information to our DBus connection is called AT-SPI: Accessible Technology-Serial Protocol Interface. To clarify: DBus is the general IPC mechanism for processes in Linux; AT-SPI is a standard for how to send accessibility information/events over the DBus protocol.

AT-SPI is a set of XML files that specify how to send data across DBus for accessibility events. I’m going to be honest: at first this system seems very convoluted and unnecessarily complex. Over time though, this system has grown on me as I start to see its “complexities” as a sort of after-affect of the core principle of simplicity used within DBus and the specifications which utilize it.

I have explained previously that DBus has objects and methods just like a native object in Python, C++ or Javascript. So let’s say we want to implement the most basic thing a screenreader can do: read text. Let’s suppose we already have an item we want to get the text of. Now to get the text of it, we call a method on the interface and pass the path. This is abstracted away for us, generally speaking, when using any kind of language-specific DBus binding, but it’s better to be explicit in this case.

No problem! We call item.get_text() and that’s it, right? No. This is where, again, this “complexity” comes in. Again, it starts out this way, but it will grow on anyone who enjoys the idea of the UNIX principles with time and understanding.

So what happens if we do obj.get_text()? Let’s try it on the first list item on my website’s homepage:

Here is the excerpt as it is written on the day of writing this article:

I have three goals in my software development career:

  1. Strong adherence to the UNIX principles of software design.
  2. Security, privacy and anonymity of the internet.
  3. Accessibility of technology to the visually impaired.

What would you expect to receive if you ran get_text() on the first list item there? If you, like me, were a little brainlette, you probably guessed “1. Strong adherence to the UNIX Principles of software design.” Let’s find out if this is correct (note I am only using code snippets to avoid complexity):

let text = acc.get_text();
println!("TEXT: \"{}\"", text);

$ cargo run
TEXT: "1. Strong aherance to the  of software design."

If you read that carefully, you’ll see there are what look like three spaces where the UNIX principles link should go. This is extremely deceptive for two reasons:

  1. One of those is NOT a space. It’s an object replacement character aka Unicode Point U+FFFC.
  2. It looks like it has just dropped a piece of text without telling us! And without a way to get it back! Gasp! Oh the horror!

This is what I thought too. But allow me to defend this for a minute.

What if you had something complex like a table, a block quote, an image or even something like a MathML equation inside the block of text (in our case, inside a list item, but this applies to any piece of text inside another)? If you had a table, would you want to read it out? With MathML, you might want to say everything upfront, but MathML would also need some amount of processing before it be readable (or speakable) as text. And even with a humble link, there is a reason for this object replacement character:

If you can see perfectly fine and browse the web like anyone else, with your eyes, you can see what is a visited and unvisited link based on the color of the link. A darker color generally indicated a visited link, whereas a lighter color generally indicates an unvisited link. When a screenreader gets info about a piece of text, it would need to include that information to its user like “UNIX princples…link” or “UNIX principles…visited link”. So if I get the text of some item which contains some sub items, should it include all sub items? What about just links? Should it tell you if the link is visited or not? Should you make that an option to the GetText call?

All these questions above would introduce additional complexity to the GetText call. This has given me pause in my youthful “the system is broken” angst that generally plagues my thinking; instead I see this is a very sober-minded and UNIX-y design principle that I think makes much more sense than the alternative. Here are some major advantages of this method:

  1. It allows optional processing of sub-elements; maybe you don’t care what is underneath the element: this saves processing power and complexity.
  2. It allows custom processing of sub-elements; you do not have to rely on AT-SPI to tell you what information you want. Perhaps you only need the role of the sub element, not the entire text of it: again, this saves CPU cycles and code complexity.
  3. Allows arbitrary data to be inside any other structural element.

In essence, it puts the developer in greater control of what the screenreader knows about the page!

My next question is: “If AT-SPI uses the object replacement character so it can replace the children, then what happens if the object replacement character is actually in the text itself?” Well, with some processing you can actually find out where each child goes, or if the object replacement character is actually written in the text itself. How so?

First off, let’s get a list of children. We can do this with obj.get_children().

# rust way of awaiting and not caring about an error case is: .await.unwrap()
println!("CHILDREN: {:?}", obj.get_children().await.unwrap());

$ cargo run
CHILDREN: [(":1.7", Path("/org/a11y/atspi/accessible/193\u{0}")), (":1.7", Path("/org/a11y/atspi/accessible/194\u{0}"))]

You’ll notice that the children are merely a list of tuples; each tuple only contains, at its core, two strings:

  • Sender: A string describing which application has sent the information.
  • Path: A string describing which element is being sent.

The sender, you will notice, looks suspiciously like a bus address. This is actually what it is. Each process has a bus address, and it is letting you know where it’s coming from. The path is a path to a new object for which we can receive information about through DBus if we want more information. Like so (Rust is weird with all its unwrap()sm, but stick with me here):

let child1_base = obj.get_children().await.unwrap().get(0).unwrap();
let child1 = Proxy::new(
  Arc::clone(connection), # some previously initiated connection
  child1_base.sender,
  child1_base.path
);
println!("CHILD1: {}", child1.get_text());

$ cargo run
CHILD1: UNIX principles

This code looks a little terse, but I assure you it makes sense:

  • A proxy object is a way to represent a DBus object as a native language object (in this case, Ruse).
  • The connection variable is some previously defined variable that you would need to start a DBus connection anyway.
  • Arc::clone(x) copies an automatic reference counted variable so it may be used additional times. Don’t worry about the details of this, it has something to do with how Rust as a language handles passing thread-safe variables. A bit out of scope for what we’re really talking about here.

Okay, now back to what I was saying about being able to grab information about children to find out if we need to replace the object replacement characters or not.

# .await.unwrap() is a Rust-ism, ignore it for now
let c1_pos = child1.start_index().await.unwrap();
println!("Position of child #1: {}", c1_pos);
let text = obj.get_text().await.unwrap();
# assume we have already created a function for get_first_of
if c1_pos == text.get_first_of("\U{FFFC}") {
  let full_text = text.replace("\U{FFFC}", child1.get_text());
} else {
  # ignore the .clone(); again, a Rust-ism
  let full_text = text.clone();
}
println!("FULL TEXT: {}", full_text);

$ cargo run
FULL TEXT: Strong adherance to the UNIX principles of software design.

Here’s what this code does: there is an interface, we talked about this earlier, called Hyperlink; the Hyperlink interface can actually tell us the cursor position of the child element within the parent. Some objects we get over DBus will not support this, but the vast majority will. I dislike the fact it is called hyperlink, even though I can see that this is the primary use case, I think it’s reasonable to say that StartIndex and EndIndex are not exactly unique to hyperlinks (<a> tags); this applies to any nestable element with a different semantic meaning (HTML). Minor criticism aside, there is an opportunity here to match with the parent and find out if and where the child belongs to be placed. You can see how this is done in a very primitive way above; here is how it would work in more complex cases:

If we get the position of every occurrence of the object replacement character from the parent, and check each child to see if its StartIndex matches the position of the object replacement character, then anytime it matches, that is where the child belongs. Then we replace the object replacement character in-place with the text of that element, or sometimes just the role of an element; for example something may be spoken like this (* indicates an audio indicator notifying the user that the containing is screenreader information and not text):

“…Einstein’s theory of relativity, *unvisited link*, shows us that there is more to time than just “seconds”: *table* in the above table, we can see how time dilation may be caused by high speeds.”

Obviously, this is not a great example; why would anybody put a table within a paragraph? I’m not sure, but it illustrates the point I’m making: that the screen reader will have a very controlled ability to decide what is said through these AT-SPI methods.

There is another use for grabbing the cursor index of children that I would like to point out. I think this is a reasonable case for seeing it pulled into its own interface: structural navigation.

Structural Navigation

People who use screenreaders have some special abilities I actually wish browsers implemented by default: the ability to jump through the document by specific tags and attributes. It’s not sophisticated; depth first search forward or backward looking for the closest heading, link, button, table, etc. This is so ingrained in screebreader users that when a page finishes loading, it is customary for the screenreader to announce (speak out lout to the user) the number of tables, headings, visited links and unvisited links that are on the page in front of them.

If I want to look for the next heading in an HTML document, however, I can not start by just checking all children, because it is fairly common to have various tags embedded in your current tag. I need to know, which children are after and which are before my caret.

The Caret 🥕🐇

The caret is the same as your cursor in an input box. Type right here and watch as your cursor (aka caret) moves with your typing; move it left and right with your arrow keys:

The caret, or cursor, is something that most people are only used to seeing in the context of editable text, but screenreader users enable a special mode in their browser (usually activated with F7) called “caret browsing”. Caret browsing allows you to navigate through a webpage using a cursor even when the text is not editable. This is awesome! I can not understate how useful this is to me, just for simple keyboard-driven simplicity’s sake and trying to eliminate the mouse as much as possible.

Try it now! You can always turn it off with F7, just the same as enabling it.

This caret can be moved around just like in any run-of-the-mill WYSIWYG (What You See Is What You Get) editors like Word or Libreoffice Writer. This is how a screenreader user navigates the web: with a cursor. They use it to read one character at a time (with left and right arrow), a word at a time (Ctrl+left or right arrow) or entire lines of text (using up and down arrow). This becomes, in essence, the active focus of the user: it is always on the cursor (a.k.a. caret).

Keyboard Input

Keyboard input with accessible applications follows a very complex path, which can be a serious buzzkill for attempting high-performance screenreaders. Let me show you what the issues are; the accessible technology (screenreader, in this case) will be written as “AT” in this diagram:

Wayland: Kernel -> libinput -> DE/WM -> accessible application -> AT
X11: Kernel -> Xorg -> DE/WM -> accessible application -> AT

What happens in the case of an inaccessible application? It doesn’t work, at all. A key press which is sent to an inaccessible application will not be sent to an AT application (i.e., a screenreader). This is a serious problem, that I don’t think should exist at all. Perhaps there is some mechanism I am missing as to how to interrupt these keys before they pass all the way to an application and then just hope the GUI is accessible; supposing that this is not the case, we need a system to interrupt the keys before they are sent all the way down the stack, then sent to the screenreader. This is needed for two reasons: 1) performance: it doesn’t make sense to send keys that far down the stack, just to hope the application implements accessibility correctly; we should be able to interrupt key presses before it gets to the application 2) control: it is best to be able to control things regardless of if an application is running or not. Under a system where an application must be accessible to send us keystrokes, a non-responsive application will not send us keystrokes either. To have full control and maximum performance, we need to interrupt the keys at their source.

rdev

rdev is a Rust crate which can (with the “unstable_grab” feature enabled) grab keys from the Linux kernel before they are passed any further down the stack. It allows us to consume events if we do not want to also do the default action; for example, in “Browse Mode” a screenreader user will use the letter h to jump between headings within a page; normally this would type the letter h, so to stop this from happening we can consume (or “eat”) the event so that it isn’t sent any further at all.

With all I’ve covered here so far, let’s see if I can wrap it up.

Pulling It All Together

All this information (which I gained mostly from asking TheFakeVIP questions) has been pooled together in a new screenreader project named Odilia. Most of the core work has been done by others, but I occasionally contribute to it as well and I want to make blind individuals have access to a blazingly fast screen-reading experience on Linux.

Happy a11y hacking!