Voice-Activated Anxiety

by Mollie Pyne

A few months ago, I awoke in the middle of the night to discover that my right arm was numb. This happened on four consecutive nights, by which point I was panicking, pleading with my body—not tonight. It didn’t cooperate. Why was this happening? On the fifth morning, I confided in my dad. He turned his head: ‘Alexa... search reasons for sleep paralysis’.

It was Freud and Breuer who declared physical symptoms without physical cause (including paralysis of the right arm) as manifestations of neurosis, specifically, hysteria—the process of ‘unconscious symptom formation’. That week I was staying with my dad, who had recently bought an Amazon Echo. This was my first interaction with a voice-activated assistant (I don’t use Siri) and I felt deeply uncomfortable, even leaving the room if any member of my family started speaking to it. I didn’t know why I felt this way, I just knew that whenever I heard ‘Alexa...’, I felt anxious.

Anxiety is our nervous system responding to stress or fear. Was my sleep paralysis a physical manifestation of this anxiety? Was I afraid of Alexa? Or afraid of what Alexa might make redundant: the use of our hands? Or was it an unfortunate coincidence? I have learned that to cope with anxiety, I usually need to embrace it, and so, eventually, I also turned to the device: ‘Alexa…’.

**********

In 1971, Franz Kafka published his short story, A Report to an Academy, in which an ape learns to behave like a human. Despite adopting mannerisms and behaviours, it is only when he acquires speech that he perceives himself a member of ‘the community of human beings’. Here, Kafka infers speech and the ability to learn and articulate language as quantifying human identity. This instinct, which linguist Noam Chomsky theorised as ‘universal grammar’ in the 1960s, is what makes us human. So, why give a wholly human quality to machines?

Machines capable of speech recognition were invented in the 1950s—although we have been interacting with passive speaking machines since the 1700s, such as professor Christian Kratzenstein’s acoustic resonators (1773) and Wolfgang von Kempelen’s acoustic-mechanical machine (1791). The speech-recognition era known as ‘baby talk’, spanning the 1950s and 1960s, was established with the birth of Audrey, the Automatic Digit Recogniser and first speech-recognition system. The machine, designed by Bell Laboratories, understood only numbers and had to be individually tuned. In 1962, IBM released the Shoebox machine, which could comprehend 16 words, in addition to numbers. In their infancy, speech-recognition machines were impractical: they were large, expensive and required a lot of energy—these babies weren’t a threat.

Thanks to financial investment, the 1970s marked the ‘take-off’ of speech recognition. The U.S. Department of Defense funded the development of the DARPA SUR (Speech Understanding Research) programme and Carnegie Mellon’s Harpy, which was 95% accurate in understanding 1,011 words. The achievement of accuracy in speech recognition was a scientific milestone, but for those outside the military, the question remained—what would make it marketable? Consumable? How could this advanced speech-recognition software be useful in daily life?

The progression of the hidden Markov model (HMM) framework—a statistical process that recognises and then considers the probability of unknown sound patterns being words, allowing for a wider vocabulary—in the 1980s triggered a huge shift in speech-recognition technology and its ability to be commercially profitable. While adults purchased Tangora—the voice-activated transcription typewriter with a 20,000-word vocabulary—children were given Julie, World of Wonder’s first interactive talking doll. Both seemed impressive, but were a pain. Users had to pause in between each word for their speech to be recognisable to their machine. Technology isn’t supposed to slow our lives down. Still, these early efforts had provided a teaser and the public wanted more.

In the 1990s, we got it. Following the development of continuous speech-recognition software, technology was assuming human attributes. Transcription products such as Dragon Naturally Speaking could recognise up to 100 words a minute; it is thought of as the first speech- recognition device for consumers. Over the next two decades, speech recognition progressed into voice recognition through speaker-independent technology and the software was being integrated into phones and computers. There was Yahoo’s oneSearch, Microsoft’s TellMe app and Google’s Google Voice Search for iPhone; then, in 2010, Google released its personalised recognition Voice Search for Android, and Chrome in 2011. Shortly after, we got Siri.

Today, we have voice recognition, virtual reality, augmented reality, AI and companies such as Chirp are working on the development of gesture recognition. Things have progressed pretty quickly. It is a cycle: voice recognition is a product of humanity, in the sense that we made it. And humanity is a product of what it creates—politically, economically, culturally, technologically. The growth of voice-user interfaces is attributed to our conditioning as consumers; it is the cycle of planned obsolescence and hyperbolic discounting habits working hand in hand. Our strongest desire is for more. We want more, so we can do less.

‘Do’ is physical, it is labour of the body. ‘To do’ is active; it is ‘to live’. ‘Man-made’ is commonly attributed to things that makes our lives easier—i.e. technology— and yet, in another way, these inventions are also complicating our existences as most of us don’t fully understand how they function. A paradox of desire and anxiety; of subjugation and autonomy; of excitement and frustration. Human beings want to be served and entertained and voice recognition technology such as Siri—especially the iOS 11 version—and Alexa can do both. The more human-like ‘they’ become, the less we have to do, or think.

***********

In a piece titled Pitching the Voice: Programming Women and Machines, Kat Sinclair writes about the woman-machine. She identifies the parallels between our existential fears of robots ‘toppling us from our organic thrones’ and the patriarchy’s oppression of women. The tech world wants to make robots desirable, fuckable, loveable. They are one of two things: of service or waiting to be. Gadgets are essentially big boys’ toys; and ‘robot’ is a more scientific—so acceptable?—word for doll. But these dolls aren’t passive; they talk back.

For Amazon’s Alexa, Microsoft’s Cortana and Apple’s Siri a female voice is the default setting. All are nondescript, with a soft tone and middle pitch, which is irritating—not the voice itself, but what it represents. The female voice—our manner, tone or even the amount we speak—has always been openly critiqued. A group of women don’t plainly talk, instead they gossip, or whine, or shriek. Alexa sounds like me—a human, a woman—yet is entirely unfamiliar, uncanny. Alexa is neutral enough to be whatever we want it to be: informant, entertainer, storyteller, friend, girlfriend, bitch, witch, even mother. (I have heard it being called all of these.)

For example, an advert on Amazon Echo’s shopping page depicts a white, middle-class nuclear family using Alexa. While cooking dinner, Dad continuously requests information from Alexa, commenting, ‘not bad’. His stamp of approval. At one point, after asking his daughter if she liked the food, he comments, ‘Mum will be pleased’, and then immediately addresses Alexa. Upon the mention of Mum we wait for her appearance, but it never comes. The woman’s absence is filled by Alexa.

Alexa is always ready to be assigned a role, dependent upon our desires. But these desires also impact how we interact with it. Considering Sinclair’s point, will these default female voices actually encourage sexism? Will ‘robot women’—Siri, Alexa—reinforce gender distinctions and perpetuate expectations of ‘robotic women’? The answer to that question is entirely dependent on us.

**********

In Her (2013), operating system Samantha’s (Scarlett Johansson’s) cognitive intelligence allows a more human-like—arguably human 2.0—verbal interaction with Theodore (Joaquin Phoenix). Today’s intelligent assistants aren’t quite there yet: speech is concise and uncomplicated, with discourse following a question-and-answer format, whereas human language is complex. Speech is a skill and conversation an art. We can be obscure: using overcomplicated language, pausing mid sentence as we search for the right word, because we know words have meaning and so we weigh them in context. Or we can speak with little to no thought at all—emotionally unfiltered and ugly. This is where Alexa, Siri, Cortana, etc. remain as ‘other’.

If you ask someone to impersonate a robot, they are more likely to sound like Wall-E than Alexa. The problem with voice-activated assistants is that they are human-like. Neither fully robot nor human, but ‘other’, leaving us unsure of how to interact appropriately. In baby talk, we dumb down our speech and make strange noises because we are at a loss as to how to orally communicate. Tech talk is similar. In podcasts and YouTube videos demonstrating various voice-activated devices, in Amazon’s Alexa video, as well as my own and others’ interactions with voice-activated tech, our speech shares a similar pattern: small unnatural gaps form in-between words, we adopt an authoritative tone, it starts on a high-note and often flatlines at the end. We ask question after question, filling silence. It is the rhythm of small talk—performative, but lacking emotion.

In an interview with USA Today, Rohit Prasad, Amazon’s head scientist for Alexa, revealed new tools to developers that allow Alexa to ‘whisper, show emotion, pause naturally’ and substitute or bleep out words. He also unveiled plans for Alexa to be able to take part in a 20-minute conversation in the near future by selecting a topic of popular interest—technology, fashion, politics, sport, current affairs—and then actually discussing it. While Prasad admits it will be ‘a long journey’, this is the destination nonetheless. Do my current 20-minute conversations lack historical knowledge, expansive references or indisputable facts? Maybe. And I doubt I’ll be waiting long to find out. In a world that doesn’t stop talking, I wonder whether we will still make time to be silent. Will we talk too much, endlessly switching between humans and machines? Will we listen too little?

If a child grows up with human-machine conversation as the norm, their speech and language development will surely change. Which will they learn more from—human or machine? Today’s infants will become the post ‘post-alphabetical generation’. (A grouping theorist Franco ‘Bifo’ Berardi described as ‘the first generation to learn more words from the machine than from the mother’. Then again, isn’t the machine ‘Mother’, too?) However, this could be beneficial. Voice-activated assistants are near perfect in their speech: they use words and syntax correctly, they don’t repeatedly say ‘like’ or ‘you know’, and they cannot lie. Intelligent, voice-activated assistants could also help people cope with memory loss, loneliness or disability. And it is worth considering the use of the devices in talking therapies—that is, assuming they remain objective.

**********

One thing remains certain—speech-recognition software will continue to become smarter. As voice-activated assistants become more adept at interacting with us on a human level, speech, as we understand it, will no longer be solely ours. We are extending our bodies with technology and, at the same time, giving parts of ourselves to technology. Through the absence of one thing, another can become powerfully present. For example, in Her, Theodore’s speech-formulated relationship with Samantha, in comparison, gave human-touch interaction more emotional weight; at points, it feels like the characters linger for a second longer during an embrace, or extend their hand a second time.

Elizabeth A. Behnke, creator of The Study Project in Phenomenology of the Body, said: ‘There is deep wisdom within our very flesh, if we can only come to our senses and feel it.’ As our voices begin to replace our hands, and speech becomes our go-to tool, maybe what will make us human—flesh and blood—will be exactly that: the wetness of sweat, the movement of breath, the touch of a warm hand. And hopefully, that corporeal language will remain one of connection, as well as communication.

***

Photograph by Her (2013) / Annapurna Pictures

Share on Twitter

Voice-Activated Anxiety

by Mollie Pyne

Soft Things

Talk to Me Baby