POTS and the VS1053 Audio board for speech synthesis - Part 1

Knowing I like old pieces of engineering, my family gifted me a fully functioning old rotary phone, identical to the one we had at our home growing up (for many years). It is a Siemens-Halske model built in Argentina probably in the early 1970’s.

I disassembled it many times when I was young. It was super easy as everything was secured with nuts and bolts. I remember marveling at it’s clean clock-like mechanics and all the clever switching and electric setup that allowed all these functions to work on just two twisted copper wires:

Dial
Ring
Hang-up
Receive audio
Send audio

I thought it would make a great project to marry that last century technology with the new era of AI, so I started a new project!

Basically I want to use the same setup I have for e-paper displays but instead of printing a quote, it would be read and play it on the phone’s speaker.

Here’s the high level requirements:

Whenever a quote is available for the phone, it will ring 3 times.
If you pick up the receiver, the quote will be played on the speaker.
If you pick up the receiver and the quote was already read, it will just play a tone; waiting for dialing to happen. a. If you dial “1”, it will play the latest quote sent to the device. b. If you dial “2”, it will retrieve a new quote from the back-end.

This will be my MVP, with plenty of stuff to work around and solve.

The Phone

To begin with, I need a way of activating the electromagnetic ringer. This will require higher voltage power supply. 3.3V will definitely not do it. Most likely a motor driver of some kind, able to reverse current.

I will also need a way to decode the dialer signals. Based on my initial research I need to detect 2 different source of signals: 1. The “start/stop of dialing” 2. The pulses associated with each number.

This is relatively straightforward as these pulses are simply “digital” inputs for a micro controller.

Text-To-Speech

The other challenge will be to generate “audio” out of text quotes. I’ve done experiments with both OpenAI and Deepgram. Both are super easy to use, and straight-forward APIs.

It is actually quite incredible how easy it is to generate high quality, synthetic voices, in any language, with tone inflections, accents. The systems even simulate the breathing pauses to achieve realism.

My OpenAI implementation looks like this:

exports.tts = (text, done) => {
  const openai = createOpenAI();
  openai.audio.speech.create({
    model: "tts-1",
    voice: process.env.OPENAI_TTS_VOICE || "shimmer",
    input: text
  }).then((data) => {
    data.arrayBuffer().then((buffer) => {
      done(null, buffer);
    });
  }).catch((error) => { 
    done(error);
  });
};

The createOpenAI() simply creates the object passing the API credentials.

And Deepgram is equally simple (no module required in this case, just using straight HTTP requests):

const url = 'https://api.deepgram.com/v1/speak?model=aura-athena-en';

exports.tts = (text, done) => {
    
    const options = {
        method: 'POST',
        headers: {
          accept: 'text/plain', 
              'Content-Type': 'application/json',
              Authorization: `Token ${process.env.DEEPGRAM_API_KEY}`
          },
        body: JSON.stringify({text: text})
      };
    
      fetch(url, options)
        .then(res => res.arrayBuffer())
        .then(buffer => {
            done(null, buffer);
        })
        .catch(error => {
            done(error);
        });
  };

I am not quite an async guy… I actually like callbacks, so my code uses mostly that and promises. Also, I implemented these two as a simple module so my code can call either easily. Changing voices, gender, accents is super easy, check their documentation. It is fun to experiment with various.

Now for playing back the generated MP3 file, I am using the equally outstanding VS1053 Audio board. It is a simple stack-able board on a Feather M0, and includes and SD card with plenty of storage for audio files. The audio board is very versatile and can play back a number of different formats: MP3, OGG WAV, etc.

The Feather board also comes with an SD card slot, so playing back sounds is as simple as:

  ...
  Adafruit_VS1053_FilePlayer audioPlayer(VS1053_RESET, VS1053_CS, VS1053_DCS, VS1053_DREQ, CARDCS);

  if(!audioPlayer.begin()){
    Serial.println("Error initializing audio board");
      return;
  }
     
  audioPlayer.setVolume(0, 0);

  audioPlayer.playFullFile("music.mp3"); 
  ...

So, all I need to do is:

Get a quote MP3 to the SD card
Ring
When someone picks up, play the MP3 file
Make sure I have all error conditions considered (in my experience production systems are 99% error handling and 1% functionality)

My go-to approach for these kinds of systems is the well proven FSM.

I’ll start with this and report progress soon. Stay tuned for Part II.