Build a Voice Notes App — Part 1, Text to Speech Pipeline

3 min readAug 25, 2019

Building an app that stores and transforms text notes to voices for later consuming came to me recently.
There might already be apps that handle this well, but decided to DIY as a fun challenge :)

DEMO

As the image above shown, now I can listen to the text notes from the Github repo(give the voices a try if you are interested).

At the time of writing, whenever notes published to a Github repo, the build script would be triggered, which generates and upload audios to storage. The audios can be pulled by mobile devices via a web page.
(update: the audios are generated via Web API right now, see Part 2)

Transforming text to speech

Chosen Amazon Polly for generating audios, it not only can produce natural sounds but also supports multiple languages.
(It even supports Speech Synthesis Markup Language, imaging HTML but for speech).

If you have AWS CLI setup, you can give this service a quick try via your terminal:

aws polly synthesize-speech \
  --output-format mp3 \
  --voice-id Matthew \
  --text "Hello World" \
  sound.mp3

If things went well, a response like below would be there and the sound.mp3 is ready for you to listen :)

{
    "ContentType": "audio/mpeg",
    "RequestCharacters": "11"
}

The synthesize-speech API has text length limitation(3000 characters at the time of writing), for text longer than that, start-speech-synthesis-task API has to be used.
The two interfaces are almost the same, except an S3 bucket name has to be provided for the later one, as the generated audio would be uploaded to that S3 bucket.

AWS provides SDKs for different languages, code below is an example that interacting with the Polly service via the javascript SDK:

The pipeline

I also used services from AWS for the build pipeline, specifically their CodePipeline and CodeBuild.
They can be replaced by other web services for sure, selected them mainly for exploring AWS and getting free from storing secret tokens(e.g., assigning the CodeBuild project a role that can interact with the services) for accessing Polly and S3.

As the image below, the build pipeline currently just has two steps, step 1 “Source” is the trigger, which itself triggered by the Github master branch push event.

Step 2 “Build” is covered by the CodeBuild, which is configured from the buildspec.yml from source code.

version: 0.2phases:
  install:
    runtime-versions:
      nodejs: 10
  build:
    commands:
      - npm install
      - node build.js

version indicates the configuration schema’s version; the left part is saying “hey, please run the two commands in a node.js 10 environment”.

Code below is the build script, which identifies the notes that haven’t been processed and starts the transformation via AWS SDK.

It probably not a good use case of the build pipeline, but currently, it gets the job done well.
Later(especially when the notes are not managed via the Git repo) I can consider transforming this part to uploading notes to S3 only and let S3 notify services to run the transformation to get better error handling and so on.

The web page

As code below, an HTML file with around 70 lines of code, pulls the audios and leveraging the native audio element to play it.

What’s next

Probably the user system(e.g., authentication) part would be the next so that I can put some private notes there 😃

Give the source code repo a look if you’re interested.