Kalan's Blog

Software Engineer / Taiwanese / Life in Fukuoka

Current Theme light

Polly is one of the services provided by Amazon Web Services (AWS) that converts text into speech, also known as text-to-speech. While text-to-speech is not a new concept, as even Google Translate can easily do it, Polly aims to produce more natural-sounding voices based on the input text, which is a great advantage for language learners. Additionally, Polly has a wide range of applications, such as converting subtitles into speech, generating scripts, narrations, dialogues, and even recording podcasts directly using Polly. Readers who want to try out the effects can visit Amazon Polly.

Common Language Testing

In terms of usage scenarios, I would like to test Chinese, English, Japanese, and Korean. Here are a few samples of voices generated using Polly:

  • Japanese
  • Chinese (Mandarin)
  • English (American)
  • English (British)
  • Korean

English needs no introduction, as it is well-supported and offers various options for American and British accents. The voices sound quite natural, and if you don't listen carefully, you might mistake them for real human voices. However, the Chinese voice sounds a bit unnatural and doesn't seem to be in Taiwanese Mandarin or Standard Chinese. The level of support for Japanese is beyond my imagination. The sentences sound very smooth, and when there are English words mixed in, Polly even converts them into a Japanese accent before pronouncing them. For example, the sentence "この件についてはbug ticket必要でしょうか?" (Do we need a bug ticket for this issue?) would be pronounced by Polly as:

Although it cannot provide rich variations in voice like voice actors according to different scenes, it is already a very useful tool for me.

Polly offers two options: "Neural" and "Standard." The "Neural" option aims to generate voices that sound as natural and close to human as possible. The "Standard" option already sounds quite natural but still retains a slightly mechanical tone. Currently, only certain languages support the "Neural" option. Among Chinese, English, Japanese, and Korean, the "Neural" option is available for English and Japanese.

Polly supports SSML (Speech Synthesis Markup Language)1, which allows for marking specific sentences with pauses or adjusting the tone of the voice based on different scenes, making the generated speech more immersive.

Pricing

For pricing details, you can refer to the official website. The cost is 4.00USDpermillioncharactersforstandardvoicesand4.00 USD per million characters for standard voices and 16.00 USD per million characters for neural voices. Unless a product requires a large volume of text-to-speech conversions, the cost should be quite affordable for general auxiliary usage, even for independent developers.

You pay monthly based on the number of characters processed by Amazon Polly for speech or speech mark requests. Amazon Polly standard voices are billed at 4.00USDper1millioncharacters(afterthefreetier).AmazonPollyneuralvoicesarebilledat4.00 USD per 1 million characters (after the free tier). Amazon Polly neural voices are billed at 16.00 USD per 1 million characters (after the free tier).

Integration (Using Node.js as an Example)

Integrating with Polly is straightforward. You can use the aws-sdk to interact with the service. Here is an example code snippet:

polly.synthesizeSpeech(
  {
    Text: "おはようございます",
    TextType: "text",
    VoiceId: "Takumi",
    LanguageCode: "ja-JP",
    OutputFormat: "mp3",
  },
  (err, data) => {
    if (err) {
      console.log(err);
    }
    fs.writeFileSync("./result.mp3", data.AudioStream);
  }
);

polly.startSpeechSynthesisTask()

This code will save the converted speech into the file result.mp3.

Conclusion

Polly is a user-friendly and cost-effective service that can enhance the richness of content in various applications. Personally, I find it useful for language learning, as it allows me to input text and instantly hear near-realistic pronunciations.

For Chinese users, although the voices are acceptable, they are not in a familiar accent, which may lead to some resistance. It would be great to have support for Taiwanese Mandarin or other local accents in the future.

Footnotes

  1. https://docs.aws.amazon.com/polly/latest/dg/ssml.html

Prev

Homemade Simple Radio

Next

Impressions (Leaving MySQL)

If you found this article helpful, please consider buy me a drink ☕️ It'll make my ordinary day shine✨

Buy me a coffee

作者

Kalan 頭像照片,在淡水拍攝,淺藍背景

愷開 | Kalan

Hi, I'm Kai. I'm Taiwanese and moved to Japan in 2019 for work. Currently settled in Fukuoka. In addition to being familiar with frontend development, I also have experience in IoT, app development, backend, and electronics. Recently, I started playing electric guitar! Feel free to contact me via email for consultations or collaborations or music! I hope to connect with more people through this blog.