logo
  • 現在做什麼
  • 關於我

Kalan

文章分類

  • 前端
  • 開發筆記
  • 雜談
  • 年度回顧

快速連結

  • 現在做什麼
  • 關於我
  • 聯絡我
  • 職涯思考🔗

關注我

在福岡生活的開發者,分享軟體開發與日本生活的點點滴滴。

© 2025 Kalan Made with ❤️. All rights reserved.

Amazon Polly - Text-to-Speech, the effect is very remarkable.

Written byKalanKalanDec 7, 2021
Home/Dev Note
💡

If you have any questions or feedback, pleasefill out this form

Japanese原文

Table of Contents

  1. Common Language Tests
  2. Pricing
  3. Integration (Example with Node.js)
  4. Conclusion

This post is translated by ChatGPT and originally written in Mandarin, so there may be some inaccuracies or mistakes.

Polly is one of the services offered by Amazon Web Services (AWS) that converts text to speech. While text-to-speech technology isn't new—Google Translate can easily perform the same function—Polly aims to deliver a more natural-sounding voice based on the text, which is a significant advantage for language learners. Additionally, its applications are vast, ranging from converting subtitles to audio, creating scripts, narrations, dialogues, and even recording podcasts directly with Polly. Readers interested in experiencing its capabilities can visit Amazon Polly.

Common Language Tests

In terms of language support, I would like to test Chinese, English, Japanese, and Korean. Here are a few audio samples I converted using Polly:

  • Japanese
  • Chinese (Mandarin)
  • English (American)
  • English (British)
  • Korean

There's no need to elaborate on English support; it's quite extensive, offering choices between American and British accents, along with various voice options that sound remarkably natural—so much so that it’s easy to mistake it for a real person if you’re not paying close attention. The Chinese voice, on the other hand, sounds a bit unnatural; it doesn't quite match Taiwanese Mandarin but isn't exactly standard Mandarin either. The level of support for Japanese exceeded my expectations; not only do the sentences sound very fluent, but if English is mixed in, Polly will even read it out with a Japanese accent. For example, the phrase: "この件についてはbug ticket必要でしょうか?(Does this issue require a bug ticket?)" is pronounced by Polly as follows:

While it can't replicate the rich vocal variations of a voice actor based on the scene, I still find it to be a very practical tool.

Polly offers two options: one is "Neural Voice," which aims to produce the most natural and human-like sound possible; the other is "Standard," which sounds fairly natural but still has a mechanical quality. Currently, only some languages support "Neural Voice." Among Chinese, English, Japanese, and Korean, both English and Japanese support "Neural."

Polly also supports SSML (Speech Synthesis Markup Language) 1, allowing you to add pauses for specific sentences or adjust the tone of the voice based on the scene, enhancing the overall auditory experience.

Pricing

You can refer to the official website for pricing details. It's 4permillioncharactersforstandardvoice,and4 per million characters for standard voice, and 4permillioncharactersforstandardvoice,and16 for neural voice. Unless your product requires extensive text-to-speech functionality, this pricing is very affordable for general auxiliary use, making it accessible for independent developers as well.

Payment is required monthly based on the number of characters processed. Amazon Polly standard voice requests are billed at 4.00permillioncharacters(afterexceedingthefreetier).AmazonPollyneuralvoicerequestsarebilledat4.00 per million characters (after exceeding the free tier). Amazon Polly neural voice requests are billed at 4.00permillioncharacters(afterexceedingthefreetier).AmazonPollyneuralvoicerequestsarebilledat16.00 per million characters (after exceeding the free tier).

Integration (Example with Node.js)

Integrating Polly is straightforward; you can use the aws-sdk. Below is a sample code snippet:

polly.synthesizeSpeech(
  {
    Text: "おはようございます",
    TextType: "text",
    VoiceId: "Takumi",
    LanguageCode: "ja-JP",
    OutputFormat: "mp3",
  },
  (err, data) => {
    if (err) {
      console.log(err);
    }
    fs.writeFileSync("./result.mp3", data.AudioStream);
  }
);

polly.startSpeechSynthesisTask()

This code will save the converted audio into result.mp3.

Conclusion

Polly is a handy and affordable service that can be applied to many use cases to enhance content richness. Personally, I plan to use it for language learning, as it allows me to hear immediate and realistic pronunciations after inputting text, which is incredibly convenient.

For Chinese users, while the voice is acceptable, it doesn't match the familiar tone of Taiwanese Mandarin, which can lead to some resistance. It's unfortunate, and I hope they will offer a local Taiwanese accent in the future.

Footnotes

  1. https://docs.aws.amazon.com/polly/latest/dg/ssml.html ↩

← Homemade Simple RadioImpressions (Leaving MySQL) →

If you found this article helpful, please consider buying me a coffee ☕ It'll make my ordinary day shine ✨

☕Buy me a coffee

Table of Contents

  1. Common Language Tests
  2. Pricing
  3. Integration (Example with Node.js)
  4. Conclusion