If you have any questions or feedback, pleasefill out this form
This post is translated by ChatGPT and originally written in Mandarin, so there may be some inaccuracies or mistakes.
Polly is one of the services offered by Amazon Web Services (AWS) that converts text to speech. While text-to-speech technology isn't new—Google Translate can easily perform the same function—Polly aims to deliver a more natural-sounding voice based on the text, which is a significant advantage for language learners. Additionally, its applications are vast, ranging from converting subtitles to audio, creating scripts, narrations, dialogues, and even recording podcasts directly with Polly. Readers interested in experiencing its capabilities can visit Amazon Polly.
Common Language Tests
In terms of language support, I would like to test Chinese, English, Japanese, and Korean. Here are a few audio samples I converted using Polly:
- Japanese
- Chinese (Mandarin)
- English (American)
- English (British)
- Korean
There's no need to elaborate on English support; it's quite extensive, offering choices between American and British accents, along with various voice options that sound remarkably natural—so much so that it’s easy to mistake it for a real person if you’re not paying close attention. The Chinese voice, on the other hand, sounds a bit unnatural; it doesn't quite match Taiwanese Mandarin but isn't exactly standard Mandarin either. The level of support for Japanese exceeded my expectations; not only do the sentences sound very fluent, but if English is mixed in, Polly will even read it out with a Japanese accent. For example, the phrase: "この件についてはbug ticket必要でしょうか?(Does this issue require a bug ticket?)" is pronounced by Polly as follows:
While it can't replicate the rich vocal variations of a voice actor based on the scene, I still find it to be a very practical tool.
Polly offers two options: one is "Neural Voice," which aims to produce the most natural and human-like sound possible; the other is "Standard," which sounds fairly natural but still has a mechanical quality. Currently, only some languages support "Neural Voice." Among Chinese, English, Japanese, and Korean, both English and Japanese support "Neural."
Polly also supports SSML (Speech Synthesis Markup Language) 1, allowing you to add pauses for specific sentences or adjust the tone of the voice based on the scene, enhancing the overall auditory experience.
Pricing
You can refer to the official website for pricing details. It's 16 for neural voice. Unless your product requires extensive text-to-speech functionality, this pricing is very affordable for general auxiliary use, making it accessible for independent developers as well.
Payment is required monthly based on the number of characters processed. Amazon Polly standard voice requests are billed at 16.00 per million characters (after exceeding the free tier).
Integration (Example with Node.js)
Integrating Polly is straightforward; you can use the aws-sdk. Below is a sample code snippet:
polly.synthesizeSpeech(
{
Text: "おはようございます",
TextType: "text",
VoiceId: "Takumi",
LanguageCode: "ja-JP",
OutputFormat: "mp3",
},
(err, data) => {
if (err) {
console.log(err);
}
fs.writeFileSync("./result.mp3", data.AudioStream);
}
);
polly.startSpeechSynthesisTask()
This code will save the converted audio into result.mp3
.
Conclusion
Polly is a handy and affordable service that can be applied to many use cases to enhance content richness. Personally, I plan to use it for language learning, as it allows me to hear immediate and realistic pronunciations after inputting text, which is incredibly convenient.
For Chinese users, while the voice is acceptable, it doesn't match the familiar tone of Taiwanese Mandarin, which can lead to some resistance. It's unfortunate, and I hope they will offer a local Taiwanese accent in the future.
Footnotes
If you found this article helpful, please consider buying me a coffee ☕ It'll make my ordinary day shine ✨
☕Buy me a coffee