Polly is one of the services provided by Amazon Web Services (AWS) that converts text into speech, also known as text-to-speech. While text-to-speech is not a new concept, as even Google Translate can easily do it, Polly aims to produce more natural-sounding voices based on the input text, which is a great advantage for language learners. Additionally, Polly has a wide range of applications, such as converting subtitles into speech, generating scripts, narrations, dialogues, and even recording podcasts directly using Polly. Readers who want to try out the effects can visit Amazon Polly.
Common Language Testing
In terms of usage scenarios, I would like to test Chinese, English, Japanese, and Korean. Here are a few samples of voices generated using Polly:
- Japanese
- Chinese (Mandarin)
- English (American)
- English (British)
- Korean
English needs no introduction, as it is well-supported and offers various options for American and British accents. The voices sound quite natural, and if you don't listen carefully, you might mistake them for real human voices. However, the Chinese voice sounds a bit unnatural and doesn't seem to be in Taiwanese Mandarin or Standard Chinese. The level of support for Japanese is beyond my imagination. The sentences sound very smooth, and when there are English words mixed in, Polly even converts them into a Japanese accent before pronouncing them. For example, the sentence "この件についてはbug ticket必要でしょうか?" (Do we need a bug ticket for this issue?) would be pronounced by Polly as:
Although it cannot provide rich variations in voice like voice actors according to different scenes, it is already a very useful tool for me.
Polly offers two options: "Neural" and "Standard." The "Neural" option aims to generate voices that sound as natural and close to human as possible. The "Standard" option already sounds quite natural but still retains a slightly mechanical tone. Currently, only certain languages support the "Neural" option. Among Chinese, English, Japanese, and Korean, the "Neural" option is available for English and Japanese.
Polly supports SSML (Speech Synthesis Markup Language)1, which allows for marking specific sentences with pauses or adjusting the tone of the voice based on different scenes, making the generated speech more immersive.
Pricing
For pricing details, you can refer to the official website. The cost is 16.00 USD per million characters for neural voices. Unless a product requires a large volume of text-to-speech conversions, the cost should be quite affordable for general auxiliary usage, even for independent developers.
You pay monthly based on the number of characters processed by Amazon Polly for speech or speech mark requests. Amazon Polly standard voices are billed at 16.00 USD per 1 million characters (after the free tier).
Integration (Using Node.js as an Example)
Integrating with Polly is straightforward. You can use the aws-sdk to interact with the service. Here is an example code snippet:
polly.synthesizeSpeech(
{
Text: "おはようございます",
TextType: "text",
VoiceId: "Takumi",
LanguageCode: "ja-JP",
OutputFormat: "mp3",
},
(err, data) => {
if (err) {
console.log(err);
}
fs.writeFileSync("./result.mp3", data.AudioStream);
}
);
polly.startSpeechSynthesisTask()
This code will save the converted speech into the file result.mp3
.
Conclusion
Polly is a user-friendly and cost-effective service that can enhance the richness of content in various applications. Personally, I find it useful for language learning, as it allows me to input text and instantly hear near-realistic pronunciations.
For Chinese users, although the voices are acceptable, they are not in a familiar accent, which may lead to some resistance. It would be great to have support for Taiwanese Mandarin or other local accents in the future.