OGCAT[tts]S

Orthographically-Generated Concatenated Audio Transforms Text-To-Speech Synthesis

Developing Natural Human Interfacing applications requires the generated responses or referenced materials to be read by the computer. For example, for accessibility to the sight impaired, or in the field where a repair tech may not be able to look at the screen during the particular tool manipulation they are executing. In interactive applications or games there is often a need for NPC’s to vocalize, or an Avatar may be chatted with or act as an interface to an underlying set of methods in the applications.

There is great discrepancy to make TTS just work across platform with much stored behind networked clouds which does not bode well for in-field, out of signal range work requiring NHI. As well, TTS solutions can be used for NPC’s and characters in games. The restriction here is that developers are stuck with a specific set of voices often made specifically for B2B and B2C interactions or narrative reading styles so they not tend to be exotic.

We wanted a cross-platform solution that ran natively inside Unity, with no python console processes and similar cruft making it nonviable as a distributed game or business application due to wonky install locations, repos and libraries coming from unverified sources etc. which can become an IT nightmare. We have nicknamed such setups “flying sandcastles”.

Unity-Native TTS Requirements

So, we needed to define what we would like as features of such a native audio TTS for Unity:

  • Should have a minimal training regimen
  • Able to mimic the voice it was trained on
  • Not graphics card or OS-specific
  • Lightweight and portable
  • Variable pitch, speed, speaking rhythm
  • International Phonetic Alphabet (IPA)-based
  • MEL Spectrogram based recording and synthesis

Populating The TTS Model

To begin with, the training corpus requires custom lists of words that incorporate each of the variant pronunciations as per the International Phonetic Alphabet, as well as examples of how they may fit in a phoneme or how they may blend or come to a full stop etc. before or after a different phoneme. In this manner you would be able to extract the audio features representing the various phonemes. As well, you would have several variants of certain phonemes from which a median or averaged sample could be produced.

The phoneme features are catalogued into a wavetable where each orthographic symbol has a matched audio file extracted from the recorded words list. To properly mimic a human speaking, especially if the voice is to be synthesized for theatrical purpose or needs to have a range of emotive styles, some further parameters need tracking. For example, does a variant of a phoneme or syllable rise or fall in pitch or remain the same relative pitch and tonality across the sample?

To get secondary characteristics of a voice, sentences were composed which were to be read questionably, excitably, droll, narrative style, nervously. Pitch, syllabic rhythm, pause lengths, transient peaks, tonality should also be tracked at various window sizes to obtain graphs that can be applied to the synthesized voice based on tags or punctuation.

Finally, a few paragraphs are read and the same analysis and extraction of features via MFCC are executed over these audio files. The training corpus spoken to above should not take more than 15 minutes to read off with auto-recording and logging. A wavetable of the orthographic variants of the IPA has been constructed with matching MEL Spectrograms. What we want to do is stitch these MEL Spectrograms back together to synthesize speech as close as possible to the original. The results of the most successful stitching parameters are what we will call our model.

Training & Classification

A Hypergraph is constructed with an Input Layer, Hypothesis Layers, Ground Truth Test Layer, a Recursion Layer and Output. Somewhat similar to a Neural Net, but with no black boxing of functionality, weights or bias.

The input takes an audio file with accompanying text and matching IPA pronunciation. The first hypothesis layer would take phonemes from the wave table just constructed and concatenate them in sequence according to the accompanying text. Each component on the hypothesis layer would have various blend and abutment methods and reference an orthographic sieve to decide which variant to use as for the hypothesis component.

For example, we have a three syllable/phoneme word we want to model the best stitch parameters based on the orthographic wavetable we extracted from the training corpus recordings. The original single word recording is 0.35 seconds in length when silence is stripped from the beginning and end of the audio clip. It has transient peaks at 0.01 and 0.22 seconds. The pitch rises from the first to second phoneme and falls from second to third. This becomes what we are trying to mimic and our hypothesis tested against it after stitching of the MEL spectrograms.

Hypothesis Layer

The Hypothesis components will each have a variant on the blending and timing of the three phonemes. There may be 10, 25, 50 or 100 Hypothesis variants. Regardless of count they each produce a result which can be tested and weighted based on how close it is to the original Ground Truth clip and spectrogram.

The result of the Hypothesis component is passed to the Ground Truth Test layer and weighting assigned based on how close various parameters such as transient peak positions, loudness curve, pitch graph, length of clip, MEL Spectrogram differences. The parameters that yielded the closest to the Ground Truth are stored, the filters on the hypothesis layers reset to match more closely that which scored as best match to the ground truth.

The hypothesis and ground truth testing can occur as many times as needed to obtain congruence between the synthesized voice and the original voice during pronunciation. For example, say we have already ran our three syllable word thru our 50 Hypothesis components. The normalized weighting on the Ground Truth Test Layer yields 30 with weights below 0.5 and 20 above 0.5. Of those 20 there are 7 above 0.75 and one at 0.85. The Recursion Layer will receive the weighting results to decide how to reset the Hypothesis Components whose parameters yielded a low test weight.

We will immediately discard the Hypothesis parameters that yielded weights below 0.5. We can either take the set of parameters weighted above 0.75, rejigger the parameters with bias towards the highest weighted parameters or just take the highest weighted set of parameters. In either case, we replace the parameters on the Hypothesis Layer Components with the rejiggered parameters and run the hypothesis results thru the Ground Truth Test Layer. If the weighting of an hypothesis is 1.0, then it is a match and those parameters are stored as parameters for the best stitching transforms that the phoneme sequence can use to mimic the ground truth during synthesis.

Similarly, this form of Hypergraph with an Hypothesis Components Layer and a Ground Truth Testing Layer can be utilized on sentences and paragraphs. In this case where the best stitching transforms per word have already been parameterized, the Hypothesis Components will jigger timing or rhythm, pause length, pitch rise and fall and tonality. The parameters with the highest weighting during Ground Truth Testing will comprise the model for synthesizing phrasings, sentences and paragraphs.

The Result TTS Model

The final trained voice synthesis module will be comprised of a wavetable of variants on the orthographic phonemes of the International Phonetic Alphabet. Sets of parameters for stitching together the possible variants into words will comprise the models phoneme word base layer. These will be fed to a secondary layer comprising pitch graphs, rhythmic and timing graphs and parameters which will stitch the individual words together in the style of the training corpus speaker.

This will yield a lightweight, cross-platform, Unity-based solution for synthesizing voices for characters, as narrators or for Human Machine Communications using Natural Human Interfacing. It remains a good base for applying subsequent filters such as taking a female voice, altering the dominant frequency bands towards male adding some extra bass tonality and grit and reproducing a male voice from the same base orthographic wave table.

The advantages of this approach is that regional accents and essential characterization remain, it can be done in Unity using their API and audio. Using visual MEL Spectrograms, the bitmaps can be altered with various effects, bands shifted up or down or accentuated, sharpened, blurred, etc., which may yield some creative surprises or allow transformations that would be algorithmically impossible strictly in the audio bytes domain.

References:

https://en.wikipedia.org/wiki/International_Phonetic_Alphabet

https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

https://en.wikipedia.org/wiki/Speech_synthesis