- Google’s research division has launched AudioLM, a framework for creating high-quality audio that retains consistency across time.
- The most amazing part is that it does so without prior transcripts or annotations, despite the fact that the generated speech is syntactically and semantically acceptable.
- Furthermore, it keeps the speaker’s identity and prosody to the point that the listener cannot discern which portion of the audio is genuine and which was generated by artificial intelligence.
- The most crucial feature of AudioLM’s artificial intelligence is its ability to accomplish several tasks at once, not only repeat talks and tunes.
- AudioLM is not yet publicly accessible; it is only a language model that may be applied in a variety of applications.
We showed them chess games, and they quickly became unbeatable opponents; we let them read our texts, and they soon began to write. They also learned to paint and do photo edits. Was there anyone who doubted that artificial intelligence could do the same with speeches and music?
Google’s AudioLM performs miracles both with speech and music
Google’s research group has launched AudioLM, a framework for producing high-quality audio that maintains consistency across time. To do this, it begins with a recording that is just a few seconds long and is capable of extending it naturally and logically.
The most impressive aspect is that it does so without being taught with previous transcripts or annotations, despite the fact that the created speech is syntactically and semantically reasonable. Furthermore, it preserves the speaker’s identity and prosody to the point that the listener is unable to determine which piece of the audio is genuine and which was created by artificial intelligence.
The applications of artificial intelligence are astounding. It can not only mimic articulation, pitch, timbre, and intensity, but it can also introduce the sound of the speaker’s breath and make understandable phrases. If it’s not from a studio but rather from a recording with background noise, AudioLM mimics it to ensure continuity. More examples are available on the AudioLM website.
AudioLM was trained in semantics and acoustics
The creation of audio or music is not a new phenomenon. However, it is the approach taken by Google researchers to solve the issue. Semantic indicators (phonemes, lexicon, semantics…) and acoustic markers (speaker identity, recording quality, background noise…) are collected from each audio to encode a high-level structure (phonemes, lexicon, semantics…).
With this data already processed and intelligible for AI, AudioML starts its job by constructing a hierarchy in which it predicts semantic markers first, which are subsequently utilized as constraints to forecast acoustic markers. The latter is employed once more at the end to turn the bits into something we can hear.
This semantic separation and hierarchy of acoustics are not just useful for training language models that create speech. It is also more successful for continuing piano compositions, according to the researchers, as demonstrated on their website. It outperforms models that are exclusively trained using auditory markers.
France starts using artificial intelligence to discover taxable swimming pools
The most important aspect of AudioLM’s artificial intelligence is that it can perform everything at once, not only repeat speeches and melodies. It is, therefore, a single language model that can be used for text-to-speech — a robot might read entire novels and replace professional voice actors — or to enable any gadget to speak with humans using a familiar voice. Amazon has already investigated the possibility of utilizing the voice of loved ones in its Alexa devices.
Is AI becoming more dangerous by the day?
Programs like DALL-E 2 and Stable Diffusion are excellent tools for quickly sketching ideas or generating creative materials. Audio may be much more significant, and one may see firms using an announcer’s voice on demand. The voices of departed actors might even be used in dubbing films.
You may be thinking this idea, while thrilling, is also risky. Any audio recording can be tampered with for political, legal, or judicial objectives. According to Google, while people have difficulties distinguishing between what comes from man and what comes from artificial intelligence, a computer can discern whether the audio is organic or not. Not only that machines might replace us, but another machine will be required to appraise their job.
Artificial intelligence jobs are in high demand: Here are the career paths
AudioLM is not yet available to the public; it is only a language model that may be implemented into various applications. However, this example, along with OpenAI’s Jukebox music software, highlights how swiftly we’re entering a new world where no one will ever know, or care, if that photo was shot by a person or if there’s someone on the other end of the phone.