speech recognition – Dataconomy https://dataconomy.ru Bridging the gap between technology and business Tue, 07 Nov 2023 15:05:28 +0000 en-US hourly 1 https://dataconomy.ru/wp-content/uploads/2025/01/DC_icon-75x75.png speech recognition – Dataconomy https://dataconomy.ru 32 32 Whisper v3: Revolutionizing speech recognition and beyond https://dataconomy.ru/2023/11/07/openai-whisper-v3-speech-recognition/ Tue, 07 Nov 2023 15:05:28 +0000 https://dataconomy.ru/?p=44303 Imagine a speech recognition model that not only understands multiple languages but also seamlessly translates and identifies them. Whisper v3 is the embodiment of this vision. It’s not just a model; it’s a dynamic force that reshapes the boundaries of audio data comprehension. The ability to transcribe, translate, and identify languages in spoken words has […]]]>

Imagine a speech recognition model that not only understands multiple languages but also seamlessly translates and identifies them. Whisper v3 is the embodiment of this vision. It’s not just a model; it’s a dynamic force that reshapes the boundaries of audio data comprehension.

The ability to transcribe, translate, and identify languages in spoken words has long been the holy grail of technology and OpenAI just changed it.

Explore OpenAI's Whisper v3: Multilingual speech recognition, translation, & more. Revolutionize audio processing!
Whisper v3 can be accessed through both command-line and Python interfaces, making it accessible to a wide range of users, from developers to researchers and novices (Image credit)

Whisper v3’s multifaceted audio revolution

Whisper v3 is a highly advanced and versatile speech recognition model developed by OpenAI. It is a part of the Whisper family of models, and it brings significant improvements and capabilities to the table. Let’s dive into the details of Whisper v3:

  • General-purpose speech recognition model: Whisper v3, like its predecessors, is a general-purpose speech recognition model. It is designed to transcribe spoken language into text, making it an invaluable tool for a wide range of applications, including transcription services, voice assistants, and more.
  • Multitasking capabilities: One of the standout features of Whisper v3 is its multitasking capabilities. It can perform a variety of speech-related tasks, which include:
    • Multilingual speech recognition: Whisper v3 can recognize speech in multiple languages, making it suitable for diverse linguistic contexts.
    • Speech translation: It can not only transcribe speech but also translate it into different languages.
    • Language identification: The model has the ability to identify the language being spoken in the provided audio.
    • Voice activity detection: Whisper v3 can determine when speech is present in audio data, making it useful for applications like voice command detection in voice assistants.

Whisper v3 is built on a state-of-the-art Transformer sequence-to-sequence model. In this model, a sequence of tokens representing the audio data is processed and decoded to produce the desired output. This architecture enables Whisper v3 to replace several stages of a traditional speech processing pipeline, simplifying the overall process.

Explore OpenAI's Whisper v3: Multilingual speech recognition, translation, & more. Revolutionize audio processing!
Released under the MIT License, Whisper v3 encourages innovation and collaboration

To perform various tasks, v3 uses special tokens that serve as task specifiers or classification targets. These tokens guide the model in understanding the specific task it needs to perform.


OpenAI Dev Day summary: ChatGPT will integrate further into daily life


Available models and languages

Whisper v3 offers a range of model sizes, with four of them having English-only versions. These models differ in terms of the trade-off between speed and accuracy. The available models and their approximate memory requirements and relative inference speeds compared to the large model are as follows:

  • Tiny: 39 million parameters, ~32x faster than the large model, and requires around 1 GB of VRAM.
  • Base: 74 million parameters, ~16x faster, and also requires about 1 GB of VRAM.
  • Small: 244 million parameters, ~6x faster, and needs around 2 GB of VRAM.
  • Medium: 769 million parameters, ~2x faster, and requires about 5 GB of VRAM.
  • Large: 1550 million parameters, which serves as the baseline, and needs approximately 10 GB of VRAM.

The English-only models, particularly the tiny.en and base.en versions, tend to perform better, with the difference becoming less significant as you move to the small.en and medium.en models.

Whisper v3’s performance can vary significantly depending on the language being transcribed or translated. Word Error Rates (WERs) and Character Error Rates (CERs) are used to evaluate performance on different datasets. The model’s performance is detailed in the provided figures and metrics, offering insights into how well it handles various languages and tasks.

How to use Whisper v3

To use Whisper v3 effectively, it’s important to set up the necessary environment. The model was developed using Python 3.9.9 and PyTorch 1.10.1. However, it is expected to be compatible with a range of Python versions, from 3.8 to 3.11, as well as recent PyTorch versions.

Additionally, it relies on various Python packages, including OpenAI’s tiktoken for efficient tokenization. Installation of Whisper v3 can be done using the provided pip commands. It is important to note that the model’s setup also requires the installation of ffmpeg, a command-line tool used for audio processing. Depending on the operating system, various package managers can be used to install it.

For more detailed information, click here.

Explore OpenAI's Whisper v3: Multilingual speech recognition, translation, & more. Revolutionize audio processing!
Whisper v3 is a state-of-the-art speech recognition model with multitasking capabilities (Image credit)

Conclusion

Whisper v3 is a versatile speech recognition model by OpenAI. It offers multilingual speech recognition, translation, language identification, and voice activity detection. Built on a Transformer model, it simplifies audio processing. Whisper v3 is compatible with various Python versions, has different model sizes, and is accessible via command-line and Python interfaces. Released under the MIT License, it encourages innovation and empowers users to extract knowledge from spoken language, transcending language barriers.

Featured image credit: Andrew Neel/Pexels

]]>
Study finds that even the best speech recognition systems exhibit bias https://dataconomy.ru/2021/04/02/study-shows-best-speech-recognition-bias/ https://dataconomy.ru/2021/04/02/study-shows-best-speech-recognition-bias/#respond Fri, 02 Apr 2021 08:21:10 +0000 https://dataconomy.ru/?p=21902 This article originally appeared on VentureBeat and is reproduced with permission. Even state-of-the-art automatic speech recognition (ASR) algorithms struggle to recognize the accents of people from certain regions of the world. That’s the top-line finding of a new study published by researchers at the University of Amsterdam, the Netherlands Cancer Institute, and the Delft University of Technology, which found […]]]>

This article originally appeared on VentureBeat and is reproduced with permission.

Even state-of-the-art automatic speech recognition (ASR) algorithms struggle to recognize the accents of people from certain regions of the world. That’s the top-line finding of a new study published by researchers at the University of Amsterdam, the Netherlands Cancer Institute, and the Delft University of Technology, which found that an ASR system for the Dutch language recognized speakers of specific age groups, genders, and countries of origin better than others.

Speech recognition has come a long way since IBM’s Shoebox machine and Worlds of Wonder’s Julie doll. But despite progress made possible by AI, voice recognition systems today are at best imperfect — and at worst discriminatory. In a study commissioned by the Washington Post, popular smart speakers made by Google and Amazon were 30% less likely to understand non-American accents than those of native-born users. More recently, the Algorithmic Justice League’s Voice Erasure project found that that speech recognition systems from Apple, Amazon, Google, IBM, and Microsoft collectively achieve word error rates of 35% for African American voices versus 19% for white voices.

The coauthors of this latest research set out to investigate how well an ASR system for Dutch recognizes speech from different groups of speakers. In a series of experiments, they observed whether the ASR system could contend with diversity in speech along the dimensions of gender, age, and accent.

The researchers began by having an ASR system ingest sample data from CGN, an annotated corpus used to train AI language models to recognize the Dutch language. CGN contains recordings spoken by people ranging in age from 18 to 65 years old from Netherlands and the Flanders region of Belgium, covering speaking styles including broadcast news and telephone conversations.

CGN has a whopping 483 hours of speech spoken by 1,185 women and 1,678 men. But to make the system even more robust, the coauthors applied data augmentation techniques to increase the total hours of training data “ninefold.”

When the researchers ran the trained ASR system through a test set derived from the CGN, they found that it recognized female speech more reliably than male speech regardless of speaking style. Moreover, the system struggled to recognize speech from older people compared with younger, potentially because the former group wasn’t well-articulated. And it had an easier time detecting speech from native speakers versus non-native speakers. Indeed, the worst-recognized native speech — that of Dutch children — had a word error rate around 20% better than that of the best non-native age group.

In general, the results suggest that teenagers’ speech was most accurately interpreted by the system, followed by seniors’ (over the age of 65) and children’s. This held even for non-native speakers who were highly proficient in Dutch vocabulary and grammar.

As the researchers point out, while it’s to an extent impossible to remove the bias that creeps into datasets, one solution is mitigating this bias at the algorithmic level.

“[We recommend] framing the problem, developing the team composition and the implementation process from a point of anticipating, proactively spotting, and developing mitigation strategies for affective prejudice [to address bias in ASR systems],” the researchers wrote in a paper detailing their work. “A direct bias mitigation strategy concerns diversifying and aiming for a balanced representation in the dataset. An indirect bias mitigation strategy deals with diverse team composition: the variety in age, regions, gender, and more provides additional lenses of spotting potential bias in design. Together, they can help ensure a more inclusive developmental environment for ASR.”

]]>
https://dataconomy.ru/2021/04/02/study-shows-best-speech-recognition-bias/feed/ 0
IBM Watson Adds Five New Services Including Image, Speech & Tradeoff Analytics https://dataconomy.ru/2015/02/06/ibm-watson-adds-five-new-services-including-image-speech-tradeoff-analytics/ https://dataconomy.ru/2015/02/06/ibm-watson-adds-five-new-services-including-image-speech-tradeoff-analytics/#respond Fri, 06 Feb 2015 11:32:56 +0000 https://dataconomy.ru/?p=11900 The IBM Watson developer cloud just got five more free beta services added to its inventory, which will expand IBM Watson capabilities to images, speech, and analytics, the IT giant reported Wednesday. The five new services now available on Bluemix are: Speech to Text : This is a cloud-based, real-time service that uses low latency […]]]>

The IBM Watson developer cloud just got five more free beta services added to its inventory, which will expand IBM Watson capabilities to images, speech, and analytics, the IT giant reported Wednesday.

IBM Watson Adds Five New Services Including Image, Speech & Tradeoff Analytics

The five new services now available on Bluemix are:

  • Speech to Text : This is a cloud-based, real-time service that uses low latency speech recognition capabilities to convert speech into text for voice-controlled mobile applications, transcription services, and more. Based on more than 50 years of speech research at IBM, its use cases include voice control over apps, transcription of meetings and conference calls in real-time.
  • Text to Speech : Text to Speech converts textual input into speech, and provides the option of three voices in English or Spanish, including the American English voice used by Watson in the 2011 Jeopardy match.  Text to Speech generates synthesized audio output complete with appropriate cadence and intonation. It could find usage in assistance of visually impaired.
  • Visual Recognition : This service analyzes the visual appearance of images or video frames to understand their content. It’ll help in organizing and ingesting large collections of digital images, drawing semantic association between images from multiple sources and provide insight into consumer shopping behavior based on image queries.
  • Concept Insights : This service handles text in a conceptual way, delivering a search capability that discovers new insights on text compared to traditional keyword searches. It’ll help improve search queries with more intuitive results,
  • Tradeoff Analytics : Helps users make better choices by weighing multiple and often conflicting goals. It can be utilized by retailers and manufacturers to determine product mix, also allowing consumers to compare and contrast competitive products or services and helping physicians decide on optimal treatment methods.

This is second such release, trailed by the offering of eight beta Watson services for the developer community to test and develop by harnessing Watson’s capabilities, in order to “harden each service as we prepare them for general availability,” explains IBM.

From Language Identification and Machine Translation to Visualization Rendering and User Modeling, the beta services are being embedded into a new class of cognitive apps. So far over 5,000 partners, developers, data hobbyists, entrepreneurs, students and others have contributed to building 6,000+ apps infused with Watson’s cognitive computing capabilities, claims IBM.

Read IBM’s full blogpost here.


(Image credit: Anirudh Koul, via Flickr)

 

]]>
https://dataconomy.ru/2015/02/06/ibm-watson-adds-five-new-services-including-image-speech-tradeoff-analytics/feed/ 0
Facebook’s Speech Recognition Cause Gets Boost with Acquisition of Wit.ai https://dataconomy.ru/2015/01/08/facebooks-speech-recognition-cause-gets-boost-with-acquisition-of-wit-ai/ https://dataconomy.ru/2015/01/08/facebooks-speech-recognition-cause-gets-boost-with-acquisition-of-wit-ai/#respond Thu, 08 Jan 2015 10:48:04 +0000 https://dataconomy.ru/?p=11287 Wit.ai, the Y Combinator startup, that has been working on an open and extensible natural language platform all the while helping developers to build applications and devices that turns speech into actionable data, announced earlier this week, its acquisition by Facebook. “It is an incredible acceleration in the execution of our vision. Facebook has the […]]]>

Wit.ai, the Y Combinator startup, that has been working on an open and extensible natural language platform all the while helping developers to build applications and devices that turns speech into actionable data, announced earlier this week, its acquisition by Facebook.

“It is an incredible acceleration in the execution of our vision. Facebook has the resources and talent to help us take the next step. Facebook’s mission is to connect everyone and build amazing experiences for the over 1.3 billion people on the platform – technology that understands natural language is a big part of that, and we think we can help,” said a blog post making the announcement.

Wit.ai’s expertise could bolster Facebook’s strategy towards voice control development tools alongside its Parse development platform all the while assisting “with voice-to-text input for Messenger”, and helping “improve Facebook’s understanding of the semantic meaning of voice, and create a Facebook app you can navigate through speech,” points out TechCrunch.

Founded 18 months ago, Wit.ai already has more than 6000 developers on its team who have built hundreds of apps and devices. It is also reported that the platform will remain open and free for everyone.

Read more here.


(Image credit: wit.ai)

]]>
https://dataconomy.ru/2015/01/08/facebooks-speech-recognition-cause-gets-boost-with-acquisition-of-wit-ai/feed/ 0
The Trouble With Siri https://dataconomy.ru/2014/07/11/trouble-siri/ https://dataconomy.ru/2014/07/11/trouble-siri/#comments Fri, 11 Jul 2014 11:25:44 +0000 https://dataconomy.ru/?p=6831 The advances in the field of speech recognition over the past couple of years have been huge. Since Microsoft began working with deep learning neural networks in 2009, huge improvements in the way algorithms can detect your dialogue have occurred. Since then, many other big names like Google and IBM have adopted the deep learning […]]]>

The advances in the field of speech recognition over the past couple of years have been huge. Since Microsoft began working with deep learning neural networks in 2009, huge improvements in the way algorithms can detect your dialogue have occurred. Since then, many other big names like Google and IBM have adopted the deep learning techniques, with dramatic improvements. But there’s once crucial name missing off that last- Apple. Their Siri product was the cornerstone of their IPhone 4S launch, but has been subject to widespread criticism since then due to its inaccuracy.

This week brought another crushing blow for Siri. They’re in the midst of a patent dispute with Zhizhen Network Technology in China, who claim Siri is overly-similar to their Xiao i Robot offering. Apple tried to have Xiao i Robot’s patent invalidated- a move which was rejected this week. Zhizhen are now trying to have blocked Siri-based Apple offerings in China, claiming the app infringes their rights.

“Apple believes deeply in protecting innovation, and we take intellectual property rights very seriously,” an Apple spokesman has stated. “Apple created Siri to provide customers with their own personal assistant by using their voice. Unfortunately, we were not aware of Zhizhen’s patent before we introduced Siri, and we do not believe we are using this patent. While a separate court considers this question, we remain open to reasonable discussions with Zhizhen.”

The news comes just weeks after Apple announced they would be moving Siri development in-house, and away from their currect outsourcing arrangement with Nuance Technologies. They’ll then be aiming to incorporate the cutting-edge deep learning techniques to Siri, with good reason. Introducing deep learning led to a 25% increase in accuracy (in a field where a an increase of a couple of percent is a breakthrough). Microsoft are now in the process of rolling out Skype Translate, a futuristic service which can translate your speech as you talk.

Perhaps altering the makeup of their technology will aid Apple in their patent disputes. But considering highly-advanced deep learning technologies have been around for such a long time, it almost beggars belief that an innovative industry leader like Apple have only now decided to jump on the bandwagon.

Read more here.
(Image credit: Flickr)



Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

[mc4wp_form]

]]>
https://dataconomy.ru/2014/07/11/trouble-siri/feed/ 1