In a previous blogpost we talked about the Opus codec, which offers very low bitrates. Another codec seeking to achieve even lower bitrates is Codec 2.
Codec 2 is designed for use with speech only, and although the bitrates are impressive the results aren’t as clear as Opus, as you can hear in the following audio examples. However, there is some interesting work being done with Codec 2 in combination with neural network (WaveNets) that is yielding great results.
Very interesting work, but even the best sounding examples (with wavenet decoder) interacts badly with my tinnitus and would make it unpleasant to listen to despite being the best sounding.
I would be curious to know if other people with tinnitus have a similar reaction.
Very promising, I hope they can keep improving using these techniques
I have tinnitus, and the audio samples sound OK to me… (but perhaps our conditions are somewhat different; my tinnitus is typically masked by ambient noise / music (why I listen practically all the time to something in the background); I “only” can’t experience silence… )
The Amiga had the so called “narrator.device”, a software to read textfiles and output them via the soundchip Paula.
It sounded quite similar to the low bitrate examples here … but I guess the pure ASCII encoding is even more compact.
The speech synthesis of the Amiga (aka SPEECH.TOS on the Atari) was only relevant for english and used phonemes that made the rendering rather robotic. Sure the idea was there, only having to type text (and possibly also add “intonation” marks) but this is far from what this codec wants to achieve, that’s to say international spoken language compression.
Btw, the Opus codec is already pretty impressive, too sad it isn’t widespread.
Not quite right, if you feed the narrator pure text you got a very robotic output, but there were ways to improve the voice output:
See:
http://amigadev.elowar.com/read/ADCD_2.1/Devices_Manual_guide/node0…
http://amigadev.elowar.com/read/ADCD_2.1/Devices_Manual_guide/node0…
http://amigadev.elowar.com/read/ADCD_2.1/Includes_and_Autodocs_2._g…
Wow, it is amazing! Maybe similar techniques could render pleasing output from a 64kbps audio stream, and perfect, or close, at 128kbps. And maybe produce pleasing images from heavily compressed video instead of blocky garbage…
I saw not long ago an article about some AI technique being applied to images that would fill in the details as the image was blown up, for a dumb CSI-like infinite image zoom. The details that were added in were fictional, there is no escaping the laws of linear systems, but the results looked much better than colored square pixels: no good for zooming security cam footage into iris recognition of a criminal three blocks away, but great for watching from the couch.
That won’t stop the general public, who have proven to be easily duped by the badly photoshopped images, from believing such “filled in” footage.
There are codecs which do just that, aoTuV Ogg Vorbis, HE-AAC+ or Opus for example.
Or did opus also have a speech only mode?
Opus does speech and music dynamically. So Skype will use it, Youtube already uses it and Spotify will probably start to use it more and more.
This is really a robot voice trying to emulate your speech.
More or less. The wavenet decoder is a robot reconstructing the sound of your voice. But it sometimes gets it wrong – it converts “of” to a sound more like “in” in the male voice sample.
It’s interesting but not 100% there yet.
https://www.youtube.com/watch?v=artEQkifGG0
https://csdb.dk/release/?id=94453
Released in 2010, this demo shows advanced (at least in C64 terms) audio processing and has inbuilt about 2 minutes of acapella version of Tom’s Dinner by Suzanne Vega, all that in 46kB, running on a computer with 0.99MHz processor and 64kB of RAM.
And there were advancements in that filed on C64 since that time, so now we have productions like Wilde that pack audio and video in 1MB cartridge:
https://www.youtube.com/watch?v=FT-LJ2pqTWo
https://csdb.dk/release/?id=165392
or SSDPCM2 V3 that streams audio from 170kB 1541 floppy drive:
https://www.youtube.com/watch?v=O7fihHkF4wY
https://csdb.dk/release/?id=162796
So, doing something similar on a machine with, by comparison, unlimited resources, maybe isn’t all that revolutionary. I’m not saying that Codec2 is bad, at contrary, but similar things were done before.
Edited 2018-06-25 17:31 UTC
Fooking around with codecs on C64 is fun.. if you are very bored, and regressed into smut I guess
https://www.youtube.com/watch?v=HtRxD5Zkanw
I tested a little codec theory myself aswell, doing 4bit resolution audio with highfrequency dither, for lowpass filtering on decode, averaging to high resolution. Was not that bad.. Personally I think it sounded better than mp3. Probably good if you are just looking for that type of size reduction.
But ofcourse really rather 24bit streaming on streaming services, lol.
Edited 2018-06-25 20:21 UTC
I find that unlikely from what you described / especially since mp3 encoded by LAME presets is excellent. Did you ABX it?
No, mp3 is rather poor actually, probably used because they did not know better at the time. This preserves phase, and original waveshape, but is a bit lowpass filtered, but mp3 is also that. Could ofcourse also potentially be emphasised, for fixing lowpass filtering also. But you need some space for highfrequency pulses to average on the filtered output, so atleast expect loss of 15khz and upwards.
PS: Needs 80-bit calculation for best result, I also suggest BSD-3 clause licence, for available source endevaours.
Peace (Go With Théé)
Edited 2018-06-28 15:06 UTC
PS. I forgot to quote yours “Personally I think it sounded better than mp3.” in my above post, that’s what it referred to. And seems slightly like LossyWave.
Anyway, I wouldn’t call LAME-encoded mp3s “poor” – ABX listening tests show it’s transparent on quite reasonable (<200kbps) bitrates. “Preserves waveshape” doesn’t mean much, audio is for listening, not looking at waveshapes…