What does it do?

Call someone through Skype. You would hear the voice of other side, however the other person would not be hearing your voice or the voice of your environment... When you talk in English, your speech gets synthesized to text. That text will then be synthesized back to speech and the other side would hear the voice of either IBM's Watson or Stephen Hawking's doppelganger (using eSpeak)... It can also translate what you just said, of course! :-)

Okay, So What's the point?

Voice Modulation devices are not really that secure. Mostly, they'll work by lowering the pitch of the voice either uniformly or something like randomly every couple of seconds etc. That's really a joke: the voice will sound very deep and/or nothing like your current voice, but then you could record that audio and increase/alter the pitch to retrieve the original voice. Even if one alters more than the pitch of one's voice, one can do almost nothing about one's accent, word stressing, intonation or even non-intentional environmental noises that might be indicative of where/who you are etc. WoIP eliminates all of that: Instead of hearing one's voice, other side would hear something that cannot be attributed to any human by definition (i.e. it's computer generated, with standard accent, stress, intonation etc. and zero environmental noise).
You can process the text in between -either by synthesizing your speech and receiving a text or by typing what you want to say instead of saying it. You could translate what you want to say to Spanish or French or any other language and have it synthesized for the other side to hear. It's quite a natural process, for simple enough communications (which is most would probably be going to have with someone in case of emergency) and having IBM's Watson as your high quality speech synthesizer.
With this technique, you could basically pass anything instead of what you want to say through a phone call. Playing a music through Skype (or any other messenger) is just a trivial example.
All the changes are in real-time: You may use Watson with or without translations, you could switch back to eSpeak, you could talk or type, even encrypt a message with AES-128 (or decrypt one) and then recite the ciphertext to the other side slowly which works under the assumption that both sides know the key.