Oct 10, 2021
Laura Berrill

Five Minutes With

Automation
enterprise
Technology
Technology Magazine speaks to Jesse Shemen, CEO and Co-Founder, Papercup

Tell me about Papercup

Papercup enables companies to scale high-quality dubbing through speech translation technology. Our pioneering tech translates voices into different languages with output sound that is indistinguishable from human speech, reflecting the characteristics of the original speaker. The speech is then fed into a human-in-the-loop editing interface (think Unbabel or Verbit) which allows for the output to be customised based on a customer’s specifications. 

Why did we build this? Well, the majority of the globe’s media content is shackled to a single language. Billions of hours of videos on YouTube, millions of podcast episodes, hundreds of thousands of corporate training videos, tens of thousands of classes on Coursera, and thousands of hours of content on streaming sites. Content owners are scrambling to go international, yet there is no simple and cost effective way to translate content beyond subtitling. This is what we set out to tackle with Papercup.

The goal of our technology is simple. We want to make all voice-based content, whether a Netflix series or a YouTube video, consumable in any and every language. Today, we work with BBC, Sky News, Business Insider and Yoga with Adriene, to improve their global reach.

Where did the idea come from and what challenge do you look to solve?

Papercup emerged from the scenario I describe above. It was such a clear problem that needed solving - it was more a question of whether we could build the tech to help chip away at the issue.

Every minute, 500 hours of content is uploaded to YouTube - that gives you a sense of the sheer amount of video created every single day. The video landscape is constantly evolving, but most video content is stuck in a single language, making it unavailable to audiences around the world to listen to. This means that the video viewing experience is not ideal for the six billion non-English speakers. 

While subtitles are great, people prefer to watch content in their own languages  - we also know from our own studies that viewer retention is far greater when people can hear/watch the content instead of having to read subtitles. We saw this huge opportunity to launch a voice translation tool for video across media and enterprise. Our automated technology translates voices into a variety of languages to increase accessibility for audiences no matter what language they happen to speak.

How do you steer the ship, in what direction and why?

I steer the ship with the help of an ambitious, curious, humble crew that spans machine learning, product and customer. These are people we’ve carefully recruited, who believe in what we’re doing and bring their expertise to the voyage. I make sure that I’m listening to and acting on what they’re telling me. We're very intentional about who we bring on board - much to the dismay of the team it takes us a long time to recruit for any role. It's so tempting to succumb to the short term pain of missing a crucial person, but we make sure to stand our ground and wait until we find someone who fits the bill.

Our total addressable market is massive – essentially the ambition is to make any video in the world watchable in any language. My job is to define the short and medium term goals and constantly help the teams prioritize and focus wholly on the things that will move the needle. Maintaining a clear focus and keeping multiple teams on track requires discipline and foresight from everyone on the team, otherwise it's so easy for scope to expand to a level we just can't tackle with the team size we have.

Maintaining our current focus allows us to capture the most pressing demand for our solution – media companies and corporates. The former is helping them reach audiences they previously couldn't access, like Business Insider with Spanish-speaking audiences. The latter, corporates, can suddenly communicate more effectively with their non-native English speakers. 

Highs and lows of the day job

There have been lots of highs – our technology has enabled some of the world’s biggest media companies like Business Insider and Sky to distribute viral content in new languages. Our world-class machine learning team has had research papers accepted into prestigious conferences this year. The content translated with our technology has reached over 150 million people this year alone - a staggering number. 

On the lows front, let’s call them expected challenges! It's never as rosy as you see on LinkedIn. In a fast-growing startup, it’s about maintaining momentum and ensuring the teams are buoyed and feel engaged with what we’re trying to achieve. You're bound to walk into problems literally every day. For me, the difficulty is not spending the time I want on the things I know will move the needle that day. There are always distractions that need my attention in the moment and it's not exactly the simplest task figuring out not what to prioritize, but instead what you know will have to be sacrificed.

Biggest mistake you’ve made in business so far and what did you learn from it?

In the beginning we held off on recruiting more staff until we really needed them. This really delays progress, but it has taught us to be more proactive with hiring and trying to forecast the roles we’ll need in the future. Now we constantly build relationships in relevant fields even when we don’t have open roles, knowing that we will be able to call on these people in the future – whether that’s for recommendations or in case they’re thinking of their next career move.

Plans for the future

From the start, we’ve been on a mission to make all videos across the world accessible in every language. We're doing this in two ways. First, through fundamental research in machine learning which improves the expressivity and naturalness of our synthetic voices while we roll out new languages. Secondly, by educating a whole new market on dubbing or localization. Many had not considered this before: it was too expensive, but now we’re tackling more regions, content types and languages to make video content accessible to global audiences. 

Eventually we want to be able to extend our state-of-the-art technology to any form of human dialogue –  allowing any two people to engage in a conversation regardless of what language they happen to speak. In other words – your voice, in another language.

Anything exciting coming your business’ way?

I think one of the more exciting shifts we’re seeing is the maturation of the FAST market - free ad-supported streaming. It’s not premium streaming like Netflix or Hulu, but instead is another home for catalogues of content that don’t fit the bill for Netflix and would be underutilized on something like YouTube. We’re already working with content partners distributing on platforms such as Pluto - I’m excited to see where we can take this. 

Lessons learned from past experiences

We’re creating a new category and launching a new product that is unfamiliar to people. What’s become clear over time, the more we speak to prospective clients, is that educating the market is a huge part of launching cutting edge technology. Showing people what the technology can do -  how it can rapidly give companies a global presence or improve their employee training for instance — is now integral to the conversations we’re having with media companies and enterprises from the very beginning. 

People are mainly accustomed to traditional translation services, so the concept of using AI to translate video content at scale requires education. How do we generate highly natural voices? Can we ensure we hit a certain quality level? Where should I distribute the content? These are the questions we now address when we have initial conversations with prospective clients and we find that they tend to be amazed by the quality of the output and the impact it can have on business.

 

Share article