AVATAR IMAGE ANIMATION USING TRANSLATION VECTORS

US 2019 172 243 A1

Assignee
Affectiva, Inc.
Inventors
Taniya Mishra, George Alexander Reichenbach, Rana el Kaliouby
Filing date
November 30 2018
Publication date
June 6 2019
Table of contents
Classifications
CPC: G06K9/00302, G06K9/6256, G06K9/6262, G06T13/40
IPC: G06K9/00, G06K9/62, G06T13/40

Techniques are described for image generation for avatar image animation using translation vectors. An avatar image is obtained for representation on a first computing device. An autoencoder is trained, on a second computing device comprising an artificial neural network, to generate synthetic emotive faces. A plurality of translation vectors is identified corresponding to a plurality of emotion metrics, based on the training. A bottleneck layer within the autoencoder is used to identify the plurality of translation vectors. A subset of the plurality of translation vectors is applied to the avatar image, wherein the subset represents an emotion metric input. The emotion metric input is obtained from facial analysis of an individual. An animated avatar image is generated for the first computing device, based on the applying, wherein the animated avatar image is reflective of the emotion metric input and the avatar image includes vocalizations.

drawing #0

Show all 16 drawings

PatentSwarm provides a collaborative workspace to help teams research and commercialize.

Start free trial Sign in

Tip: Select text to highlight, annotate, search, or share the selection.

Claims

1. A computer-implemented method for image generation comprising:
obtaining an avatar image for representation on a first computing device;
training an autoencoder, on a second computing device comprising an artificial neural network, to generate synthetic emotive faces;
identifying a plurality of translation vectors corresponding to a plurality of emotion metrics, based on the training;
applying a subset of the plurality of translation vectors to the avatar image, wherein the subset represents an emotion metric input; and
generating an animated avatar image for the first computing device, based on the applying, wherein the animated avatar image is reflective of the emotion metric input.

Show 7 dependent claims

9-10. (canceled)

Show 10 dependent claims

21-23. (canceled)

Show 5 dependent claims

29. A computer program product embodied in a non-transitory computer readable medium for image generation, the computer program product comprising code which causes one or more processors to perform operations of:
obtaining an avatar image for representation on a first computing device;
training an autoencoder, on a second computing device comprising an artificial neural network, to generate synthetic emotive faces;
identifying a plurality of translation vectors corresponding to a plurality of emotion metrics, based on the training;
applying a subset of the plurality of translation vectors to the avatar image, wherein the subset represents an emotion metric input; and
generating an animated avatar image for the first computing device, based on the applying, wherein the animated avatar image is reflective of the emotion metric input.
30. A computer system for image generation comprising:
a memory which stores instructions;
one or more processors attached to the memory wherein the one or more processors, when executing the instructions which are stored, are configured to:
obtain an avatar image for representation on a first computing device;
train an autoencoder, on a second computing device comprising an artificial neural network, to generate synthetic emotive faces;
identify a plurality of translation vectors corresponding to a plurality of emotion metrics, based on the training;
apply a subset of the plurality of translation vectors to the avatar image, wherein the subset represents an emotion metric input; and
generate an animated avatar image for the first computing device, based on the applying, wherein the animated avatar image is reflective of the emotion metric input.

Description

This application claims the benefit of U.S. provisional patent applications Avatar Image Animation Using Translation Vectors Ser. No. 62/593,440, filed Dec. 1, 2017, and Speech Analysis for Cross-Language Mental State Identification Ser. No. 62/593,449, filed Dec. 1, 2017.

Each of the foregoing applications is hereby incorporated by reference in its entirety.

FIELD OF INVENTION

This application relates generally to image generation and more particularly to avatar image animation using translation vectors.

BACKGROUND

People engage with their various personal electronic devices and computers in order to consume many types of online content. In addition, people use their devices and computers to participate in social networks and social media. The online content includes news, sports, politics, educational information, cute puppy videos, and much, much more. The social networks support sharing, discussions, and commentary, among many other social activities. Users of social networks engage these online platforms or digital soapboxes to boast about their accomplishments; share photographs of their pets, children, and vacations; rant about politics at the local, national, and global levels; and partake in other popular social activities. The social networks can enable a feeling of connectedness, albeit through a screen, since the networks enable friends, family, and followers to keep in touch, even over great geographic distances. While the social networks are not completely sufficient replacements for face-to-face interactions, the online interactions supported by the personal electronic devices are often believed by the users to be quite close. Social networks are very effective at conveying messages that the message authors want to share. The social networks can also collect information from the participants to learn about the participants, to suggest content that might be of interest to the participants, and to track the types of information that are emerging and/or popular on the social networks. This last type of data is used in order to determine the social media content that is trending. The trending information is used to track political activity, the spread of disease throughout a population, and the latest celebrity gossip, among many other possibilities.

An important element of social media is a representation of a person that an individual or other users of social media will associate with the person. The representation can take the form of a profile picture of the person, or it can be some kind of abstraction of the person's character, such as an avatar. An avatar, chosen by the person, can be a powerful tool in representing a person to himself and/or other persons on social media or other digital platforms. Avatars can range from a simple emoji, such as a smiley face, to an abstraction of a person's profile picture, such as an Instagram filter.

SUMMARY

Image generation is used for avatar image animation. The avatar image animation uses translation vectors. An avatar image is obtained for representation on a first computing device. The avatar image can include an emoji, an animated emoji, a cartoon, a video clip, a morphed version of an image of a user, and so on. The computing device can include a laptop and desktop computer; a personal computing device such as a smartphone, a personal digital assistant, a web-enabled e-book reader, and a tablet computer; a wearable computing device such as a smart watch and smart glasses; etc. An autoencoder is trained, on a second computing device comprising an artificial neural network, to generate synthetic emotive faces. An autoencoder can include a feedforward, non-recurrent neural network such as a convolutional neural network, a deep neural network, etc. An autoencoder can include unsupervised learning models. An autoencoder can include multiple layers in the neural network. The second computing device can be a device similar to the first computing device. The second computing device can be a device different from the first computing device, such as a server. A plurality of translation vectors which correspond to a plurality of emotion metrics is identified based on the training. The translation vectors can be used to map an avatar image with one facial expression to an avatar image with a different facial expression. The different expression can include a smile, frown, smirk, laugh, etc. A subset of the plurality of translation vectors is applied to the avatar image, wherein the subset represents an emotion metric input. Based on the emotion metric input, the translation vectors can be applied to map an avatar image with a neutral expression to an avatar image with a different expression. An animated avatar image is generated for the first computing device, based on the applying, wherein the animated avatar image is reflective of the emotion metric input.

The emotion metric input can be obtained from facial analysis of an individual. The facial analysis can be based on using classifiers, using a deep neural network, and so on. The animated avatar can represent facial expressions of the individual. The animated emoji, cartoon, morphed imaged, etc. can represent a smile, a smirk, a frown, a laugh, a yawn, etc. The facial expression can be identified using a software development kit (SDK). The software development kit can be provided by a vendor, obtained as shareware, and so on. The animated avatar can represent an empathetic mirroring of the individual. In embodiments, the empathetic mirroring can cause the avatar to have a similar expression to the individual. The similar expression can include a smile in reaction to a smile, a smirk in reaction to a smirk, and so on. In other embodiments, the empathetic mirroring can cause the avatar to have a complementary expression to the individual. The complementary expression can include a sad expression in reaction to crying, a thinking expression in response to anger, etc.

Various features, aspects, and advantages of numerous embodiments will become more apparent from the following description.

BRIEF DESCRIPTION OF THE DRAWINGS

The following detailed description of certain embodiments may be understood by reference to the following figures wherein:

FIG. 1 is a flow diagram for an emotion-enabled animation.

FIG. 2 is a flow diagram for generating vocalizations.

FIG. 3 is a flow diagram for identifying translation vectors.

FIG. 4 illustrates an empathy avatar.

FIG. 5 shows an autoencoder.

FIG. 6 is an example illustrating translation vectors.

FIG. 7 is a diagram showing image and audio collection including multiple mobile devices.

FIG. 8 illustrates live streaming of social video and audio.

FIG. 9 shows data collection including devices and locations.

FIG. 10 is an example showing a convolutional neural network.

FIG. 11 shows a bottleneck layer.

FIG. 12 is a flow diagram for detecting expressions.

FIG. 13 is a flow diagram for the large-scale clustering of events.

FIG. 14 is an example illustrating unsupervised clustering of features and characterizations of cluster profiles.

FIG. 15 is a diagram of a system for an emotion enabled avatar for animating emotions.

DETAILED DESCRIPTION

Individuals can interact with friends, family, followers, like-minded people, and others, using a variety of social networking platforms. The social networking platforms are readily accessible using any of a constellation of electronic devices such as smartphones, personal digital assistants, tablets, laptops, and so on. The devices enable the users to share content and to view and interact with websites and the contents of the websites such as streaming media, social media, newsfeeds, information channels, and numerous other channels. Viewing and interacting with various channels can induce emotions, moods, and mental states in the individuals. The channels can inform, amuse, entertain, annoy, anger, bore, etc., those who view the channels. As a result, the emotion or emotions of a given individual can be directly impacted by interacting with the shared channels. The emotion or emotions of the given individual can be shared and displayed by friends, followers, and those whom they follow, etc. The sharing of an emotion can be realized using an avatar image animation. The emotion that is displayed using the animated avatar can represent an empathetic mirroring of the individual. That is, the avatar can have an expression similar to that of the individual, and the avatar can have an expression complementary to that of the individual. For a similar expression, the avatar can display a smile while the user is smiling, a frown when the user is frowning, etc. For a complementary expression, the avatar can display a sad face with the individual is crying, a thoughtful face when the individual is angry, and so on.

The use of the social networks has become widespread among many types of users. The social networks enable written exchanges, sharing of photos and videos, and support of audio and video interactions. While the social networks are not completely sufficient replacements for face-to-face interactions, the online interactions supported by the personal electronic devices are often believed to be quite close and intimate. Social networks are very effective at conveying messages that the message authors want to share. Yet, as those familiar communication formats such as text, audio, and video chat, teleconferences, and videoconferences will confirm, not all information exchanged between participants is verbal or visual. The principal reason for a face-to-face interaction is to be able to observe non-verbal information such as body language, eye contact, facial expression, non-verbal vocalizations, and so on. These additional communication modes can greatly influence emotions of all members in an exchange. Various emotional metrics of an individual can be determined by applying a subset of a plurality of translation vectors to an avatar image. The translation metrics can be determined from the interaction of an individual with a social network site or other site. The translation vectors can be identified based on training an autoencoder, where the autoencoder can be based on an artificial neural network. The artificial neural network can include a convolutional neural network, a deep neural network, and so on.

In disclosed techniques, image generation is used for avatar image animation. The avatar image animation uses translation vectors. An avatar image is obtained for representation on a first computing device. An autoencoder is trained, on a second computing device comprising an artificial neural network, to generate synthetic emotive faces. A plurality of translation vectors corresponding to a plurality of emotion metrics and based on the training is identified. A subset of the plurality of translation vectors is applied to the avatar image, wherein the subset represents an emotion metric input. An animated avatar image is generated for the first computing device, based on the applying, wherein the animated avatar image is reflective of the emotion metric input.

The animated avatar image can represent a mirroring of emotions. For example, in response to a person smiling, the animated avatar image can smile back. In response to a person laughing, the animated avatar image can laugh back, which includes both visual and vocal animation. The mirroring can take the form of empathetic mirroring or complementary mirroring, in which the animated avatar image response can include an animation to show empathy for or complement a person's emotion. For example, a sad face with red eyes analyzed from an image of a person can result in the animated avatar image shedding tears in empathy. Likewise a sad face with a description of being offended by someone can result in the animated avatar image shaking its head in empathetic disbelief Mirroring can include mirrored vocal responses and mirroring of a person's gestures, as well. It should be understood that emotions in this context can be a proxy for what are normally considered emotions, such as happiness or sadness, as well as other forms of personal and mental condition such as mental states, cognitive states, and so on. For example, a mental state may include concentration, and a cognitive state may include distractedness. In some contexts, mental state or cognitive state may be the proxy term of preference. Thus in some embodiments, the avatar image includes vocal mirroring. And in other embodiments, the avatar image includes complementary emotions. And in yet other embodiments, the avatar image includes empathetic mirroring.

The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was generated.

FIG. 1 is a flow diagram for an emotion-enabled animation. Various disclosed techniques include image generation for avatar image animation using translation vectors. The flow 100 includes obtaining an avatar image 110 for representation on a first computing device. The avatar image can be based on one or more images of a person, a morphed image, and the like. The avatar image can be based on an emoji, an animated emoji, a cartoon, and so on. In embodiments, the avatar image can include a humanoid face. The humanoid face can be a simulated face, a cartoon face, a character face, and so on. In embodiments, the avatar image includes vocalizations. The vocalization can include speech vocalizations, non-speech vocalizations, etc. The first computing device can include a personal electronic device such as a smartphone, a personal digital assistant (PDA), and a tablet computer. The first computing device can include a wearable device such as a smart watch, smart glasses, a smart garment, etc. The first computing device can be a laptop computer, a desktop computer, etc. The flow 100 incudes training an autoencoder 120, on a second computing device comprising an artificial neural network, to generate synthetic emotive faces. The artificial neural network can include a convolutional neural network, a deep neural network, and so on. The second computing device can be similar to the first computing device or can be different from the first computing device. The second computing device can be a local server, a remote server, a blade server, a distributed server, a cloud server, and so on. Various types of autoencoders can be used. In embodiments, the training the autoencoder can include using a variational autoencoder 122. In other embodiments, the training the autoencoder can include using a generative autoencoder 124. In embodiments, the training is based on a plurality of facial videos of pre-catalogued facial emotion expressions.

The flow 100 includes identifying a plurality of translation vectors corresponding to a plurality of emotion metrics 130, based on the training. The translation vectors can be used to translate an avatar image, including a humanoid face, from one expression of an emotion to another expression of the same emotion or to a different emotion. The translation vectors can correspond to emotion metrics, where the emotion metrics can be used to determine one or more emotions, an intensity of an emotion, a duration of an emotion, and so on. The emotions can include happy, sad, angry, bored, and so on. In embodiments, the emotion metric input is obtained from facial analysis of an individual. The facial analysis can be based on one or more images captured from the individual. In embodiments, the facial expression is identified using a software development kit (SDK). The software development kit can be obtained from the provider of the animated avatar, from a third party, from shareware, and so on. In embodiments, the identifying the plurality of translation vectors uses a bottleneck layer 132 within the autoencoder. The bottleneck layer can include a fewer number of nodes than the one or more preceding hidden layers in an artificial neural network. The bottleneck layer can create a constriction in the artificial neural network. The bottleneck layer can force information that is pertinent to a classification, for example, into a low dimensional representation. The flow 100 can further include generating a first set of bottleneck layer parameters, from the bottleneck layer, learned for a neutral face 134. The first set of bottleneck layer parameters can be used to identify characteristics of the neutral face. The characteristics of the neutral face can include lip position, eyelid position, and so on. The neutral face can be the humanoid face, a cartoon face, and so on. The flow 100 further includes generating a second set of bottleneck layer parameters for an emotional face 136. The second set of bottleneck layer parameters can be used for determining the one or more emotions of the emotional face. The second set of bottleneck layer parameters can be used to identify emotions based on non-speech vocalizations such as laughter, cries, sighs, squeals, yawns, grunts, clicks, filled pauses, unfilled pauses, and so on. The flow 100 further includes subtracting the first set of bottleneck layer parameters from the second set of bottleneck layer parameters 138 for use in the translation vectors. The subtracting the first set of bottleneck layer parameters from the second set of bottleneck layer parameters can be used to map the transition from the face with the neutral expression to the face with the emotional expression. The mapping can include intermediate steps between the neutral face and the emotional face so that the avatar animation can show the onset of the emotional face, variations of the emotional face such as head movement and blinking eyes, the decay of the emotional face, and so on.

The flow 100 includes applying a subset of the plurality of translation vectors to the avatar image 140, wherein the subset represents an emotion metric input. Many translation vectors can be identified in order to translate a neutral avatar face such as a humanoid face to an emotional avatar face. The emotional face can be derived from the neutral face by using a subset of the translation vectors. A happy face can result from using a subset of the translation vectors, a laughing face can result from using a subset of the translation vectors, and so on. The subsets of translation vectors may overlap or may not overlap, depending on the desired emotional face. The flow 100 includes reinforcing learning 142 of one or more bottleneck layers. Feedback can be provided, either manually or automatically, to further train a bottleneck layer based on responses from a person to a currently displayed avatar image.

The flow 100 includes generating an animated avatar image 150 for the first computing device, based on the applying, wherein the animated avatar image is reflective of the emotion metric input. The generated animated avatar image can be rendered on a screen coupled to the first computing device. The generated animated avatar image can be rendered in a program, an app, a webpage displayed in a web browser, and so on. The animated avatar can represent facial expressions of an individual. The individual can be the user of the first computing device. In embodiments, the avatar image includes body language. The body language can include body position, body orientation, body movement, and so on. In embodiments, the generating further includes vocalizations 152 based on the emotion metric input. The vocalizations can include speech vocalizations, non-speech vocalizations, etc. Various steps in the flow 100 may be changed in order, repeated, omitted, or the like without departing from the disclosed concepts. Various embodiments of the flow 100 can be included in a computer program product embodied in a non-transitory computer readable medium that includes code executable by one or more processors.

FIG. 2 is a flow diagram for generating vocalizations. Image generation is performed for avatar image animation using translation vectors. The flow 200 includes collecting audio data 210. The audio data can be collected using a microphone, audio transducer, or other audio capture apparatus coupled to a first computing device. In embodiments, the audio capture apparatus, such as a microphone, can be positioned so that it can be used to capture speech, non-speech vocalization, other vocalizations, etc. The flow 200 includes training an autoencoder 220, on a second computing device comprising an artificial neural network, to generate synthetic emotive voices. The autoencoder can include a variational autoencoder, a generational autoencoder, etc. The computing device can be a device similar to the first computing device, a local server, a remote server, a cloud server, a distributed computing device, and so on. The artificial neural network can include a convolutional neural network, a deep neural network, and the like. The generated synthetic emotive voice can be based on a selected voice 222. The selected voice can include a neutral voice. The synthetic emotive voice can be selected based on gender, race, age, user preference, and so on. The selected voice can be based on selection of an avatar, where the avatar image includes vocalizations. In embodiments, the avatar includes non-speech vocalizations.

The flow 200 includes discriminating non-speech vocalizations 230 from speech vocalizations within the collected audio data. The collected audio data can include vocalization, where the vocalizations can include non-speech vocalizations. The non-speech vocalization can include grunts, whistles, clicks, groans, and so on. In embodiments, the non-speech vocalizations include sighs, laughter, or yawns. The discriminating non-speech vocalizations include differentiating sighs, laughter, or yawns 232 from the other non-speech vocalizations within the captured audio data. The non-speech vocalizations can be differentiated using an algorithm, a heuristic, a code segment, a code library, and so on. In embodiments, laughter can be identified in the non-speech vocalizations using a laughter classifier within a software development kit SDK 234. The software development kit can be supplied by the provider of a social network platform, by a third party, through shareware, etc. In embodiments the non-speech vocalizations can include laughter, cries, sighs, squeals, yawns, grunts, filled pauses, and unfilled pauses.

The flow 200 includes identifying a plurality of translation vectors 240 corresponding to a plurality of emotion metrics, based on the training. The translation vectors can translate the neutral voice to another voice based on one or more emotion metrics. The other voice can include a happy voice, a contented voice, a bored voice, a whining voice, an angry voice, and so on. The translation vectors can include translating the neutral voice or a non-speech vocalization to another non-speech vocalization. The flow 200 includes preprocessing the selected voice 250. The selected voice can be preprocessed to produce other voices based on the emotion metrics. The selected voice can be preprocessed to produce non-speech vocalizations based on the emotion metrics. The flow 200 includes generating vocalizations 260 based on the emotion metric input. The generating vocalizations can include generating speech vocalizations, non-speech vocalizations, and so on. In embodiments, the vocalizations are based on preprocessing a voice used with the animated avatar 262. The preprocessing the voice used with the animated avatar can include matching the voice to demographic or other characteristics such as age, gender, preferred accent, speech rate, and so on. Various steps in the flow 200 may be changed in order, repeated, omitted, or the like without departing from the disclosed concepts. Various embodiments of the flow 200 can be included in a computer program product embodied in a non-transitory computer readable medium that includes code executable by one or more processors.

FIG. 3 is a flow diagram for identifying translation vectors. Data including image data and audio data can be collected from a person interacting with a computing device. The data that is collected can be used for image generation for avatar image animation using translation vectors. The plurality of translation vectors can be identified using a bottleneck layer within an autoencoder such as a variational autoencoder and a generative autoencoder. The flow 300 incudes obtaining an emotional face 310. The emotional face can be obtained using a computing device with which a person is interacting, a webcam, and so on. The emotional face can be an avatar image that represents a given emotion. The emotion can be based on a facial expression such as a smile, frown, yawn, smirk, laugh, and so on. The flow 300 includes obtaining a neutral face 312. The neutral face can be obtained using the computing device, the webcam, etc. The flow 300 includes learning parameters 320. The parameters can be related to a layer within a convolutional neural network, a deep neural network, and so on. The learning can include generating parameters for the layers of the convolutional neural network. In embodiments, the generating includes generating a first set of bottleneck layer parameters, from the bottleneck layer, learned for a neutral face. The parameters can be used for identifying the neutral face in collected video data. In other embodiments, the generating includes generating a second set of bottleneck layer parameters for an emotional face.

PatentSwarm provides a collaborative workspace to help teams research and commercialize.

Start free trial Sign in

Assignee
Affectiva, Inc.
Inventors
Taniya Mishra, George Alexander Reichenbach, Rana el Kaliouby
Filing date
November 30 2018
Publication date
June 6 2019
Table of contents
Classifications
CPC: G06K9/00302, G06K9/6256, G06K9/6262, G06T13/40
IPC: G06K9/00, G06K9/62, G06T13/40