A long exegesis on the nature of the animation system, specifically going
through how to do make a Hierarchal Fuzzy State Machine (HfuSM pronounced
Hi-Fuzzy for simplicities sake) that would procedurally create the body language
from basic primitives.
Isolate into compound states and primitives.
There are a number of different
places where the events that control the gesture/animation system can come from:
Movement
When the user moves the animation system is required to keep
track of the users physical state and issue the relevant events.
Looking, walking forward, turning around, ducking, all cause either
animation data to play or a targeted real-time IK solution to be calculated.
Replication
Another obvious place where gestures can come
from is the replication of events from other users via the server.
See the replication design document for further details on replication
issues and methodology.
Items, Scripts and other triggers
Depending on the rules of a
space, an item or script should be able to induce another avatar to change
state. This seems most interesting in the case of emotional state. For example
fear, this might be as simple as having an “if state = = fear do x” clause
in the finite state machine, which could be activated either by someone clicking
fear or getting an event from the event system that triggers a fearful attitude.
These could eventually be used to establish more than just a facial change, the
fear state could last x amount of time and change y variables
during that time. By doing this we facilitate role-playing or force role-playing
for appropriate environments.
Mirror
Shows users what the face of their character avatar,
allowing the user to see as well as change their current emotional state. The
Mirror actually consists of two components. One is a docked UI plug-in that
shows the user’s face and current facial expression. The UI plug-in will also
have a menu system that allows the user to change their current mood. The second
component is an autonomous agent that automates some of the expressions of an
avatar. If an individual specifies their character as being “happy”, this
agent would then attempt to combine (over time) the basic primitives that would
constitute a “happy” avatar. This both simplifies the user’s interface in
expressing emotions and makes the real-life simulation much more elaborate.
In the first column, the
first row represents the various moods as well as the ability to change
them.
The second row represents
specific gestures; these could represent facial expressions, body
expressions or a combination of the two.
The last two
rows are fairly arbitrary at the moment, the first icon representing an
attack mode, while the second icon represents a personal teleporter.
Flow
Chart for a few of the mood states
Obviously the logic for the change state machinery can be simpler for a user
avatar then for an AI, since the mirror has a person sitting behind the state
machine controlling it. We can rely on their memory and their attitude to
control the avatar in a logical manner; this is based on the Eliza principle.
However, we will still need a FuSM for the lower-level moods, i.e. happy,
sad or fear, to create variety and automate tedium.
(GP Note: this is the autonomous agent part, and this is effectively what Juanita was studying in Snow Crash)
There are a lot of hidden cues that an individual is not consciously aware of. Things like how often an individual blinks when at rest versus talking and the and at what point while talking is an individual more likely to blink. These will probably turn into a fairly fixed probabilistic formula for most avatars.
Lower Level State Machines
Altiverse will probably need to hire an expert on human facial expressions and emotional state for this section of the project. These variables will probably be used as weights in the HFuSM. A simulation that this would be appropriate for is flirting, which winds up begging the question: What do people do while flirting? Some of the states would be:
make eye contact
bat
eyelashes
smile
toss
hair
Fidget
Footsee
(this would require a bit more of immersive hardware setup then we currently
foresee).
When a user creates their avatars, we will be able to give the user the ability to customize some of the weights or completely turn them off.
Task based animations are another way of establishing a logical hierarchy for animations to play within. For the example picking up an object, this could be broken down into its component sub tasks/animations as:
Walk
to proximity of objects.
Get
into the right position to pick up object.
Pick
up the object.
Motivate actually excels of this type of task based animation and planning,
UnrealScript also has facilities for doing this but probably not as advanced.
Task based animations are actually more necessary for the automation of AI’s
that it is for a human driven actor. The
actual animation is just the visual representation of the artificial
intelligence planning and executing a task.
The traditional
paradigm for working in three dimensions with objects is very often seen in 3D
shooters. The controls are not extremely intuitive but can be accessed by the
keyboard and any traditional input device. Interacting with objects is usually
accomplished by a running over them, the object is then either immediately used
or in more advanced programs stored in inventory, which can be accessed later.
More
recently we've seen Trespasser by DreamWorks Interactive, this was DWI's attempt
to be much more elaborate in its object interaction. They failed miserably. (See
postmortem in Gamasutra for a complete rundown).
They attempted to do was make a visible avatar with a user-operated hand
and a visible upper body, the character was female so that amounted to a pair of
breasts obstructing the view of the ground. The hand’s IK was done very badly.
It flailed about and was difficult to control even while standing still. While
moving it acted as an inverse kinematic chain that attached to the player the
motions of a player to propagated in a fashion that was aesthetically and
game-playing wise unappealing. Emulating Trespasser's mistakes does not really
strike me as being the best thing to do.
However
in the case of picking up an object off the table or examining something on a
bookshelf, what would be acceptable inside a FPS (running over the object) would
not look right. Beyond that I do not know how evolved graphics engines are at
this point. Would it be possible to have someone sit down or open a door?
Obviously collision has been done but we will probably need something much more
sophisticated for our purposes. Being able to take a target and play animation
that would be based on the location of your avatar relative to the target. This
animation would have the avatar interact with the target based upon that
information.
For actions such as grasping, sitting, picking up etc, etc. A product named Motivate made by Motion Factory seemed to promise a certain amount of these facilities. It couples adding basic information to objects that individuals will interact with, such as the shape of its handle, with a real-time inverse kinematics solution. In all honesty this is really Motivates domain. I have not really seen any other program to do this in general manner. The main reason for this is that this is relatively simple problem to solve with a robust skeletal system. Which motivate has, however the industry has just started moving to skeletal system and until that point the solution will be more akin to just allowing a weapon to be held and not much else. (Note: quite a bit has changed since this document was started, Unreal now has a skeletal animation system, Intel and RadGameTools are both selling skeletal system SDKs. Rune has a Softimage based one. Jeff Lander’s articles in game developer are also a good resource to look at. Additionally the 3DSMax opens ource website has an app called dejaview that is a stand alone 3d skeletal exporter)
Another issue with targeting is that you are typically interacting with
someone, hence you would need a target for eye contact or smiling (Note: this is not necessarily true, for example in the case of a very
drunk individual. For now however the simulation of extremely drunk individuals
will be left out of the system).
A person could wind up flirting with the wall, which in my opinion would be
extremely funny but would definitely hurt the verisimilitude of the simulation. (Janak:
Why prevent this? Hell, we can do
it in real life. Hell, we do it in
real life – in front of a glass mirror! To
rehearse! In fact, we should have
glass mirrors in the system in the “practice” environment so people become
more familiarized with the mirror interface and its effects. This
once again brings up the features of motivate or at least having a more robust
animation system.
To steal Motivates categorization:
Locomotion skills – move from one place to another
Manipulation skill – manipulating an object via an abstract handle, I’ve generalized this even further into a target based animation, where the target might just be the general location where the action is directed towards. I should point out the in a situation such as climbing a ladder or a rope of the distinction between manipulation and locomotion starts getting blurry.
Basic skill - what we are calling a gesture
Aprox 22 muscles for face
Aprox 17 joints for body not counting fingers and toes
The fact that we are going to be using a hierarchal skeletal system should facilitate our abilities to make more complex animations at runtime. We should be able to superposition data both in a numeric sense, by literally adding one set of contraction/angle vectors to another, or in a joint sense, allowing the lower level bones to play different data then the main skeleton. In addition, we have structured animation data, so it should be possible to store an animation such as “wave” not just as a full body animation but just an arm animation. This would allow not exactly superposition but the replacement of one animation’s arm data with another animation’s arm data at runtime. For example, we could combine two dissimilar animations achieving a combined “walk” with a “wave” as well as a “smile” at the same time. A few problems with this: what if the person isn’t walking - we don't really want them to be perfectly still except for the wave; also, by separating out the hierarchy and playing different animations on different parts of the skeleton, we increase the risk of animations where different parts of the avatar start interfering with each other and cause intra-avatar collisions, lowering the realism of the simulation.
Note some of this has been supplanted by the implemented facial animation system
The facial animation system will have to be able to overlay the visual representation of the phonemes with the emotional state of the avatar in any given time. This should be able to be done via straight superposition of the data as long as there are well-defined constraint limiters on the muscles. Moreover, by restricting to one facial expression and one phoneme expression we can allow for a wide variety of facial animations yet at the same time keep the system tractable. The intuition behind this is that mood and speech largely determines facial state.
We probably offer the ability to either slowdown or speed up the playback of a gesture by changing the rate of interpolation of the data, as well as the option to loop the gesture. Note I’m talking about a simple fixed gesture, one that the consists of a fixed length set of biped based key frames. Janak: Hmmm… are we assuming all gestures/moods have the same duration rules? Maybe we should just come up with two or three duration rules – i.e. fixed for X time, decay over Y time, and immediate (but the last even could be fixed for {VERY SMALL} time). Then we can enumerate sample gestures and moods and give examples that fit inside our duration structure. Scripting can simplify a lot of this.
Another obvious problem with
this is that people act very different when they are doing high-level complex
behaviors. Generalizing these actions into one prototypical behavior will be
problematic. Back to our high level task of flirting, (bit transfixed aren’t
we), when the user selects “flirting” and a target, the avatar would
automatically enter into a flirty mood. However, what one person describes as
flirting may not be remotely what another describes as flirting, me versus the
guys in the village. A way to solve
this would be to customize the FuSM, so that the user could select which actions
constitute as flirting or adjust the probabilities of different sub actions. Or
just have a few prototypical FuSM a few for men a few for women. (GP Note:
This is a problem that emerges in the lower-level motion cap (mocap) data, you
don’t want things to look the same. I currently see no solution to this
except for a real-time video analysis.0
Currently,
animations that are responses to events are called gestures. But
since I can think of all of the animations as being responses to events, I’m
having a problem not lumping them together.
Each
of these has a muscle contraction vector that defines the state
Basic expressions
|
|
Smile Frown Fear Look angry Disgust mirrorDisgust openmouth |
Left_Blink
Right_Blink close_eyes Pout Raise Eyebrows Suspicious Surprise |
Phenomes
(based on lispinc) |
|
bump cage church earth eat fave if new |
oat ox Roar size though told wet |
Head motions
|
|
turnleft turnright tiltright |
tiltleft |
Compound Expressions
|
|
no skeptical_no laugh look_around |
yawn whistle wahha roll_head |
Moods |
|
Anger Boredom Confusion Contempt Disgust Afraid Friendl Happy Pleading |
Sadness Sexy Skepticism Stoic Surprise Suspicion Tired Triumph Others? |
Major question is: How many of these are discrete expressions vs. moods that are conveyed over time.
GNP Actually I meant in a much more philosophical sense so I stared at a mirror (quick cam) for while)
Janak :Well – maybe there’s a way we can combine them somewhat – discrete expressions can be states lasting a time of zero. The system would automatically interpret that as being a immediate wham-bam thing. We could also have a time of –1, which would mean expression presented only if the mouse is clicked. Something like that.
-That
we see in the standard library necessary for basic and extend motion and the
like
Aim/ targeting
Turn head - left/right/up/down
Run/Walk forwards/backwards
Side step - left/right
Turn
Jump: forward/back/up/right/left
Crouch
Pick up - right hand/left hand
Die
Basic street fighting punch - left/right arms
Basic street fighting kick - left/right legs
Throw
Climb (with knees bent, like
stairs)
Climb (with hands and feet,
rope)
Bend down (at knees or at back)
Sit: flat/bent surface
Looking at something
Looking at someone
Holding something in hand - left/right
Holding something out in front of you - one hand/
both hands
Lie down
Wave
Janak:
Lots and lots of them – we have to get experts to do this – people who are
familiar with human physiology.