Technical Approach - Visual and Acoustic Surveillance and Monitoring
- Site-model based image stabilization.
Prior research in collaboration with ARL on image stabilization
for automatic target acquisition (ATA) revealed
that high-accuracy image alignment is
very important for subsequent detection and tracking tasks. For some applications,
even subpixel misalignment between image frames may
cause ATA algorithms to fail.
In wide-area surveillance,
a person or small vehicle in a surveillance image may be only several pixels across and
the camera platform may vibrate due to wind and/or strong impacts on
the ground. We expect a high-accuracy camera stabilization capability
to be even more critical in this situation. Additionally, in surveillance systems in which the
visual sensors are truly mobile, stabilization is again a critical step in the
detection of people and vehicles.
We have developed several image stabilization algorithms over the past
three years.
As part of the DARPA UGV/RSTA project, we developed
a 2D feature-based multi-resolution camera motion estimation
technique.
These stabilization algorithms will be extended through the use of site models for
surveillance applications. Using a 3D site model for image stabilization has the
following advantages:
- Camera location and
orientation can be accurately determined by camera resection based on the
image domain locations of several control points whose 3D coordinates
with respect to the site are known.
- Task-dependent information (e.g., monitoring a building entrance for
activity) can be used to choose the points and/or surfaces stabilized
by the algorithm.
In the surveillance system, the acquired video sequence will be registered to
the site model. Task-specific control points whose 3D coordinates are known from the
site model will be tracked from frame to frame and used in the registration
of the video sequence to the site model. We expect the registration
procedure to achieve high accuracy. We will develop a real-time,
high-accuracy feature tracking algorithm for video sequence to site model
registration.
- Wide area detection of people and vehicles using
visual and acoustic sensors.
Detection, recognition, and motion analysis of people and vehicles form the core
tasks of wide-area surveillance and monitoring.
For the detection task, we plan to use
- changes between video frames
to detect moving objects,
- changes between the acquired video and
reference images stored in the site model to detect slow-motion
intruders, and
- unexpected acoustic signals to detect potential
targets.
Recognition will involve fusing cues from the site model with information from visual and acoustic
features of the target(s), while moving people and vehicles will be detected
use 3D cues from the site model combined with feature tracking results
to estimate location, heading, and speed.
- Narrow-area recognition of activities.
Our research on narrow-area recognition of human activities
addresses the problems of:
- Detection, segmentation and tracking of people (and their
parts) in color video and
IR sensors.
- Developing structural models for single-person and multi-person
activities, with emphasis on entering, exiting, carrying and
exchanging activities.
Our proposed research on detection and tracking of people is based
on segmenting that part of each image predicted to contain the moving
person, and finding as many natural body parts as possible using a combination
of motion-based tracking, and shape and color analysis.
We plan to employ a novel hierarchical, region-based background subtraction
method to focus on that part of the image predicted to be the
person. Current versions of
these programs currently operate at 10-15 frames per second on a PC system.
Our proposed research involves integration of this segmentation process into a tracking
framework that employs a generic model of the human body and its parts to find (through
limited searches over the combinatorial space defined by the hierarchical segmentations)
body parts and to track them both through the image sequence and, when possible, in 3-D.
Tracking will be based on identifying short-term stable features on the surfaces of the
visible body parts, dynamically updating this set as the sequence progresses. Tracking will
employ models of the dynamics of the motion being observed (walking, bending,
reaching, etc.) under control of the high-level system described below.
- Recognizing human activity.
We propose to develop a mixed statistical and structural approach to the
representation
of human actions.
The models will be grounded in statistical primitive action
models, in which the movements of individual body parts in a body-centered
frame of reference are associated with ``primitive'' body part actions.
These primitive action models are augmented with a theory of
attachment that is used to determine how and when a person is moving with
an object (as opposed to a chance visual coincidence between the
instantaneous views of a human and an object in the scene). We will
specifically be concerned with three types of attachment corresponding,
roughly, to objects with handles (briefcases, suitcases) that are carried
in one hand, bags and boxes that are carried with both arms and hands, and
stick-like objects that are carried in one or two hands. The theory will
also model how such objects are ``picked up'' and ``put down.''
Structural models are compositional activities involving the
coordinated action of many body parts and sequential constraints on
primitive or constituent composite body actions.
Primitive body actions will be recognized using a robust estimation
algorithm for indexing into databases of such linear models and
simultaneously extrapolating motion descriptions to previously unencountered
viewpoints.
The theory of attachment, used to recognize carrying activities,
will be based both on constraints on body part
motion that result from different types of carrying activity, and on the
recognition of ``objects'' in a time-varying image that move with a person
and whose position and motion are consistent with the hypothesized type of
carrying activity.
Our computational approach will
be strongly motion- and body-pose-based, and will not rely on any general
vision capabilities for finding and tracking objects that humans
might be carrying. Instead, we will hypothesize, from body pose and motion,
the type of carrying action, and then search in the image sequence for collections
of features (regions, markings, etc.) that move with the body in a manner
consistent with the hypothesized carrying action.
- High-level system for control and operator interaction.
There is
a wide variety of control knowledge that should be represented in a general way
in order to control the activities of the surveillance system, and the interactions
between the surveillance system and human operators. This control knowledge
will, generally, make reference to both spatial and temporal attributes of the
surveillance site being monitored. We propose to represent this control knowledge
using temporal logic programs, taking advantage of the general database capabilities of
logic programming to
- support the insertion of ancillary data that can
be integrated into situation assessments by the surveillance system,
and
- specify the conditions
under which control passes from wide-area surveillance to narrow-area surveillance to
requests for human assistance through queries posed to the temporal logic programming
database.
This high-level system will also draw upon the research being conducted under
a DARPA AASERT grant.