Tactile Sensing and Dexterous Manipulation — Robots That Feel with Their Fingertips

Introduction
Why Touch
Two Streams of Tactile Sensors
- Vision-Based Tactile Sensors (the GelSight family)
- Electronic Skin (electrical tactile arrays)
What Touch Tells You
Learning from Human Touch
Grasp Stability: How Hard to Grip
Representation of Tactile Data
Visual-Tactile Fusion
Active Tactile Exploration
Recognizing Objects by Touch Alone
The Basics of Force Control
Before and After Contact: The Handoff of Senses
The Challenge of Multi-Fingered Hands
Dexterous In-Hand Manipulation
Learning Methods: Putting Touch into the Policy
- Reinforcement-Learning-Based
- Imitation-Learning-Based
Tactile Sim2Real
Sensor Families: A Deeper Comparison
An Actual Workflow Example: Connector Insertion
Benchmarks and Data
Applications
Pitfalls and Limits
Touch Meets VLA
A Learning-Perspective Summary
Narrowing Tactile Sim2Real: Three Branches
Real-Time Behavior and Latency
New Directions
Closing
References

Introduction

Even with our eyes closed, we tell keys from coins inside a pocket, modulate force so a cup does not spill, and instantly catch an object slipping between our fingers. What makes all of this possible is touch.

Robotic manipulation has long relied mostly on the "eyes." A camera sees an object, its position and pose are estimated, and the gripper is sent there. But the moment the hand touches the object, vision can no longer see the contact between hand and object (occlusion). Whether it is slipping, how hard it is being gripped, whether the surface is smooth or rough — only the sense at the fingertips knows.

This article covers the tactile sensing that lets a robot feel the world with its fingertips, and the dexterous manipulation that uses that sense. We look at which sensors exist, what information touch actually provides, how it fuses with vision, how in-hand manipulation is learned, and the tactile-specific sim2real problem.

Why Touch

Vision and touch are each good at different things. Vision excels at grasping the whole scene from a distance, but it misses the fine physics at the moment of contact. Touch is the opposite.

Sense	Good at	Bad at
Vision	Long-range perception, overall layout, rough shape	Contact force, slip, state behind occlusion
Touch	Contact force, slip detection, local texture/shape	Long-range perception, pre-contact information

In dexterous manipulation, the decisive moments mostly occur after contact. Turning a screw with the fingers, picking up a thin card, moving an egg without cracking it — these are very hard without real-time feedback of contact force. This is why touch is needed.

Two Streams of Tactile Sensors

Robotic tactile sensors can be broadly split into two families.

Vision-Based Tactile Sensors (the GelSight family)

The representative one is the GelSight family. The principle is astonishingly simple yet powerful. A thin reflective membrane is coated on the surface of a transparent gel (a jelly-like elastomer), and that gel is filmed from the inside with a camera. When an object presses the gel, the gel surface deforms, and under internal LED lighting the camera captures that deformation as an image.

   object presses the gel surface
        │
        ▼
   ┌─────────────────────────────┐
   │  elastic gel (reflective     │  ← deforms to the object shape
   │  coating)                    │
   ├─────────────────────────────┤
   │  transparent support layer   │
   │        ↑ LED lighting (multi │
   │      [ camera ]  ──▶ image    │  ← captures deformation in high res
   └─────────────────────────────┘
        │
        ▼
   acquire contact shape, fine texture, and force distribution as an "image"

The big advantage of this approach is that it turns contact into an image. Fine texture of the contact surface, small lettering, even surface asperities can be obtained as a high-resolution image, and from the amount of gel deformation you can read the force distribution and even signs of slip. And a large benefit is that mature computer-vision and deep-learning tools can be applied directly.

Electronic Skin (electrical tactile arrays)

Another stream is electronic skin (e-skin). Elements that respond to pressure and deformation are laid out in a grid over a wide area, and the force at each point is read as an electrical signal. There are various schemes — capacitive, resistive, piezoelectric, and more.

Electronic skin is advantageous for covering thin, wide areas and has strengths in sensing contact over broad regions like arms and torso. On the other hand, it is comparatively harder to obtain the ultra-high-resolution texture image that vision-based sensors give. Rather than competing, the two families are naturally seen as different roles: fingertips (high resolution) and broad surfaces (wide coverage).

What Touch Tells You

Let us organize what information can actually be extracted from tactile sensors.

Contact force and direction: How hard and in which direction it is being pressed. Essential when handling weak objects such as eggs.
Slip detection: Early signs that an object is starting to slip from the hand. On detecting slip, you can immediately raise the grip force so as not to drop it.
Local shape/texture: Asperities, edges, and surface roughness of the contact surface. Similar to finding a keyhole by hand in the dark.
Contact location: Which point of the finger the object touches. It becomes a clue for estimating the object's pose in in-hand manipulation.

Slip detection is especially important. People feel the very early signals that an object is starting to slip (vibration, contact-surface shift) and unconsciously apply more force. If a robot catches these signals through touch, it becomes capable of exquisite force modulation — neither gripping so hard it breaks the object nor so weakly it drops it.

Learning from Human Touch

Robot tactile research often references the human fingertip. Human skin has several kinds of mechanoreceptors, each specialized for different stimuli.

Receptor type	Good at sensing	Robot analog
Fast-adapting	Vibration, onset of slip	Time derivative of contact change
Slow-adapting	Sustained pressure, shape	Static force distribution

The key insight is that "touch is not a single signal but a combination of several channels." A channel that measures sustained pressure and a channel that measures instantaneous change must both exist for stable grasping. This is also why vision-based tactile sensors are powerful: they can extract both the static deformation of the gel surface (pressure) and the temporal change of that deformation (signs of slip) from one image stream.

Grasp Stability: How Hard to Grip

One fundamental question of manipulation is "how hard to grip." Too weak and it drops; too hard and it breaks or strains the joints. People strike this balance unconsciously, but a robot needs an explicit strategy.

   grip force ────────────────────────────────────▶
   too weak         proper range        too hard
   ┌──────────┬─────────────────┬──────────────┐
   │  slip     │  stable grasp   │  object      │
   │  drop     │  (held by touch)│  damage      │
   └──────────┴─────────────────┴──────────────┘
        ▲            ▲                 ▲
     too little    touch feedback    too much
                   keeps it here

Tactile feedback makes it possible to stay in this "proper range." If a sign of slip is detected, raise the force a little; if stable, keep only the minimum force. This way, a weak object like an egg and a heavy tool can be handled with the same hand. Modulating grip force in real time by touch instead of fixing it to a constant — this is the starting point of dexterous manipulation.

Representation of Tactile Data

To use touch in learning, you must represent the sensor output in a form the policy can handle. There are two broad branches.

Image representation: A vision-based sensor gives contact directly as an image. It is convenient to reuse vision networks like a CNN as-is.
Low-dimensional signal representation: Represent it as summarized numbers like whether there is contact, contact location, and force vector. Light and favorable for sim2real, but fine texture information is lost.

Which representation is good depends on the task. If fine texture matters, like picking up a thin card, the image representation may be favorable; if force management matters, like not dropping a heavy object, the low-dimensional representation may be. In practice, the two are sometimes mixed.

Visual-Tactile Fusion

The most powerful manipulation systems use vision and touch together, because each covers the other's weakness.

   ┌──────────── vision (camera) ────────────┐
   │  before contact: rough object pos/shape  │
   │  plan the path as the hand approaches    │
   └──────────────────┬───────────────────────┘
                      │  at contact, the hand occludes the object
                      ▼
   ┌──────────── touch (fingertip sensor) ────┐
   │  after contact: force, slip, local shape │
   │  real-time force control, pose fine-tune │
   └──────────────────┬───────────────────────┘
                      │
                      ▼
        ┌──── integrated visuotactile policy ────┐
        │  take both senses as input and         │
        │  decide the next action                │
        └─────────────────────────────────────────┘

A typical flow is this. Before contact, vision approaches the object; from the moment of contact, touch takes over the lead, modulating force and preventing slip. In learning-based systems, the camera image and the tactile image (or signal) are fed together into a neural network to learn a single policy that integrates both senses.

Active Tactile Exploration

Touch is not a sense that only "feels" passively. To know an object, a person moves the fingers actively. Rubbing a surface to know its roughness, pressing to know hardness, following the outline to grasp shape. This is called active tactile exploration.

   rubbing        ──▶ texture, roughness
   pressing       ──▶ hardness, elasticity
   contour-following ──▶ shape, edges
   lifting        ──▶ weight, center of mass estimate

A robot, likewise, can choose the mode of contact on its own to gather information. If "it is unsure what this object is right now," it moves the fingers to gather more tactile information. This turns manipulation from mere execution into an exploratory process where sensing and action intertwine. If information is short, feel more; if enough, execute — this loop is the core of active touch.

Recognizing Objects by Touch Alone

One interesting application is recognizing an object by touch alone — giving a robot the ability to tell a key from a coin in a pocket with eyes closed. Because a vision-based tactile sensor gives contact as a high-resolution image, an object can be distinguished fairly well from the texture and shape of the contact surface alone.

   object A (smooth cylinder)   contact image ──▶ [classifier] ──▶ "pen"
   object B (bumpy surface)     contact image ──▶ [classifier] ──▶ "coin"
   object C (soft cloth)        contact image ──▶ [classifier] ──▶ "cloth"

This ability is useful because even when the view is blocked, the robot can know what it is touching right now. Rummaging in a bag by hand to find a wanted item, or telling parts apart in the dark, becomes possible. Using vision and touch together, it can also serve as a "confirmation" step, touching an object seen roughly from afar to confirm it.

The Basics of Force Control

To use touch well, a robot must be able to control force. A robot that only controls position follows only the command "go here," but a robot that controls force can follow "press with this much force."

   position-only:  move to target position (risk of excess force on contact)
        │
        ▼
   force/compliance control:  contact while holding a target force
        │  (presses softly even when touching a hard wall)
        ▼
   combined with tactile feedback:  adjust target in real time with measured force

The concept of compliance is especially important. A compliant robot reacts softly to external force, so it does not harm parts or people even on unexpected contact. If you measure contact force with a tactile sensor and move the robot compliantly with that value, delicate manipulation impossible with stiff position control becomes possible. Touch and force control are practically a pair.

Before and After Contact: The Handoff of Senses

Unfolding manipulation along the time axis makes it clear how the lead of the senses changes.

   time ────────────────────────────────────────────▶

   [ approach ]     [ contact ]      [ manipulate ]  [ release ]
   vision leads     transition       touch leads      vision checks
   move to object   first contact    manage force,    check result
      │              │                slip             │
   camera leads   touch begins       touch leads     camera again

The key of this picture is that "the senses are not fixed." When approaching, vision leads; on contact, touch; when done, vision again takes the lead. A well-built manipulation system handles this handoff smoothly. It detects the moment of contact precisely to switch to touch, and when manipulation ends, checks the result with vision. The natural collaboration of the two senses is the foundation of dexterous manipulation.

The Challenge of Multi-Fingered Hands

Dexterous manipulation usually requires multiple fingers. A two-finger gripper is limited to grasping, but with several fingers like a human hand, an object can be handled freely within the hand. However, the more fingers, the more sharply control becomes hard.

Degree-of-freedom explosion: Each finger has several joints, so there are many variables to control.
Coordination: The forces of several fingers must harmonize so the object stays stable.
More tactile channels: All the fingertip touches must be integrated, so the sensing-processing burden is large too.

So multi-fingered manipulation generally relies heavily on learning. It is very hard to handle many fingers and joints with rules a person tunes one by one. Putting touch into the observation and learning coordination by reinforcement or imitation learning is the practical approach.

Dexterous In-Hand Manipulation

In-hand manipulation means the ability to re-roll, rotate, and re-pose a held object within the hand. For example, turning a cube in the hand so a desired face points up, or rotating a bolt between the fingers. This is counted among the hardest problems in robotic manipulation.

The reasons it is hard are these.

Contact keeps changing: While rolling the object, some fingers lift off and others newly touch. The contact state changes moment to moment.
Severe occlusion: An object in the hand is occluded by the fingers and hard for a camera to see. This is why touch is especially important.
Fine force balance: The forces that several fingers apply to the object must balance so it can be moved as desired without being dropped.

In this problem, touch is the key sense that tells you "what pose the object is in the hand right now, and which finger is touching how." It fills in, through the fingertips, the information that vision alone cannot know because of occlusion.

Learning Methods: Putting Touch into the Policy

Dexterous manipulation is often approached by learning rather than hand-coding every rule. Let us look at two representative branches.

Reinforcement-Learning-Based

In simulation, a hand and an object are placed together, and the policy is trained by trial and error with "reward for rotating to the desired pose." The observation includes joint state along with tactile signals (whether there is contact, contact location, force, etc.). Once touch is included in the observation, the policy can estimate the object's state and manipulate it even under occlusion.

  ┌───────────── in-hand manipulation learning (concept) ─────────────┐
  │                                                                    │
  │  observation = joint state + tactile signals (contact/force/loc)  │
  │        │                                                           │
  │        ▼                                                           │
  │   [policy network] ──▶ per-finger joint targets                    │
  │        │                                                           │
  │        ▼                                                           │
  │   observe object pose change in simulator ──▶ reward               │
  │        │  (+ closer to target pose, - if dropped)                  │
  │        └───────────────────────────────────────────────────────────┘
  └────────────────────────────────────────────────────────────────────┘

Imitation-Learning-Based

A person shows the manipulation by teleoperation or demonstration, and the visual, tactile, and action data at that time are collected so the policy imitates it. Recently, there are increasing attempts to add touch as one input modality to policies that handle vision, language, and action together (the VLA trend). That said, standardizing tactile data and collecting it at scale is still a developing area.

Tactile Sim2Real

The sim2real problem of transferring a policy learned in simulation to real hardware is especially tricky for touch. Contact physics (friction, deformation, elasticity) are a representative target that simulators struggle to approximate.

  sim touch (approximated contact physics)   real touch (complex friction/deform/noise)
        contact computed with a simple    vs   gel deformation, slip, sensor noise
        model                                  actually occur
              │                                        │
              └────── techniques to narrow the gap ─────┘
                 · domain randomization (friction/stiffness/noise)
                 · calibrate sim rendering with measured contact images
                 · simplify touch to abstract signals (contact point, force)

The approach is similar to sim2real for locomotion. With domain randomization, friction coefficients, gel stiffness, and sensor noise are shaken during training so the policy does not overfit to particular values. For vision-based sensors, effort is also made to render gel deformation in simulation to look as close as possible to real images. Also, if touch is summarized into more abstract signals like "whether there is contact, contact location, rough force" instead of raw images, the sim-to-real gap tends to shrink.

Sensor Families: A Deeper Comparison

Let us look a little deeper at the difference between the two sensor families. The choice is always a trade-off.

Aspect	Vision-based (GelSight family)	Electronic skin (e-skin)
Spatial resolution	Very high (image level)	Relatively low (depends on grid density)
Coverage area	Small area like fingertip	Favorable for wide areas
Thickness	Somewhat thick with camera and gel	Favorable for making thin
Output form	Image	Array of electrical signals
Durability	Vulnerable to gel wear	Varies by scheme
Tool reuse	Vision tools reusable as-is	Needs dedicated processing

In sum, vision-based sensors suit precise fingertip manipulation, and electronic skin suits contact sensing over wide regions like arms and torso. A hybrid design using both is natural too: fingertips sense precisely with high-resolution images, wide surfaces sense roughly with electronic skin.

An Actual Workflow Example: Connector Insertion

To see how touch is actually used, take inserting a connector into a socket as an example. This is a representative task very hard by vision alone.

   1) approach   roughly locate socket by vision, move connector nearby
      │
      ▼
   2) search     lightly place connector tip around the socket, sense contact by touch
      │        (not in yet → adjust position finely)
      ▼
   3) align      estimate the misaligned angle from contact force direction, correct pose
      │        (feels caught → align to force direction)
      ▼
   4) insert     when aligned, push in gently, monitor insertion force by touch
      │        (excess force detected → stop immediately, prevent damage)
      ▼
   5) confirm    judge insertion complete by the "click" contact signal

Here vision is almost no help after step 2, because the connector and socket are hidden by the hand and part. Just as a person plugs in a plug by hand feel alone in the dark, a robot does this fine alignment and insertion by touch. This is a typical scene where touch shows its true worth.

Benchmarks and Data

For tactile research to mature, a fair basis for comparison is needed. But there are structural difficulties here.

Sensor diversity: Each sensor has a different output format, so data gathered with one sensor is hard to use directly with another.
Reproducibility: Contact physics is sensitive to fine conditions (friction, temperature, gel state), making the same experiment tricky to reproduce.
Absence of standard tasks: There is still a lack of agreed standards for "let us measure success rate with this task."

So recently, efforts to build shared datasets and evaluation protocols spanning several sensors and tasks continue. Considering how the vision and language fields advanced fast thanks to large-scale benchmarks, such a shared foundation is also the crux of growth in touch. That said, this area is still developing, and the scale or composition of a particular dataset may differ by point in time.

Applications

Precision assembly: Tasks needing fine force feedback, such as connector insertion and screw fastening.
Handling fragile objects: Objects where force control is vital, like eggs, fruit, and glass.
Thin, flexible objects: Objects hard to handle by vision alone, like cloth, paper, and cables.
Dark or occluded environments: Finding and assembling parts by hand feel alone where the view is blocked.
Medical/service robots: Situations that must safely contact people or soft objects.

Pitfalls and Limits

Sensor durability: Soft gel can be vulnerable to repeated contact and wear, requiring replacement and protection design.
Absence of data standards: Each tactile sensor has a different output format, so large-scale shared datasets and benchmarks are not yet mature.
Difficulty of contact physics: Friction and deformation are hard to model accurately, which leads to sim2real and reproducibility problems.
Processing latency: Processing high-resolution tactile images in real time carries a compute burden.
Generalization: A policy that works well on specific objects/conditions may degrade when moved to unfamiliar objects/surfaces.
Evaluation criteria: Standard metrics to fairly compare "how well it handles" are still being established.

Touch Meets VLA

We mentioned robot foundation models and the VLA (Vision-Language-Action) trend earlier. One interesting recent question is "what happens if you add touch here." One can imagine an extended policy that handles touch on top of vision and language.

   existing VLA:  vision + language ──▶ action
                    │
                    ▼  add tactile modality
   extended:      vision + language + touch ──▶ action
                    │
                    ▼
        follow language instruction even at contact, manipulate precisely

But this is not easy. Vision and language data are vast on the web, but tactile data comes only from actual contact and has a different format per sensor, making it hard to gather at scale. Still, the attempt to integrate touch as one sensory channel is seen as a natural next step for dexterous manipulation. How fast this direction matures is still an open question.

A Learning-Perspective Summary

Let us summarize the methods of putting touch into a policy in one table.

Method	Data source	Strength	Weakness
Reinforcement learning	Simulation trial and error	Cheap data, exploratory	sim2real gap
Imitation learning	Human demonstration	Directly transfers human skill	Cost of collecting demos
Hybrid	Sim + a little real	Balances robustness and efficiency	Complex pipeline

No method is a panacea. In practice, one learns basics in simulation, fine-tunes with a small amount of real data, and adds human demonstration if needed, combining several methods. The scarcer the data — as with touch — the more important a data-efficient combination strategy becomes.

Narrowing Tactile Sim2Real: Three Branches

We said tactile sim2real is especially tricky. Efforts to narrow it can be split into three broad branches.

Make physics more accurate: Simulate contact, friction, and deformation more realistically to bring the physics of sim and real closer.
Make the policy more robust: Let it experience varied physics via domain randomization to make a policy that withstands any real condition.
Make the representation more abstract: Summarize touch into abstract signals like contact point and force instead of raw images, using a representation that sim and real can share easily.

These three are not exclusive and are used together. You improve physics, make the policy robust, and choose the representation carefully. In the end the goal is one: learn a lot cheaply in simulation, yet make the result work on hardware too. Because of touch's complex contact physics, this goal is harder than for locomotion, but for that reason it is an actively researched area.

Real-Time Behavior and Latency

In manipulation, touch is better the faster. If the latency from detecting slip to raising force is large, the object may already be dropped. A person's slip reflex is very fast. A robot too needs a fast tactile feedback loop comparable to it.

   contact occurs ──▶ sensor measures ──▶ signal processing ──▶ decide ──▶ adjust force
      │                 │                    │                   │           │
      └── the latency of each stage sums into the total reaction time ───────┘

   goal: close this whole loop fast enough (with short latency)

A vision-based sensor gives a high-resolution image but carries the compute burden of processing that image. A low-dimensional signal, in contrast, is light to process. So in real-time manipulation you must balance "what and how precisely to sense" against "how fast to react." A reflex where speed is vital, like slip prevention, is done fast with a light signal; recognition needing precision, like grasping texture, is done slowly with an image — dividing roles this way is one approach.

New Directions

Tactile manipulation is a rapidly widening field. Let us touch on a few trends.

Cheaper and tougher sensors: For mass deployment, cheap and wear-resistant sensors are needed.
Whole-body touch: Sensing widely beyond fingertips to arms and torso, toward robots that contact people safely.
Advances in tactile simulation: Simulating contact physics more accurately shrinks the sim2real gap.
Deeper sensory fusion: Integrating vision, touch, and even sound (contact sound) to understand objects and situations.

The common goal of these trends is one: to make robots contact the physical world more delicately and safely. Contact is the essence of manipulation, and touch is the sense that understands that contact.

Closing

Giving a robot the sense of its fingertips transforms manipulation from a "see and roughly guess" problem into a "feel and modulate in real time" problem. Vision-based tactile sensors turned contact into images so that mature vision tools can be used directly, and electronic skin senses contact over broad surfaces. We use vision and touch together, learn in-hand manipulation, and narrow the gap to reality with sim2real techniques.

Just as a person finds a key in a pocket with eyes closed, the day a robot handles the world with fingertips alone — in that direction, tactile research is moving quietly but steadily.

References

GelSight (introduction): https://www.gelsight.com/
Open X-Embodiment (arXiv, large-scale robot data context): https://arxiv.org/abs/2310.08864
OpenVLA (arXiv, vision-language-action policy): https://arxiv.org/abs/2406.09246
RT-2 (arXiv, VLA): https://arxiv.org/abs/2307.15818
Physical Intelligence: https://www.physicalintelligence.company/
NVIDIA Isaac (contact simulation context): https://developer.nvidia.com/isaac
Boston Dynamics (manipulation research context): https://bostondynamics.com/
Hacker News (robotics/tactile discussion): https://news.ycombinator.com/