- Published on
Tactile Sensing and Dexterous Manipulation — Robots That Feel with Their Fingertips
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- Why Touch
- Two Streams of Tactile Sensors
- What Touch Tells You
- Learning from Human Touch
- Grasp Stability: How Hard to Grip
- Representation of Tactile Data
- Visual-Tactile Fusion
- Active Tactile Exploration
- Recognizing Objects by Touch Alone
- The Basics of Force Control
- Before and After Contact: The Handoff of Senses
- The Challenge of Multi-Fingered Hands
- Dexterous In-Hand Manipulation
- Learning Methods: Putting Touch into the Policy
- Tactile Sim2Real
- Sensor Families: A Deeper Comparison
- An Actual Workflow Example: Connector Insertion
- Benchmarks and Data
- Applications
- Pitfalls and Limits
- Touch Meets VLA
- A Learning-Perspective Summary
- Narrowing Tactile Sim2Real: Three Branches
- Real-Time Behavior and Latency
- New Directions
- Closing
- References
Introduction
Even with our eyes closed, we tell keys from coins inside a pocket, modulate force so a cup does not spill, and instantly catch an object slipping between our fingers. What makes all of this possible is touch.
Robotic manipulation has long relied mostly on the "eyes." A camera sees an object, its position and pose are estimated, and the gripper is sent there. But the moment the hand touches the object, vision can no longer see the contact between hand and object (occlusion). Whether it is slipping, how hard it is being gripped, whether the surface is smooth or rough — only the sense at the fingertips knows.
This article covers the tactile sensing that lets a robot feel the world with its fingertips, and the dexterous manipulation that uses that sense. We look at which sensors exist, what information touch actually provides, how it fuses with vision, how in-hand manipulation is learned, and the tactile-specific sim2real problem.
Why Touch
Vision and touch are each good at different things. Vision excels at grasping the whole scene from a distance, but it misses the fine physics at the moment of contact. Touch is the opposite.
| Sense | Good at | Bad at |
|---|---|---|
| Vision | Long-range perception, overall layout, rough shape | Contact force, slip, state behind occlusion |
| Touch | Contact force, slip detection, local texture/shape | Long-range perception, pre-contact information |
In dexterous manipulation, the decisive moments mostly occur after contact. Turning a screw with the fingers, picking up a thin card, moving an egg without cracking it — these are very hard without real-time feedback of contact force. This is why touch is needed.
Two Streams of Tactile Sensors
Robotic tactile sensors can be broadly split into two families.
Vision-Based Tactile Sensors (the GelSight family)
The representative one is the GelSight family. The principle is astonishingly simple yet powerful. A thin reflective membrane is coated on the surface of a transparent gel (a jelly-like elastomer), and that gel is filmed from the inside with a camera. When an object presses the gel, the gel surface deforms, and under internal LED lighting the camera captures that deformation as an image.
object presses the gel surface
│
▼
┌─────────────────────────────┐
│ elastic gel (reflective │ ← deforms to the object shape
│ coating) │
├─────────────────────────────┤
│ transparent support layer │
│ ↑ LED lighting (multi │
│ [ camera ] ──▶ image │ ← captures deformation in high res
└─────────────────────────────┘
│
▼
acquire contact shape, fine texture, and force distribution as an "image"
The big advantage of this approach is that it turns contact into an image. Fine texture of the contact surface, small lettering, even surface asperities can be obtained as a high-resolution image, and from the amount of gel deformation you can read the force distribution and even signs of slip. And a large benefit is that mature computer-vision and deep-learning tools can be applied directly.
Electronic Skin (electrical tactile arrays)
Another stream is electronic skin (e-skin). Elements that respond to pressure and deformation are laid out in a grid over a wide area, and the force at each point is read as an electrical signal. There are various schemes — capacitive, resistive, piezoelectric, and more.
Electronic skin is advantageous for covering thin, wide areas and has strengths in sensing contact over broad regions like arms and torso. On the other hand, it is comparatively harder to obtain the ultra-high-resolution texture image that vision-based sensors give. Rather than competing, the two families are naturally seen as different roles: fingertips (high resolution) and broad surfaces (wide coverage).
What Touch Tells You
Let us organize what information can actually be extracted from tactile sensors.
- Contact force and direction: How hard and in which direction it is being pressed. Essential when handling weak objects such as eggs.
- Slip detection: Early signs that an object is starting to slip from the hand. On detecting slip, you can immediately raise the grip force so as not to drop it.
- Local shape/texture: Asperities, edges, and surface roughness of the contact surface. Similar to finding a keyhole by hand in the dark.
- Contact location: Which point of the finger the object touches. It becomes a clue for estimating the object's pose in in-hand manipulation.
Slip detection is especially important. People feel the very early signals that an object is starting to slip (vibration, contact-surface shift) and unconsciously apply more force. If a robot catches these signals through touch, it becomes capable of exquisite force modulation — neither gripping so hard it breaks the object nor so weakly it drops it.
Learning from Human Touch
Robot tactile research often references the human fingertip. Human skin has several kinds of mechanoreceptors, each specialized for different stimuli.
| Receptor type | Good at sensing | Robot analog |
|---|---|---|
| Fast-adapting | Vibration, onset of slip | Time derivative of contact change |
| Slow-adapting | Sustained pressure, shape | Static force distribution |
The key insight is that "touch is not a single signal but a combination of several channels." A channel that measures sustained pressure and a channel that measures instantaneous change must both exist for stable grasping. This is also why vision-based tactile sensors are powerful: they can extract both the static deformation of the gel surface (pressure) and the temporal change of that deformation (signs of slip) from one image stream.
Grasp Stability: How Hard to Grip
One fundamental question of manipulation is "how hard to grip." Too weak and it drops; too hard and it breaks or strains the joints. People strike this balance unconsciously, but a robot needs an explicit strategy.
grip force ────────────────────────────────────▶
too weak proper range too hard
┌──────────┬─────────────────┬──────────────┐
│ slip │ stable grasp │ object │
│ drop │ (held by touch)│ damage │
└──────────┴─────────────────┴──────────────┘
▲ ▲ ▲
too little touch feedback too much
keeps it here
Tactile feedback makes it possible to stay in this "proper range." If a sign of slip is detected, raise the force a little; if stable, keep only the minimum force. This way, a weak object like an egg and a heavy tool can be handled with the same hand. Modulating grip force in real time by touch instead of fixing it to a constant — this is the starting point of dexterous manipulation.
Representation of Tactile Data
To use touch in learning, you must represent the sensor output in a form the policy can handle. There are two broad branches.
- Image representation: A vision-based sensor gives contact directly as an image. It is convenient to reuse vision networks like a CNN as-is.
- Low-dimensional signal representation: Represent it as summarized numbers like whether there is contact, contact location, and force vector. Light and favorable for sim2real, but fine texture information is lost.
Which representation is good depends on the task. If fine texture matters, like picking up a thin card, the image representation may be favorable; if force management matters, like not dropping a heavy object, the low-dimensional representation may be. In practice, the two are sometimes mixed.
Visual-Tactile Fusion
The most powerful manipulation systems use vision and touch together, because each covers the other's weakness.
┌──────────── vision (camera) ────────────┐
│ before contact: rough object pos/shape │
│ plan the path as the hand approaches │
└──────────────────┬───────────────────────┘
│ at contact, the hand occludes the object
▼
┌──────────── touch (fingertip sensor) ────┐
│ after contact: force, slip, local shape │
│ real-time force control, pose fine-tune │
└──────────────────┬───────────────────────┘
│
▼
┌──── integrated visuotactile policy ────┐
│ take both senses as input and │
│ decide the next action │
└─────────────────────────────────────────┘
A typical flow is this. Before contact, vision approaches the object; from the moment of contact, touch takes over the lead, modulating force and preventing slip. In learning-based systems, the camera image and the tactile image (or signal) are fed together into a neural network to learn a single policy that integrates both senses.
Active Tactile Exploration
Touch is not a sense that only "feels" passively. To know an object, a person moves the fingers actively. Rubbing a surface to know its roughness, pressing to know hardness, following the outline to grasp shape. This is called active tactile exploration.
rubbing ──▶ texture, roughness
pressing ──▶ hardness, elasticity
contour-following ──▶ shape, edges
lifting ──▶ weight, center of mass estimate
A robot, likewise, can choose the mode of contact on its own to gather information. If "it is unsure what this object is right now," it moves the fingers to gather more tactile information. This turns manipulation from mere execution into an exploratory process where sensing and action intertwine. If information is short, feel more; if enough, execute — this loop is the core of active touch.
Recognizing Objects by Touch Alone
One interesting application is recognizing an object by touch alone — giving a robot the ability to tell a key from a coin in a pocket with eyes closed. Because a vision-based tactile sensor gives contact as a high-resolution image, an object can be distinguished fairly well from the texture and shape of the contact surface alone.
object A (smooth cylinder) contact image ──▶ [classifier] ──▶ "pen"
object B (bumpy surface) contact image ──▶ [classifier] ──▶ "coin"
object C (soft cloth) contact image ──▶ [classifier] ──▶ "cloth"
This ability is useful because even when the view is blocked, the robot can know what it is touching right now. Rummaging in a bag by hand to find a wanted item, or telling parts apart in the dark, becomes possible. Using vision and touch together, it can also serve as a "confirmation" step, touching an object seen roughly from afar to confirm it.
The Basics of Force Control
To use touch well, a robot must be able to control force. A robot that only controls position follows only the command "go here," but a robot that controls force can follow "press with this much force."
position-only: move to target position (risk of excess force on contact)
│
▼
force/compliance control: contact while holding a target force
│ (presses softly even when touching a hard wall)
▼
combined with tactile feedback: adjust target in real time with measured force
The concept of compliance is especially important. A compliant robot reacts softly to external force, so it does not harm parts or people even on unexpected contact. If you measure contact force with a tactile sensor and move the robot compliantly with that value, delicate manipulation impossible with stiff position control becomes possible. Touch and force control are practically a pair.
Before and After Contact: The Handoff of Senses
Unfolding manipulation along the time axis makes it clear how the lead of the senses changes.
time ────────────────────────────────────────────▶
[ approach ] [ contact ] [ manipulate ] [ release ]
vision leads transition touch leads vision checks
move to object first contact manage force, check result
│ │ slip │
camera leads touch begins touch leads camera again
The key of this picture is that "the senses are not fixed." When approaching, vision leads; on contact, touch; when done, vision again takes the lead. A well-built manipulation system handles this handoff smoothly. It detects the moment of contact precisely to switch to touch, and when manipulation ends, checks the result with vision. The natural collaboration of the two senses is the foundation of dexterous manipulation.
The Challenge of Multi-Fingered Hands
Dexterous manipulation usually requires multiple fingers. A two-finger gripper is limited to grasping, but with several fingers like a human hand, an object can be handled freely within the hand. However, the more fingers, the more sharply control becomes hard.
- Degree-of-freedom explosion: Each finger has several joints, so there are many variables to control.
- Coordination: The forces of several fingers must harmonize so the object stays stable.
- More tactile channels: All the fingertip touches must be integrated, so the sensing-processing burden is large too.
So multi-fingered manipulation generally relies heavily on learning. It is very hard to handle many fingers and joints with rules a person tunes one by one. Putting touch into the observation and learning coordination by reinforcement or imitation learning is the practical approach.
Dexterous In-Hand Manipulation
In-hand manipulation means the ability to re-roll, rotate, and re-pose a held object within the hand. For example, turning a cube in the hand so a desired face points up, or rotating a bolt between the fingers. This is counted among the hardest problems in robotic manipulation.
The reasons it is hard are these.
- Contact keeps changing: While rolling the object, some fingers lift off and others newly touch. The contact state changes moment to moment.
- Severe occlusion: An object in the hand is occluded by the fingers and hard for a camera to see. This is why touch is especially important.
- Fine force balance: The forces that several fingers apply to the object must balance so it can be moved as desired without being dropped.
In this problem, touch is the key sense that tells you "what pose the object is in the hand right now, and which finger is touching how." It fills in, through the fingertips, the information that vision alone cannot know because of occlusion.
Learning Methods: Putting Touch into the Policy
Dexterous manipulation is often approached by learning rather than hand-coding every rule. Let us look at two representative branches.
Reinforcement-Learning-Based
In simulation, a hand and an object are placed together, and the policy is trained by trial and error with "reward for rotating to the desired pose." The observation includes joint state along with tactile signals (whether there is contact, contact location, force, etc.). Once touch is included in the observation, the policy can estimate the object's state and manipulate it even under occlusion.
┌───────────── in-hand manipulation learning (concept) ─────────────┐
│ │
│ observation = joint state + tactile signals (contact/force/loc) │
│ │ │
│ ▼ │
│ [policy network] ──▶ per-finger joint targets │
│ │ │
│ ▼ │
│ observe object pose change in simulator ──▶ reward │
│ │ (+ closer to target pose, - if dropped) │
│ └───────────────────────────────────────────────────────────┘
└────────────────────────────────────────────────────────────────────┘
Imitation-Learning-Based
A person shows the manipulation by teleoperation or demonstration, and the visual, tactile, and action data at that time are collected so the policy imitates it. Recently, there are increasing attempts to add touch as one input modality to policies that handle vision, language, and action together (the VLA trend). That said, standardizing tactile data and collecting it at scale is still a developing area.
Tactile Sim2Real
The sim2real problem of transferring a policy learned in simulation to real hardware is especially tricky for touch. Contact physics (friction, deformation, elasticity) are a representative target that simulators struggle to approximate.
sim touch (approximated contact physics) real touch (complex friction/deform/noise)
contact computed with a simple vs gel deformation, slip, sensor noise
model actually occur
│ │
└────── techniques to narrow the gap ─────┘
· domain randomization (friction/stiffness/noise)
· calibrate sim rendering with measured contact images
· simplify touch to abstract signals (contact point, force)
The approach is similar to sim2real for locomotion. With domain randomization, friction coefficients, gel stiffness, and sensor noise are shaken during training so the policy does not overfit to particular values. For vision-based sensors, effort is also made to render gel deformation in simulation to look as close as possible to real images. Also, if touch is summarized into more abstract signals like "whether there is contact, contact location, rough force" instead of raw images, the sim-to-real gap tends to shrink.
Sensor Families: A Deeper Comparison
Let us look a little deeper at the difference between the two sensor families. The choice is always a trade-off.
| Aspect | Vision-based (GelSight family) | Electronic skin (e-skin) |
|---|---|---|
| Spatial resolution | Very high (image level) | Relatively low (depends on grid density) |
| Coverage area | Small area like fingertip | Favorable for wide areas |
| Thickness | Somewhat thick with camera and gel | Favorable for making thin |
| Output form | Image | Array of electrical signals |
| Durability | Vulnerable to gel wear | Varies by scheme |
| Tool reuse | Vision tools reusable as-is | Needs dedicated processing |
In sum, vision-based sensors suit precise fingertip manipulation, and electronic skin suits contact sensing over wide regions like arms and torso. A hybrid design using both is natural too: fingertips sense precisely with high-resolution images, wide surfaces sense roughly with electronic skin.
An Actual Workflow Example: Connector Insertion
To see how touch is actually used, take inserting a connector into a socket as an example. This is a representative task very hard by vision alone.
1) approach roughly locate socket by vision, move connector nearby
│
▼
2) search lightly place connector tip around the socket, sense contact by touch
│ (not in yet → adjust position finely)
▼
3) align estimate the misaligned angle from contact force direction, correct pose
│ (feels caught → align to force direction)
▼
4) insert when aligned, push in gently, monitor insertion force by touch
│ (excess force detected → stop immediately, prevent damage)
▼
5) confirm judge insertion complete by the "click" contact signal
Here vision is almost no help after step 2, because the connector and socket are hidden by the hand and part. Just as a person plugs in a plug by hand feel alone in the dark, a robot does this fine alignment and insertion by touch. This is a typical scene where touch shows its true worth.
Benchmarks and Data
For tactile research to mature, a fair basis for comparison is needed. But there are structural difficulties here.
- Sensor diversity: Each sensor has a different output format, so data gathered with one sensor is hard to use directly with another.
- Reproducibility: Contact physics is sensitive to fine conditions (friction, temperature, gel state), making the same experiment tricky to reproduce.
- Absence of standard tasks: There is still a lack of agreed standards for "let us measure success rate with this task."
So recently, efforts to build shared datasets and evaluation protocols spanning several sensors and tasks continue. Considering how the vision and language fields advanced fast thanks to large-scale benchmarks, such a shared foundation is also the crux of growth in touch. That said, this area is still developing, and the scale or composition of a particular dataset may differ by point in time.
Applications
- Precision assembly: Tasks needing fine force feedback, such as connector insertion and screw fastening.
- Handling fragile objects: Objects where force control is vital, like eggs, fruit, and glass.
- Thin, flexible objects: Objects hard to handle by vision alone, like cloth, paper, and cables.
- Dark or occluded environments: Finding and assembling parts by hand feel alone where the view is blocked.
- Medical/service robots: Situations that must safely contact people or soft objects.
Pitfalls and Limits
- Sensor durability: Soft gel can be vulnerable to repeated contact and wear, requiring replacement and protection design.
- Absence of data standards: Each tactile sensor has a different output format, so large-scale shared datasets and benchmarks are not yet mature.
- Difficulty of contact physics: Friction and deformation are hard to model accurately, which leads to sim2real and reproducibility problems.
- Processing latency: Processing high-resolution tactile images in real time carries a compute burden.
- Generalization: A policy that works well on specific objects/conditions may degrade when moved to unfamiliar objects/surfaces.
- Evaluation criteria: Standard metrics to fairly compare "how well it handles" are still being established.
Touch Meets VLA
We mentioned robot foundation models and the VLA (Vision-Language-Action) trend earlier. One interesting recent question is "what happens if you add touch here." One can imagine an extended policy that handles touch on top of vision and language.
existing VLA: vision + language ──▶ action
│
▼ add tactile modality
extended: vision + language + touch ──▶ action
│
▼
follow language instruction even at contact, manipulate precisely
But this is not easy. Vision and language data are vast on the web, but tactile data comes only from actual contact and has a different format per sensor, making it hard to gather at scale. Still, the attempt to integrate touch as one sensory channel is seen as a natural next step for dexterous manipulation. How fast this direction matures is still an open question.
A Learning-Perspective Summary
Let us summarize the methods of putting touch into a policy in one table.
| Method | Data source | Strength | Weakness |
|---|---|---|---|
| Reinforcement learning | Simulation trial and error | Cheap data, exploratory | sim2real gap |
| Imitation learning | Human demonstration | Directly transfers human skill | Cost of collecting demos |
| Hybrid | Sim + a little real | Balances robustness and efficiency | Complex pipeline |
No method is a panacea. In practice, one learns basics in simulation, fine-tunes with a small amount of real data, and adds human demonstration if needed, combining several methods. The scarcer the data — as with touch — the more important a data-efficient combination strategy becomes.
Narrowing Tactile Sim2Real: Three Branches
We said tactile sim2real is especially tricky. Efforts to narrow it can be split into three broad branches.
- Make physics more accurate: Simulate contact, friction, and deformation more realistically to bring the physics of sim and real closer.
- Make the policy more robust: Let it experience varied physics via domain randomization to make a policy that withstands any real condition.
- Make the representation more abstract: Summarize touch into abstract signals like contact point and force instead of raw images, using a representation that sim and real can share easily.
These three are not exclusive and are used together. You improve physics, make the policy robust, and choose the representation carefully. In the end the goal is one: learn a lot cheaply in simulation, yet make the result work on hardware too. Because of touch's complex contact physics, this goal is harder than for locomotion, but for that reason it is an actively researched area.
Real-Time Behavior and Latency
In manipulation, touch is better the faster. If the latency from detecting slip to raising force is large, the object may already be dropped. A person's slip reflex is very fast. A robot too needs a fast tactile feedback loop comparable to it.
contact occurs ──▶ sensor measures ──▶ signal processing ──▶ decide ──▶ adjust force
│ │ │ │ │
└── the latency of each stage sums into the total reaction time ───────┘
goal: close this whole loop fast enough (with short latency)
A vision-based sensor gives a high-resolution image but carries the compute burden of processing that image. A low-dimensional signal, in contrast, is light to process. So in real-time manipulation you must balance "what and how precisely to sense" against "how fast to react." A reflex where speed is vital, like slip prevention, is done fast with a light signal; recognition needing precision, like grasping texture, is done slowly with an image — dividing roles this way is one approach.
New Directions
Tactile manipulation is a rapidly widening field. Let us touch on a few trends.
- Cheaper and tougher sensors: For mass deployment, cheap and wear-resistant sensors are needed.
- Whole-body touch: Sensing widely beyond fingertips to arms and torso, toward robots that contact people safely.
- Advances in tactile simulation: Simulating contact physics more accurately shrinks the sim2real gap.
- Deeper sensory fusion: Integrating vision, touch, and even sound (contact sound) to understand objects and situations.
The common goal of these trends is one: to make robots contact the physical world more delicately and safely. Contact is the essence of manipulation, and touch is the sense that understands that contact.
Closing
Giving a robot the sense of its fingertips transforms manipulation from a "see and roughly guess" problem into a "feel and modulate in real time" problem. Vision-based tactile sensors turned contact into images so that mature vision tools can be used directly, and electronic skin senses contact over broad surfaces. We use vision and touch together, learn in-hand manipulation, and narrow the gap to reality with sim2real techniques.
Just as a person finds a key in a pocket with eyes closed, the day a robot handles the world with fingertips alone — in that direction, tactile research is moving quietly but steadily.
References
- GelSight (introduction): https://www.gelsight.com/
- Open X-Embodiment (arXiv, large-scale robot data context): https://arxiv.org/abs/2310.08864
- OpenVLA (arXiv, vision-language-action policy): https://arxiv.org/abs/2406.09246
- RT-2 (arXiv, VLA): https://arxiv.org/abs/2307.15818
- Physical Intelligence: https://www.physicalintelligence.company/
- NVIDIA Isaac (contact simulation context): https://developer.nvidia.com/isaac
- Boston Dynamics (manipulation research context): https://bostondynamics.com/
- Hacker News (robotics/tactile discussion): https://news.ycombinator.com/