- Published on
Speaking Practice — Watch the Mouth and Speak; If You Cannot Speak It, You Cannot Hear It
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Opening — I can hear everything, but my mouth will not move
- Core insight — language is a protocol that connects you to the world
- Going deeper 1 — production builds the circuit
- Going deeper 2 — how the circuit gets laid: motor memory and automation
- Going deeper 3 — understanding the speech organs: a map of tongue and lips
- Going deeper 4 — watch the mouth, imitate the sound
- A 30-day plan to fully finish one piece of shadowing
- English and Japanese — same principle, different details
- A collection of concrete prompts for real conversation with AI
- A case — from an awkward meeting remark to a natural one
- Input-centered learner vs. production-centered learner
- Practice — daily routines that move the mouth
- Handling fear — clumsy production beats silence
- A real check — questions to see whether your learning is production-centered
- Traps — the ones people commonly fall into
- FAQ
- The long view — keeping the production habit going past a year
- Closing — only when the mouth moves do you finally hear
- References
Opening — I can hear everything, but my mouth will not move
Even after studying English for a fairly long time, one problem tormented me for a while. When I watch an American show without subtitles, I catch about 80 percent. I know the words and I follow the context. On the listening side, at least, I had little trouble.
But then, in a video call, when a foreign colleague would ask "so how are you going to handle this issue?", I had the answer in my head yet my mouth would not move. They were words I clearly knew, but they would not assemble into a sentence and come out.
My years at LINE were especially like this. Over Slack text I could go back and forth in English or Japanese just fine, but the moment I joined a voice meeting my mouth froze. In chat I wrote long, accurate sentences, while out loud I only repeated "uh... um... yeah, it's... okay."
At first I thought I was short on vocabulary, so I memorized more words. It still did not work. I studied more grammar. Again, nothing.
Only much later did I realize: I had done plenty of input but almost no output. I had built only the listening and reading muscles, left the speaking muscle neglected, and then wondered "why won't the words come out?"
This post is about that realization. There are two core claims. First, language is not knowledge but a skill, so it improves only when you produce output. Second, surprisingly, speaking practice lifts even your listening.
The title "if you cannot speak it, you cannot hear it" is not hyperbole but a conclusion I reached in my own body. When sounds I had moved my mouth to make started ringing clearly in my ears, I understood that this order was not backwards at all.
Core insight — language is a protocol that connects you to the world
First, let me propose a way of seeing language. Language is not an exam subject; it is a protocol that connects you to the world.
Let me use an analogy familiar to developers. For two systems to communicate, they must speak the same protocol. No matter how good your data is, if you cannot emit it in a format the other side understands, the communication fails.
People are the same. Even with a good thought in your head, if you cannot produce it as sounds the other side understands, that thought is not conveyed.
Learning a language is, then, installing this input-output protocol into your own body. And a protocol does not install just from reading. The connection completes only after you actually exchange data.
Yet we usually learn only the "read" half of the protocol. We practice decoding (listening and reading) but never encoding (speaking and writing). So we end up with half a connection.
It is a one-sided link that processes what comes in but cannot send anything out. Picture a server that receives data and parses it but cannot serialize a response and send it back. Such a server is useless as a communication partner.
Linguistics has the Output Hypothesis proposed by Merrill Swain. Comprehensible input alone is not enough; only when learners strain to produce the language themselves do they discover the gaps in their own knowledge and internalize the grammar.
The moment you try to produce, you confront your limits — "wait, how do I say this?" That confrontation pushes learning forward. The blank you skipped right over while only doing input finally appears before your eyes the moment you try to speak.
On the other side is Stephen Krashen's Input Hypothesis (Comprehensible Input). The claim is that if you receive enough comprehensible input, language is acquired naturally.
I see the two hypotheses not as opposed but as a pair. You fill up the raw material with good input, and you drive that material into your own circuit through output.
To put it in an analogy, input is grocery shopping and output is cooking. No matter how many ingredients you buy, your cooking skill does not improve unless you actually cook. And of course you cannot cook with no ingredients either. The two are not a sequence but a pair.
Going deeper 1 — production builds the circuit
Why does listening alone fail to bring out speech? Because, from the brain's point of view, listening and speaking are separate circuits.
Listening is recognition. You identify a pattern that already exists, so the load is light. It is similar to recognizing the right option on a multiple-choice question.
Speaking, by contrast, is production. You have to arrange the meaning in your head into the right word order, choose the words, and convert all of that into precise movements of tongue and lips. This is an entirely different motor task. It is closer to writing an open-ended answer from a blank page.
No matter how much you listen, this production circuit is barely trained. The recognition circuit and the production circuit run along different paths.
The analogy to sports makes it clear. If you watched 1,000 hours of basketball, could you sink a free throw? You could not. Watching and doing are different things.
I play table tennis. I have watched hundreds of hours of Ma Long and Zhang Jike matches on YouTube. I know the trajectory of a forehand drive in my head. Yet the moment I pick up the paddle, my body will not reproduce it, because the watching circuit and the hitting circuit run separately. In the end you have to swing thousands of times yourself before the motion is burned into the body.
Speaking is exactly the same. Production improves only by producing. You have to repeat that motion of moving the mouth to lay down the circuit. What the head knows and what the mouth does are different dimensions.
Here is one important fact. Production practice does not just improve speaking — it lifts listening too. Sounds you have pronounced yourself become far easier to hear.
A sound you have never produced with your own mouth — say English linking, or a reduced vowel — leaves you unsure where to break it even when it reaches your ear. When "what do you want to do?" arrives smeared together as "wadayawannado", it looks like one lump and you cannot get a handle on it.
But once you pronounce that linking yourself as "wanna do", from then on you hear it clearly. Because your mouth has made the pattern, your ear recognizes the pattern. Production aids recognition.
This is the identity behind "if you cannot speak it, you cannot hear it." If you have no list of sounds you have produced inside you, your ear has no standard to compare against, so the sound comes in and slips away.
Let me give one more concrete example of sounds. English "gonna", "wanna", "gotta" look like separate words when you learn them in writing, but in reality they come out fused into a single lump.
Until I had pronounced "I'm gonna check it" fused together myself, something like "amgonnacheckit", I could not pin down where the word boundaries were even when the phrase came up in a show. But once I had built it as one lump with my own mouth, I recognized the same sound instantly whenever it appeared.
Japanese is the same. "Janai desu ka" is long on the page, but in reality "ja nai desu ka" gets fused fast into something like "janaissuka". After I had fused and said it myself, I clearly heard a colleague using this expression in meetings.
In the end, listening and speaking are not two separate subjects but the input and output ends of the same circuit. Train one side and the other rises along with it. That is why production practice is also the most efficient listening practice.
Going deeper 2 — how the circuit gets laid: motor memory and automation
Let me go a little further into exactly what it means for a production circuit to "get laid down."
The first time you put a new expression in your mouth, everything is conscious. You call up each word, check the word order, mind the pronunciation, and speak haltingly. Your brain is running at full tilt, so you tire quickly and have no spare capacity for anything else.
In table tennis terms, this is when you first learn the backhand. If you check foot position, paddle angle, and timing one by one in your head as you swing, the ball flies off somewhere strange. When the conscious mind controls every detail of the motion, the motion actually breaks up.
But repeat the same motion hundreds of times and at some point the conscious mind drops out. The body does it on its own. This is automation. Only then do you finally get the spare capacity to attend to other information — where the ball is coming, how your opponent is standing.
Automation in speaking is exactly the same. Glue one sentence like "could you walk me through this part?" to your mouth a few dozen times, and later it pops out whole without conscious effort. Once that one sentence is automated, you can spend the leftover attention right there on the content and the other person's reaction.
For the circuit to be laid down means, in the end, reaching the state where "what you used to assemble consciously, the unconscious now pulls out whole." And this automation is built only through repeated production. It does not get laid down by reading or by listening. The mouth has to repeat the motion enough.
So the conclusion of the learning strategy is simple. Repeat even a small amount until it is fully automated. Ten sentences that are completely glued to your mouth are far stronger in the field than a hundred sentences you half-know.
An automated sentence carries almost no cognitive load. So while you use it in the field, you can pour all your leftover attention into the content and the other person's reaction. A sentence that is not automated, by contrast, eats up your whole mind each time, robbing you of the room to focus on what actually matters.
This is the real reason behind the principle "completeness over quantity." A small number of high-completeness sentences create mental room in the field, and that room makes better communication possible.
Going deeper 3 — understanding the speech organs: a map of tongue and lips
The reason a sound will not come is not weak willpower. It is that you do not know, or have never done, the motion of mouth and tongue that makes that sound. Pronunciation is ultimately motion, and motion requires knowing the exact position.
Let me point out a few sounds that do not exist in Korean. The key is to explain not with abstract phrases like "roll your tongue" but with what touches where.
The English "th" sound (think, this) is made by lightly catching the tip of the tongue between the upper and lower teeth and letting air out. Koreans have a habit of replacing this with a "ss" or "d" sound, so think becomes "ssingk" and this becomes "diss". Look in a mirror and check whether the tongue tip pokes slightly out between the teeth, and it corrects quickly.
English "r" and "l" are notorious for Koreans. For "l", you press the tongue tip firmly against the gum ridge just behind the upper teeth. For "r", you touch nothing. You curl the tongue slightly up toward the roof of the mouth but keep it floating, not touching. The difference between rice and lice, right and light, is decided right here. Just being conscious of "does the tongue touch or not" solves half of it.
English "f" and "v" are made by lightly catching the lower lip with the upper teeth. You must not bring the two lips together as in "p" or "b". You have to make five not as "pa-i-beu" but with the motion of the upper teeth touching the lower lip.
Move to Japanese and there are different traps. "Tsu" is not the same as the Korean "ch". You keep the tongue tip near the gum ridge and burst out a short "ts" sound. Saying "chu" is awkward; you have to keep the "t" feel alive in "tsu" for it to sound natural.
The Japanese geminate consonant (the small tsu) is not a sound you make but a one-beat pause. "Kitte" (stamp) is "kit-te", but you are not pronouncing an "s" — you close the mouth between "ki" and "te" and rest for one beat. Drop that one beat of pause and it gets confused with "kite" (come).
The Japanese long vowel is also a motion. The difference between "obasan" (auntie) and "obaasan" (grandmother) is whether you end "ba" in one beat or draw it out over two. On the page it is a difference of a single character, but in speech the mouth has to know the motion of dragging one extra beat.
All of this is consciously drawing a map of the mouth and tongue. "What touches where, where it releases, where it pauses." Once you draw this map, your imitation from then on becomes far more accurate.
At first I strongly recommend using a mirror. When you pronounce "th" in front of a mirror, check with your eyes whether the tongue tip shows between the teeth; when you make "f", check whether the upper teeth meet the lower lip. Sound is hard to confirm by ear alone, but motion you can confirm with your eyes.
Filming your mouth with a phone camera and placing it side by side with a native speaker's video is also good. The mouth shape you think you make and your actual mouth shape are often different. An objective comparison corrects you far faster than a vague sense.
One more thing. When you learn a new sound, it works well to practice that sound alone in an exaggerated way and then put it into words. Roll just "r" ten times and just "l" ten times on their own, then move on to right and light, and the difference between the two sounds is burned more clearly into your mouth.
Going deeper 4 — watch the mouth, imitate the sound
So how should you actually do production practice? The key is to watch the mouth shape and copy it exactly.
Pronunciation is ultimately a matter of mouth and tongue position. The reason the sound will not come no matter how you mimic with a Korean mouth shape is that the motion itself is different. So you have to watch the native speaker's mouth shape with your eyes, be conscious of where the tongue rests, and make the same shape.
The most powerful drill is shadowing. While listening to a native voice, you repeat it almost simultaneously, like a shadow. Do not fixate on word meaning; imitate the sound, intonation, rhythm, and break points wholesale.
At first your mouth cannot keep up and you stumble, and this stumbling is the very proof that your production circuit was lacking, and the signal that it is now being trained. Not stumbling means you can already do it; the spot where you stumble is exactly the spot that needs practice.
I practiced English this way. I picked one short line from a show I liked and, watching the actor's mouth shape on screen, copied it ten times, twenty times.
At first it felt awkward and I could only do it when alone, but once that one line was fully glued to my mouth, similar sentences strangely popped out automatically in meetings. A sentence I had pre-built with my mouth was used like a part in the field.
Let me give a concrete example. When I had meetings with Japanese colleagues at LINE, I would always get stuck at the start, going "uh... ano... kore wa...". So I picked a few opening phrases I often used in meetings and shadowed them whole.
My awkward version: "e... ano... watashi wa omou, kore ga..." (halting, piling up the subject first and then jamming).
The natural version: "sou desu ne, kore ni tsuite wa..." ("well now, as for this...", an opener that buys a little time first).
After I had glued "sou desu ne, kore ni tsuite wa" to my mouth about a hundred times, this opener came out automatically whenever I got a question in a meeting. That bought me time to assemble the body of the answer in my head. Once the halting silence vanished, meetings got comfortable.
I did the same for English. I glued meeting openers like "that's a good point, but...", "let me get back to you on that", and "just to make sure I understand..." to my mouth whole. The content differs each time, but when the opener comes out automatically, continuing afterward is far easier.
Stages of shadowing
- Listen: Listen to a short sentence (one line) several times until you fully catch the meaning.
- Watch the mouth: If it is a video, watch the speaker's mouth — where the lips round and where the tongue touches.
- Speak in overlap: Repeat almost overlapping with the audio. Imitate sound, intonation, and rhythm wholesale.
- Record and compare: Record yourself and compare with the original. Mark the points that differ.
- Repeat: Repeat until that one line is completely glued to your mouth. Completeness over quantity.
A 30-day plan to fully finish one piece of shadowing
Even after you resolve to do shadowing, once you start it is easy to drift, swapping material over and over. So I set the goal of "fully finishing one piece." Whether a song or a single drama scene, it means gluing one three-minute chunk completely to your mouth within 30 days.
There is a reason to dig one piece all the way through. If you keep changing material, you only repeat the initial halting stage every time. Take the same material to the end and you can see the halting sections shrink, so you directly feel the process of automation happening.
Pick material under three minutes that is not too fast. A scene from a show you like, a song you like, or a clip from a podcast you enjoy will do. The material has to be one you like to survive 30 days.
| Week | Goal | Daily task |
|---|---|---|
| Week 1 | Get familiar with the whole | Keep listening, read the lyrics or script and grasp the meaning |
| Week 2 | Glue your mouth to it | Speak in overlap while watching, repeat only the halting sections separately |
| Week 3 | Stop looking at the script | Listen to the sound only and follow without the script, start recording |
| Week 4 | Check completeness | Compare the original with your recording, focus-correct the sections that differ |
One tip. For the one or two phrases that just will not come, pull them out separately and do a "loop drill." Repeat that phrase alone ten, twenty times until it is burned in, then return to the whole. If you always run the whole thing from the start, the hard sections stay hard forever.
After 30 days you can follow that three minutes almost like the original. And the pronunciation, linking, intonation, and expressions inside it begin to leak out into other sentences too. That is why the experience of properly finishing one piece is worth more than drifting through a hundred.
English and Japanese — same principle, different details
Even with the same production-centered principle, the wall you hit differs by language. Let me sort out the differences I felt while studying both English and Japanese.
English's biggest wall is rhythm and linking. English is a stress-timed language, so stressed syllables go long and strong while the rest are blurred and fast.
Pronouncing every word crisply actually makes you harder to follow. Pronounce "I want to go to the store" with each word exact and you sound like a machine. In reality it comes out like "aiwannagotothestore", with only a few stresses surviving and the rest weakened.
So English production practice should focus on putting the rhythm of the chunk on whole, rather than on individual words. The key is the practice of deliberately weakening the sounds that reduce.
Japanese, by contrast, is a mora-timed language, so you have to produce each sound crisply in even beats for it to sound natural. Koreans tend to let their guard down because Japanese looks easy.
But the awkwardness shows in long versus short vowels (obasan versus obaasan), the one beat of the geminate consonant (the small tsu), and pitch accent. The same "hashi" can mean bridge or chopsticks depending on the pitch. Japanese production practice goes well when you keep the beat with claps or a metronome.
NHK's Japanese learning materials and announcer news broadcasts are good models of standard intonation. Shadow a news clip in time with the beat and you develop a sense for the pitch accent.
The commonality is clear too. Both improve only after you produce enough with the mouth, and in both, listening rises once you produce. One principle, only the details differ.
| Aspect | English | Japanese |
|---|---|---|
| Timing type | Stress-timed | Mora-timed |
| Core challenge | Linking, stress, reduced vowels | Long/short vowels, geminate, pitch accent |
| Korean weak sounds | th, r/l, f/v | tsu, geminate one beat, long vowel |
| Production point | Put on the chunk rhythm whole | Even, crisp beats |
| Good model material | Shows, BBC, podcasts | NHK news, announcer speech |
| Korean pitfall | Pronouncing everything crisply | Relaxing on beat/pitch because it looks easy |
I gathered the pronunciations Koreans get wrong most often into a separate table. Knowing where you slip makes correction faster.
| Sound | Common mistake | Correct motion |
|---|---|---|
| English th (think) | Replacing with ss, like "ssingk" | Catch the tongue tip between the teeth and blow |
| English r / l | Treating both as the same Korean r/l | l touches the gum ridge, r touches nothing |
| English f / v | Replacing with p, like "pa-i-beu" | Catch the lower lip with the upper teeth |
| English reduced vowel | Pronouncing every vowel crisply | Weaken and blur vowels outside the stress |
| Japanese tsu | Pronouncing it like "chu" | Tongue tip near the gum ridge, short ts |
| Japanese geminate | Dropping the pause | Close the mouth and rest one beat |
| Japanese long vowel | Cutting it short with no beat | Drag one extra beat |
A collection of concrete prompts for real conversation with AI
In the past, the biggest barrier to production practice was "no partner." There was no one to talk to, and no one to correct you when you got it wrong. But now the environment has completely changed.
These days you can speak to an AI like ChatGPT by voice and get an answer back. You have gained an infinite, pressure-free conversation partner.
Mistakes you would be too embarrassed to make in front of a person, you make freely in front of an AI. Since you improve faster the more mistakes you make, an environment where the embarrassment is gone is itself a great asset.
That said, just saying "let's talk in English" vaguely is less effective. You have to specify the role and the correction method concretely. Let me share a few prompts I actually use.
Role-play plus correction prompt. "From now on you are my foreign colleague. Let's role-play a sprint meeting in English. When I say something awkward, first fix it into natural phrasing, then continue the conversation. Fix only one sentence at a time."
Pronunciation feedback prompt. "Point out the pronunciation spots Koreans often get wrong in the sentence I just said. Explain which sound to make and how, in terms of mouth and tongue position."
Expression upgrade prompt. "This sentence I wrote gets the meaning across but is too textbook. Rewrite it into three more natural expressions a native speaker would actually use in a meeting."
Opener practice prompt. "Tell me three natural English openers each for agreeing, disagreeing, buying time, and asking back, in a meeting. When I repeat them, check my pronunciation."
Japanese business prompt. "Let's role-play a business Japanese email reply. When I speak too casually, fix it into polite phrasing and point out my honorific mistakes."
Here is the most important principle. If you only talk to AI in text, the production circuit (mouth motion) does not improve. You must speak aloud. Chatting with only your fingers moving is listening and reading practice, not speaking practice.
Also, rather than trusting AI pronunciation blindly, you should run input alongside it by shadowing real native voices (shows, news, podcasts). AI is a partner that removes the burden of production practice; it is not a textbook that guarantees the quality of your input.
A case — from an awkward meeting remark to a natural one
The abstract principle alone may not land, so let me lay out my actual process of change as a dialogue example. In my LINE years, I often had to explain schedule delays in English sprint meetings.
The early me was like this. When the manager asked "when will this task be done?", I would get stuck trying to build a perfect sentence in my head.
The awkward version: "uh... I think... this task... um... maybe not this week. very difficult. sorry." (laying out words one chopped piece at a time and ending in an apology).
The problem with this remark was not my English ability. I had the answer in my head but no parts glued to my mouth, so each time I stacked a sentence from bare ground and it collapsed. And ending every block with "sorry" left the remark powerless.
So I picked the chunks of expression I often used for explaining schedules and shadowed them whole. I glued expressions like "it's taking longer than expected because...", "I should have it ready by Thursday", and "let me get back to you with a firm date" to my mouth a few dozen times each.
A few weeks later, when I got the same question, it came out like this.
The natural version: "it's taking a bit longer than expected because of the API change. I should have it ready by Thursday, and I'll let you know if anything changes." (joining together three parts I had glued in advance).
The sentence did not get fancy. I just connected parts I had glued to my mouth in advance to fit the situation. Once the halting silence and the needless apology vanished, the same content sounded far more confident.
This is the key. Do not try to make it up on the spot in the field; pull out parts you have glued to your mouth in advance. A meeting is not a stage for improvisation but a place to assemble prepared parts. And only the parts you have produced with your mouth in advance come out in the field.
Input-centered learner vs. production-centered learner
Let me sort out why results split even with the same amount of time spent, by comparing two types of learner. The past me was a textbook input-centered learner.
| Aspect | Input-centered learner | Production-centered learner |
|---|---|---|
| Mostly does | Listening, reading, memorizing words | Shadowing, self-talk, conversation |
| Time with mouth open | Almost none | Sets aside time daily |
| Attitude to mistakes | Avoids out of fear of being wrong | Sees mistakes as fuel |
| Listening skill | Improves somewhat | Improves faster while producing |
| Speaking skill | Hardly improves | Improves steadily |
| Common state | Hears everything but the mouth won't move | Speech comes out, even if rough |
The key difference is just one thing: "do you open your mouth?" The input-centered learner is content to stack material; the production-centered learner pulls that material out with the mouth. Study the same amount and the one who opens the mouth goes further in the end.
Do not misunderstand. This is not to say input is bad. Input is the material and it is essential. It is only that input alone is not enough, and that without production you stop at the halfway point. A good learner does both together.
Practice — daily routines that move the mouth
- One line of shadowing daily: Pick one short line and copy it until it is fully glued to your mouth. Completeness over quantity. Five minutes a day is enough.
- Record and listen back: Record your pronunciation and compare with the original. Even if you hate hearing it, you have to see the gap to fix it.
- Talk to yourself in English/Japanese: On the commute, mutter today's tasks in the target language. Production practice with no listener.
- AI role-play three times a week: Set a scene — a meeting, ordering at a restaurant, small talk — and converse aloud with AI and get corrected.
- Be conscious of mouth shape: When you meet a new sound, consciously observe where the mouth and tongue are and imitate it.
- Stock up on openers: Glue meeting opener expressions to your mouth whole. They guard against the moment speech jams.
- Frequency over fear: Ten clumsy tries beat one perfect one. The production circuit is laid down by repetition.
Example week
| Day | Action |
|---|---|
| Daily | 5 min of one-line shadowing + commute self-talk |
| Mon/Wed/Fri | 10 min of AI role-play conversation |
| Tue/Thu | Record and compare yesterday's shadowed sentence |
| Saturday | Shadow one whole scene from a favorite video |
| Sunday | Review the expressions that got glued to your mouth that week |
The heart of this routine is not grandeur but daily. Even if it is five minutes a day, days of opening your mouth pile up into a difference incomparable with piling up only days of not.
Table tennis, too, improves far faster with 20 minutes daily than with two hours once a week. Motor memory feeds on frequency and grows. Just as your hands stiffen after a few days off, speaking too gets halting again if it stops.
Handling fear — clumsy production beats silence
The biggest enemy that blocks production-centered learning is neither vocabulary nor grammar but fear. You do not open your mouth out of fear of being wrong, of looking foolish, of awkward pronunciation. That silence stops the learning.
I was like that for a long time too. In a LINE meeting, while I polished a sentence ten times in my head to say one thing in English, the topic moved on. There were countless meetings where, waiting for the perfect sentence, I ended up saying nothing at all.
There was an event that changed my thinking. One day I watched a non-native colleague boldly press his point in grammatically messy English. The sentences were wrong, but the delivery was perfect, and the meeting flowed his way. Meanwhile I, holding more accurate English in my head yet never opening my mouth, was as good as absent from the meeting.
That is when I realized: language is not an exam that scores you but a tool for conveying intent. A message delivered, however clumsily, always beats a message kept perfectly polished inside the head.
Let me sort out a few realistic ways to reduce the fear.
- Start in a pressure-free setting: Begin where there is no evaluator — when alone, in front of an AI, muttering to yourself on the commute.
- Set a mistake goal: Set a goal like "make five mistakes on purpose in today's conversation," and a mistake becomes an achievement rather than a failure.
- Start small: Instead of a long remark, open your mouth with a short reply ("yes, I agree") first. A small success opens the next mouth.
- Objectify with recording: Fear is largest when it is vague. Record and actually listen back and it is often better than you thought.
Fear is not erased by willpower but worn down by frequency. The more often you open your mouth, the more ordinary opening it becomes, and once it is ordinary, it is no longer scary.
A real check — questions to see whether your learning is production-centered
Here is a list of questions to check whether you have fallen into the input trap. Answer honestly and you will quickly see where to fix.
| Question | If yes | If no |
|---|---|---|
| Did you say something aloud in the target language today | Production is turning | You may be doing input only |
| Have you ever recorded and listened to your own pronunciation | You are objectifying | Risk of stiffening, unaware you are wrong |
| Do you have ten sentences fully glued to your mouth | Automated assets are piling up | The state of knowing but it won't come out |
| Have you stocked up on meeting openers | You are ready for jams | Risk of freezing when asked |
| Have you ever dug one piece all the way through | You have a completion experience | Possibly drifting learning |
If two or fewer of the five questions are "yes," your learning is tilted toward input. Just change the easiest one first. Usually "say something aloud today" is the starting point.
I recommend doing this check once a month. Left alone, learning drifts toward the comfortable side, namely input. Listening and reading work even when you sit still, but production requires consciously opening your mouth. A regular check corrects that drift.
Traps — the ones people commonly fall into
- Endlessly increasing input only: Hoarding listening material while the mouth stays still. With no production, speech does not improve. For 30 minutes of input, slot in 15 of production.
- Caring only about meaning, ignoring sound: If you mind only meaning while shadowing, pronunciation and rhythm do not improve. At first set the meaning down for a moment and imitate the sound wholesale.
- Just switching material: Move to new material before finishing one piece and you always repeat only the halting stage. Dig one piece all the way through.
- A mouth blocked by perfectionism: If fear of being wrong keeps you from speaking, you never improve. Clumsy production is a hundred times better than silence.
- Relying on AI text chat: If only the fingers move, mouth motion is zero. You must speak aloud.
- Ignoring language differences: Apply English rhythm to Japanese, or Japanese beats to English, and it sounds awkward. The principle is the same, but the details differ by language.
- Overtrusting the health metaphor: "Mouth muscle" is just a metaphor, not a medical assertion. The point is that repeated production automates the motor pattern.
FAQ
Q. Shouldn't I finish listening first and then move to speaking? Do not see it as sequential. The two go together. In fact, speaking practice tends to lift listening, so running both from the start is the efficient choice. The two wheels — filling material with input and driving it in with output — have to turn together.
Q. If I shadow alone, won't I miss that my pronunciation is wrong? That is why recording and comparing are essential. Hear your voice next to the original and the gap shows. At first listening to your own recording is painful, but that pain is the starting point of correction. Adding AI voice feedback makes it even better.
Q. I am too embarrassed to make sound. Most people are. So start in pressure-free settings — when alone, in front of an AI, muttering to yourself on the commute. Embarrassment wears away with frequency. I, too, started by muttering alone in the bathroom.
Q. Isn't memorizing more vocabulary the priority? Vocabulary is the material; production is the assembly. Pile up material without assembling and no speech comes. Practicing making sentences from words you already know and gluing them to your mouth pays off first. Vocabulary researchers like Paul Nation also distinguish knowing a word from using one as different abilities.
Q. How much per day before I improve? Frequency matters more than time. Even five minutes a day, opening your mouth every day, beats cramming two hours on the weekend. Motor memory settles in only when stimulated often.
Q. My mind goes blank when I'm suddenly asked a question in a meeting. Stock up on opener expressions. Glue something like "that's a good question, let me think" to your mouth whole, and it comes out automatically first, buying you time to assemble the body in the meantime. It is a safety device that fills the silence.
Q. Can I do English and Japanese at the same time? You can. The principle is the same, so the production-centered habit applies as is. But the pronunciation details differ, so be careful not to drag the beat sense of one language straight into the other. Splitting the time slots so the two languages do not mix is also a method.
The long view — keeping the production habit going past a year
Speaking does not explode in a short stretch. It is a matter of accumulating motor memory, so a few days of flash shows nothing. That is why you need devices to keep it going over the long haul.
First, leave your progress in records. Shadow the same material and record it every month, and compared with a recording from three months ago, you can hear your own growth. Nothing sustains motivation longer than visible progress.
Second, tie the learning to your life. Slot it into routines you already have — commute self-talk, simulating today's meeting in the shower, one line of shadowing before bed — and it rolls without setting aside separate time. Lean on habit, not willpower, and it lasts a year.
Third, deliberately make real stages. Say one more thing in a meeting, grab a chance to chat with a foreigner, converse with AI three times a week. Practice with no real stage lets motivation cool. One experience of it working in the field pulls out the next round of practice.
Fourth, accept slumps as normal. A stretch comes where skill that was clearly improving stalls. It is easy here to switch material or just rest, but it is usually only a plateau right before automation. Push the same material just a little more and you climb another step.
Language learning is a marathon. The one who opens the mouth a little every day, rather than the one who blazes fast, goes farther in the end. For me too, across my years at LINE, it was not an explosive leap but daily small production piling up that opened my mouth.
Closing — only when the mouth moves do you finally hear
In those days when I could hear everything but my mouth would not move, I believed words would eventually burst out if I just kept doing input. They never burst out.
Speech improved only through production. Only after the hours of gluing short lines to my mouth, talking to myself, and speaking clumsily despite the embarrassment piled up did my mouth open.
And strangely, once my mouth opened, my ears grew sharper. Sounds I had produced with my own mouth came through clearly.
"If you cannot speak it, you cannot hear it" also means, in reverse, that once you can speak, you hear better. Production and recognition were two sides of one circuit.
Language is a protocol that connects you to the world. The work of turning a half-connection that only receives into a full one that gives and receives.
The moment that connection completes, language is no longer an object of study but a passage that reaches the world. Real conversation with colleagues from other countries, enjoying content you love without subtitles, mingling with people in an unfamiliar city. All of it starts from opening your mouth.
The start is not grand. Today, pick one favorite line, watch the mouth shape on screen, and copy it just ten times. That one line will come out again through your mouth — in the next meeting, on the next trip.
References
- Swain, M. (1985). Concept summary related to the Output Hypothesis. https://en.wikipedia.org/wiki/Comprehensible_output
- Stephen Krashen — Comprehensible Input and the Input Hypothesis. https://www.sdkrashen.com/content/articles/the_comprehension_hypothesis_extended.pdf
- Paul Nation — research on vocabulary learning and the receptive/productive vocabulary distinction. https://www.wgtn.ac.nz/lals/about/staff/paul-nation
- Shadowing technique explained — Fluent Forever. https://blog.fluent-forever.com/shadowing/
- Stress-timed vs. mora-timed languages — linguistics overview. https://en.wikipedia.org/wiki/Isochrony
- BBC Learning English — pronunciation and linking materials. https://www.bbc.co.uk/learningenglish/english/features/pronunciation
- NHK Japanese pronunciation/intonation learning materials. https://www.nhk.or.jp/lesson/