Voice is no longer a reliable feedback loop

For most of recorded commercial history, voice was the check. If you received a written instruction to move money, you called the person who sent it. You heard them confirm it. That was the loop. It was imperfect but it worked because cloning a voice in real time — well enough to deceive someone who knew the speaker — was not something a criminal could do.

That assumption is gone.

The attack pattern that has now produced losses across multiple continents follows a consistent template: an employee receives an instruction via email or message, joins a video or voice call that appears to involve familiar senior figures, and authorises a transfer because everything — the face, the voice, the context — matches what they would expect. The ASD's ACSC called out this pattern directly in its Annual Cyber Threat Report 2023–24, referencing the Hong Kong incident in which all meeting attendees except the targeted employee were deepfake recreations. The employee had initially suspected a phishing attempt. The video call removed that suspicion. That is the point.

The attack succeeds not in spite of the verification instinct, but because of it. The employee did what they were trained to do: they looked for additional confirmation. The attacker simply occupied the confirmation channel.

What this attack actually requires

The tooling to run this attack has matured to the point where technical barriers are not the limiting factor. Generating a real-time voice replica requires a short sample of existing audio — the kind that executives routinely produce in earnings calls, media appearances, podcast interviews, and LinkedIn videos. The synthesis runs locally on standard hardware. No specialised infrastructure. No vendor relationship. No meaningful cost.

In the Asia-Pacific region, AI-related fraud attempts increased 194% in 2024 compared to the prior year. Australia is not a peripheral target. ASD's ACSC confirmed in its Annual Cyber Threat Report 2023–24 that Business Email Compromise — increasingly AI-augmented — was one of the biggest cyber security threats to Australian businesses, with almost $84 million lost in FY2023–24 alone.

The attack does not require the attacker to perfectly clone an executive for an extended conversation. It requires plausibility for long enough to get an authorisation. Urgency is manufactured as part of the script. The target is under time pressure, the instruction is coming from a voice they recognise, and there is no available mechanism to verify through any other channel in the moment.

The instructive detail in the Hong Kong case is not that controls were absent. It is that they failed while functioning correctly. Video calls were supposed to be more secure than voice calls. The employee saw multiple faces, not just heard a single voice. The distributed authorisation model was specifically designed to prevent unilateral fraud. When attackers simply recreated the entire meeting, the multi-person setup became an expanded attack surface rather than a protection.

The single-channel problem

Most Australian organisations have one de facto confirmation mechanism for high-value instructions: voice. Email arrives; someone calls to confirm. A Zoom invite drops; attendees join and the call proceeds. The problem is not that voice was ever perfectly reliable — it is that voice has historically been difficult enough to fake that organisations never needed a second layer to sit behind it.

That calculus no longer holds. Gartner predicts that by 2026, 30% of enterprises will no longer consider standalone identity verification and authentication solutions to be reliable in isolation. That shift in posture needs to reach beyond contact-centre authentication into the operational workflows where financial transfers are actually approved.

The structure to fix this is not technically complex. What it requires is that organisations deliberately design their verification workflows so that no single channel — voice, video, or messaging — can stand alone as the confirmation mechanism for consequential actions. Out-of-band verification means that if a request arrives via voice or video, it must be confirmed via a separate channel: an encrypted message or push notification to a trusted device. Dual authorisation means that transfers above a threshold require sign-off from two separately authenticated personnel. And eliminating the "executive override" means that protocols must strictly prohibit bypassing security checks regardless of who is purportedly on the line.

None of those controls depends on detecting deepfakes. They are designed so that the question of whether a voice is synthetic does not matter — because the voice alone cannot authorise anything.

Channel segregation in practice

The specific channels matter less than the segregation between them. The threat model is that an attacker controls one communication channel — typically the one initiating the request. The control is that any authorisation requires confirmation through at least one channel the attacker cannot simultaneously control.

In practice, this means keeping a small set of pre-verified, out-of-band communication paths whose integrity is known and maintained. A dedicated Signal or encrypted messaging group for finance approvals. A hardware token or authenticator app confirmation alongside a voice call. A pre-agreed verbal code exchanged over a known number that is stored in the company's system before the call — not obtained from the caller. The callback must go out to a number that already exists in a secure directory, not a number provided during the call itself. Ending the current session and returning the call using a pre-verified number from the company's secure directory is a concrete example of out-of-band confirmation that breaks the attacker's channel control.

What undermines all of this is scope creep on the trusted channel. If the encrypted messaging group that exists for finance confirmations gradually gets used for general team communication, its value as a segregated verification channel is gone. If the pre-verified callback number is stored in a system the caller can manipulate, it provides no assurance. Channel integrity requires that the verification path be maintained as distinct from the communication path — and that people understand why.

The culture problem is harder than the process problem

Designing the controls is straightforward. Getting people to use them when they feel they are creating friction with a superior is not.

Deepfake voice attacks work partly because organisations have implicitly trained their staff that urgency from senior figures overrides normal process. An executive calls with an urgent transfer and the instinct is to move, not to say "I need to verify this through our secondary channel before I can proceed." In a healthy organisation, that response should be normalised and supported — not treated as an implicit accusation of fraud against the person calling.

Teaching employees to engage in out-of-band verification — confirming a request through a second secure communication channel — is one of the most direct precautions organisations can take, alongside training staff to slow down when urgency is applied. The urgency itself should be the trigger for additional scrutiny, not the reason to skip it.

This requires explicit management endorsement. The policy needs to state that any employee can and should apply the secondary verification requirement to any instruction, regardless of the apparent seniority of the requester. And it needs to be tested — not just written down. If the first time an employee encounters a secondary verification scenario is during an actual attack, the trained instinct will not be there.

In Australia, the Australian Standard AS 8001:2021 Fraud and Corruption Control provides a framework for building out fraud risk management that covers exactly this kind of procedural hardening. Organisations with APRA obligations should also consider how APRA Prudential Standard CPS 234 Information Security applies to the integrity of internal communication channels used in financial approval workflows.

What detection tools can and cannot do

Technical detection of deepfake audio is an active research area, but it is not a complete answer at the point of a live call. Detection tools face a ceiling: as the underlying generative models improve, the artefacts that detection systems target become less reliable as indicators. Early deepfake detection worked by identifying visual artefacts — unnatural blinking patterns, edge distortions around faces, and audio-visual sync delays. Those signals are becoming less consistent as model quality improves.

Detection is worth layering in where it can be operationalised — particularly in contact-centre environments that handle large volumes of inbound calls. Multifactor authentication, real-time liveness detection, and risk scoring are the tools that security teams should be evaluating to secure high-value interactions. But detection at the point of a live executive-to-finance call is not where most Australian organisations will have tooling in place. Procedural controls — segregated channels, dual authorisation, callback protocols — do not depend on detection accuracy. They work regardless of whether the voice is real or synthetic.

The ASD's ACSC framing

ASD's ACSC has published guidance on social engineering risks associated with messaging applications and noted in successive Annual Cyber Threat Reports that AI is augmenting social engineering attacks across Australian targets. The FY2024–25 threat report documented over 84,700 cybercrime reports through ReportCyber — with ASD's ACSC assessing that the vast majority of cybercrime continues to go unreported. The procedural response ASD consistently recommends maps directly to the controls above: verify through separate channels, do not act on urgency alone, and report suspicious activity to ReportCyber or via the cyber security hotline 1300 CYBER1.

None of that advice is exotic. The gap is that it has not been translated into operational workflow design for most organisations. An awareness email telling staff to be careful is not a verification protocol. It does not tell them what to do when the CFO calls asking for a transfer in the next ten minutes.

Putting this into your own environment

If your organisation's current answer to "how do we confirm a high-value financial instruction delivered by phone or video" is "we listen carefully and look for anything suspicious," the exposure is real. The attack is specifically designed to produce an interaction where nothing feels suspicious.

Artificer Cyber works with Australian organisations to assess where social engineering controls have gaps and to design procedural and technical mitigations that fit how the organisation actually operates. That includes reviewing approval workflows for financial transfers, mapping out the communication channels that currently serve as verification mechanisms, and identifying where single-channel exposure exists. If this is a current concern, reach out via our contact page.