
When you send a prompt to an LLM, something physical happens. Transistors switch. Current flows. Heat is generated. Cooling systems activate. Electricity meters turn. In a data center somewhere — most likely in Iowa, Virginia, or the Netherlands — a rack of NVIDIA H100s draws somewhere between 500 and 900 watts per GPU to process your request, while chillers, power distribution units, and network switches add another 10–50% on top.
AI companies report their infrastructure in teraflops and petaflop-seconds. Your electricity provider reports your bill in kilowatt-hours. The carbon researchers tracking AI's environmental footprint work in grams of CO₂ equivalent. These three unit systems describe the same physical reality — the energy consumed to answer your question — but nobody has written down the chain of conversions that connects them clearly.
This article is that chain. Every unit conversion is shown. Every assumption is stated explicitly. By the end, you will be able to estimate the kilowatt-hour cost and carbon footprint of any AI inference request — from a one-line lookup query to a 128,000-token document analysis — and compare it against the household energy benchmarks that make the numbers intuitive.
Step 1 — What a Teraflop Actually Measures
A floating-point operation (FLOP) is a single arithmetic calculation performed on decimal numbers — an addition, subtraction, multiplication, or division of two floating-point values. GPUs and TPUs perform billions to trillions of these operations per second. Their performance is rated in:
FLOP/s (floating-point operations per second) — the hardware throughput metric.
One teraFLOP/s = 10¹² operations per second.
One petaFLOP/s = 10¹⁵ operations per second.
The most important current GPU for LLM inference — the NVIDIA H100 SXM — is rated at 3,958 TFLOP/s (roughly 4 petaFLOP/s) in FP16 (16-bit floating point) precision, at a thermal design power (TDP) of 700 watts. That is the peak theoretical throughput. Real inference workloads rarely reach it, for reasons we will get to shortly.
FLOPs vs. FLOP/s: The Distinction That Matters
FLOPs (note the capital S) is a cumulative count of operations — it describes the total computational work done by a task, not a rate. Running a 175-billion-parameter model to generate 1,000 tokens requires a specific number of FLOPs of total computation, the same way driving 50 miles requires a specific number of fuel-burn events, regardless of how fast you drove.
The energy cost of a task depends on both:
- The total FLOPs required (how much computation)
- The FLOP/s the hardware delivers per watt (how efficiently it runs)
Energy (Joules) = FLOPs ÷ Effective FLOP/s × Power (Watts)
This is the central conversion formula. Everything else is substitution.
Estimating FLOPs Per Token
For transformer-based LLMs, a well-established approximation from ML systems research gives the FLOPs required for a single forward pass (generating one output token):
FLOPs per token ≈ 2 × Model Parameter Count
This approximation holds for dense transformer models at inference time, where each parameter is involved in approximately two floating-point multiply-accumulate operations per forward pass. For a 70-billion-parameter model, that is approximately 140 × 10⁹ FLOPs = 140 GFLOP per output token.
For mixture-of-experts (MoE) architectures — which GPT-4, Claude 3, and Gemini Ultra are believed to use — the active parameter count per token is a fraction of total parameters, making inference substantially cheaper per token than the total parameter count implies. The 2× rule applies to active parameters, not total parameters.
| Model Family | Est. Active Params | FLOPs per Token | 1,000-Token Query |
|---|---|---|---|
| GPT-3.5 / Llama 3 8B | 8B (dense) | 16 GFLOP | 16 TFLOP |
| Llama 3 70B | 70B (dense) | 140 GFLOP | 140 TFLOP |
| GPT-4 class (MoE, est.) | ~55B active | ~110 GFLOP | 110 TFLOP |
| Claude 3 Sonnet class | ~50B active est. | ~100 GFLOP | 100 TFLOP |
| Claude 3 Opus class | ~85B active est. | ~170 GFLOP | 170 TFLOP |
| Gemini Ultra class | ~90B active est. | ~180 GFLOP | 180 TFLOP |
These are estimates based on published model architectures, benchmark performance, and inference speed data. Exact figures for proprietary models are not publicly disclosed, but the order-of-magnitude accuracy is sufficient for energy estimation purposes — and that is all the calculation requires.
Step 2 — Hardware Efficiency: Converting FLOPs to Watts
Once you have an estimated FLOPs-per-token figure, the next step is the hardware efficiency ratio: how many watts does the GPU consume per effective TFLOP/s of inference throughput?
The Hardware Efficiency Ratio
Watts per TFLOP/s = GPU TDP (W) ÷ Effective Inference TFLOP/s
| GPU | FP16 Peak TFLOP/s | TDP (W) | Peak W/TFLOP/s |
|---|---|---|---|
| NVIDIA H100 SXM | 3,958 | 700 | 0.177 |
| NVIDIA H200 SXM | ~4,000 | 700 | 0.175 |
| NVIDIA A100 SXM | 312 | 400 | 1.282 |
| NVIDIA L40S | 362 | 350 | 0.967 |
| Google TPU v5e | ~197 | 170 | 0.863 |
| AMD MI300X | 1,307 | 750 | 0.574 |
At peak spec, the H100 is extraordinarily efficient: less than 0.18 watts per teraFLOP/s. But peak spec is not real-world inference.
Why Real Inference Is Far Less Efficient Than Peak Spec
LLM inference — specifically the autoregressive decoding phase, where the model generates one token at a time — is fundamentally memory-bandwidth limited, not compute-limited. The GPU must load the model's full set of weight matrices from HBM (High Bandwidth Memory) for each token generated, regardless of how many arithmetic operations those weights participate in.
For a 70B-parameter model in FP16, the weight matrices occupy approximately 140 GB of GPU memory. The H100's HBM3 memory bandwidth is 3.35 TB/s. Loading all weights for one forward pass takes approximately 140 GB ÷ 3,350 GB/s ≈ 0.042 seconds — which means the maximum autoregressive decoding throughput is roughly 1 ÷ 0.042 ≈ 24 tokens per second, regardless of the GPU's peak FLOP/s.
During this time, the GPU is running at roughly 10–30% of its peak FLOP/s — spending most of its time waiting for memory reads, not performing computation. This means the effective watts-per-FLOP/s during inference is 3–10× worse than the peak spec table suggests.
A conservative but realistic estimate for H100-class hardware in single-batch inference: effective utilization of ~20%, giving a real-world energy cost of approximately 0.9 watts per effective TFLOP/s (vs. the 0.177W peak spec).
This single factor — GPU memory bandwidth saturation during autoregressive decoding — is the reason AI inference energy costs are routinely 5–10× higher per token than a naive FLOP-count calculation would suggest.
Step 3 — The PUE Multiplier: Cooling and Infrastructure
The GPU's power draw is not the total energy cost of your inference request. Every watt the GPU consumes generates heat that the data center must remove — via cooling towers, chillers, computer room air handlers, and liquid cooling loops. Power also flows through transformers, UPS systems, and distribution units before reaching the GPU, with resistive losses at each stage.
The industry standard metric for this overhead is Power Usage Effectiveness (PUE):
PUE = Total Data Center Power ÷ IT Equipment Power
A PUE of 1.0 would mean zero overhead — every watt drawn from the grid goes directly into compute. That is physically impossible. A PUE of 2.0 means overhead equals compute power. Modern hyperscale data centers achieve:
| Operator | Reported PUE | Notes |
|---|---|---|
| Google (global avg.) | 1.10 | Best-in-class, free-air cooling |
| Meta (global avg.) | 1.08 | Custom OCP hardware |
| Microsoft Azure | 1.18 | Global average across all regions |
| Amazon AWS | 1.20 | Global average |
| Typical colocation | 1.40–1.60 | Older infrastructure |
| Industry average (IEA) | 1.55 | All data centers globally |
For calculations involving major AI providers (OpenAI on Azure, Anthropic on AWS/GCP, Google's own models), a PUE of 1.10–1.20 is a reasonable assumption. For the worked example, we will use 1.15.
Total Facility Energy = IT Equipment Energy × PUE
Step 4 — From Joules to Kilowatt-Hours
Physics measures energy in joules. Your electricity bill uses kilowatt-hours. The conversion is exact:
1 kWh = 3,600,000 joules = 3.6 megajoules
kWh = Joules ÷ 3,600,000
Or, working directly from watts and seconds:
kWh = Watts × Seconds ÷ 3,600,000
This is the bridge between the compute world and the household energy world. A GPU drawing 700 watts for 1 second consumes 700 joules = 0.000194 kWh = 0.194 watt-hours. The same GPU running for one full hour consumes 700 watt-hours = 0.7 kWh.
Step 5 — Carbon Intensity: The Geography Problem
One kilowatt-hour of electricity carries a different carbon cost depending entirely on where and when it is consumed. The carbon intensity of the grid — measured in grams or kilograms of CO₂ equivalent per kWh — varies by a factor of 8× across U.S. states and more than 10× globally.
| Grid / Region | CO₂e per kWh | Notes |
|---|---|---|
| Iceland | 28 g | Geothermal and hydro |
| Norway | 29 g | Near-100% hydro |
| France | 85 g | ~70% nuclear |
| Washington State, USA | 116 g | Columbia River hydro |
| California, USA | 202 g | Mix with significant solar |
| EU Average | 233 g | European Green Deal progress |
| UK | 238 g | North Sea wind contribution |
| US National Average | 386 g | EPA eGRID 2023 |
| Texas (ERCOT) | 432 g | Gas-heavy grid |
| West Virginia | 680 g | Predominantly coal |
| Australia | 510 g | Coal-heavy NEM grid |
| India | 708 g | Coal-dominant generation |
| Poland | 722 g | High coal dependency |
Data centers are not randomly distributed across these grids. AWS's US-East-1 (Northern Virginia) is one of the largest AI inference regions in the world. Virginia's grid carbon intensity runs approximately 320–350 g CO₂e/kWh. Google's data centers cluster in Iowa (wind-heavy, ~380 g), Oregon (~130 g, Columbia River hydro), and Finland (~140 g). The exact grid mix a given inference request lands on is not disclosed by providers — which is why carbon estimates for AI inference carry an unavoidable ±40% uncertainty from grid variability alone.
The Full Conversion Chain: Worked Example
Let us run a complete calculation for a realistic AI prompt: a 1,000-token query to a GPT-4 class model (MoE architecture, approximately 55B active parameters per token), served on H100 hardware operating at 20% utilization efficiency, in a data center with PUE 1.15, on a grid with 386 g CO₂e/kWh (US average).
Stage 1 — Total FLOPs for the query:
FLOPs per token ≈ 2 × 55 × 10⁹ = 110 × 10⁹ = 110 GFLOP
1,000 tokens × 110 GFLOP = 110,000 GFLOP = 110 TFLOP
Stage 2 — Convert FLOPs to GPU compute time:
H100 peak FLOP/s = 3,958 TFLOP/s
Effective FLOP/s at 20% utilization = 3,958 × 0.20 = 791.6 TFLOP/s
Compute time = 110 TFLOP ÷ 791.6 TFLOP/s = 0.139 seconds
Stage 3 — GPU energy consumption:
H100 TDP = 700 W
GPU energy = 700 W × 0.139 s = 97.3 joules
Stage 4 — Apply PUE overhead:
Total facility energy = 97.3 J × 1.15 = 111.9 joules
Stage 5 — Convert to kWh:
kWh = 111.9 ÷ 3,600,000 = 0.0000311 kWh = 0.031 Wh
Stage 6 — Convert to carbon:
CO₂e = 0.0000311 kWh × 386 g/kWh = 0.012 g CO₂e
Summary table for the worked example:
| Stage | Value | Unit |
|---|---|---|
| FLOPs required | 110,000 | GFLOP (110 TFLOP) |
| Effective FLOP/s (20% util.) | 791.6 | TFLOP/s |
| GPU compute time | 0.139 | seconds |
| Raw GPU energy | 97.3 | joules |
| Total facility energy (PUE 1.15) | 111.9 | joules |
| Total facility energy | 0.031 | watt-hours |
| Total facility energy | 0.0000311 | kWh |
| Carbon emission (US avg. grid) | 0.012 | g CO₂e |
For a single 1,000-token query: approximately 0.031 Wh and 0.012 grams of CO₂e. That is individually trivial. At OpenAI's reported 10 million daily ChatGPT users generating an average of 5 queries each, the daily total across all queries is approximately 1,550 kWh in compute energy — or about 1.8 MWh accounting for PUE — just from user inference, just for that one product. Per year: roughly 656 MWh, equivalent to the annual electricity consumption of approximately 60 average U.S. homes.
Your Prompt in Household Energy Equivalents
The absolute numbers — milliwatt-hours, hundredths of a gram of CO₂ — are hard to grasp. The comparison to familiar household energy uses makes them concrete.
| Activity | Energy (Wh) | CO₂e (g, US avg.) |
|---|---|---|
| AI prompt, 500 tokens (small model) | 0.008 Wh | 0.003 g |
| AI prompt, 1,000 tokens (mid model) | 0.031 Wh | 0.012 g |
| AI prompt, 1,000 tokens (large model) | 0.080 Wh | 0.031 g |
| AI reasoning query (128K context) | 0.85–2.4 Wh | 0.33–0.93 g |
| Google Search (1 query) | 0.30 Wh | 0.116 g |
| Sending one email | 0.04 Wh | 0.015 g |
| Charging an iPhone 16 (full) | 18.6 Wh | 7.2 g |
| LED bulb running 1 hour | 10 Wh | 3.9 g |
| Laptop running 1 hour | 45 Wh | 17.4 g |
| Dishwasher cycle | 1,100 Wh | 425 g |
| EV charge (25 miles range) | 7,500 Wh | 2,898 g |
| Transatlantic flight (per passenger) | ~530,000 Wh | ~205,000 g |
The comparison that recalibrates most people's intuition: a standard 1,000-token mid-model AI query uses approximately one-tenth the energy of a Google search. This surprises people who assume AI must be vastly more expensive than search. At the per-query level for simple inference, it is not — though AI inference scales to far higher token counts than search ever does, and the cumulative footprint of running many long-context or reasoning queries is meaningfully larger.
The 128K context reasoning query — sending a long document for deep analysis using a model with extended thinking enabled — sits in a different category entirely: 0.85–2.4 Wh per query, equivalent to running an LED bulb for 5–15 minutes. At scale, these queries represent the highest per-call energy cost in the commercial AI stack today.
Convert Your Prompt's Energy Cost
How Prompt Design Changes the Carbon Math
Prompt length directly controls the token count, which directly controls the FLOPs required, which directly controls energy consumption. This is not a marginal relationship — it is linear. Double the tokens, double the energy.
| Prompt Type | Tokens In + Out | Relative Energy | Relative CO₂e |
|---|---|---|---|
| One-line lookup | ~150 tokens | 1× baseline | 1× |
| Typical chat query | ~500 tokens | 3.3× | 3.3× |
| Standard document summary | ~3,000 tokens | 20× | 20× |
| Full code review (mid codebase) | ~15,000 tokens | 100× | 100× |
| Long-context document analysis | ~80,000 tokens | 533× | 533× |
| Extended reasoning + 128K context | ~130,000 tokens | 867× | 867× |
A developer habit of sending the entire codebase into a reasoning model context window — a pattern that has become increasingly common with 128K+ context models — costs approximately 800× more energy per query than a targeted one-line lookup. The model capability gains are real, but so is the energy arithmetic.
The practical implication: targeted, scoped prompts that retrieve only what the model needs (RAG, chunking, summarization pipelines) are not just more token-efficient for your API bill — they are meaningfully more carbon-efficient for exactly the same reason.
The Renewable Energy Accounting Problem
Major AI providers report ambitious renewable energy commitments. Google claims to match 100% of its annual electricity consumption with renewable energy certificates. Microsoft targets 100% renewable by 2025. These statements are true in a specific and limited accounting sense — and misleading in the physics sense that actually matters for carbon.
Renewable energy matching works through Renewable Energy Certificates (RECs) or Power Purchase Agreements (PPAs): a company buys certificates representing renewable generation that occurred somewhere on the grid, in the same calendar year as their consumption. The certificates retire, and the company claims renewable matching. The electrons that actually powered the data center at 2 AM on a Tuesday when Virginia wind was low came from whatever generators were running in that moment — which may have been gas peakers.
The carbon impact that matters is marginal emissions intensity — the carbon cost of the additional electron drawn from the grid at the specific time and location of the inference request. For a data center in Virginia running heavy inference loads at night, the marginal generator is often a natural gas plant, not a solar array in Texas whose certificates the company bought in Q3.
This accounting gap does not make renewable matching meaningless — building demand for renewable generation is genuinely positive. But it does mean that provider claims of "100% renewable AI" should be understood as accounting statements, not physical statements. The actual carbon footprint of a query processed at midnight on a coal-heavy grid is not zero, regardless of certificate purchases.
The only physically meaningful carbon claim is 24/7 carbon-free energy (CFE) — matching renewable generation to consumption on an hourly basis, in the same grid region. Google is the furthest along on this path, reporting 72% CFE globally in 2024. No major AI provider has reached 100% 24/7 CFE at their inference data center locations.
What You Can Actually Control
As a developer or product team running AI inference at scale, four levers directly reduce your carbon exposure:
Route by complexity. Use small, efficient models for simple tasks — classification, extraction, lookup — and large models only for tasks that genuinely require them. A Llama 3 8B model running on a single A100 uses approximately 7× less energy per token than a 70B model. The quality difference is irrelevant for tasks where either model achieves the same output.
Constrain context aggressively. Every token in the context window is billed in compute. A RAG pipeline that retrieves 2 precise chunks instead of 8 fuzzy ones does not just save API cost — it reduces the FLOPs required for every attention operation across all transformer layers. Attention scales quadratically with context length: doubling context length quadruples the attention computation.
Batch where latency allows. Batched inference dramatically improves GPU utilization by moving the operation from memory-bandwidth-bound to compute-bound. At batch size 16–32, effective GPU utilization rises from ~20% to ~60–70%, reducing the watts-per-effective-TFLOP/s ratio by 3× and improving the energy cost per token proportionally.
Prefer providers with high CFE scores for carbon-sensitive workloads. Google Cloud (72% 24/7 CFE, with best performance in Oregon and Finland regions), Microsoft Azure (Norway and Sweden regions with near-100% hydro), and AWS (Oregon and Canada regions) offer meaningfully lower marginal carbon intensity than US-East or Asia-Pacific regions for the same inference workload.
The teraflop and the kilowatt-hour were never as far apart as the jargon made them seem. One is a measure of work performed; the other is a measure of the energy required to perform it. The conversion chain is fixed physics — every step is a unit substitution, not an estimate. The uncertainty is not in the math but in the inputs: model architecture, GPU utilization, data center efficiency, and grid carbon intensity. State your assumptions clearly, run the numbers, and the household energy equivalent of your prompt is a calculation, not a mystery.





