
When you ship a set off to an LLM, some component bodily takes place. Transistors transfer, contemporary flows, warmth is generated, Cooling structures spark off, energy meters flip. In a records middle someplace — maximum probably in Iowa, Virginia, or the Netherlands — a rack of NVIDIA H100s attracts someplace between 500 and 900 watts in step with GPU to method your request, on the equal time as chillers, energy distribution devices, and community switches upload some other 10–50% on pinnacle.
AI groups record their infrastructure in teraflops and petaflop-seconds. Your electricity provider critiques your bill in kilowatt-hours. The carbon researchers monitoring AI's environmental footprint paintings in grams of CO₂ equal. Those 3 unit structures describe the same bodily reality — the energy ate up to reply your question — but no man or woman has written down the chain of conversions that connects them absolutely.
This newsletter is that chain. every unit conversion is confirmed. Each assumption is said explicitly, by means of using the give up, you'll be able to estimate the kilowatt-hour fee and carbon footprint of any AI inference request — from a one-line research question to a 128,000-token file assessment — and take a look at it in opposition to the household power benchmarks that make the numbers intuitive.
Step 1 — What a Teraflop virtually Measures
A floating-issue operation (FLOP) is a single mathematics calculation completed on decimal numbers — an addition, subtraction, multiplication, or department of two floating-factor values. GPUs and TPUs perform billions to trillions of these operations regular with seconds. Their overall performance is rated in:
FLOP/s (floating-factor operations in step with 2d) — the hardware throughput metric.
One teraFLOP/s = 10¹² operations according to 2d.
One petaFLOP/s = 10¹⁵ operations in line with second.
The most essential modern-day GPU for LLM inference — the NVIDIA H100 SXM — is rated at 3,958 TFLOP/s (more or less 4 petaFLOP/s) in FP16 (sixteen-bit floating factor) precision, at a thermal design electricity (TDP) of 700 watts. That is the peak theoretical throughput. Actual inference workloads rarely attain it, for motives we will get to rapidly.
FLOPs vs. FLOP/s: The distinction That topics
FLOPs (have a look at the capital S) is a cumulative rely of operations — it describes the whole computational artwork completed through a project, no longer a price. walking a 170 5-billion-parameter version to generate 1,000 tokens calls for a selected range of FLOPs of total computation, the same way riding 50 miles calls for a specific wide kind of fuel-burn events, no matter how fast you drove.
The electricity cost of a task is based upon on both:
- The complete FLOPs required (how an awful lot computation)
- The FLOP/s the hardware supplies in step with watt (how efficiently it runs)
Electricity (Joules) = FLOPs ÷ powerful FLOP/s × energy (Watts)
That is the treasured conversion formulation, entirety else is substitution.
Estimating FLOPs in step with Token
For transformer-based LLMs, a nicely-hooked up approximation from ML systems studies offers the FLOPs required for a single ahead bypass (generating one output token):
FLOPs steady with token ≈ 2 × version Parameter rely
This approximation holds for dense transformer fashions at inference time, where every parameter is involved in about floating-factor multiply-collect operations consistent with ahead skip. For a 70-billion-parameter model, that is approximately 140 × 10⁹ FLOPs = a 140 GFLOP regular with output token.
For aggregate-of-experts (MoE) architectures — which GPT-4, Claude 3, and Gemini ultra are believed to apply — the lively parameter matter wide variety in keeping with token is a fraction of fashionable parameters, making inference significantly less steeply-priced in keeping with token than the entire parameter rely implies. The 2× rule applies to energetic parameters, no longer average parameters.
| Version Family | Est.energetic Params | FLOPs consistent with Token | 1,000-Token question |
|---|---|---|---|
| GPT-3.5 / Llama 3 8B | 8B (dense) | 16 GFLOP | 16 TFLOP |
| Llama 3 70B | 70B (dense) | a 140 GFLOP | 140 TFLOP |
| GPT-4 magnificence (MoE, est.) | ~55B active | ~a 110 GFLOP | a 110 TFLOP |
| Claude 3 Sonnet magnificence | ~50B lively est. | ~100 GFLOP | 100 TFLOP |
| Claude 3 Opus magnificence | ~85B active est. | ~a 170 GFLOP | 170 TFLOP |
| Gemini extraordinarily elegance | ~90B energetic est. | ~a 180 GFLOP | a 180 TFLOP |
The ones are estimates based on published version architectures, benchmark overall performance, and inference velocity information. Specific figures for proprietary models aren't publicly disclosed, but the order-of-significance accuracy is enough for strength estimation functions — and that is all the calculation requires.
Step 2 — Hardware performance: changing FLOPs to Watts
As soon as you have an expected FLOPs-in keeping with-token determine, the subsequent step is the hardware overall performance ratio: what number of watts does the GPU eat in step with powerful TFLOP/s of inference throughput?
The hardware performance Ratio
Watts in step with TFLOP/s = GPU TDP (W) ÷ effective Inference TFLOP/s
| GPU | FP16 peak TFLOP/s | TDP (W) | pinnacle W/TFLOP/s |
|---|---|---|---|
| NVIDIA H100 SXM | 3,958 | 700 | 0.177 |
| NVIDIA H200 SXM | ~4,000 | 700 | 0.175 |
| NVIDIA A100 SXM | 312 | 400 | 1.282 |
| NVIDIA L40S | 362 | 350 | 0.967 |
| Google TPU v5e | ~197 | a 170 | 0.863 |
| AMD MI300X | 1,307 | 750 | 0.574 |
At peak spec, the H100 is reasonably efficient: a great deal less than 0.18 watts in line with teraFLOP/s. But height spec is not real-international inference.
Why actual Inference Is an extended manner lots much less efficient Than top Spec
LLM inference — particularly the autoregressive deciphering segment, in which the model generates one token at a time — is fundamentally reminiscence-bandwidth constrained, now not compute-restricted. The GPU ought to load the model's entire set of weight matrices from HBM (high Bandwidth reminiscence) for every token generated, regardless of what number of arithmetic operations the ones weights participate in.
For a 70B-parameter model in FP16, the weight matrices occupy about 140 GB of GPU memory. The H100's HBM3 reminiscence bandwidth is 3.35 TB/s. Loading all weights for one beforehand bypass takes approximately a 140 GB ÷ 3,350 GB/s ≈ 0.042 seconds — because of this the maximum autoregressive deciphering throughput is sort of 1 ÷ 0.042 ≈ 24 tokens consistent with 2nd, regardless of the GPU's height FLOP/s.
Within the direction of this time, the GPU is strolling at form of 10–30% of its top FLOP/s — spending maximum of its time watching for memory reads, not appearing computation. This means the effective watts-according to-FLOP/s throughout inference is 3–10× worse than the height spec table indicates.
A conservative however practical estimate for H100-class hardware in single-batch inference: powerful utilization of ~20%, giving a actual-world power price of approximately 0.9 watts in keeping with effective TFLOP/s (vs. the 0.177W top spec).
This single issue — GPU memory bandwidth saturation within the route of autoregressive decoding — is the motive AI inference energy prices are routinely 5–10× higher in step with token than a naive FLOP-be counted calculation might advise.
Step 3 — The PUE Multiplier: Cooling and Infrastructure
The GPU's strength draw isn't always the full power price of your inference request. Every watt the GPU consumes generates warm temperature that the data middle should cast off — via cooling towers, chillers, laptop room air handlers, and liquid cooling loops. Energy also flows thru transformers, americasystems, and distribution gadgets before achieving the GPU, with resistive losses at each level.
The business enterprise vast metric for this overhead is Power Usage Effectiveness (PUE):
PUE = total information center electricity ÷ IT device strength
A PUE of 1.0 might advise 0 overhead — each watt drawn from the grid is going without delay into compute. That is physical impossible. A PUE of 2.0 approach overhead equals compute electricity. Present day hyperscale facts centers gain:
| Operator | Suggested PUE | Notes |
|---|---|---|
| Google (international avg.) | 1.10 | high-quality-in-elegance, loose-air cooling |
| Meta (international avg.) | 1.08 | custom OCP hardware |
| Microsoft Azure | 1.18 | global not unusual throughout all regions |
| Amazon AWS | 1.20 | global common |
| Everyday Colocation | 1.40–1.60 | Older infrastructure |
| Employer commonplace (IEA) | 1.55 | All records facilities globally |
For calculations regarding maximum critical AI organizations (OpenAI on Azure, Anthropic on AWS/GCP, Google's non-public fashions), a PUE of 1.10–1.20 is an cheaper assumption. For the worked instance, we're capable of use 1.15.
Total Facility strength = IT device electricity × PUE
Step 4 — From Joules to Kilowatt-Hours
Physics measures energy in joules. Your power bill uses kilowatt-hours. The conversion is unique:
1 kWh = 3,600,000 joules = 3.6 megajoules
kWh = Joules ÷ 3,600,000
Or, operating at once from watts and seconds:
kWh = Watts × Seconds ÷ 3,600,000
This is the bridge among the compute international and the own family electricity world. A GPU drawing 700 watts for 1 2nd consumes 700 joules = 0.000194 kWh = 0.194 watt-hours. The same GPU walking for one complete hour consumes 700 watt-hours = 0.7 kWh.
Step 5 — Carbon Depth: The Geography trouble
One kilowatt-hour of energy includes a unique carbon price depending certainly on in which and even as it's far fed on. The carbon depth of the grid — measured in grams or kilograms of CO₂ equal in line with kWh — varies by using a component of 8× throughout U.S. states and extra than 10× globally.
| Grid / location | CO₂e consistent with kWh | Notes |
|---|---|---|
| Iceland | 28 g | Geothermal and hydro |
| Norway | 29 g | close to-100% hydro |
| France | 85 g | ~70% nuclear |
| Washington United States, United States | 116 g | Columbia River hydro |
| California, United States of america | 202 g | Blend with vast solar |
| ECU Average | 233 g | ECU green deal development |
| United Kingdom | 238 g | North Sea wind contribution |
| US countrywide common | 386 g | EPA eGRID 2023 |
| Texas (ERCOT) | 432 g | gasoline-heavy grid |
| West Virginia | 680 g | Predominantly coal |
| Australia | 510 g | Coal-heavy NEM grid |
| India | 708 g | Coal-dominant generation |
| Poland | 722 g | immoderate coal dependency |
Facts facilities aren't randomly allocated across those grids. AWS's US-East-1 (Northern Virginia) is certainly one of the most important AI inference areas inside the worldwide. Virginia's grid carbon intensity runs approximately 320–350 g CO₂e/kWh. Google's facts centers cluster in Iowa (wind-heavy, ~380 g), Oregon (~a 130 g, Columbia River hydro), and Finland (~a 140 g). The exact grid mix a given inference request lands on isn't disclosed through vendors — it truly is why carbon estimates for AI inference supply an unavoidable ±40% uncertainty from grid variability by myself.
The Entire Conversion Chain: Labored Instance
Allow us to run an entire calculation for a practical AI set off: a 1,000-token query to a GPT-4 magnificence version (MoE shape, approximately 55B energetic parameters in line with token), served on H100 hardware running at 20% utilization performance, in a information center with PUE 1.15, on a grid with 386 g CO₂e/kWh (US average).
Degree 1 — widespread FLOPs for the question:
FLOPs constant with token ≈ 2 × 55 × 10⁹ = 110 × 10⁹ = a 110 GFLOP
1,000 tokens × 110 GFLOP = 110,000 GFLOP = 110 TFLOP
Stage 2 — Convert FLOPs to GPU compute time:
H100 pinnacle FLOP/s = 3,958 TFLOP/s
powerful FLOP/s at 20% utilization = 3,958 × 0.20 = 791.6 TFLOP/s
Compute time = 110 TFLOP ÷ 791.6 TFLOP/s = 0.139 seconds
Degree 3 — GPU strength consumption:
H100 TDP = 700 W
GPU strength = 700 W × 0.139 s = 97.3 joules
Stage 4 — follow PUE overhead:
wellknown facility power = 97.3 J × 1.15 = 111.9 joules
Degree 5 — Convert to kWh:
kWh = 111.9 ÷ 3,600,000 = 0.0000311 kWh = 0.031 Wh
Stage 6 — Convert to carbon:
CO₂e = 0.0000311 kWh × 386 g/kWh = 0.012 g CO₂e
Summary desk for the worked example:
| Stage | Fee | Unit |
|---|---|---|
| FLOPs required | 110,000 | GFLOP (a 110 TFLOP) |
| powerful FLOP/s (20% util.) | 791.6 | TFLOP/s |
| GPU compute time | 0.139 | seconds |
| Uncooked GPU electricity | 97.3 | joules |
| Trendy facility strength (PUE 1.15) | 111.9 | joules |
| Ordinary facility strength | 0.031 | watt-hours |
| Usual facility electricity | 0.0000311 | kWh |
| Carbon emission (US avg. grid) | 0.012 | g CO₂e |
For a single 1,000-token question: about 0.031 Wh and 0.012 grams of CO₂e. That is in my opinion trivial. At OpenAI's reported 10 million each day ChatGPT customers producing a mean of 5 queries every, the daily ordinary across all queries is about 1,550 kWh in compute energy — or about 1.8 MWh accounting for PUE — without a doubt from purchaser inference, handiest for that one product. In keeping with twelve months: kind of 656 MWh, identical to the as soon as a 12 months electricity consumption of about 60 common U.S. homes.
Your set off in family power Equivalents
Truly the numbers — milliwatt-hours, 100ths of a gram of CO₂ — are difficult to recognize. The assessment to acquainted family electricity uses makes them concrete.
| Pastime | Energy (Wh) | CO₂e (g, US avg.) |
|---|---|---|
| AI set off, 500 tokens (small version) | 0.008 Wh | 0.003 g |
| AI spark off, 1,000 tokens (mid version) | 0.031 Wh | 0.012 g |
| AI prompt, 1,000 tokens (huge model) | 0.080 Wh | 0.031 g |
| AI reasoning query (128K context) | 0.85–2.4 Wh | 0.33–0.93 g |
| Google seek (1 question) | 0.30 Wh | 0.116 g |
| Sending one e mail | 0.04 Wh | 0.1/2 g |
| Charging an iPhone 16 (entire) | 18.6 Wh | 7.2 g |
| LED bulb taking walks 1 hour | 10 Wh | 3.9 g |
| computer jogging 1 hour | 45 Wh | 17.4 g |
| Dishwasher cycle | 1,100 Wh | 425 g |
| EV price (25 miles range) | 7,500 Wh | 2,898 g |
| Transatlantic flight (in line with passenger) | ~530,000 Wh | ~205,000 g |
The evaluation that recalibrates most of the people's instinct: a favored 1,000-token mid-version AI query uses approximately 1/10th the strength of a Google search. This surprises individuals who count on AI need to be hugely more steeply-priced than seek. At the consistent with-question stage for clean inference, it is not — even though AI inference scales to a ways better token counts than are seeking for ever does, and the cumulative footprint of walking many long-context or reasoning queries is meaningfully large.
The 128K context reasoning question — sending a long record for deep assessment the usage of a model with prolonged wondering enabled — sits in a exceptional magnificence totally: 0.85–2.4 Wh constant with query, same to walking an LED bulb for 5–15 mins. At scale, these queries constitute the best in step with-call strength price within the enterprise AI stack these days.
Convert Your activate's strength Cost
How activate format modifications the Carbon Math
Set off period straight away controls the token don't forget, which immediately controls the FLOPs required, which at once controls power intake. This is not a marginal relationship — it is linear. Double the tokens, double the strength.
| Spark off kind | Tokens In + Out | Relative power | Relative CO₂e |
|---|---|---|---|
| One-line studies | ~150 tokens | 1× baseline | 1× |
| Preferred chat question | ~500 tokens | 3.3× | 3.3× |
| Famous record precis | ~3,000 tokens | 20× | 20× |
| Complete code compare (mid codebase) | ~15,000 tokens | a 100× | a 100× |
| Long-context document analysis | ~80,000 tokens | 533× | 533× |
| Prolonged reasoning + 128K context | ~130,000 tokens | 867× | 867× |
A developer dependancy of sending the entire codebase right right into a reasoning version context window — a pattern that has turn out to be increasingly more not unusual with 128K+ context fashions — charges about 800× more electricity consistent with query than a focused one-line studies. The model functionality gains are real, but so is the power mathematics.
The realistic implication: centered, scoped prompts that retrieve simplest what the model needs (RAG, chunking, summarization pipelines) aren't simply greater token-green in your API invoice — they're meaningfully extra carbon-inexperienced for precisely the equal purpose.
The Renewable energy Accounting problem
Principal AI vendors report bold renewable strength commitments. Google claims to wholesome a 100% of its annual power consumption with renewable strength certificates. Microsoft goals 100% renewable by using using 2025. Those statements are real in a particular and limited accounting experience — and deceptive in the physics sense that actually topics for carbon.
Renewable electricity matching works via Renewable strength certificates (RECs) or electricity buy Agreements (PPAs): a corporation buys certificate representing renewable generation that happened someplace on the grid, in the identical calendar 12 months as their consumption. The certificate retire, and the business enterprise claims renewable matching. The electrons that without a doubt powered the facts center at 2 AM on a Tuesday whilst Virginia wind modified into low came from some aspect generators were jogging in that moment — which might also moreover had been gas peakers.
The carbon effect that topics is marginal emissions depth — the carbon cost of the more electron drawn from the grid on the ideal time and area of the inference request. For a information center in Virginia strolling heavy inference masses at night time, the marginal generator is often a natural gasoline plant, now not a sun array in Texas whose certificates the agency bought in Q3.
This accounting hollow does not make renewable matching meaningless — building call for for renewable technology is definitely brilliant. however it does recommend that agency claims of "100% renewable AI" should be understood as accounting statements, now not physical statements. The actual carbon footprint of a query processed at the hours of darkness on a coal-heavy grid isn't 0, no matter certificates purchases.
The handiest physically sizable carbon declare is 24/7 carbon-free power (CFE) — matching renewable technology to consumption on an hourly foundation, in the identical grid vicinity. Google is the furthest alongside in this direction, reporting 72% CFE globally in 2024. No principal AI enterprise has reached 100% 24/7 CFE at their inference records center places.
What you could honestly manage
As a developer or product team jogging AI inference at scale, 4 levers at once lessen your carbon publicity:
Route by the usage of complexity. Use small, efficient fashions for easy duties — type, extraction, lookup — and massive models simplest for responsibilities that surely require them. A Llama 3 8B model walking on a single A100 makes use of about 7× less power constant with token than a 70B model. The extraordinary distinction is irrelevant for responsibilities wherein both version achieves the equal output.
Constrain context aggressively. Every token within the context window is billed in compute. A RAG pipeline that retrieves 2 specific chunks in place of 8 fuzzy ones does not certainly keep API charge — it reduces the FLOPs required for every interest operation throughout all transformer layers. Hobby scales quadratically with context length: Doubling context duration quadruples the eye computation.
Batch wherein latency permits. Batched inference dramatically improves GPU usage thru moving the operation from reminiscence-bandwidth-sure to compute-sure. At batch duration 16–32, powerful GPU usage rises from ~20% to ~60–70%, decreasing the watts-consistent with-effective-TFLOP/s ratio via 3× and enhancing the power price consistent with token proportionally.
Choose providers with excessive CFE rankings for carbon-sensitive workloads. Google Cloud (72% 24/7 CFE, with great performance in Oregon and Finland areas), Microsoft Azure (Norway and Sweden regions with near-100% hydro), and AWS (Oregon and Canada areas) offer meaningfully decrease marginal carbon intensity than US-East or Asia-Pacific areas for the equal inference workload.
The teraflop and the kilowatt-hour had been never as a long way aside due to the fact the jargon made them seem. One is a measure of work completed; the opportunity is a diploma of the strength required to perform it. The conversion chain is steady physics — each step is a unit substitution, now not an estimate. The uncertainty isn't always inside the math but in the inputs: model structure, GPU usage, data center performance, and grid carbon depth. us of a your assumptions virtually, run the numbers, and the family electricity same of your set off is a calculation, no longer a thriller.


