← Back to Blog

The Real Carbon Cost of an AI Token

CA
Cam Pedersen

CTO & Co-founder

Every time you prompt an AI and it generates a word, there's an invisible cost paid in carbon.

We usually think in dollars per token, but what about CO₂ per token?¹ A single ChatGPT request emits roughly 4.32 g CO₂ on average––an order of magnitude more than a Google search at ≈ 0.2 g CO₂. A few grams seem trivial until millions of users ask millions of questions: the grams quickly add up to tonnes.

Below, we dissect the carbon footprint of today's large-language models (LLMs) – covering both one-time training and day-to-day inference – and finish with a checklist you can apply right now to slash emissions without sacrificing quality.

Training

Model (release)Size (parameters)Reported / estimated CO₂ emitted in training
OpenAI GPT-3 (2020)175 B≈ 552 t CO₂ (Patterson et al.)
OpenAI GPT-4 (2023)undisclosed (≫ 1 T)12 000–15 000 t CO₂ (estimate via analysis of leaked compute)
OpenAI GPT-o3 (2024)undisclosedHigh compute per task: 684 kg CO₂e for a single ARC-AGI benchmark task (analysis by Salesforce)
Google Gemini (2024)undisclosed, multimodalnot disclosed; Google claims 100 % renewable energy procurement (Google ESG)
Google Gemini 2.5 Pro (2025)undisclosed, mixture-of-expertsnot disclosed; likely significant due to 1M+ token context window and multimodal capabilities
Google Gemma 3 (2025)1 B–27 B, optimized for efficiencynot disclosed; designed for lower resource consumption on single GPUs/TPUs
Meta LLaMA-2 (2023)7 B / 13 B / 70 B539 t CO₂, 100 % offset (model card)
Meta LLaMA-3 (2024)8 B / 70 Bnot yet disclosed
Meta Llama 4 (2025)17 B (Scout/Maverick) / 288 B (Behemoth)1,999 t CO₂ for Scout and Maverick; Meta claims net-zero via renewable energy
Anthropic Claude 2 (2023)50–100 B (estimate)not disclosed
Mistral-7B (2023)7 Bno public number; extrapolation ≈ 10 t CO₂

For scale, 550 t CO₂ ≈ the emissions of 30 round-trip flights NYC ↔ London. Training matters—but, as we'll see, inference dominates long-term impact.

Inference

Analyses by Meta AI, AWS SageMaker and Google show that 60–90 % of an LLM's life-cycle emissions come from inference, not training (Google Green-AI audit). Power-hungry GPUs (or TPUs) sit on-call 24 × 7, generating answers token-by-token.

How much CO₂ per token?

Model scaleHardware + precisionCarbon per output token
288+ B paramsNVIDIA H100 @ FP16~30 mg CO₂ (estimated for models like GPT-o3 and Llama 4 Behemoth)
70 B paramsNVIDIA A100 @ FP16~15 mg CO₂
70 B paramsNVIDIA H100 @ FP8~7.5 mg CO₂ (≈ 2 × better; see NVIDIA H100 deep-dive)
70 B paramsGoogle TPU v5e @ INT8~3 mg CO₂ (TPU v5e launch)
13-27 B paramsNVIDIA A100 @ FP16~3 mg CO₂ (applicable to models like Gemma 3 and LLaMA variants)
2 B paramsNVIDIA A100 @ FP16~0.5 mg CO₂

Figures synthesise measurements from "From Words to Watts: Benchmarking the Energy Costs of LLM Inference" with vendor-reported perf/W data. Note that specific CO₂ figures depend heavily on the electricity grid's carbon intensity where the computation occurs. Generating ~350 tokens (≈ 500 words) on an H100 draws only 0.008 kWh – yet at global ChatGPT scale that's tens of tonnes per day.

A separate empirical study put a 1 000-token, image-enhanced ChatGPT request at 8.3 g CO₂ – roughly the footprint of charging a smartphone ten times. For perspective, a single task on the ARC-AGI benchmark for GPT-o3 consumes approximately 1,785 kWh of energy, equivalent to two months of an average U.S. household's electricity use.

Choosing a low-carbon model (without wrecking quality)

Rule #1: Use the smallest model that meets the task's quality bar.

Below are common tasks and evidence that "smaller" often suffices:

Use-caseLow-carbon alternativeEvidence
Summarisationfine-tuned 7 B–13 B LLaMA-2/3 or Gemma 3Matches GPT-3.5 on CNN/DailyMail with 10 × lower energy (Watts paper)
Retrieval-Augmented Generation (RAG)LLaMA-2-13B, Llama 4 Scout, or Gemma 3 + vector DBShows comparable factual accuracy to GPT-3.5-turbo at ~10 % cost/emissions (MyScale RAG benchmark)
Structured extractionclassic BERT-large or INT8-quantised 7 B modelNear-perfect F1 while using < 1 % the energy of GPT-4 (Responsible-AI survey)
Casual chatdistilled models like Llama 4 Scout or Gemma 3 (1B-7B)These smaller models offer high performance for everyday tasks at a fraction of the energy cost of larger models
Complex reasoningMixture-of-experts models (Gemini 2.5 Pro, Llama 4 Maverick)MEO architecture selectively activates only relevant parameters, offering better efficiency than monolithic models of similar capability

Fine-tuning, prompt engineering and RAG let smaller models "punch above their weight," delivering orders-of-magnitude greener inference (Luccioni et al.). The newest generation of efficient models like Llama 4 Scout (17B parameters) and Gemma 3 demonstrate that smaller doesn't necessarily mean less capable.

Checklist: ten immediate wins for greener LLM apps

  1. Right-size the model. Benchmark a 7 B or 13 B alternative before defaulting to GPT-4 or GPT-o3.
  2. Fine-tune or distil. A domain-specific 7 B often beats a generic 70 B.
  3. Quantise aggressively. INT8 / FP8 cuts energy 2–4 × with negligible quality loss (TensorRT-LLM case study).
  4. Pick efficient accelerators. H100 or TPU v5e deliver > 2 × tokens/W versus A100.
  5. Consider mixture-of-experts models. Models like Gemini 2.5 Pro and Llama 4 Maverick activate only relevant parameters, improving efficiency.
  6. Batch and stream smartly. Full GPU utilisation slashes joules per token.
  7. Trim prompts & max-tokens. Don't encode or generate text you'll discard.
  8. Cache recurring answers. Stop paying (in dollars and CO₂) for repeat queries.
  9. Choose green regions. Oregon's hydro-heavy grid beats Virginia's coal-heavy grid.
  10. Schedule maintenance jobs for clean-grid hours. Solar-rich midday or windy nights.

Real-world teams have achieved > 10 × CO₂ reduction by combining just a few of the above.

A call for transparency

Meta publishes full emissions for LLaMA-2 and has continued this practice with Llama 4, reporting 1,999 tons CO₂eq for training Scout and Maverick models. Hugging Face now displays per-model estimates on the Open LLM Leaderboard.

By contrast, OpenAI provides limited environmental data for GPT-o3, with third-party analysis suggesting extremely high per-task energy consumption. Google has not published specific carbon footprint data for Gemini 2.5 Pro or Gemma 3, though they emphasize Gemma's efficiency focus. Anthropic still provides no model-specific carbon data. If a carmaker hid its MPG you'd balk; why tolerate opacity from AI vendors whose flagship training runs may exceed 15 000 t CO₂?

We need:

  • Training-emission disclosure (compute hours, PUE, grid mix, offsets).
  • Standardised inference-efficiency reporting (Wh or g CO₂ per 100 tokens).
  • Clear offset accounting (quality and permanence of removals).

Competition on efficiency is healthy; secrecy delays progress.

What's next?

The age of foundation models has arrived – and with it a surge in computing's climate cost. Yet the data show that smarter choices – smaller models, quantisation, efficient hardware and clever system design – can cut emissions > 10 × with little or no quality loss.

The emergence of mixture-of-experts architectures in models like Gemini 2.5 Pro and Llama 4 Maverick represents a promising direction, allowing selective activation of parameters rather than running entire trillion-parameter models for every query. Similarly, efficiency-focused models like Gemma 3 demonstrate that high-quality results don't always require massive computation.

The next time you marvel at an AI-generated poem or bug-fix, ask yourself: "How many milligrams of CO₂ did that cost?" By making that question normal, we drive the entire ecosystem toward models that are not only powerful, but planet-friendly.

Let's build AI we can be proud of – technically and environmentally.