Every time you prompt an AI and it generates a word, there's an invisible cost paid in carbon.
We usually think in dollars per token, but what about CO₂ per token?¹ A single ChatGPT request emits roughly 4.32 g CO₂ on average––an order of magnitude more than a Google search at ≈ 0.2 g CO₂. A few grams seem trivial until millions of users ask millions of questions: the grams quickly add up to tonnes.
Below, we dissect the carbon footprint of today's large-language models (LLMs) – covering both one-time training and day-to-day inference – and finish with a checklist you can apply right now to slash emissions without sacrificing quality.
Training
Model (release) | Size (parameters) | Reported / estimated CO₂ emitted in training |
---|---|---|
OpenAI GPT-3 (2020) | 175 B | ≈ 552 t CO₂ (Patterson et al.) |
OpenAI GPT-4 (2023) | undisclosed (≫ 1 T) | 12 000–15 000 t CO₂ (estimate via analysis of leaked compute) |
OpenAI GPT-o3 (2024) | undisclosed | High compute per task: 684 kg CO₂e for a single ARC-AGI benchmark task (analysis by Salesforce) |
Google Gemini (2024) | undisclosed, multimodal | not disclosed; Google claims 100 % renewable energy procurement (Google ESG) |
Google Gemini 2.5 Pro (2025) | undisclosed, mixture-of-experts | not disclosed; likely significant due to 1M+ token context window and multimodal capabilities |
Google Gemma 3 (2025) | 1 B–27 B, optimized for efficiency | not disclosed; designed for lower resource consumption on single GPUs/TPUs |
Meta LLaMA-2 (2023) | 7 B / 13 B / 70 B | 539 t CO₂, 100 % offset (model card) |
Meta LLaMA-3 (2024) | 8 B / 70 B | not yet disclosed |
Meta Llama 4 (2025) | 17 B (Scout/Maverick) / 288 B (Behemoth) | 1,999 t CO₂ for Scout and Maverick; Meta claims net-zero via renewable energy |
Anthropic Claude 2 (2023) | 50–100 B (estimate) | not disclosed |
Mistral-7B (2023) | 7 B | no public number; extrapolation ≈ 10 t CO₂ |
For scale, 550 t CO₂ ≈ the emissions of 30 round-trip flights NYC ↔ London. Training matters—but, as we'll see, inference dominates long-term impact.
Inference
Analyses by Meta AI, AWS SageMaker and Google show that 60–90 % of an LLM's life-cycle emissions come from inference, not training (Google Green-AI audit). Power-hungry GPUs (or TPUs) sit on-call 24 × 7, generating answers token-by-token.
How much CO₂ per token?
Model scale | Hardware + precision | Carbon per output token |
---|---|---|
288+ B params | NVIDIA H100 @ FP16 | ~30 mg CO₂ (estimated for models like GPT-o3 and Llama 4 Behemoth) |
70 B params | NVIDIA A100 @ FP16 | ~15 mg CO₂ |
70 B params | NVIDIA H100 @ FP8 | ~7.5 mg CO₂ (≈ 2 × better; see NVIDIA H100 deep-dive) |
70 B params | Google TPU v5e @ INT8 | ~3 mg CO₂ (TPU v5e launch) |
13-27 B params | NVIDIA A100 @ FP16 | ~3 mg CO₂ (applicable to models like Gemma 3 and LLaMA variants) |
2 B params | NVIDIA A100 @ FP16 | ~0.5 mg CO₂ |
Figures synthesise measurements from "From Words to Watts: Benchmarking the Energy Costs of LLM Inference" with vendor-reported perf/W data. Note that specific CO₂ figures depend heavily on the electricity grid's carbon intensity where the computation occurs. Generating ~350 tokens (≈ 500 words) on an H100 draws only 0.008 kWh – yet at global ChatGPT scale that's tens of tonnes per day.
A separate empirical study put a 1 000-token, image-enhanced ChatGPT request at 8.3 g CO₂ – roughly the footprint of charging a smartphone ten times. For perspective, a single task on the ARC-AGI benchmark for GPT-o3 consumes approximately 1,785 kWh of energy, equivalent to two months of an average U.S. household's electricity use.
Choosing a low-carbon model (without wrecking quality)
Rule #1: Use the smallest model that meets the task's quality bar.
Below are common tasks and evidence that "smaller" often suffices:
Use-case | Low-carbon alternative | Evidence |
---|---|---|
Summarisation | fine-tuned 7 B–13 B LLaMA-2/3 or Gemma 3 | Matches GPT-3.5 on CNN/DailyMail with 10 × lower energy (Watts paper) |
Retrieval-Augmented Generation (RAG) | LLaMA-2-13B, Llama 4 Scout, or Gemma 3 + vector DB | Shows comparable factual accuracy to GPT-3.5-turbo at ~10 % cost/emissions (MyScale RAG benchmark) |
Structured extraction | classic BERT-large or INT8-quantised 7 B model | Near-perfect F1 while using < 1 % the energy of GPT-4 (Responsible-AI survey) |
Casual chat | distilled models like Llama 4 Scout or Gemma 3 (1B-7B) | These smaller models offer high performance for everyday tasks at a fraction of the energy cost of larger models |
Complex reasoning | Mixture-of-experts models (Gemini 2.5 Pro, Llama 4 Maverick) | MEO architecture selectively activates only relevant parameters, offering better efficiency than monolithic models of similar capability |
Fine-tuning, prompt engineering and RAG let smaller models "punch above their weight," delivering orders-of-magnitude greener inference (Luccioni et al.). The newest generation of efficient models like Llama 4 Scout (17B parameters) and Gemma 3 demonstrate that smaller doesn't necessarily mean less capable.
Checklist: ten immediate wins for greener LLM apps
- Right-size the model. Benchmark a 7 B or 13 B alternative before defaulting to GPT-4 or GPT-o3.
- Fine-tune or distil. A domain-specific 7 B often beats a generic 70 B.
- Quantise aggressively. INT8 / FP8 cuts energy 2–4 × with negligible quality loss (TensorRT-LLM case study).
- Pick efficient accelerators. H100 or TPU v5e deliver > 2 × tokens/W versus A100.
- Consider mixture-of-experts models. Models like Gemini 2.5 Pro and Llama 4 Maverick activate only relevant parameters, improving efficiency.
- Batch and stream smartly. Full GPU utilisation slashes joules per token.
- Trim prompts & max-tokens. Don't encode or generate text you'll discard.
- Cache recurring answers. Stop paying (in dollars and CO₂) for repeat queries.
- Choose green regions. Oregon's hydro-heavy grid beats Virginia's coal-heavy grid.
- Schedule maintenance jobs for clean-grid hours. Solar-rich midday or windy nights.
Real-world teams have achieved > 10 × CO₂ reduction by combining just a few of the above.
A call for transparency
Meta publishes full emissions for LLaMA-2 and has continued this practice with Llama 4, reporting 1,999 tons CO₂eq for training Scout and Maverick models. Hugging Face now displays per-model estimates on the Open LLM Leaderboard.
By contrast, OpenAI provides limited environmental data for GPT-o3, with third-party analysis suggesting extremely high per-task energy consumption. Google has not published specific carbon footprint data for Gemini 2.5 Pro or Gemma 3, though they emphasize Gemma's efficiency focus. Anthropic still provides no model-specific carbon data. If a carmaker hid its MPG you'd balk; why tolerate opacity from AI vendors whose flagship training runs may exceed 15 000 t CO₂?
We need:
- Training-emission disclosure (compute hours, PUE, grid mix, offsets).
- Standardised inference-efficiency reporting (Wh or g CO₂ per 100 tokens).
- Clear offset accounting (quality and permanence of removals).
Competition on efficiency is healthy; secrecy delays progress.
What's next?
The age of foundation models has arrived – and with it a surge in computing's climate cost. Yet the data show that smarter choices – smaller models, quantisation, efficient hardware and clever system design – can cut emissions > 10 × with little or no quality loss.
The emergence of mixture-of-experts architectures in models like Gemini 2.5 Pro and Llama 4 Maverick represents a promising direction, allowing selective activation of parameters rather than running entire trillion-parameter models for every query. Similarly, efficiency-focused models like Gemma 3 demonstrate that high-quality results don't always require massive computation.
The next time you marvel at an AI-generated poem or bug-fix, ask yourself: "How many milligrams of CO₂ did that cost?" By making that question normal, we drive the entire ecosystem toward models that are not only powerful, but planet-friendly.
Let's build AI we can be proud of – technically and environmentally.