AI models’ environmental impact and accuracy trade-off

AI tools like ChatGPT have changed our personal and professional worlds, with around 52% of American adults regularly using a large language model (LLM). Now, a new study details the immense environmental costs of our prompts, and it might make you think twice about what chatbot you use and how you use it.

Researchers from Germany’s Hochschule München University of Applied Sciences (HM) looked at 14 different LLMs, ranging from basic to more complex knowledge bases, and provided them all with the same 1,000 “benchmark” questions. How many “tokens” the LLM generated can then be translated into greenhouse gas emissions.

“The environmental impact of questioning trained LLMs is strongly determined by their reasoning approach, with explicit reasoning processes significantly driving up energy consumption and carbon emissions,” said first author Maximilian Dauner, a researcher at HM. “We found that reasoning-enabled models produced up to 50 times more CO₂ emissions than concise response models.”

In understanding how LLMs work and end up so environmentally costly, it’s important to look at tokens and parameters. When we type a prompt – be it a question or an instruction – we generate tokens, which represent pieces of our prompt, The LLM then generates more of these as it gets to work. LLMs with more intensive advanced reasoning capabilities create even more tokens. Tokens are essentially the computing (searching, linking, assessing), and computing requires power. This power results in CO₂ emissions.

When a LLM is trained, it “learns” by adjusting parameters, which are numbers inside a neural network. And these parameters control how the model predicts one token after another. So, a model with fewer parameters is considered simpler and with fewer “weights” (a number that tells the AI how important something is when it’s processing information), and will generate fewer tokens but might not be as accurate. On the flip side, a model with a high amount of parameters will also have a high amount of weights – and should have higher accuracy, but it’s not always the case.

Unfortunately, the most complex and accurate LLMs are also the most energy intensive.

The scientists used an NVIDIA A100 GPU computer and the Perun framework (which analyzes LLM performance and the power required) to gauge energy consumption, applying an average emission factor of 480 gCO₂/kWh. They then had each of the 14 models answer 1,000 quiz questions covering philosophy, world history, international law, abstract algebra, and high school math. The LLMs tested were a mix of text-only and reasoning models from Meta, Alibaba, Deep Cognito and Deepseek.

“The analysis of combined CO₂eq [CO₂-equivalent] emissions, accuracy, and token generation across all 1,000 questions reveals clear trends and trade-offs between model scale, reasoning complexity, and environmental impact,” the researchers wrote. “As model size increases, accuracy tends to improve. However, this gain is also linked to substantial growth in both CO₂eq emissions and the number of generated tokens.”

They found that the reasoning models created an average of 543.5 “thinking” tokens per quiz question, while the text-only models averaged around 37.7 tokens for the same prompt. However, while more tokens means more emissions, the researchers found that it didn’t also mean the LLM was more accurate – just more verbose.

The most accurate model was one of the reasoning LLMs tested, Deep Cogito 70B – 70 billion parameters – with an accuracy rate of 84.9%. It produced three times the emissions of similar sized LLMs that returned more basic answers.

“Currently, we see a clear accuracy-sustainability trade-off inherent in LLM technologies,” said Dauner. “None of the models that kept emissions below 500 grams of CO₂ equivalent achieved higher than 80% accuracy on answering the 1,000 questions correctly.”

Deepseek’s R1 70B reasoning model was the most energy-expensive, producing 2,042 g CO₂-equivalent in emissions, roughly the same as an 8-mile (15-km) trip in a gas vehicle. While this might not seem like a lot on a small scale, it’s worth remembering that more than 130 million Americans are using some AI model regularly. And this Deepseek model wasn’t the most correct, either, with a 78.9% accuracy rate. The researchers noted that having this model answer 600,000 questions would create CO₂ emissions equal to a London-New York return flight.

Alibaba’s Qwen 7B model was the most energy efficient (27.7 g CO₂eq emissions), but it only managed 31.9% accuracy.

“On average, reasoning-enabled models required significantly more tokens in both testing phases,” the researchers noted in the study. “Particularly in the multiple-choice phase, reasoning models frequently struggled to produce concise answers, despite explicit prompts instructing them to only return the choice index. For instance, Deepseek-R1 7B generated up to 14,187 tokens on a single mathematical question, while standard models consistently produced minimal single-token responses.”

Energy consumption also varied depending on the prompt, with abstract algebra and philosophy requiring more reasoning than more straightforward questions. It’s also worth noting that this study looked at only a sample of the LLMs we now have access to, and didn’t look at some of the big players including OpenAI’s ChatGPT, Google’s Gemini, X’s Grok and Anthropic’s Claude.

While LLMs are certainly here to stay, and are likely to be further incorporated into our lives, the researchers hope their study can help users make better choices, switching between models depending on the task at hand. They hope it’ll also draw attention to the need for more energy-efficient reasoning models in the future.

“Users can significantly reduce emissions by prompting AI to generate concise answers or limiting the use of high-capacity models to tasks that genuinely require that power,” Dauner said. “If users know the exact CO₂ cost of their AI-generated outputs, such as casually turning themselves into an action figure, they might be more selective and thoughtful about when and how they use these technologies.”

The research was published in the journal Frontiers in Communication.

Source: Hochschule München University of Applied Sciences