Increases visibility
Glencore monitors over a 100 applications with Elastic Observability, in both development and production.
Accelerates root cause analysis
- Glencore applies filters to search millions of records and retrieve data in just a few seconds.
- Elastic's AIOps capabilities allow Glencore to detect anomalies and correlate to root causes across applications, services, and infrastructure.
Mitigates revenue loss
With Elastic Observability, Glencore is more able to proactively detect and respond to potential issues with its business-critical trading applications before revenue is impacted.
Leading energy trading organization deploys Elastic Observability to minimize outages, proactively fix business-critical trading applications, and accelerate innovation.
Founded in the 1970s as a trading company, Glencore is now a major producer and marketer of commodities that employs 135,000 people around the world. Its oil division, with traders employs 1,500 people worldwide and generates roughly 10% of annual revenue.
Based out of London, New York, and Singapore, Glencore uses a mixture of inhouse and COTS applications to trade, optimize, and schedule Oil, Gas, & Power transactions across multiple markets.
James Lamb, Head of DevOps, Glencore UK Ltd and his colleague, Dave Johnson, Application Support Team Lead, are responsible for trading system availability and performance. "In the fast-paced world of energy trading, an unavailable or underperforming application can have a serious impact on revenues," says Johnson. "Our previous event-tracking tool was slow and difficult to use. This made it harder to resolve issues for the business."
As such, it was a priority for Lamb to move to a more flexible and performant observability platform. "Our business requires relatable and responsive systems, plus the ability to deploy new features but the old event tracking tool was holding back the building, testing, and deployment of new applications."
Start small, scale fast
Lamb first explored Elastic’s potential in 2020 using the open-source version to test search, analytics, and monitoring features. Following a successful proof of concept, Glencore began a roll out of Elastic Observability running on Elastic Cloud, hosted by Microsoft Azure. "Elastic enables you to start small but grow fast. We began with a small Elastic cloud cluster and built a logging library for our Energy Trading and Risk Management (ETRM) solution. We made some adjustments, fine-tuned the solution, and successfully rolled it out across the business," Lamb comments.
Today, Glencore runs two Elastic Cloud clusters. The first supports its non-production estate, taking in development, testing, quality assurance, and user acceptance testing. The second is entirely dedicated to production and business applications that require a segregated, business-critical monitoring environment.
Elastic gives our DevOps team the speed and visibility to fix issues while maximizing our trading application performance. It means we can more proactively respond to general service issues, but also lower-level difficulties where reference data is missing. It’s a significant step up in our support capabilities.
The Elasticsearch database ingests millions of logs weekly, the equivalent of more than 8TB of data. "We hold the data in hot, warm, and cold configurations. Data is also cost-effectively held in the frozen tier for up to three months so that when we need it, our operations team can use it to identify performance trends or bottlenecks," says Lamb.
Glencore has also deployed Elastic APM, which enables the team to quickly identify and resolve root causes of issues with the trading applications. "We've deployed Elastic APM across over a 100 different applications," Lamb comments. "We can now track measurements like transaction latency and throughput, as well as how they correlate with the infrastructure view. This helps us determine outliers and identify abnormal behavior."
"The front office team is probably the most advanced with about 12 dashboards monitoring different applications. Most are related to their data platform so they can see when things are failing in both non-production and production environments," says Lamb. In addition, Glencore's development team also produces dashboards for their own applications in the non-production estate and uses them to monitor a wide range of metrics.
Boosting visibility, accelerating root cause analysis
Elastic has significantly improved visibility into performance and availability issues, such as error handling, especially in production.
Adds Johnson, "During an outage, Elastic makes the problem analysis phase much more straightforward. You can constantly refine your search or switch to a different endpoint and update your query accordingly."
The ability to establish segregated Elastic environments between development and production drives performance for each team and improves communication when issues arise. "Elastic gives the support team the data, analytics, and dashboards to identify problems quickly and accurately," says Johnson. "In the past we would have a rough idea that something wasn't right. Now we can show our colleagues in development the specific issue and how they can help."
Elastic also simplifies new issue diagnoses and helps set up alerts to proactively identify these events. "If there's an outage and nothing was recorded, Elastic enables us to quickly establish where we need to enhance our logging and then create an alert so that we get an early warning next time," says Lamb.
Elastic gives the support team the data, analytics, and dashboards to identify problems quickly and accurately. In the past we would have a rough idea that something wasn’t right. Now we can show our colleagues in development the specific issue and how they can help.
As Glencore expands the use of Elastic across the organization, Lamb is looking forward to taking an increasingly proactive stance. "Elastic gives our DevOps team the speed and visibility to fix issues while maximizing our trading application performance. It means we can more proactively respond to general service issues, but also lower-level difficulties where reference data is missing. It's a significant step up in our support capabilities," he concludes.