Case Study

Put some magic in your DevOps life

At Société Générale, they built an in-house PaaS for deploying containers using Vault, Consul, and several other tools. Hear how they did it.

Having a custom PaaS that runs containers changed the way Société Générale scales and deploys software. It also made life a lot easier for operations and developers. Stéphan Dechoux, a DevOps Architect at Société Générale, shares the secrets of how they built this platform using a few plugins, a little glue and open source tools like Vault, Consul, and Fabio.

Speakers

  • Stéphan Dechoux
    Stéphan DechouxDevOps Architect, Société Générale

Transcript

Thank you for being here with me. So the name of my talk is "Put Some Magic in DevOps Life," so I will explain it to you. I'm very happy to be here to share with you some experiences we had in SG. We built something we call platform, and I will explain to you why we build it and how we build it for some parts.

Qui je suis? In French, because I'm a French guy. Who am I? So I'm Stéphan Dechoux, already said. I work for Société Générale for five years now. I have 15 years experience in IT. I'm a solution architect, and the product owner of the platform. And before that, I was the architect of the platform, so I know it pretty well.

I'm very sorry, I have to go just after my talk to catch my train. So if you want to send me an email or something like that, if you have questions, you can take a picture, and I will respond to it as far as I can. I have about 500 emails per days, so it can be difficult. It's my professional one, so you can.

Quick agenda. A quick introduction of Société Générale, SG, it's shorter, so I will say SG for now. Activity in IT, after that what is a platform? The definition of a platform from Société Générale. Our journey, how we think about it, and how we build it, and how we integrate some HashiCorp product in it, and why, and the future of the platform, the future features in it.

Who is Société Générale?

So introduction of SG: We are a bank, one of the biggest European banks. And the origin of the bank is French, so it's still a French bank. We have four main activities. The first one, the retail banking for people like you and me to handle your bank account. International banking, banking everywhere, investment banking, trading part, all the trading stuff, and insurance for your car, your home, your life, your health, etc. etc.

We are 160,000 employees worldwide, and about 24,000 employee just for the IT. As you can see we are a pretty old company, more than 150 years old. And this pretty old lady is in good shape. If you check our financial capacity last year, 3.8 billion of euro. So not so bad grandma.

IT. So now we will talk about IT, because we are here for talking about IT. Just a quick question, picture it. We create a giant stack, we pile up all our data center equipment in a giant stack, do you have an idea about the height of this tower? Just a hint, think in Eiffel Towers, not in meters. Idea? No? Eight times (the height of the Eiffel Tower). So not so bad. 2.7 kilometers.

Our network: We have enough optical fiber to cover the Tour de France race. It's about 3,000 kilometers of optical fiber.

Storage: We handle 40 petabytes of data, more than 200 years of HD video. And we'll double our storage in three years.

Our power: We have great computing, with about 100,000 cores, CPU and GPU mixed, so we can forecast weather faster than Météo France. Météo France is the official organization in France to forecasting weather. And we can do it faster than them. So this is my playground. Not so bad. So every day I work with that.

The digital transformation at Société Générale

What is a platform? Like other big companies and major companies, we started the digital transformation. I think a lot of you are doing the same thing right now. We embrace agile stuff, new ways for doing things, DevOps, SecOps, all the package.

But to continue our transformation, we need more tools. And why do we need more tools? Because we are in a very competitive world, with UBS for example. Competitor, but French too. We need to do things smarter, faster, and better, and always in a secure way. We need to provide new services to clientss, and sometimes we need to do it before they think about it or they need it. It's very important. Time to market is very crucial for us.

So we created what we called a platform. And we did something crazy. We talked to people and we asked them, "What do you want, and what do you need?" Maybe it seems obvious for a lot of you, but in a big company like Société Générale, it's not so obvious.

So we talked to people, and we say, "Okay, now we have a digital transformation, we need to do some things, we want to try new things in a new way, what do you need?" So we talked with a lot of people. Devs, ops, sec guys etc, and this is a sum-up and the feedback and the expectations they have.

For the dev, they do not care about plumbing. They I do not care about how it works. They just want to deploy, to develop code easily and use the tools they like and they love. They don't want to change everything.

For the ops, different topics. They want to have generic components, and everything must be automated with APIs for self-service, for example. So if you want to create a database, you have an API to do that. You close the API and everything is done for you, with a generic component and reproduceable behavior.

So when five projects ask for a database, five database will be installed in the same way. So it's easier to manage, easier to debug, and it's easier to update. So it's very important.

Building an integrated, efficient, and secure platform

Integrated. Something totally important for us. In this platform, I will describe it in a few minutes. We have a lot of product running in it. It'sbest of breed. We use a lot of products, HashiCorp products and other. Those products must be integrating with each other. They must be able to connect and talk to each other. If it's not possible, we create some glue to do that. But everything must be seemly integrated. And more importantly, everything must be integrated with your development framework.

So I will use Spring Boot as an example. We have two stacks in SG, I will describe it later, for .NET and for Java, and we put a lot of things in the call. So you can interact with a service in the platform directly in your framework. It's very important for us. You will need to have a very very good developer experience, because we want to do things faster.

Everything must be resilient, inactive-active way, multi-data center, data replicated from one data center to another. You cannot lose the data or something like that. So it's very important. Everything, every application that will run on this platform must be able to run in a clustered way and distributed way.

We want to improve your infrastructure efficiency. So we check our infrastructure. We have about 10,000 VMs running right now. And the average usage of a VM is between 5 or 10%, regarding CPU or memory. So we want to improve efficiency. We want to have some densification, to have some massification, and mutualize some services, to avoid having a lot of VMs that do nothing the rest of the year.

We are a bank, everything must be secure, no data leakage. Everything has a secret, everything is encrypted, the flow, the data, everything. You can use mutual TLS, strong airbag model, NFA etc. etc. with DMZ, a lot of DMZ. So everything must be secure by design.

Questions about our platform

One sentence, the platform, use it and do not think about it. It's very important.

Our journey. We decided for this platform to use containers. It's a logical choice, it's a choice we have right now. So we want to have containers on this platform. So everything will be containers. Everything regarding the application, not other services, but the application will be and will run in containers.

When you start to work with containers, some things change. Some things change, because you are not doing the same things as before. For example, I have eight examples right now. People ask us: how do I access my containers? So if I scale up or scale down, the IP changes, then I don't know where my container is. I cannot call the DNS API to put a CNAME on it. So they ask us how we do that: how do I publish my services, how do I expose it, how do I handle my configuration? Before, I've got the property files on my VM, start my Java, it works. So right now I don't know where my container will run. So do I handle my configuration?

How to handle my secrets? Same thing—we have some file with encryption on a VM, you can use it as a secret, so now how do I do that? How do I manage and store my secret in a secure way and an easy way?

How to register my services? So some applications are already in a microservices architecture, so via using Consul right now in a dedicated cluster with a dedicated Consul agent or something like that, when they will run in the platform how do I use Consul in the platform? It is different than before, do I have to handle agent or not? Will my Consul agent will be a container or not? I don't know, so we have to run a service question.

How do I handle my certificate? When you start the container, everytime you start it, or you start it to your scale, you will have a new IP, a new container ID etc. etc. So you have to dynamically create certificates to handle a full security chain on TLS. So they asked us, "I cannot do that, because for the moment I have to create a ticket to have a certificate, so how do I do that with acontainer?"

General question: how do I deploy to monitor and to log my container to check my logs? Again, they have a VM. Via VM, they connect to the VM in SSH, they read the log file, and they can debug. Same thing for monitoring. Do we have a solution for monitoring, or they use their own solution to be able to monitor the infrastructure or the application itself?

So do I need to change my monitoring tools? What do I have to do to monitor and retrieve my logs? Obviously, do I need to modify my code, do I have to throw everything to run in a container or not? Good question. Can I use the same framework that I used before, can I use the same tools? Etc, etc. And obviously more and more questions, but just to sum up.

So we'll talk about the platform and how we integrate Vault and Consult and Fabio. Because we are in HashiCorp, so I will talk about those three.

The Société Générale platform

This is our platform [11:36]. It's based on Docker Enterprise Edition, on the left, with three main components, the UCP, the Universal Control Plane, the manager of the platform that will schedule tasks, create resources etc. etc. on the engine. We call it engine or worker.

We have a lot of nodes. It's a VM with a Docker engine running it, and everything will be scheduled on it. So all your application containers will run on these workers. Now all of your applications will run on this work desk. Nothing's run as an application on UCP, only UCP. On the bottom, you have the DTR, it's the Docker Trusted Registry. It will be the component where you will store your docker image.

In the platform, we integrate some service or we use some existing service. For Jenkins, when you are onboarding on the platform, we create a Jenkins master as a container. You have the logo for Docker on the bottom right. When you build something, you deploy something, it will pop-up a container as a Jenkins slave. So everything is a container regarding Jenkins, and we have, I think, 120 Jenkins masters running on the platform right now. So, it's about 500 projects.

Everything is integrated with our internal GitHub and our artifact repository. We are using Nexus and we are using Artifactory. So when you want to build something, your application and your containers on your image with Jenkins, will start, which will drive everything in GitHub, the source, compile it, and retrieve the dependencies from our artifact repository. Push the artifacts you build in Nexus, and after that you start to create or image in your pipeline, and you push it in the DTR, The Docker Trusted Registry, and after that you can deploy it and launch some unique testing, everything you want. In fact, you are admin of the Jenkins master and you can install all the plugins you want, and we create or modify, very deeply, the plugin to have this kind of behavior.

We are using Sysdig, for monitoring and alerting, and in a few weeks we will provide some dashboards for this application. Some generic dashboards for applications with a restricted scope so the applications, with authentication, can only see their own containers and their own service or own stock, etc.

For persistent storage, so for some container we need to store some data, for example, the Jenkins master, we use two things. The first thing is a plugin of NetApp—Trident, we modify it, too for security purposes. This plugin is able to create dynamic volume in a NAS, in an NetApp NAS, it's an on-tap, and mount it directly on a container. So when you start your container, you say, "Okay, I need a volume for ten gigabytes," for example. It will be created in the NAS and mounted directly to an available container.

If something fails, your OS fails or container fails, it will be restarted elsewhere, and the mount will be done again to this container. So you'll re-drive for volume, you'll re-drive your data, and, as everything, it is replicated between data centers.

The second plugin is Contain X, cifs-plugin, we modify it too, security purpose again. It lets you mount existing Samba or cifs share in container, so if you already have a cifs share or Samba share, you can mount it directly in your container with this plugin.

After that, every load, every metric is sent to our data lake. So, we create some dependency, some real relief with java.net, and you have to use it to be able to send every log and every metric to our monitoring services. We are using HortonWorks for storage, Elasticsearch, Kibana and logstash for indexing and for our dashboard, and Zipkin to be able to track every call and response time for our microservices. When you start your application you have a dependency on Jenkins, and everything is done for you. You will have a dedicated Zipkin dashboard, so everything is omitted.

For the frame, Vault, Consul and Fabio. We are using Vault as secret management and certificate storeage. We are using Consul as service registry, discovery and a key-value store to run configuration, and Fabio as a container for Dynamic L7 Load Balancing.

Consul at SG

An important thing, so I talked about Spring Boot, so Spring Boot can automatically register your service in Consul. When you start your Spring Boot we modify it because it misses some options and some fine tuning. So, we create some bootstrap and some starter in a Spring Boot framework to improve it and to be able to do new things with Consul I will explain to you later.

The Consul integration with our stack—when you start your container, it will be automatically registered on Consul, the service name, the service ID, extra charge store, everything will be done for you. You just put the name of your application, the kind of environment, mode of development. The script you will need to do the f-check, and everything is done for you.

The configuration of the application is retried directly in the KV store. You can use Consul template before starting your containers to re-drive some configuration, but you can re-drive the same thing directly in you code. So when everything's done in Consul, you can enable or disable a feature directly in your code. We have some event and we watch what happens in Consul, so it's provided directly in our stack.

For Vault, we create a plugin, in fact, and what this plugin does—I'm pretty sure that you already have used Docker secrets, so when you create the Docker secrets you re-drive it in your Docker container as a tmpfs environ secrets. So, we did the same thing. In fact, we create a Vault plugin, every secret is stored in Vault, but when you start your application, you put the secrets in your Docker compose file, as a source of the target, but everything is stored in Vault and everything will be injected in your container as Docker secrets. So, you will find your container environ secrets, for example my secrets, but everything, in fact, is stored in Vault and not in Docker secrets to avoid dependency with the Docker EE solution. When you have to upgrade or you have to create a new platform, everything is in Vault, it's simpler to immigrate.

Recreate an API to create, to generate and regenerate certificates dynamically, and everything is stored in Vault, too. It's very important, because when you start your application, you'll re-drive the certificate, you'll create your Tri Store, or K store, etc, etc, with a good certificate and you can start with it, and all situations are okay with our own certificate authority.

Regarding Vault and Consul infrastructure, we have two clusters. In fact, we have this infrastructure for development, and a second for product strength with the same infrastructure, the same number of nodes, but totally separated. So, we have five nodes for Consul, so it can virtualize infrastructure and for Vault we have five nodes, too, and five Consul nodes dedicated to Vault and we have the same thing in AMER and in Hong Kong. So, we have this infrastructure for dev and prod in AMER, Hong Kong, and Paris. They are not joined, right now, in a few weeks that will be the case.

Using Consul as a service

So, now, some technical. We decided to use Consul as a services on the platform, but to use Consul as a service in a platform we have a lot of questions. For example, will my Consul agent be a container or not? Do I need to provide a Consul agent for each project or just one in each VM? Because we want to keep the way that Consul works, you have one Consul agent on your VM, you'll register it locally and everything is done to the server after that, but you have just one Consul agent locally on your servers. We don't want to have some remote agent or something like that.

How do I reach my Consul agent? With localhost, or not? How do I enter multi-tenancy and service isolation because this platform is multi-tenant, that's very important for us. How do we enter network isolation? How do I health check my services? How do I register/de-register my services and do I have access to the Consul UI?

So, we made some choice. Assume in general using Consul agent as a container is okay, and only one BI server. How do I use localhost? In my container, if I try to reach local host it will be difficult to reach the container agent so it's not possible. Or I can use another network, eh, it's possible. My issue with that is every time a new project arrives, every time a new project creates a new network or something like that, I have to attach it to the Consul agent container and I have to disconnect it when the network disappears, so it's complicated. I can create a mutualized overlay, so okay, I create a big relay over the network, everything is connected to it and you can reach your Consul agent, but if I'm doing that I'm losing the network isolation.

So, we decided to use the Consul agent as a process. Just one process, bare nodes, but how do I use localhost? It cannot do that, you will use DNSMasq on each host. We create, on each node defined within a DNS tri called Consul-agent.paas.local and this DNS and tri will return the host ID. In fact, the IP of the Consul agent itself.

DNSMasq will listen to the Docker bridge interface, and we add, as a DNS server in the Docker daemon, the IP of the bridge of Docker. So, if you do an LS lookup or if you do a ping or something, right now your container will try to reach Consul-agent.paas.local it will return the local IP and you can reach the local Consul agent like you are in a VM or outside the container world. So, with this kind of feature we keep the original the way of Consul to work, we don't want to twist the design.

With this solution, how do I check my health? For example, everything running over a network, to have some network isolation, so all my container ... All my console agents, sorry, can't reach my container if they are running on overlay. I cannot do an HTTP health check, doesn't work so we use docker exec check instead so we have multiple ways to ... You have multiple kind of health check in Consul, so TTL, TCP, HTTP script, and Docker exec, so we modify Spring Boot to do that. When you start your container it will register for you a health check with a script you gave in parameters, it will take the ID of the containers and create a health check with a Docker type for you, directly. The Consul agent will do a docker exec, execute the script and if the script returns zero it's okay, if it's not zero you have service failure. So, it's pretty handy.

How do you protect your client agent from outside? Because we have some application that already runs on Consul, if they say, "Okay, I have to enter my Consul agent, no I don't want to write anymore, okay, I will use the Consul agent remotely." We don't want to have that, so we create the IP tables work to block the connection from the outside to the consul agent, and we put an Nginx on top of it. One Nginx per node, and a load balancer on top of it. And we add some filtering rules onto Nginx to just let, get, for the verb, for the catalog of the services, so we can only read services. You cannot register yourself or widen your services. For the KV store we let everything, because sometimes it's easier to modify key values to our inner UI instead of using Societe. We give access to the UIs in same filtering.

We have some use case, for example, every client that uses Consul to discover the services and the backend. So, we need to have this kind of thing because we don't want to install a Consul agent in the desktop. We did this thing for this purpose.

All we handle is the multi-tenacity of this Consul. So, we are using ACL or ACL, pick your favor. ACL? We will say ACL. Everything that you will create in Consul in the KV store, for example, or in services, will be prefixed by the name of your services and your environment. So, for example, we have the trigram nomination for services, for example 'ABC'. So we have an API to create the ACL for you. So, if you want to be on board on Consul, you called an API, and now you receive a set of ACL. And we create a directory in Consul, virtualized, with the name of your application, 'ABC/', and your environment, for example '/DEV'. You will only have the right to write in a directory with this ACL. Same thing for the services. You can only create or read services with the name of your services, or 'ABC - your environment'. This is all we handle: the multi-tenancity with Consul.

Of course, if you need to share some things with other people or other projects, if you need to read services for, I don't know, 'DEF' application or something, we can create a server for that, it's possible. But by design, we don't want that.

We are using Fabio as dynamic L7 proxy. When you onboard on the platform, again, we create two or four Fabio-generated instances, as a Docker instance in our shell, in our two data centers. We will create a load balancer for you that points to this instance. In fact, multiple load balancers have some HA. We create entry in-Vault clusters for the certificates. So when you want to export to the outside, you just have to connect this network and publish your service in Consul and that's done. We create a dedicated set of ACL for Fabio too, to only write and read your services. So it's pretty simple, because you have auto-registering in your code so you start your Spring Boot, you're already registered in Consul, and you are served that by Fabio directly. And if you create some certificates during the startup or something like that, Fabio is connected to Vault, detects a change in Vault, and retrieves the certificate immediately.

So it's pretty easy for the developer, and everything is secured by default because you have to do HTTPS. It's not possible to do something else. So we dedicate an instance of Fabio and a load balancer for Fabio to be able to have some fine-tuning for each application.

Vault at SG

Vault. We are using Vault in many ways. The first one is PKI as a service, so the API to generate the certificate. Everything will be stored in Vault, in your dedicated secret pass for the application. Fabio will use it to retrieve the certificate dynamically. We use it for the secret plugin.

We are using it with the cifs plugin because when you have to mount the cifs share you need some credentials, and we store those credentials directly in Vault. When you mount this share, the cifs plugin will retrieve the credential in Vault and connect to the share directly and mount this share in your containers. So everything is automated.

A typical workflow when you startup a container. You retrieve your configuration from the Consul KV store. You can use Consul templates to generate to a property file for example. It's a good example. You retrieve your secrets from Vault. They are stored in tmpfs like Docker secrets. We generate or retrieve your certificates directly from Vault. We start your application. Everything is registered in Consul, you can retrieve more configuration if needed directly in your application.

Fabio configures itself, because it has watched what happens in Vault and Consul, and modifies configuration accordingly. So, if you scale up or scale down it will modify its configuration in a dynamic way. So you are not losing connection because it doesn't restart itself. And when all the health checks are okay Fabio serves your request.

This is a typical worker we have, so Docker EE is installed, and a lot of stuff like cifs plugin, NetApp plugin, DNSMasq, Syslog, Fabio, Sysdig Agent, Consul Agent - of course - HRM to dedicated solution, HTTP mesh for Docker EE. So this is a typical workload we have in SG with all the services connected to it.

We use other HashiCorp products too; like Vagrant. We're using it to deploy development environments on desktop. We're using Terraform as infrastructure as code to deploy resources in Amazon and Azure. And Packer—it's a OS bakery for CSP. We have no project to do that, so we're using Packer and we have an OS bakery to create an AMI, for example, for Amazon. We have a framework to do that, it's open source.

What's next for SG

What's next? A lot of things. We want to leverage the public cloud and on managed services, we want to do things faster and smarter, and Amazon is faster and smarter than us so... Yeah, many services, we cannot compete, so we're going to use their services. I think we will have some integration between on-premises and cloud. I'm not really sure, but maybe we have, so we have to use their services.

Everything will not run in the public cloud, so we have to have a next-generation data center on-premises. We need SDS and SDN; so software-defined storage and software-defined network. Because we have a lot of dynamicity in this container, so we need a new way to do things on a low level of infrastructure, like storage and network.

So what else? We want to have a solution for doing lambda functions on-premises. So I cannot talk too much about it because I will do a demo in two weeks. I am working with a team in America, they are doing incredible work. I am pretty proud of what they did, it's pretty amazing. I think I will demo it in two weeks privately. So maybe in a few months I can talk about it, but for now I cannot.

Open banking, because we provide service outside of SG to the internet. So we need to find a way to deploy a service in a very easy and faster way, because when you want to do that right now you have to create VMs in multiple DMZs, you have to open the firewall, so it takes months. We've worked on a new design for doing that over one year now, and we are putting it in place right now. And it will not be weeks, months, or days. No, it will be minutes. So we will be able to expose things to the internet directly within minutes in a secure way.

For the security part, we want to enforce security with global rules. We want to make security transparent for the developer. Everything must be secure by default. We can use Consul Connect, for example, and it will be a good candidate to do that, or Sentinel, or stuff like Cilium or Aqua, or TwistLock to check what happens in the container and that it is secure.

So that's all for me. Team Spirit, that's our motto at Société Générale . Thank you.

More resources like this one

  • 1/20/2023
  • Case Study

Adopting GitOps and the Cloud in a Regulated Industry

  • 1/5/2023
  • Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

  • 9/26/2022
  • Case Study

How Deutsche Bank onboarded to Google Cloud w/ Terraform

  • 9/2/2022
  • Case Study

Vault in BBVA, Secrets in a Hybrid Architecture