Modernization Hub

Modernization and Improvement
GKE Architecture Patterns for Retail and Financial Services (Cloud Next ’18)

GKE Architecture Patterns for Retail and Financial Services (Cloud Next ’18)

[MUSIC PLAYING] ALLAN NAIM: Good morning. How’s everybody doing today? Awesome. So we have a full talk today. It’s very packed. We’re going to cover some
very interesting topics. So I’m Allan Naim. I work on Google Kubernetes
Engine at Google. I’m part of the product
management team there. And I am joined today
by Wei Dang, who is VP of product at StackRox. So StackRox provides
container security. It is one of the options that’s
available when you’re running Google Kubernetes Engine. And we’re also joined by
Alex van Voxel who is a Cloud Architect at Vente-Exclusive. And then Connor
Gilbert, with StackRox, will actually be wrapping
up our talk today with a really, really cool demo. So really looking
forward to that. So let me frame the conversation
before we dive into things. The talk is called GKE
Architecture Patterns for Retail and
Financial Services. And I bet you’re asking, why
retail, why financial services? What is the similarity here? Why are we talking about
these two industries together? And the reason for that is,
from an executive priority standpoint, there
are similarities in terms of what these
two industries are really trying to achieve. Although, from a
regulatory standpoint it’s very different in
terms of the challenges that you face from
a financial services standpoint versus retail,
from an executive priority standpoint, some of the
commonalities include, your data is your most important
asset in both industries. And the two industries are
really transforming as well. Retailers, whether it’s brick
and mortar or e-commerce are becoming software
companies in many ways. The need to be able to
deliver software quickly, and push that software to
the, edges and transform that customer experience in
real-time is crucial– it’s really crucial
to competing. And both industries are
really in the midst of this. With the advent of open
source technologies, in particular containerization
from several years ago, it’s presented
a way for companies within these verticals to really
be an internet scale company and figure out ways
to really bring the experience to the user
as efficiently and quickly as possible. And a lot of it is software. How do you build that software
to enable that and make that happen? Our customers are becoming
smarter and smarter. There are so many signals
associated with customers that you need to keep
track of and be aware of in terms of tailoring that
experience for your customers. At the same time,
this is a study that was done very recently,
75% of customers out there expect a
consistent experience in terms of dealing with
institutions, whether it’s mobile, whether it’s
online, whether it’s going in to the brick and mortar
location, or a banking center. At the same time,
64% of consumers expect that organizations
will respond and interact with them in real-time. We live in a society where
people really have no patience. You want response
immediately, and you want that response to be continuous. And then two thirds of
consumers will actually switch brands if they’re
treated as numbers instead of individuals. So you have to tailor that
experience for that customer. And how do you do that? Again, using software. So just a show of hands, how
many people here are actually running containers
in production? Very cool. How about using Kubernetes? And using Kubernetes
on Google Cloud– Kubernetes Engine? That’s fantastic. And then, how many
developers here? Great. This is a great composition. So the technology impact
is really transforming how people do things today. And a lot of it is really being
pushed by containerization and by efficiencies around
continuous integration and continuous delivery, as
well as machine learning that’s really bringing a facet in
terms of how you make decisions, and bringing the cloud
into a lot of the workflow. And for companies that are
really at the midst of this today– and particular, I’m
seeing this a lot with retail and financial services– there’s only two
ways that you can go. At the end of the
day, we all want to get to that top right corner. But it’s either up or out. And really, the
up path is taking what you have today in your
data centers and modernizing it. A lot of companies
that I talk to are actually building
passes on top of their existing
infrastructure, because they want that
speed, that agility. On the other hand, you
can move to the cloud– lift and shift–
and then modernize when you’re in the cloud. So we see both of these
trends that are happening. But eventually, we all want to
get to that top right quadrant, and there’s always going to
be a component that’s sitting, whether it’s on premises,
or multi-cloud– but you want to be able to
manage these investments in a consistent fashion. So from an internet-scale
application architecture, at the end of the
day, we all want to be able to push releases
n number of times each night. We want to be able to
build applications that are coherently resilient
and scale really well without having to
worry about the underlying infrastructure. But as a developer, all you
really want to care about is your application, right? You really don’t want to
care about the underlying mechanisms, like your
networking, your load balance, your health checks, how you
implement that resilient architecture. Of course, it’s
important, but you really want to focus on the application
and then let the underlying framework really do a
lot of the heavy lifting to abstract away
the complexities for that application
underneath that application. And really, that’s
where Kubernetes came into the picture. Containers solve a
very important problem in terms of how do you
package up an application, make it portable, run it
in such a way where you’re taking all these
various dependencies and you have a really
predictable way of running that application
across different environments. But then it introduces
a lot of complexities, such as, how do you
scale these applications, how do you run these containers
across a pool of computers? Containers are ephemeral–
how do you mount storage to these containers
and ensure that you can run not just
12-factor stateless applications, but stateful
applications as well. How do you do that in
a consistent fashion? What about day two type models
around operations, monitoring, logging. How do you expose containers
to the outside world so that they can be consumed
by external applications? Those are the kinds of
things that Kubernetes really helps solve. And Google has had a
managed Kubernetes offering for the last three years. It went GA– General
Availability– three years ago, called Google
Kubernetes Engine, which is a managed service
around Kubernetes that basically makes the cluster
creation process an API call. You specify how
many nodes you want, where you want
your nodes to live, the type of resiliency and
highly available infrastructure that you want
underneath your cluster, and then you’re in business–
you’ve got an API, a command line interface, that you can
now use to deploy and run your applications. So what I wanted to do now
is, I can sit here and talk and explain all the various
great things about Google Kubernetes Engine, and
Kubernetes, and so forth. But I feel like it would be
tremendously more valuable for you to hear it directly from
our customers and our partners. So the first thing
I’m going to do is I’m going to invite Alex to
come and talk about how Google Kubernetes Engine actually
enabled them to become more agile, and move to a
microservice architecture, and all of the
challenges that they faced going down that journey. And then, we’re going
to have a Wei bring a different perspective. As a technology partner,
they offer a solution that actually helps
customers really build secure environments. I mentioned the important thing
is ensuring your data is safe. Well, Wei is going to bring
a different perspective around what are
some of the patterns that you should be
implementing to ensure that your environment–
the container environment– is protected and secured. And then we’ll show
you a demo around that. So without any further ado,
Alex, the floor is yours. ALEX VAN BOXEL:
Thank you, Allan. So I’m Alex. I’m a cloud architect. So in a company at
our country and size, that means you also have
to get your hands dirty, and you also code. So but let me first
present Vente-Exclusive. It’s a French name, so
it’s Vente-Exclusive. Vente-Exclusive is a special
kind of e-commerce business, and is flash sales. What does that mean? We have, every day, like
100 sales open and close. Each day we have opening
and closing sales. On average, a sale is
open for around a week. So as soon as that
sale closes, the goods are ordered with the suppliers
and shipped to our customers. So it’s not the next
day delivery in general. So about two years ago. we were
acquired by a bigger company. And that acquisition actually
brought some special things. We could operate
in northern Europe. It was a French company
that acquired us, so that means
everything above France. And that acquisition brought
some special cases for us, because we were
normally operating in a small part of Europe. And we’re running a monolith. And going to the
north of Europe, that means more countries
with special business needs, more languages, and
other currencies. Not all of Europe
is running the euro. So with that, we
said we’re going to develop the
platform of the future and go for microservices. Running a monolith for 10 years
brought some extra advantages. We actually knew our
business quite well. So we put silos in
our architecture that was the shop,
the warehouse, data. And within those
silos, everybody could make their
microservices on a certain use case or domain. The technology that
could be used there could be really
tailored to the use case or the competence of the team. As it required mainly
a .NET company, most of our services were
created in .NET But like in the data team
with data scientists, Python is more in
their skill sets. So there it turned out that
we had a lot of Python code and even Go if we really
needed to have performance. The databases, it was also the
choice of the best use case. So it could be relational
or non-relational. It didn’t matter. They had their own choice,
but a very important thing is that each microservice
is data separated. So a very good
isolation between them. And at the time, we started
more than a year ago to re-platforming. And why did we pick Kubernetes? Well, because at
the time that we chose, until Kubernetes came
along, and Docker came along, containerization was
really a technical thing. And Kubernetes were
the first orchestrators that actually
brought some concepts that a human could understand. And luckily we were already on
Google with our data platforms, so we actually tried out GKE. We also asked for an offer on
our current hosting provider, but by the time that they
came with a real offer we were already running
some stuff on GKE. But going to microservices
brings some extra challenges. So it’s more than
three, but I picked three of my favorites ones. It’s like the developer
experience, observability, and also the data silos. Developer experience–
so if you’re used to creating a
monolith and then have to go to make
very big blocks, you have to make sure the
developers don’t lose speed. So in the beginning, we
did a lot of small things, and made sure that
the development speed didn’t go too low. So the first thing
was access control. Luckily, Google had
the Google accounts that we really could sync
with our on-premise active directory. And it was with just a snap
they have access to the cluster. And with those Google
accounts, they also have access to services
like logging, tracing, and even if they
needed BigQuery, they could have
access to BigQuery. Another thing that
is very important and helped us in
the beginning, was to make sure that you have a
set of guidelines between teams. It’s very important. One of them is, make sure
that your microservice has a single name. You could have some silly
names like boreporting, or is it back office deporting. But the thing is
that it really helps with automating the process. So with conventions, when
there was the API at the back, then we automatically exposed
that to the outside world. Was it a batch? No, it was a batch system. It really helps to just make
the developer flow go well. Another thing is, make sure
that you have contracts defined between the microservices. We actually learned the hard
way that going [? good ?] first is not the best approach. So we are actually slowly
moving now to a contract first, where we are slowly
moving to gRPC. And with some cool tools
like Cloud Endpoints and probably [INAUDIBLE]
as well, it eases the way. Because it can expose your gRPC
interfaces as a REST as well. Then, don’t forget this– continuous deployment. Start in the beginning. All those conventions,
and so on, really help automate
the process. Automation processes–
but make sure that those automation
processes are all centered around the continuous
deployments pipeline. We chose GitLab because it is a
nice combination with git repos and deployment pipelines. And it has some nice tools
for code reviews as well. We are now going
to the next level. And with the evolution
of Kubernetes, with the custom
resource definitions, we’re going for
the Git approach, GitOps approach,
where everything, the desired state is
in the git itself. Next, observability. As everything is
scattered around, make sure that your
system is observable. In a single application
it’s a bit simpler. But luckily, Google has
a stack driver suite. You probably saw it yesterday. There were some nice
announcements here. So that’s very good. So ever every microservice– for us, we had the
Cloud Endpoints. But now with Istio, probably
you would go to Istio. But we have been running this
for a year on Cloud Endpoints. It really helped us
to make it observable. Because having a proxy just
in front of your microservice makes sure that everything
is locked and traced. And that really helps. Those locks and so are
aggregated with everything in the Google ecosystem. So everything that comes
in the network and even your Kubernetes cluster,
all come together in the same logging system. And it really helps dive deep. But don’t forget, make sure that
you have like a special lock entry points, trace
points, and metrics as well in your application. An example of a metric is
like [? active ?] [? cards, ?] for example. And through the
stack drive suite, it all merges the operational
locks and your custom metrics together to have a deep dive. And if you want to be
an observer superstar. Make sure to export all
your locks to BigQuery. I find it quite strange that
not a lot of people do that, but it’s just a click away. And now is the time to go
to Istio, look at that one. We were looking
at Cloud Endpoints because we started very early. Make sure to look at the Istio. And also make sure to
look at OpenCensus. With that you get
nice dashboards. Now, it’s an answer
from yesterday, you also have service
graphs added on top of it. And that’s very
important if you want to dig deep if you have
troubles in your system. And the last, but this
is my favorite one, is how to get the flow of
your data in your system. So everybody knows the typical
use case of sync calls, where microservice A
calls B, and B and C. But make sure to go for
an async approach as well. Where you actually put
everything on Pub/Sub. Pub/Sub is also a good
technology of Goggle, where it just can
publish your entities. And that makes it possible
for other microservices to subscribe there. An example is when
a card is complete, your logistic process could
listen on those events, and trigger the slow flows. And with the added
value, if you do it consistently over all
your microservices, you could listen there in your
data lake on all those topics and combine everything into
BigQuery, BigTable, or Elastic, or whatever
technology you choose. And that’s a way,
where you actually break the barriers
between those silos. And then your data analysts,
and your marketeers really have one tool to
query over everything. And that’s then BigQuery. And it combines, and you have
some nice dashboarding tools, like Tableau, where you can have
backend view on the business. So in hindsight, going
to Kubernetes engine was really a lifesaver,
because then we could focus on those problems. Because we like having
a managed service is not just throwing
Kubernetes on a few VMs. No, it brings autoscaler,
autorepair, monitoring, all out of the box. And with the three monthly
release cycle of Kubernetes, the newest version of
Kubernetes is just a click away. I do an upgrade of the cluster
to a new version in a meeting. So lessons learned– make sure
to have conventions, contracts, and automation in place. That will help you
accelerate your developers. If you don’t have
those things, you could be up for
some [INAUDIBLE].. And embrace eventual
consistency through Pub/Sub, where you publish
every entity in there, and listen there in your data
lake and put it in BigQuery. And make your system observable. Certainly with the nice
news announced yesterday this will be great. So that’s it for me. Wei, it’s up to you. WEI LIEN DANG: Thank you. My name is Wei Lien Dang. I am VP of Product at StackRox. StackRox is a Bay
Area based startup that delivers cloud native
security for the Global 2000. Our product is a platform
that secures containers and microservices across
their entire lifecycle from build to runtime. We’re pleased to be a
Google Cloud Partner and jointly working with
some of the world’s most recognizable enterprises across
technology, media, government, and financial services. This morning I’ll share
about our collaboration with financial organizations,
a collaboration that has enabled them to modernize
their infrastructure and app architectures atop
Kubernetes and GKE. And one of the primary
challenges that these customers face when moving to GKE
is thinking through how to effectively leverage a shared
security model, which requires considering what controls
do you need to implement on a standalone
basis versus where you need to augment existing
controls that are provided by GKE. And our customers tend to hone
in on the following concerns. The first is, can strong
security boundaries be established to
isolate and segregate containerized workloads? And isolation needs to extend
across pods, and services, nodes and clusters, and so on. Second, what is needed to ensure
that the container software supply chain, which
generates the images used to instantiate
containers, remains under the customer’s control? You need continuous
verification, attestation, and auditability. And finally, this
customer segment has to tackle regulatory
compliance in a context where requirements
for containers specifically are not
yet that well defined. And so, relevant certifications
and data protection requirements are going to vary
across workloads of course. But what we tend to see is
that financial customers take conservative interpretations
of existing requirements, map them to container
environments, and also increasingly adhere to
guidance on container security that has been published
by organizations such as NIST and the Center for
Internet Security. I’m excited this morning to
share three examples of how the combination of
GKE and StackRox together has successfully
enabled enterprises to navigate these concerns. In the first example, we’ve
worked with a Fortune 100 bank. That bank has a line
of business that’s leading the bank’s
digital transformation efforts to re-imagine
the consumer experience. And to do that, they’ve built
an entirely new investment management application that runs
on hundreds of nodes in GKE. In the second example, a Fortune
50 financial services firm runs containerized
analytics services on dozens of nodes in GKE
on-demand to better leverage data that has already been
warehoused in GCP, thereby driving greater ROI. And in the third example,
a global 200 enterprise has re-architected its
SAS e-commerce platform, a platform that serves
small and medium sized businesses by providing
them with everything from a storefront to billing
and order processing functions. They’ve re-architected that
platform atop Kubernetes. They run it on-premises
today, with a plan to move it into
Google Cloud, and it’s designed to scale to
thousands of nodes across dozens of clusters. Now, Alex talked a lot about
infrastructure modernization. So at the same time
these businesses selected GKE and Kubernetes
to drive that modernization, they also had to figure out how
to address the unique security challenges that were
introduced by containers. Existing security tools
lack container awareness and are rendered obsolete. The attack surface expands
beyond just containers themselves to encompass layers
that include the container runtime, the orchestrator,
namely Kubernetes, the registry, and so on. And so there’s the emergence
of new threat vectors that need to be
protected against. And then, finally, the
short, ephemeral lifetimes of containers make
necessary functions such as auditing
response and forensics, in particular,
much, much harder. And so, to solve
these challenges, requires modernizing
security as well. And there are a
lot of great talks going on this week
about security for Kubernetes and GKE. I encourage you to
check out those talks. What stands out
is how Kubernetes is modernizing security
with the breadth of baseline capabilities that are being
built into the platform itself. There’s firewalling
for pods and services, there’s TLS throughout
the cluster, there’s access control,
and audit logging, and secrets management,
and storage and master. GKE takes all of this,
operationalizes it for you, and augments it by giving you
a minimal container optimized operating system by
managing the master for you and also integrating with
container analysis and image scanning functionality that’s
part of Google Container Registry. And so, where a
customer previously had to turn to several
standalone security tools to implement all
these controls, they can now benefit from a
single infrastructure platform that makes
security fully programmable. But our customers, particularly
those in financial space, still require more. The move to containers
does not obviate the need for security functions
like vulnerability management, configuration monitoring,
compliance, policy enforcement, threat detection,
and instant response. These customers
look to standalone solutions like StackRox to
solve for these use cases. And so, what StackRox
does is unify this set of capabilities
into a single platform that allows our customers to run
on GKE with high assurance and confidence. I’m now going to
run through some of the common
architectural patterns that we’ve seen among
our financial customers adopting GKE. And so in the first example,
that Fortune 100 bank chose to build a
new cloud native, mobile-first application
on top of GKE. And what that looks like, is
there’s a highly scalable web tier made up of microservices
that sits behind a GCP load balancer and a reverse proxy. There’s a separate
set of microservices that handles brokerage
functions, order processing, reconciliation, and other tasks. And that tier
interfaces directly with data that is
persisted off-cluster and is non-containerized. Everything that runs within
the cluster is stateless. The tier also
interfaces with a set of internal custodial services,
as well as third party APIs. And one particular
concern of this customer was that an attacker
could potentially gain a foothold
somewhere in the web tier and then subsequently move
laterally from one service to another to gain access
to the brokerage tier, which would then give
that person access to the data or other
non-containerized bank services. And so, the ability
to detect and respond to that type of threat vector
is one of the main use cases that StackRox has helped
solve for that customer. In the second example, that
Fortune 50 financial services firm has already been using
BigQuery and Cloud Data Flow to set up a highly
efficient data pipeline that handles transformation, batch
processing, and warehousing. And what they’ve added
on is a GTE environment that allows their developers
to spin-up containerized Spark applications. And those Spark
applications can be used to process the data
that’s already in BigQuery, or analyze it, or query
and manipulate it. And GKE makes big BigQuery
integration extremely easy, literally via a drop-down
selection in the console. And because those
Spark services run on-demand for limited
periods of time, understanding what is
running, what has already run, and how policies are enforced
on services that are processing different data sets with
varying degrees of sensitivity becomes a very hard challenge. And so that is one
of the main areas that StackRox has helped
solve for them as well. And then finally, that
global 200 business that has containerized its
existing e-commerce platform on-premises. This is really an
effort designed to drive application
portability and agility, and is part of a plan to
eventually move to Google Cloud and GKE as part of a broader
lift and shift migration. Now, what stands out here
is that these are not microservices, these are legacy
services– think TomCat, think WebLogic. These are not stripped
down services built from minimal base OS images. And so vulnerability
management, and handling those configurations
and compliance become key critical use cases
that the customer and us initially focused on. Separate services are run to
handle logging and laundering. These are off-cluster. These are non-containerized. But as part of the
organization’s move to GKE, they anticipate adopting
the stack driver [INAUDIBLE] which Alex described
for those functions. As a proud partner
of Google Cloud, we’ve invested in delivering
a seamless experience across StackRox and GKE. We have native integration
that automatically sends alert information out to
Cloud Security Command Center so that you can view
threat information about your containers alongside
all of your other security data about all your other
cloud resources. And also, to re-emphasize,
for a moment, the importance and
significance of discovery and gaining visibility
when migrating to GKE, it is the starting
point for nearly all of the financial customers
that we engage with. Being able to understand what
is running, what has run, and to do so in a context
that abstracts away potentially thousands and
thousands of containers. Security operators do not
want to immediately know what is going on with each
and every single container, rather they need to be made
aware of whether there’s been a particular incident or
threat on a given application. And one of the unique
ways that we provide comprehensive visibility
for customers, is we interface with every key
layer of the full container stack. And so, we interface with the
operating system, the runtime, the orchestrator, the
registry, and so on, so that we help build a
bigger picture and deeper picture of how your
GKE environment is changing over time. So what have we gleaned
from the past several years of working with the
financial services community? Here are some common
practices that we see nearly all adhere to. The first, is they start
with hardening their cluster infrastructure,
utilizing controls that are built into
communities and GKE, such as role-based access
control, node constraints, service account privileges,
network policies, metadata concealment, and so on. It’s a very long
list, which speaks to the richness of
security capabilities that are being built
into the platform itself. Next, they focus on ensuring
that their images are securely built
and secure to use, that they meet
best practices, and that the runtime
environment is configured against known benchmarks. Automation is key here. Automation and programmability,
those are the same principles that are driving
instructor modernization, also have to extend to security. And finally, they then focus on
analyzing container activity. Once your containers
are running, you can actually
analyze their behavior to better identify
threats, because the declarative and immutable
attributes of containers make it easier to spot when
an attacker is potentially looking to gain a foothold,
escalate privileges, persist within the environment, move
laterally across services, or achieve some type of
objectives, such as crypto mining, or exfiltrating data. And regardless of
where your organization is on its container journey,
here are a set of key lessons from our customer
success stories that you can act on today. The first is that the
security controls that are built into
Kubernetes and GKE provide the starting point
for any comprehensive security program. Identify what else is
needed, where, to then solve for additional use
cases, such as threat detection and response. Build in automation and apply
automation, fast iteration, and closer collaboration between
security and DevOps teams as part of your organizational
workflows to better prevent, detect, and respond to threats. And then, finally, tune
your security policies based on the rich
context about containers that can be gathered
via declarative metadata and runtime configuration. Collectively, these lessons
have helped our customers and can also help you to better
harness all the advantages that Kubernetes and
GKE have to offer. We’re now excited to show you
how this all comes together. I’d like to invite
Connor on stage, and he’s going to
demo and walk you through three examples of
how GKE and StackRox together help protect against a set
of threats in a live GKE environment. CONNOR GILBERT: Thank, Wei. I’m Connor. I’m excited to show you a live
GKE cluster, as Wei mentioned, and demo three real-world
scenarios of container security. I could talk to you
about this for hours, especially about how much
container technology, Kubernetes, and the integrations
with Google products can really help you get
all the right visibility in the right places. But I have about five minutes,
so I’m going to dive right in. So I’m going to show you
these three scenarios. They’re based on exposures
at Tesla, Shopify, and inspired by
the Equifax attack. I don’t think they
were using GKE. So on the screen you’ve got
a StackRox portal running in the left hand
side and a terminal that I’ll be using on the right. So let’s start with
the Tesla example. So this is an
example where there was an operational mistake,
or a cluster misconfiguration, and that led to a cryptojacking
attack discovered by a security firm who was just scanning
the internet for Kubernetes dashboards. So first we’re going to
expose the dashboard here. This Kubernetes
dashboard service, it can be helpful
for debugging, but it can be a security exposure. It doesn’t run with
public addresses cert, so I’m just going
to click through. And you can see, you can log
in with a kubeconfig or a token if you’re going to use your
privileges in proxy to the API. But you can also skip. And if you do skip, it’s going
to use the default service account that the dashboard
service is running under. Now, you can see there
all these errors. You were trying to debug
something, and know your screen is just
full of errors. You were supposed to be fixing
these, not getting more. So you Google this,
you find out, OK, this user account cannot list
persistent volume claims. And you’ll find some YAMLs
and some instructions about how to give that service
account cluster admin access. And this is actually,
I believe, what happened in that Tesla example. So I’ll go ahead and elevate
permissions for that dashboard service account. And if I refresh the page,
I can see everything. I can debug my problem. And then I go away, and
I forget to take it off. So now, anyone who goes
to this node port address will be able to come
in and use this helpful create button that is up in
the right hand corner, where I can then create deployments. So I’m just go ahead
and create a YAML– create a deployment. It is running a
crypto mining service. And upload, and there we go. I’m going to head
back to the dashboard and show you that we
will have detected a new service coming up, and
then the crypto miner being run. Wei talked about
software, supply chain, and what’s in your
images and everything, but sometimes you
do need to know what is actually happening. Because I just
introduced an image from a publicly
available registry that wasn’t the one that
you were using, so you wouldn’t have any sort
of scanned data or anything. You can see there’s a error. I’m running minor d
inside that process. All the metadata that you’d need
to go respond to that issue. So I’m going to continue on. The countermeasure here
is really, don’t do this. It’s easy to make
these mistakes, and the defaults
are getting better, but you might have
an old cluster, and it’s really
important to audit who has cluster admin
and the exposure of the important
services, especially out of kubesystem namespace. You can even disable
the dashboard entirely in recent GKE releases. All right, we’ll move
on to Shopify now. So I just first want to
give credit to Shopify for helping educate
the community about this kind of issue. They posted this as a
bug bounty and HackerOne, and made it publicly accessible
so that everyone can learn from what happened to them. So let’s move on to
that demonstration. The vulnerability
there was that they had a service that was
vulnerable to server side request forgery. You could ask it to
make network requests. And those network requests
could include internal resources and sensitive things, like the
cloud provider metadata server that’s running on every node. So I’m going to start by just
deploying an application that’s sort truly vulnerable to
server side request forgery. And the first thing
I’ll do is have it just check what IP it’s running on. So you can see, first
it’ll show my laptop IP, and then we should be able to
see the IP that the container cluster is actually
running under. So there’s my laptop. And then this is actually
the container node that the container is
going to egress from. So you can tell
they’re different. So then, that’s not super
interesting, but what is is starting to use the
orchestrator against itself. So you can, if you make
this kind of request, talk to the local
Kubelet read-only API, and enumerate things about
the cluster that you’re in. So, for instance, I’ve just gone
and enumerated the images that are running on this machine. And this actually
highlights the importance of segmentation and separating
sensitive workloads that was mentioned earlier. You can use built in
Kubernetes features that are fully available in
GKE to make separate node pools for sensitive data,
apply taints and tolerations. So you taint a node and then
tolerate it in the workload. And you can say that only
sensitive things can schedule on these sets of
nodes, so that nothing like my dumb, little app
can be scheduled next to your sensitive data and
then be used to steal it. And so that’s an example
of using the Kubelet API against the cluster. The other example,
specifically with Shopify, was using the Kubem. So you can see there’s
a bunch of stuff here. Not all of it
immediately decipherable. But eventually the bug
report includes the fact that there’s a key
and a certificate that can be used for TLS
bootstrapping with the Kubelet. This is another thing where
GKE has an counter-measure but it wasn’t enabled. So there’s a feature
called metadata concealment in their long-term plans to
distribute cluster metadata more securely. But you just need to opt in
to that kind of concealment to stop this kind of attack. You can see on the left,
we’re monitoring for things like cloud metadata access. So this container ended up
contacting the cloud metadata server. And that is not in it’s normal
behavior, and so we’ve alerted. I’m going to move on now
to the Equifax attack– this is inspired by it, but
not an exact reproduction. I don’t think they
were using containers, and I don’t think
they were using GKE. But it’s an example of where
application vulnerabilities really can hit you even
though you’re in containers. Containers have all these
rich configuration options– running as different users,
running with read-only rootfs, abstracting away the
access to secret things, abstracting storage
away from you. But you can still
write an app that gives people access and have
not configured the security boundaries well enough
that they can keep going. So I’m going to
deploy a app that is vulnerable to a Struts CVE
that allows remote command execution. So that’s created. And I’ve created a
sensitive database service that I’ll then contact. So I’m going to go ahead
and just make a crafted HTTP request that injects a command,
and connects to that database, and dumps some
information about it. The database is
the tables inside. So the port for it is still– I have to do this live. So we get to pull the pods. Now, I’m going to start again. So you can see the
data leak here. So we’re able to make this
large, exciting, crafted HTTP request that included
some MySQL commands. And then the output
was all the databases in the database, and
then all the tables. And if they were
more stuff in there, you could have executed
commands like that also. Then you can see on
the left-hand side, we’ve alerted on a
lateral movement, because these services don’t
typically talk in this manner. So just an example
of runtime monitoring of those connectivity graphs and
to catch the lateral movement. Once they’ve gotten
the initial foothold, they’re trying to
get the objectives. They’re trying to get
through to your data. They’re trying to get the
things that they want. So I’m going to hand it
back to Allan after this. I’m excited to have shown you
three real-world scenarios in a GKE environment,
and the way that both GKE defaults and
StackRox-type protection can help secure your workloads. Thanks. [APPLAUSE] ALLAN NAIM: So call to action– we covered a lot
of topics today. And there’s a whole
bunch of sessions that are going on
at NEXT that go really deep into various areas. Yesterday there was
actually a great session one of our PMs delivered around
Kubernetes and enterprise security requirements. It’s all going to be
available on YouTube. And then today there’s
a bunch of sessions, one on containers and
Kubernetes Engine, that really goes deep into
some of the differentiators with Google Kubernetes Engine. And then later on
in the afternoon there is a session on best
practices from Google SRE. I highly recommend
checking that out. It’s going to be really
a combination of both GKE and Istio. And then there is a session
later in the afternoon really around running managed
Kubernetes both in a GKE perspective, but from a
multi-cloud standpoint. So for those of you that were
in the keynote yesterday, you probably heard
the announcement around GKE and the
on-prem component. So it’s going to get
really interesting in terms of providing
that consistent experience for managing your applications
in a hybrid fashion. So with that, thank you
very much for your time, really appreciate it. And I wanted to thank
the speakers, Wei, Alex, and Connor. Very nicely done. We’ll take a few questions. So if you do have questions,
please walk over to the podium and ask away. Thank you. [MUSIC PLAYING]

Leave a Reply

Your email address will not be published. Required fields are marked *