Modernization Hub

Modernization and Improvement
The Good, the Bad, and the Ugly of ML for Networked Systems

The Good, the Bad, and the Ugly of ML for Networked Systems

>>Welcome everyone to this session on the
good bad and the ugly of ML in network systems. Applying ML to a problem needs no real motivation
or push in 2018. We’re seeing a huge surge
in developing ML techniques for a lot of
large-scale network systems. With the huge amount of data and computing
power available, this actually makes
it quite feasible. Of course, Microsoft has been in the forefront of applying data at huge scales to
large-scale systems issues. However, there is
a growing voice, maybe even the silent majority of voices in the systems and
networking community, that is clamoring for
more principal solutions. They’re concerned
that they’re building solutions without
quite understanding what they actually do. This session intends to
write into this to stop. What are the network systems
for which ML is appropriate? The fact that we have built systems are built
with abstractions, does it make it more amenable to using black-box
ML-based techniques? Finally, do we even need
full understanding of the solutions as the so-called
traditionalists insist? To discuss all these
questions and more, we have a panel of
researchers in this space. We’ll start with Balaji
Prabhakar from Stanford. Balaji is a professor
at Stanford. His research is largely in the space of
self-programming networks, particularly in the context of cloud computing platforms
and data centers. Balaji has a long list
of awards but I won’t list all of them but I would
just note that is one, the Erlang prize from
the Applied Probability Society, and the Rollo Davidson Prize from the University of Cambridge. He’s a Sloan fellow, Itripoli and ACM fellow as well.>>Thank you. Hi, good
afternoon everybody. I’m going to take a position
and my main thesis is that, on the good bad ugly side,
I’m going to say it’s good. What’s ugly and sometimes
can be bad is the hype, but in of themselves, there’s nothing wrong with ideas, is how we end up
using them or worse, how we end up making
a bigger dealer of them. So, some of my background
is theoretical, and so the scientific
modelling process always goes on a knife edge. There are few theorems
that give universality, but other than that, things tend to be
for tractability. We make some models and
they work by and large. There’s some well-known
combinations, like linear systems go with
Gaussian noise really well, and that goes with quadratic
cost functions really well. That’s a fairly large amount
of optimal control theory, stochastic control, runs on
those three assumptions. People have tried relaxing any one of these and
it’s not been easy. Nobody’s heard from those people who’ve tried to relax
these assumptions. They’ve gone off into
the woods and never come back. Similarly, networking
we know there is again Markovian assumption
whether it’s to poisson arrivals and exponential
services small variations but, otherwise, by and large, that’s a successful
modeling class. What happens when we
leave this world of modeling assumptions
don’t quite work or not want to capture
something in more exact detail? That is where lots of
data in these days, and when combined with neural networks and
ML, can actually help. So, it helps us to capture
something that is actual, an actual system where the modeling assumptions
have gone some distance, but now, we want to be
more exact than that. What that does is it
does certainly leave the burden of
understanding upon us, and I’m okay if first we get success
in that understanding, usually prefers
understanding first, and then success, then we can
declare success correctly. But people have had observations that for which theories will come
later, that’s fine. It’s what just saying sharing an anecdote with Niels Bohr, about Niels Bohr
or involving him. He had a scientific
friend visit with him at some summer home that he had in
the mountains somewhere, and banged on the door
frame of his house. There was horseshoe, which is well-known
as a good luck charm. So, scientific friend
said, “Surely, Professor Bohr, you don’t
believe in this superstition.” He said, “Well, I
don’t believe in it, but they say it works, whether you believe
in it or not.” It’s not a bad thing
to keep in mind for the topic of this panel. I want to take two specific uses
for neural networks. First this so-called
function approximation. The well known to
have the property of approximating functions. A two layer neural
networks are universal as long ago theorem
of Andrew Barons. Second is that, when combined
with hardware of the GPU, TPU type, this ability to approximate a function
can vastly speed up, and was very useful
for us in networking. Speed up computations, I’m going to give
you some examples. One example supporting both is from self-programming
networks. So, I’ll tell you what that is, and briefly also tell you some work I’ve been doing with the
students Sachin Katti, Mani Kotaru, and also
Gordon Wetzstein who’s a colleague who specialize
in VR computer vision. Finally, I was at the Simons Institute Program
in Berkeley last season, last spring in the one
that is run by, and we had a physicist from Caltech who’s involved in
elementary particle physics, this large hadron collider, and they also use neural networks and pretty much the same way
that I’m going to describe it for
the same reasons also. So, self-programming networks as a research program on Stanford, Mendel Rosenblum and I are working on it together
with a bunch of students, and the infrared learners are blue because that’s where we’ve used some neural networks. So, again, what is self-driving, self-programming networks. So, think of a self-driving car, you take a plain old car. If you add sense
and control to it, then it becomes Stanley
the self-driving car, this is Sebastian thrown the
on the trunk under the hood. Similarly, we would take
a plain old data center, and our focus is on data centers. So, in other words,
I’m not advocating what we’re doing for wider Internet yet because there’s some particular
assumptions we made. You have sense and control to it, and we want to use that to form a self-programming
networks. I don’t have time
to tell you more about it, where we sense, how we control, its mostly
is NIC-based approach. What that allows us to do is to run some new applications that can run on the old
plain old data centers, that allows us to run this because it has
additional functionality, like very accurate timestamping
service and so on. So, in this wall, neural networks, I
want to get to it. Comes in the following
way: We’ve taken a thermography type approach
to measurement. I look at ingress and egress
times of packets at NICs. When you have this for
a whole bunch of packets, you’re in the world of solving simultaneous linear equations to determine from the total time how much time was
spent in each box. The algorithm for this is
LASSO, the one we use. So, you’re looking
at the total delay, D. A is the adjacency matrix that decides what path
the packet took, Q is the queuing delays
of the switches its visiting, N is the noise. Noise is variable wireless, NIC timestamping inaccuracy,
various things like this. We worked on this and tested
it in Google’s 40G-test bed, the internal production network
and various other places, our Internet data
center at Stanford, but this is the algorithm. So, given the end-to-end delay, estimate individual queuing
delays or a certain interval. This works, but with slow LASSO algorithm scales poorly with the size
of the network. There are faster implementations
of LASSO but even so, we want to see if we can just teach a neural
network to do this. What that means is, take
an algorithm like LASSO, it has it’s input
output map that defines a function and take
the input-output pairs, lots and lots and lots of them, and train a neural
network with it. You have to remember after all, we’re just inverting noisy
in linear equations. If there’s no noise in
that equation well-determined, it was just another
linear system. So, you expect a priori the Neural Networks
should have a good chance. So, it turns out just a two layer is completely as
shallow as you can get, a Neural Network of the basic
ReLU type is good enough. So, what’s the result look like? On the top is
Reconstruction Accuracy. This first chart. You can have different
size networks. LASSO has a root-mean-squared
error which is, you can think of it as how many packets like roughly three full-sized for
general byte packets roughly, that’s the error. The second column is
Neural Networks trained with ground truth when
you have it available. Third is to
Neural Network trained with the output of LASSO itself. So, as you can see, it’s pretty accurate but it
is not much loss of accuracy. But the killer is
the second thing. LASSO scales poorly
as I said before, 30 milliseconds worth of data takes you more
than a millisecond, depending on size
of the network, you can get as bad as a second. But if we’re trying
Neural Network, we can still be tracking time. So, that’s an example of function
approximation capability as well as hardware acceleration. So, that’s my first example. Second one is again, very briefly even
briefer than this is, work with Mani Kotaru, who’s Sachin’s student as I said. Sachin and his students
have had a theme of using Wi-Fi signals
for various things, in this case is localization. There’s a VR headset
that’s moving, Wi-Fi signals are captured. The goal is to localize
based on the signals. There’s a well-known
mid-seventies algorithm called Music that does it. It cancels the indirect path,
etcetera, it does. But it had a 60-centimeter error, when music does the localization. In work that took money more than a year and a
half because you really it’s not like
off the shelf ML, it’s like you have
work, wrestled with it. I actually thought he
won’t even succeed, maybe there’s not
an easy function to approximate, but he did. There’s a submission
on this it’s, four-centimetre error and it’s also really fast by
comparison with Music. So, this example
is second example, and the last one is just, I’m going to- I really encourage you to read
it because I was quite impressed with the way the
physics folks are doing it. This is from CERN. I’m happy to share this with
folks who beyond the URL. So basically, what happens in
the Large Hadron Collider, it has got some
elementary particles, something is held over here and something really fast comes and bangs it and then
stuff like muons, gluons, all these guys come out, and then they traced
the trajectory of these particles,
that’s their business. But the data generated is so huge that they had to sensor it, they can’t store everything. The article starts by
saying, in the 60’s, people used to pore over
these images and reconstruct the trajectory is to
find out if they’re found a new elementary
particle and modeling the Higgs Boson type
things showing up or not. Then the learning revolution
has just begun and it was based on grid computing that was
introduced in the 2000’s. What they’re now doing is just, it’s exactly reconstruction. So, my problem of the inverse, given the overall
delay find me the queue sizes that’s called
an inverse problem, that is a forward equation
is D equals AQ plus N, the inverse equation
is given D find NQ. There’s another inverse problem, inverse problems are in MRI. All of those are inverse problems and there’s another
inverse problem and the most recent work
has been about training this neural networks to just do the inverse problems
and find the trajectories. So, what the equivalent
of LASSO for them is the classical quantum
mechanics laws and then they restrain them. I think it’s harder acceleration plus all that is
beneficially used in this case as well. So,
with that I’ll stop.>>The format we’re
hoping is that, if there are any burning
questions one or two we could bring it up now but if you have more broader questions, we ask that you keep it
for the panel at the end. Any quick questions while
Bruce gets setup. Okay. We’ll try my laptop. Yes, while Bruce get set up. Bruce is from Duke University. His interests are in large-scale networks as
well as his research. Prior to that Bruce was at Carnegie Mellon University
where he helped launch the Akamai CDN Service and he was the first
vice president of research and development for Akamai
where he continues to hold a role as
vice president for research. Bruce’s work has won the SIGCOMM Networking
Systems Award, as well as the IEEE
Cybersecurity Innovation Award.>>We’re going to go with VGA.>>Got it hooked hooked up there?>>Yes.>>There you go.>>Thank you Ganesh.
It’s nice to be here, I have to say that I am giving this presentation with
some trepidation, because I don’t really use
ML very much in my work. I tried to think of
the best justification I could find for myself as a speaker and it occurred to me that I have ML
in my middle name. So, I’m a traditional
systems designer. I like to design big systems, the biggest example
is the “Akamai CDN”, large parts of which I designed. I like to try to
optimize these systems, and I like to fix them. One of the things I’ve discovered over and over is that whenever you take measurements and look at how your
system’s performing, you find out that it’s broken, broken either not
functioning correctly, or it’s just performing
very poorly. There’s a methodology
that I’ve used many times to try to
either improve systems, or design systems with
new capabilities, that I call data-driven design
of network systems. So, the idea is that, you look around for
existing systems that are already deployed
on a large scale, and from which you can
collect a lot of data, possibly data of different types, and then you think about, how could we marry
these datasets, to help us evaluate
new system designs? Maybe for a system that
does something different, or that does the same thing
but in a very different way. So, one example of this, maybe the first one I
was involved in was, there was a new system
that had been developed at Carnegie Mellon around 2000 or so for purely
peer-to-peer delivery of live or real-time video
called N System Multicast. This was not my work, it was Wei Jiang, and Saddy Raul, and several others, they won the Sigmetrics Test of time award for their paper describing
N System Multicast. But, they didn’t have
a very good way of evaluating whether or
not this system could really hold up under
real workloads. So, one of the things we did was, we gathered all the logs from Akamai real-time
video delivery system, then we used that to drive simulations of ESM
and the kinds of questions we were
interested in were, well, when there’s
a massive flash crowd at the beginning of
some scheduled event, and peers are joining
and leaving rapidly, is the system stable? Most of the time, we
found that in fact, even without the
dedicated infrastructure, that Akamai had deployed for its real-time
streaming network, the ESM was still
mostly successful. Let me give you another
example of this kind of work. So, here’s a fact that
maybe people don’t know, but TLS certificate
revocation checking is broken in a bad way,
and here’s how. Your mobile device doesn’t check for certificate revocation. So, what this means is, if you go to some website where you
provide your password, and maybe use the same password
at a lot of websites, and somebody has stolen that
site’s private key somehow. Maybe through a bug, like the Heartbleed bug, and somehow they have interfered with DNS and you’ve
gone to an imposter site, and despite the fact
that the true owner of that certificate has reported
it as no longer valid, and put it on a revocation list, your device will
not check for that. I know this because I’ve tested all the mobile devices,
Android devices, iPhones. I couldn’t find any example
in which any device ever checked for
certificate revocation. The reason they don’t is
that it’s too expensive. One of the things
I found a little frustrating this morning going to talks about formally verifying
security properties, making sure your
implementations of the protocols are
correct and so on, is that there are
these other economic reasons why whole parts of the protocol
just aren’t implemented. So, on mobile devices, they don’t do certificate
revocation checking because it will be too much
bandwidth to download the whole list of
revoked certificates. Or, it would take
too long to contact an OCSP server to ask is
this certificate valid. Or, there’s new protocols
like OCSP stapling, but only two or three percent of servers are supporting that. So, It’s just not done, and this is a crying shame. So, we started thinking about, is there data out
there that would make it possible to
solve this problem? It is not difficult to find the list of all the
revoked certificates, because they’re on
the certificate revocation list. But what’s harder to find is, a list of all the valid
certificates, until recently, when Google was so annoyed that semantic had made
some fake certificates, that they launched something called certificate transparency, and now, they basically pressuring
certificate authorities, to put into, just to
spice up my talk, I’m going to say
a blockchain type of log, every certificate
that they issue. In addition to that, there’s
some research projects that have been going around, scanning the whole
IPv4 address space, pulling every certificate
they can from port 443. So, suddenly we have a new dataset we didn’t
have before which is, all the valid certificates. So, we had the invalid
ones before, and now we have the valid ones. By the way, it’s a lot
of certificates. We’re talking about
tens of millions. This suggests a way to represent in
a very compact data structure, complete information
about every certificate indicating whether it’s
been revoked or not. The structure is called
a “Filter Cascade”. So, the idea is that, you take all the
revoked certificates, and you insert them
into a “Bloom Filter”. Now, that would be
fine as a solution except that it’d be really unfair if there was
a false positive, and some websites certificate
is now marked as revoked because there was
just a bad collision. So, then you take all the
non-revoked certificates, and you ask, “Are there
any false positives?” For all the false positives, you create a “Bloom
Filter” to hold those. There are a lot more non
revoked certificates than revoked. So, that it’s important
that the first level you insert
the revoked ones, so you can have a small table, but with a really small
false positive rate. Then you say, “What about revoked certificates
that accidentally get marked as false positives, meaning they would
be considered as non revoked you
make it a little?” So, you get this
geometric decrease, and it turns out that
this “Filter Cascade”, it has 43 levels, but you only expect to do ten hashes to look up
the status of any certificate, and the whole thing, only takes 1.1 megabytes to represent the precise status
of every certificate. There’s a lower bound, an information theoretic lower
bound of 0.77 megabytes, so you’re not going
to do much better. So basically, today there are
60 million valid certificates, that number is
about 40 million higher than it was two years ago. Maybe 50 million, because
of Let’s Encrypt. Today, in September we had 850 thousand revoked
certificates, and so that would take 1.1
megabytes, but actually, the list changes very slowly, so you can just publish diffs, and the whole thing is
only 40 kilobytes per day. So, this is an example of
data-driven system design, what’s out there, how
can we exploit it. Now, I want to say a little bit
about Machine Learning. The application of
Machine Learning in computer networking is old. So, here’s a paper from
a survey from 2008 that’s been cited a thousand times,
Traffic classification. This is a thing people
have done for a long time. Trying to figure out,
is this an attack? What kind of application
is running here? This is an old application. I was thinking a little bit
about places where you might use Machine Learning in the context of
content delivery networks. So, one of the things
that Akamai tries to do, is predict for each client
on the internet, a billion clients, which of our servers would
give best performance. This is an example where you have multiple datasets
of different character, you have past performance, you have network measurements, you have network
topology information, you’re trying to
make a prediction, and here you could try
to use Machine Learning. Another application is
trying to figure out, is this a legitimate client? We have a lot of
trouble with botnets trying to break into
banking customers. So, we want to develop
a reputation for a client, or estimate the physical
location of a client. This is easy if
there’s some way for a client to click on
the “Allow” button to give up their location? But Akamai doesn’t
touch clients the same way that say
a content provider does. These kinds of things work best when the penalty for
false positives is low. So, it’s okay if whatever algorithm you’re
using does poorly, as long as it doesn’t hurt that much when you
make the wrong decision. So, when we decide
a client is suspicious, we don’t ban it, but
we rate limit it. Or, if we decided a participant in our peer to peer
content delivery system, Akamai has tens of
millions of clients in a system that does
this is suspicious, we just kick it out and say, “You have to download
from the CDN.” Okay, I’ll finish there.>>Any quick questions?>>[inaudible] kind
of like trying to say data-driven and
machine learning is a good thing for networking, but I see there’s
something different between you and says
the first talk is not talking about the catalyst
of success of machine learning networking is because we have more expensive, more complex models like
Neural Network as an example. But you’re not talking about
a catalyst is because we had most all of a sudden
we have more data. Right? We have
more data from Occam or we had data from Google
and it’s kind of explosion of more of data that is the kind of
a driving force behind this. So, is it true that
you think having more data is kind of more important to have
a more complex model?>>Okay. So, I’m not right, that was exactly the
point I wanted to make but I do like the way you summarized it that
the fact that we have more data is giving us
new opportunities, new ways of designing systems and perhaps ML is a tool that
can be used to help do that.>>That’s a good point
for follow up as well and abandoned more data
or model complexity.>>Next up we have David Moitz who’s a distinguished
engineer at Microsoft. He leads Azure’s Physical
Networking Team. His team builds the software and network devices that connects Microsoft largest
services including the Azure public cloud
and being prior to that, David was at Microsoft autopilot and then before that he had
spent five years at M Assad. David has won the sitcom test of time award as well as
the ECM system software.>>Great thanks guys. So, I’m here talking about the work of the Azure
Physical Networking Team. We’re called the
Physical Networking Team because you might
imagine we’re in charge of actually
getting packets between all the servers inside
of our data centers. I want to talk to you
a little bit about the environment in which we work. So, we’re building
a really large network. Hundreds of thousands links
of each data center, hundreds of data centers
around the world. You know when you talk about
large scale large data sets, the network is very good at generating those kinds of things. And just to be clear, the number one job for
my team is high availability. Is making sure that
every packet that gets sent by every server is
running in our network actually makes it
across that network and received appropriately by
the destination servers. And the challenge we have is
that at this large scale, the law large numbers
is not your friend. Yes we see lots of
nice median behavior, but it turns out
nobody cares about what happens at
the 50th percentile. People who care about
what happened at the strange edge conditions, what is the very high
nine’s percentile of your behavior of packet
loss and everything else? So, instead of Occam’s Razor, where the simplest explanation
is the most likely, you get Murphy’s Law. There is something crazy happening in some
part of the network. There’s going to generate
really weird behavior and my team’s job is to go figure out what that
is and stop it. So, we take a network this is actually sort of a shot
of a rendering of one small part of our network
in one small data center, you can see all the
inter connectivity and go find that proverbial
needle in a haystack. Sadly we’re actually
pretty good at it, but sadly it does often take
us quite a bit of time. Again the time to mitigate
these problems is also very important because
our customers are suffering. They are bleeding
money reputation and everything while we go and try and find these
problems and resolve them. So, how can machine
learning help us with this? So, I’m going to have
sort of a mixed message. I’m going to show you a couple of problems where I
think ML probably has some things to offer us and we’ve been
getting some benefits. I’m not trying to characterize
some things where ML is probably not
the best approach and I hope I have enough time
to end with a little bit of commentary about from
a practitioner’s guide, if you’re a normal person
and want to work with system scientists or
research scientists, you want to work with ML people, how can you cross the knowledge
gap between those two? I’ll talk about sort of the three examples and I’ll come back to sort the commentary text on
the side as we go through. Let me start with
a problem called Layer One Topology Generation. So, the basic problem here is we want to keep
our network from failing. We want to keeps
our optical network from failing from various data centers
or regions that are network being partition from
the rest of the network due to fiber cuts
rather fiber failures. Now, you’re like, hey guys
this is an easy problem. You tell me the availability distributions time between
failure time to repair for your fiber and we can
predict what the availability what any particular
region should be. The problem is the following. Fiber providers( a) don’t really know this information
and (b) innovative they did they’re going to lie and they’re not really
going to give us someone who’s trying to
acquire fiber to build our network accurate information. So the question is, given the fact that we have to acquire new fiber paths all the time to continue to build and
extend our network, we need to acquire
enough fiber paths in the right kinds of fiber paths to the availability
targets that we want, how can we actually predict
what this time between failure and time to repair
distributions are going to be for a new fiber path? Now, we have a large bunch of fiber pass already
in our inventory. So, if we could take our experience that we’ve
actually measured from existing fiber paths
and figure out how to apply that to estimate the
characteristics for new fiber, we could do a pretty good job. Challenge here is
there are lots and lots and lots of
relevant dimensions. There’s where the fiber goes, there’s how many construction
sites it goes through, there’s how many people
live along that fiber, there’s how many kilometers
are aerial fiber, how many kilometers
are buried fiber. Lots and lots of other variables that can be extracted
for each of these paths. Unfortunately, we don’t
want topology decisions or fiber path selection decisions to be made based
on human emotion. There’s oh, we bought
some fiber from this company last time and
they really were nice to us. We like them so we
should buy more fiber from them or I know this person who worked
on that project and they’re very
reputable so let’s buy. Right? That’s not the right
approach to this. So, we’ve actually had some pretty good initial results in sort of this
overall problem space. This is data from not exactly the problem
I just described, but a very related
one of trying to estimate what the availability
of particular portions of our topology is
going to be based on our estimated and
then simulated data, versus the actual availability. So, the green bars represent our actual
observed availability, the blue is the average of our simulated an estimated
availability and you can see in some cases we undershoot in some
cases we overshoot. We’re doing a lot better
than random chance here. So, again that gives us hope
that this is the kind of thing where classification
algorithms can extract dimensionality
that’s not obvious to a human being that is still going to give us
good business value. Is this ML, is a statistics, should we be using PCA or other kinds of
straight stats versus more harder to explain
classification algorithms? That’s really interesting
conversation to have me we’ll talk about
during the panel. Another problem, optical
performance optimization. So, we have what we
call Open Line Systems. Microsoft has
actually been one of the companies out there
driving towards this. Rather than buying a highly
integrated system from one of the standard optical
vendors where you have to buy all the
equipment from them, we’re trying to get to
a desegregated ecosystem where you can buy line systems and amplifiers from
one company transponders from another and mix it
or match together. One of the reasons is
because these systems, when you open them up, actually have lots
of parameters that we can tune to maximize the number of bits per
second that can actually be delivered across
any individual piece of glass. These are things
like the transmit power, the received power, how many amplifiers you
put along your path, all of this kind of stuff. Is this a good machine
learning problem? Okay, we’ve got a bunch of parameters and we want to
set them appropriately. This is a control system, a problem in any one’s sense. I’m going to argue,
no. This isn’t. Why? Availabilities is are
job number one, right? So, I can actually tolerate many of the errors that a
machine learning system might put into
the misclassification or misestimation
of the parameters. Two, I have a really good set. I have a physics textbook
that more or less describes exactly the
plant model of how this optical line system
and all the amplifiers in power generation and noise perturbations
is going to work. So, I can use much more
standard control theory, because I have that forward
plant model to say, “If I use this set of parameters, what’s actually going to happen, what’s going to
happen my signal?” And I can then spin that loop
as often and only when needed to try and tune
the system, okay? If you’ve got to something where the physical layer
is well understood, I don’t quite see
what we’re going to use ML to try and do this, even though I do see a lot of people out there trying to use ML kinds of approaches to
this sort of a problem. Third problem,
network availability. Hopefully, by now, you can
see where the thinking is. Essentially, if you’ve
got a really big dataset, you’ve got the potential for
machine learning problem. If what’s underlying
that big dataset is a bunch of physical layers or properties that are
well understood, you understand the interactions between all the variables, you’re probably better off using standard control theory and some forward plant model
to drive your system. If you’re in a world
where you don’t even know which of these variables
are actually salient and most important, you don’t know which
ones are going to be most descriptive in
predicting the outcome, that’s where a bunch of machine learning types
of algorithms can come up and actually help the surface a much better model than any human being is
going to come up with. Let’s look at this last one,
network availability. So, again, let’s talk
about preventing failures. Now, it turns out
when you have a big complicated network like this, I have lots of redundant paths. If any device or a link fails and actually shuts itself
off when it fails, I am totally fine. The problem we have is what
we call Gray switch failures, where some piece of
equipment actually starts dropping some small fraction of the packets that
go through it, but not enough that
it actually fail to stops and takes itself
out of service. When it stays in service
while dropping packets, now that’s a problem
my team has to go through and find the device
and shut it down, and there’s going
to be customer pain and impact until we do so. This really is the needle
in a haystack problem. Again, we have lots of
different data sources. In part, because
we’ve tried many, many different approaches
to this problem. So, for example, we’ve
set a Pingmeshes. We have all of our servers
sending ping packets through the network and we’re measuring latency and failure rates. We have targeted probe systems, which are actually sending
different types of probe packets of
different kinds to try and simulate different behaviors, and we can target more probes
on the links that we think might be dropping packets, we’ve got error messages, basically debugging pirnits
that are coming back from the switches that may or may not indicate a problem on
the switches themselves. We’ve got these end-to-end
service health metrics, all the services running across the network
which are telling us, “I’m happy, I’m not happy, I’m getting worse,
I’m getting better.” Essentially, the problem
we now have is, how do we combine
all these things to localize the problem? We have a whole bunch
of noise coming out of the best classification
and recognition systems that we were able to
build individually. So, this is now one of
those metal level things. One set of technology for
this would be boosting. How can we use boosting is
a way to pull more signal out of the noise being
generated by each of these other smaller
classification systems. We’ve also use
stochastic gradient descent and other kinds of algorithms, in a fault propagation model, to try and work
backwards to what is the most likely explanation for this set of symptoms that
we’re currently observing. But this, again, is a really
rich and important area, and something where, again,
we have these large datasets. And sometimes they’re
pretty well labeled, we know whether the services
are healthier or not. Coming back to that final point about Practitioner’s Guide. I’m not a machine
learning person. I am a networking researcher
as well at heart, and the thing I’ve seen the most is what we call
the Knowledge Gap. There are some people who
know machine-learning, know the algorithms there
really, really well. I’ve got a bunch of
network scientists who know how A6 work and how A6 fail, and how to look at service data. The problem we have is when
people are more interested in their own field than they
are at solving the problem, these people can’t
talk to each other, or they just fail to
talk to each other. So, for data science, the people who want to come
work with systems people, I think the key thing is having to become more interested in solving the problem and
learning more about the specific subject material, then in even knowing in the classification algorithms in order to make forward progress, to basically establish
what are the metrics, what are the salient dimensions, to be able to work through
the data and sort through it, to come up with better outcomes. I’ve experienced some
positive examples of this. I’ve experienced some negative
examples of this, and I’m glad to be part of the conversation here and
talk more. So, thank you.>>Okay, questions? Okay,
next up we have Keith. Keith is from Stanford, where he is an
Assistant Professor. His research is in
Network Systems and it mostly cuts across traditional
abstraction boundaries and uses statistical techniques. Keith has won the second
dissertation award. Prior to being a professor, he was a staff reporter at
the Wall Street Journal. He later worked at Ksplice, which was acquired by
Oracle where he was the Vice President of Product Management
and Business Development, and according to his webpage, he was also responsible
for cleaning the bathroom. Yes. Welcome, Keith.>>Thank you. I don’t
know which input I’m on. I think he’s working on it. Perfect. All right,
thank you very much. I’m not an expert in anything, but I was invited
to be on this panel so I’m going to do my best. I think I’m going
to do with biology, machine learning is good. One thing I want to highlight is that when people say
machine learning, they mean a bunch of
different things. I think there are maybe, at least three paradigms in
which we talk about using machine learning
in the context of deploying an operational system. So, Paradigm 1 is what I’ll
call learn then deploy. So you learn something, machine-learning
produces an artifact, someone runs in a data center
on a million core GPU, all kinds of stuff, we
produce an awesome artifact, learned the network and then
we deploy that network. That’s Paradigm 1. Paradigm 2, it’s about learning in real life. We deploy something and
then we learn over time. Something’s in
the field, we learn. Paradigm 3, I’m going to call, we learn from the machines. Not the computer learning,
but machine-learning teaches us about
our own thinking. Like these are
sort of three different ways we could talk about machine learning in the context
of network systems. I guess I have sort of
three different views. Paradigm 1, which I
think is currently the dominant one
people talk about. In my opinion, it’s often
harder than we expect, that’s not obvious but
it’s often tricky. Paradigm 2, this
deploy and learn. They continuously
learning in the field. I think is quite valuable and I think we should research it, but it’s hard because of
the nature of network systems. I think there’s other domains where this is probably easier, but in networks is not so easy. I think we should be
working on those problems. Paradigm 3, this idea that we can learn from the experience of teaching a machine to
match us in some capability, I am not such old-fashioned, like 1960s view, but I think
it’s still true today. I think we should talk
more in this term. So, I’m going try and talk
about it in that sense. So, Paradigm 1,
learn then deploy. Why do I say it’s often
harder than we expect? Well, I have been bit by this. So, here’s a paper
for me that was NSDI 2013 about a machine-learning
congestion control algorithm, uses Bayesian reasoning for congestion control,
the sprout algorithm. You can see on this graph, the best algorithms have the lowest delay and
the best throughput. So, the best algorithms
are all algorithm, sprout up into the right,
we’re very proud of that. Now, what happens in real life? So we deploy this around
the world, and five years later, here’s the result that was taken yesterday on the T-Mobile
network in California. Same kind of graph.
Here’s sprout. So, still sort of
up into the right, but it’s not really
as dominant as we showed in our paper
from five years ago. There’s a whole bunch of
other algorithms that are on this Pareto frontier,
that’s in America. This is the same place we
measured when we did our paper. Here’s result that was literally this morning in New Delhi, India on the Airtel network. Here’s my algorithm, it
is terrible, it sucks. Can I say that here? I don’t know where,
okay, thank you. So, Yes this learn
then deploy I learned, I wrote a paper about it, the paper won an award, I got a professorship. Now, we learned in the field
that doesn’t actually work, at least not outside America. This is not a good paradigm
for anyone other than me, I’m the beneficiary,
but not everyone else. So, that’s my algorithm. Here’s a different
algorithm Vivace, and this was published
in NSDI 2018. Here’s the publication again. We see really good
performance here which shares the link is
flows enter and leave. This is in the paper.
Here’s the result, again, this morning in Brazil. Same algorithm but again it’s quite different
from the paper. We see that the first
flow dominates, the second flow much less,
the third flow much less. So, there’s the difference
again between the learn and then the deploy. Here’s a different paper,
this is the pensive paper that was at SIGCOMM, the
most recent SIGCOMM. This is a learned
scheme for deciding which video bit rate to
send a user over a network, where there’s unpredictability
about the network, and whether they’re
going to need to re-buffer and you want to have good quality video but you don’t want to
run out of buffer, so this is in the paper, if you take the same
learned artifact, and you deploy it
over similar networks some students in my class tried and they got these results. So, again that learn it
works when you learn it, but later when you deploy
it, something’s different. It doesn’t work anymore. The only way to know
this is to teach a networking class where
you can compel students to reproduce papers, and
then they have to do it. It’s not actually easy to
learn this information. We put a lot of
effort into deploying this global infrastructure
whose conclusion is that I was wrong, but if we hadn’t done that we
would never know I would be going around blissfully
talking about how great I was. Here’s another one, so BBR this is an algorithm from Google, this was published in ACM Queue, and they say BBR converges towards a fair share of
the bottleneck bandwidth, and worst-case, the oppositional algorithms might grab more than their fair share. BBR can have less
than its fair share. That’s the claim in the paper, here is a slide
presentation from ripe 76 just a few months ago, and the gentleman from APNIC
finds that’s just not true. Here’s a cubic flow and suddenly when
the BBR flow starts, it crushes the cubic flow. So, there’s learn,
Google learned it, but then when you deploy it, you get something quite different. So, what they find is that BBR is not a scalable approach, it works great when just a few users use it
like one big company, but not when lots
of people use it, is BBR a failure, not necessarily because now
there’s going to be a BBR 2.0, we’ll see what happens. Here’s another example.
So, network system a broad field let’s talk
what search engines. There’s a lot of interest
in using the data from search engines to learn
things about the real world. In 2008, Google had a paper
about learning about the flu, from the search queries
that they got. The idea here is to learn, a model to predict
the real-world flu incidence by correlating search engine queries against government flu data. Which is released on
two to three week time lag. They learn the model, and then they deployed in
real-time to break flu in advance of the government to help direct vaccines more efficiently. So, this was on the front page of The New York Times,
aches, a sneeze, a Google search, it was a nature, that was on Charlie Rose, and Google put this plot
on their website, and you can see the predictor which is in blue
does really good, against the real life data
which is in yellow. This is what they put
on their website, they don’t show you is this, it’s actually the vast majority this plot is training data. So, this is what it
is but it’s actually only a tiny bit of here
it was not trained on. So this is hard so
they learned it on this training data here, I annotated the figure, but then when they
actually deploy it, this is not deployment. It’s they call it
historical estimates, but it wasn’t deployed
historically, when they actually deploy, this is what happens.
It goes crazy. We don’t know why, Google eventually they kept
trying to fix this and kept screwing up they
eventually gave up on it. We don’t know why this happens, maybe someone in
the audience knows why, I don’t know why. Here’s another example,
Spam Filtering, and then another sort of broadly construed network
systems problems. SpamAssasins is a spam
filtering engine, has very clever
community approach, the idea is that
anyone can propose a spam filtering decision rule. It’s okay because a central party learns the best weights
for each of these rules, a sort of a panel of experts. So, anyone can contribute
a putative expert. People learn, essentially
they learn the best rates. So, if you’re expert is nonsense they’ll just give it
a weight of zero. If it happens to be very predictive they’ll give
it a strong weight. Those weights are then,
deployed in the field. So, in 2007 people
proposed various rules, some a rule, and the rule was, does the year matched to 200 and then any character, the date. It turned out that
caught a lot of spam, because a lot of spam is sent
for a date in the future. Like a date like 2012, which would happen
in the future so that would’ve been spam. So, this had an extremely low false positive rate
when they learned it. People can see why,
because it is in 2007 that they’re learning
it? So, what happened? January 1st, 2010 SpamAssassin started killing
every message spam. Physicists are very easy
to see case where did they learn then deployed it just
falls down in the real world. Because they had this idea that the algorithms
were black boxes, they didn’t have to
know how they worked, they just looked at empirically
how well they worked, and they waited them according
to real performance, that did not work
when time evolved. Time is a very clear case here
if you depend on the date. As soon as you hit 2010
times is going to evolve. So, what was the fix for this? How do you fix this problem?>>You got change dates?>>What was that?>>You got change dates?>>Well, they just
changed it to 2020. That’s their new decision rule. Yes. So, I think
the lesson here is that it is very easy
to fool yourself, and I think Feynman
says, “You are the easiest person to fool”. This learn and deploy it is as a challenging parrot pattern, and empirically,
very very smart people, and very very smart
mega corporations embarrass themselves
going on television to, in a segue with premature
declarations to success. So, I think we have
to keep that in mind. So, paradigm two is okay let’s deploy and learn over time. We’ll build systems that
learn continuously, and we’ll directly observe the operational figure
of merit over time. That way we can react quickly
to real-world changes. The real-world changes
just we just change. So, some people are
really good at this. So, here’s a paper from Google, that quick paper at
SIGCOMM last year, we see Google directly observed everyday how
well quick is doing, and if there’s a problem,
they can immediately fix it. Whether it’s automatic,
or them they can fix it. This is really a world that
I would want to live in. It would be very comfortable
with deploying something crazy because you can see immediately how well it’s doing. Not just congestion control but, in the real world we
do learn over time, if I’m driving a car, and something bad happens to me, I’m going to learn,
I’m going to say, “Wow, I’m definitely not going to get that close to
a Ferrari next time”. You want to learn over time, you don’t want to just
have a fixed mindset and do it continuously, if you build a robot
to change a diaper, and the baby wiggles out of it, you want to learn
from that experience, you don’t want to be like,”Okay, I’m going to stay fixed”. So, this continuous learning has some intuitive appeal,
but it’s hard. It’s hard on a network. There’s some classic results
that make it hard in cases where information
is distributed. So, learning is it’s
much easier when all the information
comes to one place, when information is distributed, it’s really hard and
there’s some old results, that are therefore profound that say that “In some cases
it’s actually impossible”. Impossibilities based group
decision-making with separate aggregation
of beliefs and values. So,if you keep
the information apart, It’s really hard to learn
something consistent. If you bring all
the information in one place, you can do it. If you’re keeping
information apart, which is what we’d like
to do on a network, we like to be decentralized,
It’s hard to do. So, also have agents are adversarial it’s very
hard to handle this, how do you handle
competing agents, one guy says, “Oh, I’ve learned this as best,
I learned this is best”. These are hard problems. They
are hard network problems. They are problems
we should work on, but they’re not easy. Michael Shapiro and I actually have
a conjecture that it’s impossible for a decentralized
congestion control scheme that greedily optimizing
an objective function to be globally
asymptotically stable. That it’s an unproved conjecture, but it may be hard in
these adversarial cases. How about in neural networks? Do we have to just
bring all the data to the Cloud and learn there,
or just learning the edge? It wouldn’t it be nice my student John Evans is working
on this project, if the data in the computer
or in different places, to still be able to learn
without having to bring all the compute to the edge
or all the data in the Cloud. It’ll be nice to now split
the problem somehow, and even learn continuously
between date and the edge and computing the Cloud
and learn between them. Unfortunately, so
far you can’t do it. There’s a green region we want to be in and
we’re not in there. So, lesson two is
deploy-and-learn can be great, but it’s hard on a network, when data’s in different places, computes in different places,
and people are adversarial. So, I think to wrap up I think
I’m most excited about is this old fashioned view that the real benefit of teaching
something to a machine, is that you learn about
that thing yourself. We can tell the machine
design a system, and the machine comes, here my requirements, here my goals, machine comes back and
says, “How about this”, it’s better if it looks so weird, if we think that’s crazy, but huh, it does meet
the requirements, and it does maximize
the objective, that’s very interesting let
me clarify my thinking, that “Hmm” is like the most
exciting thing in science. So, teaching something is
the best way to learn anything, the dumber the student the
better the teacher will learn, because if you explained
yourself so clearly, machines are very dumb. So, teaching machines to learn to design systems
is the best way for us to learn about what it really means and what
we really care about. So, to wrap up here,
I think Paradigm one, it gets a lot of hype but it
is harder than we expect, and it is so easy
to fool yourself. Paradigm two, I think we
should be working on, I certainly I’m working on it, but the problems of
network systems, the decentralization of data
and compute makes it hard. And Paradigm three, I’m
quite sympathetic to, and I think even
if we never deploy anything that’s the product
of Machine Learning, we should still do
Machine Learning because we learned so
much about the problem.>>So, we had a set of really interesting
perspectives on the topic of this panel. I’d like to thank
all the speakers. I think what I heard was for problems of
an inference nature, I think Baraj you
talked about this, and I think Dave you
also talked about it. But we are trying to figure
out where the fault is in the network and things like that. It’s obviously something that lends itself well to
machine learning. I think those who talked
about using data broadly, and Kate you had very interesting perspectives on different styles of
applying machine learning. I’d like to just start off with a question and then we’ll open the floor for the audience
to ask questions. I think networking people, by training, are sort of
like to be predictable. The RFC specify
what protocols must do and if it’s not a must
it’s at least assured. If such a message comes
this is what you should do. If there’s a packet lost this is what you
should do and so on. So, is there a cultural issue in adapting to a world
where you don’t have that predictability
around the data and the algorithm that
you don’t understand tells you something and you want to make decisions based on that? So, do you think there
is a inherent tension there and sort of there’s
a little hump that people need to cross to get
comfortable applying machine learning for what
critical networking decisions?>>So, I mean, the examples I talked about, they are well-known actual
algorithms that do the job, LASSO music and physics guys, I was saying there may be some, for example in
the networking case, the new network
doesn’t need to know the adjacency matrix
because it’s just a map. It’s a signal.
That’s an advantage. It doesn’t need to
know the nature of the noise. It’s
going to learn. So, I think it’s
mostly that there’s certain factors that function
as learnable that really is most of
the work goes in establishing that fact and then you trying using a neural
network to learn that function. In the case of music, it
wasn’t straightforward. You really had to
work quite hard. So, often shelf stuff
doesn’t work. But it wasn’t much doubt
that function is learnable because we know what the function is and it’s
got continuous parts and lends itself just
visually into learnable. So, I think that’s one thing
about this question of whether do you want data models, the way you train these models frankly is you need data
just to begin with. Much of what we did or anybody’s doing it just requires it. Whenever you’re trying to
learn a function y equals f of x, in a lot of it X, I, Y pairs, then the function
has to be learnable. So, that’s really what I’ve seen. There may be some cases
where the function has some very bad sort of Lipschitz continuity
properties then it may be difficult to learn those bits, but otherwise so far
it’s pretty decent.>>I can give you
function approximation example. But I was asking
the question more broadly about making
networking decisions.>>Yes, I understand.>>So, I heard the question
as one of, geez, if I have a system
that I don’t know exactly how it works, how do I have confidence
that I could actually take action based on
its recommendations and not make things worse. And you’ll all go
back to the adage. People have a tendency
to screw things up but to really make a big mess
it takes a computer. So, I am trying to hook up essentially machine
learning classification systems that are looking at data to automatic mitigation
mechanisms that will go in and change our network
to try and fix problems. That means is that if
that system runs a mock you could actually go
create a much bigger problem for our customers
than what actually existed there organically
in the network. So, for us it’s been
this combination of guardrails. Around the behavior of this feedback loop essentially
so that no matter what the classifier says it thinks it should turn off in order to maybe get a problem. Remember that grid
failure example? If I can find the switch
I think is bad, I turn it off,
everything gets good. But if I turn off
all the switches, that’s bad. So putting guardrails
around us we don’t get kind of runaway behavior
and that if it looks like things are getting worse
and worse then we apply the brakes to the system
and actually start rolling back some of
the changes that we made. In my domain, at least, we have the ability
to sort of create a closed feedback loop and that’s how we’ve approached it. Certainly our system
is recommended things to us that made no sense and sometimes there’s right, sometimes
there’s wrong.>>But one of the things I like about Dave’s adage to error is human but to really foul
things up it requires a computer is that it
comes from the 1970’s. They already knew back then. But when you talk about concerns about whether we have a system whose behavior we
don’t really understand, I got to tell you
you’re deluded if you think we understand
how the systems we’ve been building work. Okay. We already don’t
understand them. It’s just a question of degree.>>I think it’s
a question of illusion. That is, we thought we’re the students so that
we felt comfortable.>>But going back a little bit to the specifics of your question, there’s this cultural question like “Do network
engineers want to give up control to something they don’t understand
by a machine?” But on the other hand
if you actually look at RFCs that say
should and would, that kind of provides the framework and the
more of that that’s in the RFC probably the better off you are because it says
your learning algorithm, it’s only going to
have a few knobs that can change and you have some hope you’re going
to understand it. So, that goes both ways.>>I think it’s also
this composition. You can probably understand
something at a single level. It’s like when you
have packets here and then there’s protocols that are driving these packets
and then they rely on the arrival of acknowledgements
and that doesn’t happen. Now, something at
this level triggers something at one level above
that trigger something. It’s almost like a f of g of h of k sort of composing now. You only see the first thing and the last thing and you don’t
know where it’s breaking. That sort of stuff
is really hard. You can’t isolate the fault very easily and that’s what
frustrates us networking folks. We design with understanding
but unfortunately we compose things now and they
become big and complex.>>I think there’s sort of a sense where if something has only a small number of parameters we thought that we understood it. Take the AI MD algorithm in TCP. You added an increase by
one, divide it by two. There’s two parameters and
we wrote some nice proofs. And to then discover that
actually the flock behavior of a decentralized network
of computers running TCP, even with those
two parameters, is incredibly complicated. Has all kinds of fun
things that we discovered. It’s actually
chaotic overdraft TLQ, and there’s problems like incast but have to
do with the sort of packet by packet behavior that don’t show up
in fluid models. So, even something with
just two parameters when run in a decentralized
networks situation is so fun but also so buggy. That’s a sort of fun and
also humbling discovery. I think there’s an appropriate
caution about deploying something with a 100 million
learn parameters. If we can’t even get the AI and the MD understand those dynamics, I think we should
be appropriately humble about our ability to reason about a system
with a million parameters. Doing machine learning to
optimize tail behavior or worst-case behavior
especially in the presence of adversarial input is
an unsolved problem. It’s one thing to optimize average-case behavior of an
image recognition program. And we know that
adversarial input to in image recognition is actually the adversary
has the better hand there. You can change
one pixel and suddenly a puppy turns into
the Sears Tower. So, I think there’s an appropriate caution about increasing the number
of parameters. Defining a guardrail, which I
think is such a good point, I don’t think we
know how to define the guardrails for
many decentralized scenarios. I mean, how do you
define stability? How do you define
a guardrail when you only observe part
of the situation? These are fun problems that we should be working on but I don’t think we have the answers.>>Right. I think
an interesting question is whether adversarial attacks become more likely or less likely with machine learning system versus a hand crafted system. But I think let’s hold
onto that question. I’d like to open the floor for questions
from the audience. Yes. Do we have
microphones? Okay.>>Thanks. Is this on? Okay. Tensor really has
interesting set of talks. My question is sort of more
broadly about the use of training datasets for machine learning
which is fundamental. Obvious problem with
such training datasets is that rare events are
by definition rare, and the kind of control systems that we need to put into
place for dealing with such events are not going to be perhaps even available
in the training dataset. So, there’s
critical dependency on the training dataset and the fact that
these datasets may not include the cases that
you care about in practice makes reliance on machine learning only as your control system probably bad.>>I think that’s a fair point. So, the kinds of
things that are in my examples that we’re
trying to learn do actually happen a couple of
times a day across the fleet. So, they’re rare but they happen often enough
that we actually have positive examples
and we can define a false-positive
false-negative rate on it. I still have to have
human engineers on call 24/7 365 because
things will break that has never been
seen before and that any of the automatic
classification systems just don’t know what to do with. Then we’re back to
good old human troubleshooting. I mean, we talked with the ideas of trying to actually inject. When we have a partial
forward plant model and can predict if
this thing broke, what would it look like in terms of the symptoms we observe, trying to reject more essentially
synthetic failures into our training data, mixed success.>>Any other questions?>>I guess a question maybe
following up on Keath’s talk. Do you think there’s
a prospect for designing systems to be machine learnable?>>Yes.>>And tell us how.>>Oh.>>I mean,>>Okay. Low systems
without a lot of high frequency response
are much easier to train a system for than
ones that are very twitchy. So, I think stability, slow evolution, and
behavior over time. Good network architecture,
good system architecture, whether you’re trying to have
humans understand what’s going on or an AI system
understand what’s going on, they’re all easier to control.>>We’ve understood
for a long time how to build congestion
control protocols that don’t need machine
learning to operate well. We just don’t build
those systems. For whatever reason, we use more complicated
mechanisms for coordination than we
really needed to just because man invented an algorithm that we all have
to live with now. So, there is
this question, I think, of if we have designed
for testability or modular design for
agility or whatever, do we need a set of
design principles for telling engineers how to build systems that can
actually be predictable?>>Sure. I think, they involve taking away the things that make
the problem hard. So, get rid of decentralization, get rid of adversaries, the problem becomes much easier. So, I can just control scheme where there is a central
dictator that governs the allocation of
everybody is much easier to reason about than
a Van Jacobson style scheme where there’s a partially deliverable markup
decision process that’s solved in
decentralize fashion. That is literally
an undecidable problem in general in computer science. So, that’s why congestion control is so hard if you
do it decentralize. If you bring all
the information in one place and have a dictator
make the decision, it becomes much easier.>>But maybe observation is that there are many of
these kinds of tassels. For another one, it’s
not centralization, decentralize but time to convergence or efficiency
and optimality. Sometimes giving up on some of those dimensions can actually get your system
that’s going to be far easier to understand
or far simpler. But it is a trade-off.>>I agree. So, maybe there’s
an intuitive feeling that the more you squeeze the
optimization level in one place, all the other things
are going to get worse. So, if you squeeze for
great average-case performance, probably your tails
and your liability and your adversary robustness, adversary in form of loosener
are going to get worse. So, maybe if we stop squeezing
for the thing that gets you promoted, then
we’ll all be happier. I mean, being less glib.
Maybe if we somehow value these harder to define metrics instead of just your
average-case performance or 95th percentile performance and stop squeezing so
hard on that one, maybe understandability
and stability, if we can quantify those, we might be able to
optimize for those instead.>>Also, the world of algorithms, there’s better chance
to subsystems. There’s more chance of this
being very helpful just because it’s very clean and it got easier definition of what. If this input arrives, I want this desirable output. I think there you
can see success.>>It used to be one
of the requirements is good telemetry system
that actually gives you the data that you need
from actual deployment. I think search engines did
a great job because they had all the data and so we know exactly what it is that
we’re clicking on and so on. Whereas, networking systems, I don’t think
a lot of data comes back because it’s all over the place and it’s
expensive to bring it back. So, I think that’s the challenge. I think we have
just a couple more minutes, so I want to ask
a final question.>>I think it’s fair to say that overall the whole systems and networking
community has a cautionary approach to
infusing ML into the whole mix. It’s fair. A lot of reasons were brought up
in all the presentations, we can outline that as well. My question to the panelists is, do you see any
sub-areas of systems and networking where we would actually be better off if
we lower the caution bar, and then the systems could evolve faster and for the better. I understand it’s not
all the sub-areas but any specific sub-areas
you could think of?>>How about the scheduler or the branch predictor?
Those kinds of things.>>Why is it do you
think that lowering the caution bar in those is
better than say, condition?>>I think those are
examples perhaps where when there’s a misprediction, the cost isn’t so high. Now, David, he’s worried about individual
packets being lost. That’s not the right case then, you might
not want to do it. But if it’s just a
question of, well, it’s just going to
cost us a little more but we get the same performance or it’s a microscopic change
in performance, you don’t need to be as cautious. That’s the low-hanging fruit. Especially, and you may
not need to know why it made the decision
because if it’s wrong, it didn’t cost you that much.>>There’s totally
no way that I couldn’t use quarks of our branch
prediction algorithm and speculative execution to steal information out of
a process or anything. That would totally never work.>>Yes. I’m not sure it’d be worse if you used a neural network, though. I mean, how about
the query planner in a database? I think these kinds of things, we’re already using sort
of folklore method, so why not use a neural network?>>Okay. I think
we need to close, but I just want you to
take 30 seconds each. Usually, we ask people
to predict the future, but I’m going to change
the question here. If you had that machine
learning technology and horsepower that we have
today in the ’80s, when congestion control took shape or in the ’90s
when BGP was designed or maybe the 2000s when firewalls and deep packet inspection and so on
came into the fold, how would things have been
done differently that would perhaps make
us better off today? If you had to
redesign things with machine learning technologies
available to you, what would have happened
differently? Just quick.>>Okay, I can go first. I’m just using the example as music thing. It learns the environment. So you come out on
off-the-shelf with an algorithm that works in many environments fairly well but not optimized for anything, can be, and then you
deploy in a particular, like in your indoor home area, you got some VR headset. That thing really
works differently in the basement room or
which one each work. I think that’s an example
of something where it’s adaptive to the environment.>>I think that if we’d had
this horsepower in the ’80s, I would’ve been able to say, “Okay, Google, why is
the network so slow?” It would say, “I’m sorry. I
can’t help you with that.” That’s the kind of
thing it’s good at. It’s recognizing patterns. It can understand my question, but it doesn’t know the answer.>>Anyone else?
Congestion control.>>Okay, fine.
Congestion control. The truth is that people have been talking about
congestion control in the context of machine
learning since it started. I mean, Kashev’s paper on
packet pair it talks about the then prevailing paradigm
of machine learning, which was fuzzy logic.
That’s right, right? Yes, I mean, it’s all about
fuzzy logic which was the upper hype thing
in the early ’90s. I bet if deep neural networks
have been around then, you would have read
the exact same paper, but you’d have subbed
out fuzzy logic for neural networks but nothing would change about the algorithm. So, I don’t know.>>Okay. The last one.>>Yes. I’m not really
coming up with one either. The one thing I do
know is whenever Skynet tries to take
over the world, it will probably crash because
of a networking problem.>>Okay. On that note, let’s
thank our panel again.

Leave a Reply

Your email address will not be published. Required fields are marked *