The brave new world of cloud-scale systems and networking with Dr. Lidong Zhou

This post has been republished via RSS; it originally appeared at: Microsoft Research.

Dr. Lidong Zhou

Episode 82, June 26, 2019

If you’re like me, you’re no longer amazed by how all your technologies can work for you. Rather, you’ve begun to take for granted that they simply should work for you. Instantly. All together. All the time. The fact that you’re not amazed is a testimony to the work that people like Dr. Lidong Zhou, Assistant Managing Director of Microsoft Research Asia, do every day. He oversees some of the cutting-edge systems and networking research that goes on behind the scenes to make sure you’re not amazed when your technologies work together seamlessly but rather, can continue to take it for granted that they will!

Today, Dr. Zhou talks about systems and networking research in an era of unprecedented systems complexity and what happens when old assumptions don’t apply to new systems, explains how projects like CloudBrain are taking aim at real-time troubleshooting to address cloud-scale, network-related problems like “gray failure,” and tells us why he believes now is the most exciting time to be a systems and networking researcher.



Lindong Zhou: We have seen a lot of advances in, for example, machine learning and deep learning. So, one thing that we have been looking into is how we can leverage all those new technologies in machine learning and deep learning and apply it to deal with the complexity in systems.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: If you’re like me, you’re no longer amazed by how all your technologies can work for you. Rather, you’ve begun to take for granted that they simply should work for you. Instantly. All together. All the time. The fact that you’re not amazed is a testimony to the work that people like Dr. Lidong Zhou, Assistant Managing Director of Microsoft Research Asia, do every day. He oversees some of the cutting-edge systems and networking research that goes on behind the scenes to make sure you’re not amazed when your technologies work together seamlessly but rather, can continue to take it for granted that they will!

Today, Dr. Zhou talks about systems and networking research in an era of unprecedented systems complexity and what happens when old assumptions don’t apply to new systems, explains how projects like CloudBrain are taking aim at real-time troubleshooting to address cloud-scale, network-related problems like “gray failure,” and tells us why he believes now is the most exciting time to be a systems and networking researcher. That and much more on this episode of the Microsoft Research Podcast.

Host: Lidong Zhou, welcome to the podcast.

Lidong Zhou: Yes. It’s great to be here.

Host: As the Assistant Managing Director of MSR Asia, you are, among other things, responsible for overseeing research in systems and networking, and I know you’ve done a lot of research in systems and networking over the course of your career as well. So, in broad strokes, what do you do and why do you do it? What gets you up in the morning?

Lidong Zhou: Yeah, I think, you know, this is one of the most exciting times to do research in systems and networking. And we already have seen advances of, you know, systems and networking have been pushing the envelopes in many technologies. We’ve seen the internet, the web, web search, big data, and all the way to the artificial intelligence and cloud computing that, you know, everybody kind of relies on these days.

Host: Yeah.

Lidong Zhou: All those advances have created challenges of unprecedented complexity, scale and a lot of dynamism. So, my understanding, you know, of systems is always, you know, a system is about bringing order to chaos, right? The chaotic situation. So, we are actually in a very chaotic situation where things change so fast and there are a lot of, you know, new technologies coming. And so, when we talk about systems research, it’s really about transforming all those unorganized pieces into a unified whole, right? That’s why, you know, we’re very excited about all those challenges. And also, we realized over the years that it’s actually not just the typical systems expertise – when we talk about distributed systems, operating systems or networking – that’s actually not enough to address the challenges we’re facing. Like, you have to actually also master other fields like, you know, database systems and programming languages, compilers, hardware, and also in artificial intelligence and machine learning and deep learning. And what I do at Microsoft Research Asia, is to put together a team with a diverse set of expertise and inspire the team to take on those big challenges together by, you know, working together, and, you know, that’s a very exciting job to have.

Host: I love the “order out of chaos” representation… if you’ve ever been involved in software code writing, you write this here and someone else is writing that there, and it has to work together, and you’ve got ten other people writing… and we all just take for granted, on my end, it’s going to work. And if it doesn’t, I curse my computer!

Lidong Zhou: Yes, that’s our problem!

Host: Well, I had Hsiao-Wuen Hon on the podcast in November for the 20th anniversary of the lab there, and he talked about the mission to, in essence, both advance the theory and practice of computing, in general. Your own nearly twenty-year career has been about advancing the theory and practice of distributed systems, particularly. So, talk about some of the initiatives you’ve been part of and technical contributions you’ve made to distributed systems over the years. You’ve just come off the heels of talking about the complexities. Now, how have you seen it evolve over those years?

Lidong Zhou: You know, I think we are getting into the year of distributed systems. Being a distributed systems person, we always believe, you know, what we’re working on is the most important piece. You know, I think Microsoft Research is really a great place to connect theory and practice, because we are constantly exposed to very difficult technical challenges from the product teams. They’re tackling very difficult problems, and we also have the luxury of stepping back and thinking deeply about the problems we’re facing and thinking about what kinds of new theories we want to develop, what new methodologies we can develop to address those problems. I remember, you know, in early 2000, when Microsoft started doing web search, and we had a meeting with the dev manager, who was actually in charge of architecting the web search system. And so, we had a, you know, very interesting discussion. We talked about, you know, how we were doing research in distributed systems, how we had to deal with, you know, a lot of problems when services fail. So, we have to make sure that the whole service actually stays correct in the face of all kinds of problems that you can see in a distributed system. I remember at that time, we had Roy Levin, Leslie Lamport, you know, a lot of colleagues, and we talked about protocols. And, at the beginning, the dev manager basically said, oh yeah, I know, you know, it’s complicated to deal with all these failures, but it’s actually under control. And a couple months later, he came back and said, oh, you know, there’s so many corner cases. It’s just beyond our capability of reasoning about the correctness. And we need the protocols that we were talking about. But it’s also interesting that, you know, in developing those protocols, we tend to make some assumptions. Say, okay, you know, we can tolerate a certain number of failures. And one question that the general manager asked was, you know, what happens if we have more than that number of failures in the system, right? And from a practical point of view, you have to deal with those kinds of situations. In theory, when you work on theory, then, you know, you can say, okay, let’s make an assumption and let’s just work under that assumption. So, we see that there’s a difference between theory and practice. The nice thing about working at Microsoft Research is you can actually get exposed to those real problems and keep you honest about what assumptions are reasonable, what assumptions are not reasonable. And then you think about, you know, what is the best way of solving those problems in a more general sense rather than just solving a particular problem?

Host: Your work in networked computer systems is somewhat analogous to another passion of yours that I’m going to call “networked human systems.” In other words, your desire to build community among systems researchers. How are you going about that? I’m particularly interested in your Asia Pacific Systems workshop and the results you’ve seen come out of that.

Lidong Zhou: So, I moved to Microsoft Research Asia in late 2008, and, when I was in the United States, clearly there is a very strong systems community. And, over the years, we’ve also seen that community sort of expanding into Europe. So, the European systems community sort of started the systems workshop, and eventually it evolved into a conference called EuroSys, and very successfully. And you know we see a lot of people getting into systems and networking because of the community, because of the influence of those conferences. And the workshop has been very successful in gathering momentum in the region. And so, in 2010, I remember it was Chandu Thekkath and Rama Kotla who were my colleagues at Microsoft Research, and they basically had this idea that maybe we should start something also in the Asia Pacific region. At that time, I was already working in Beijing, and I thought, you know, this is also part of my obligation. So, in 2010, we started the first Asia Pacific systems workshop. And it was a humble beginning. We had probably about thirty submissions and accepted probably a dozen. It was a good workshop, but it was a very humble beginning, as I said. But what happened after that was really beyond our expectation. It’s like, you know, we just planted a seed, and the community sort of picked it up and grew with it. And, you know, it’s very satisfying to see that we’re actually going to have the tenth workshop in Hangzhou in August. If you look at the organizing committee, they are really you know all world-class researchers from all over the world. It’s not just from a particular region, but you know really, all the experts across the world contributed to the success of this workshop over the last, you know, almost ten years now. And the impact that this workshop has is actually pretty tremendous.

Host: What would you attribute it to?

Lidong Zhou: I think it’s really, first of all, this is the natural trend, right? You go from… the U.S. was leading in systems research and, and then expanded to Europe. And it’s just a natural trajectory to expand further to Asia Pacific given, you know, a lot of, you know, technological advances are happening in Asia. And the other, you know, reason is because the community really came together. There are a lot of top systems researchers that originally, just like me, came from the Asia Pacific region. So, we have a lot of incentives and commitment to give back.

Host: Right.

Lidong Zhou: And all those enthusiasms, passion, or the willingness to help young researchers in the region, I mean those actually contributed to the success of the workshop, in my view.

Host: Well, you were recently involved in hosting another interesting workshop, or conference: The Symposium on Operating Systems Principles, right?

Lidong Zhou: Right.

Host: SOSP?

Lidong Zhou: SOSP.

Host: And this was in Shanghai in 2017. It’s the premier conference for computer systems technology. And as I understand, it’s about as hard to win the bid for as the Olympics!

Lidong Zhou: Yes, almost.

Host: So why was it important to host this conference for you, and how do you think it will help broaden the reach of the systems community worldwide?

Lidong Zhou: So, SOSP is one of the most important systems conferences and traditionally, it has been held in the U.S. and later on, they started rotating into Europe. And it was really a very interesting journey that we went through, along with Professor Haibo Chen who is from Shanghai Jiao Tong University. We started pitching for having SOSP in the Asia Pacific region in 2011. That was like six years before we actually succeeded! We pitched three times. But overall, even for the first time, the community was very supportive in many ways, so that we’d be very careful to make sure that the first one is going to be a success. And in 2017, when Haibo and I opened the conference, I was actually very happy that I didn’t have to be there to make another pitch! I was essentially opening the conference. And it was very successful in the sense that we had a record number of attendees, over eight hundred people…

Host: Wow.

Lidong Zhou: …and we had almost the same number, if not a little bit more, from the U.S. and Europe. And we had, you know, many more people from the region, which was what we intended.

Host: Mm-hmm.

Lidong Zhou: And having the conference in the Asia Pacific is actually very significant to the region. We’re seeing more and more high-quality work and papers in those top conferences from the Asia Pacific region, you know, from Korea, India, China, and many other countries.

Host: Right.

Lidong Zhou: And I’d like to believe that what we have done sort of helped a little bit in those regards.

(music plays)

Host: Let’s talk about the broader topic of education for a minute. This is really, really important for the systems talent pipeline around the world. And perhaps the biggest challenge is expanding and improving university-level education for this talent pipeline. MSRA has been hosting a systems education workshop for the past three years. The fourth is coming up this summer, and none other than Turing Award winner John Hopcroft has praised it as “a step toward improving education and cultivating world-class talent.” And he also said a fifth of the world’s talent is in the Asia Pacific region, so we’d better get over there. Tell us about this ongoing workshop.

Lidong Zhou: Yeah, actually John really inspired us to get this started I think more than three years ago.

Host: Mm-hmm.

Lidong Zhou: And I think we’re seeing a need to improve, you know, systems education. But more importantly, I think, for MSR Asia, one of the things that we’re very proud of doing is connecting educators and researchers from all over the world, especially connecting people from, you know, the U.S. and Europe with those in the Asia Pacific region. And the other thing that we are also very proud of doing is cultivating the next generation of computer scientists. And certainly, as you said, you know, the most important thing is education. And during the process, what we found, is that there are a lot of professors who share the same passion. And we’re talking about, you know, a couple of professors, Lorenzo Alvisi from Cornell and Robbert van Renesse from Cornell and Geoff Voelker from UCSD… they actually came all the way from the U.S. just to be at the workshop, talking to all the systems professors from all over the country in China. And so, I attended those workshops myself. The first one was five days, and the next two were, like, three days. It’s a huge time commitment.

Host: Yeah.

Lidong Zhou: But you see all the passion from those professors. They’re really into improving teaching. They’re trying to figure out, you know, how to make students more engaged, how to get them excited about systems, even how to design experiments, all those aspects. And, you know, we’re really optimistic that with those passionate professors, we’re going to see a very strong new generation of systems researchers. And this is, you know, I think the kind of impact we really want to see from a perspective of, you know, Microsoft Research Asia. It’s not just about making the lab successful, but, if we can make an impact in the community in terms of talent, in terms of the quality of education, that’s much more satisfying.

Host: Before we get into specific work, I’d like you to talk about what you’d referred to as a fundamental shift in the way we need to design systems – and by we, I mean you – in the era of cloud computing and AI. You’ve suggested that things have changed enough that the older methodologies and principles aren’t valid anymore. So, unpack that for us. What’s changed and what needs to happen to build next-gen systems?

Lidong Zhou: Yeah, that’s a great question. I’ll continue with the story about building fault-tolerant systems. So, in the last thirty years, we have been working on systems reliability, and we have developed a lot of techniques, a lot of protocols, and we think it will solve all the problems. But if you look at how this thread of work started, it really started in the late seventies when we were looking at the reliability of airplanes, and so on. Of course, you know, there are assumptions we make about the kinds of failures in those kinds of systems. And we sort of generalize those protocols so that it can be applicable up until now. But if you look at the cloud, it’s much more complicated, in many dimensions. And the system also evolves very quickly. And a lot of assumptions we make actually start to break. And even though we have applied all these well-known techniques, that’s just not enough. So, that’s one aspect. The other aspect is, it used to be that, you know, the system we build, we can sort of understand how it works, right? And now, the complexity has already gone beyond our own understanding, you know. We can’t reason about how the system behaves. On the other hand, we have seen a lot of advances in, for example, machine learning and deep learning. So, one thing that we have been looking into is how we can leverage all those new technologies in machine learning and deep learning and apply it to deal with the complexity in systems. And that’s, you know, another very fascinating area that we’re looking into as well.

Host: Yeah. Well, let’s get specific now. Another super interesting area of research deals with exceptions and failures in the cloud-scale era and how you’re dealing with what you call “gray failure.” And you’ve also called it the gray swan (which I want you to explain) or the Achilles heel of cloud-scale systems. So how did you handle exceptions and failures in a somewhat less complex, pre-cloud era and what new methodologies are you trying to implement now?

Lidong Zhou: Right. So, as I mentioned, in the older days, we are targeting those systems with assumptions about failures, right? Like crash failures, you know, a component can fail… when it fails, it crashes. It stops working. And nowadays, we realize, you know, this kind of assumption no longer holds. So, this is why we define a new type of failures called gray failures. So, thinking about what kind of name to give to this very interesting new line of research that we’re starting so we called it gray swan. People already know about black swan or gray rhino. So first of all, because we’re talking about the cloud, we want something not as heavy as a rhino!

Host: Right.

Lidong Zhou: We want something that can fly. And the reason we call it gray is because, you know a systems component is no longer just black or white. It could be in a weird state where, from some of the observers it’s actually behaving correctly, but from the others, it’s actually not. And that turns out to be behind many of the issues that major problems that we’re seeing in the cloud. And it has sort of some components of black swan in the sense that some of the assumptions we’re making break. So that’s why everything we build on top of that assumption starts to break down. So, for example, I mentioned the assumption about failure, right? If you think that it either crashed or it’s correct, then it’s a very simple kind of world, right? But if it’s not the case, then all the protocols that will work under that assumption will cease to work. It also has this connection with gray rhino because gray rhino is this problem that everybody sort of sees coming, and it’s a very major problem, but people tend to ignore it for the wrong reason. And in our case, we know that, for the cloud, all those service disruptions happen all the time, and there are actually failures all over the place. It’s just very hard to figure out which ones are important. But we know something big is going to happen at some point, right? So, we try to use this notion of gray swan to describe this new line of thinking where, you know, we really think about failures that are not just crash failures or not even, you know, Byzantine failures where it’s essentially arbitrary failures. But there’s something in between that we should reason about, and then using those to reason about the correctness of the whole service.

Host: So, does the word catastrophic enter into this at all? Or is it…

Lidong Zhou: Yes! That could be catastrophic. Eventually.

Host: How does that kind of thinking playing into what you’re doing?

Lidong Zhou: If you look at the cloud system, it’s like in a rhino sort of charging towards you, and before it hits you, there are a lot of dusts, and you know noise and other things. But you just don’t know when and how something bad is going to happen, right? And it could be catastrophic. It happens actually a couple times already. And so, one of the things we try to do is to try to figure out when and how bad things could happen to prevent catastrophic failures…

Host: Right.

Lidong Zhou: …from all the dust and maybe, you know, other signals we have in the system. There are signals. It’s just we don’t know how to leverage them.

Host: Part of your approach to coping with gray failures is a line of research you call CloudBrain.

Lidong Zhou: Right.

Host: And it’s all about automatic troubleshooting for the cloud. It’s actually a huge issue because of the remarkable complexity of the systems. So, tell us how CloudBrain, and what you call DeepView, is actually helping operators – the people that have to deal with it on the ground – simplify how they write troubleshooting algorithms.

Lidong Zhou: Mm-hmm. So, I think CloudBrain is one of the efforts that we have to deal with gray failures. And remember, you know, we talked about the challenges that come from the complexity of the system or the scale of the system. It would really have, you know, a huge number of components interacting with each other. But on the other hand, we can really leverage the scale of the system to help us in terms of, you know, diagnosis and all, detecting problems, even figuring out where the problem is. And this is the premise of the CloudBrain project. So, it has actually three components, three ideas. The first one is really the notion of near, real-time monitoring. And so instead of trying to look at the logs after the fact and then analyze what happened, we try to have a pulse on what the system is doing, how it’s doing, and so on. So that’s the first component. And the second component is we really want to form a global view. So, it’s not just one observation we make about a system, but really observations for all over the systems combined, so we can actually understand how a system is behaving and which part is actually having a problem. And then, the third part is, once you have, you know, all these global observations that come in real time, then we can use statistical methods to really reason about, you know, what’s abnormal and so on. So, this is where we really leverage the scale, the huge amount of data…

Host: Right.

Lidong Zhou: …that used to be a challenge and now it becomes an opportunity for us to actually come up with new solutions to handle the complexity of the system.

Host: So how does that help an operator simplify writing an algorithm?

Lidong Zhou: Right, so now, the operator actually has all the data in near real time. And, you know, you can write this very simple algorithm that operates on the data sort of like a SQL query.

Host: Right.

Lidong Zhou: Right? And then it can emit signals and you know tell people that something’s wrong or something’s correct, or maybe we have to pay attention to part of the system that seems to have some problems.

Host: So where is this gray failure research, with all its pieces and parts, in the pipeline for production?

Lidong Zhou: Overall, we are not at the stage where we solve all the problems, but we have pieces of the technology we developed to solve some specific problems like DeepView and CloudBrain are, you know, the two projects that have already been incorporated in Azure to deal with network-related problems, for example.

Host: Mm-hmm.

Lidong Zhou: But, you know, we are far from solving the problem. It’s really sort of a research agenda that we set out probably for years to come. And one idea that we have been working on, which is actually very interesting, is that we really have to change how we view programs. In the past, for defensive programming, we have been trained to handle exceptions, and it turns out that handling exceptions in a large, complex system is not enough. So, one of the ideas that we’ve been thinking about is changing exception handling into exception or error reporting. So, you start to collect all those signals. We talked about, you know, the dust when the…

Host: Right.

Lidong Zhou: …rhino comes charging at you. So, you have to really collect those dusts towards one place so that you can actually reason about the behavior of the system. And that’s, you know, one of those major shifts…

Host: Yeah.

Lidong Zhou: …that, you know, we see coming even in how we develop systems.

Host: Right.

Lidong Zhou: Not just, you know, after the fact, we already have this beast and now we need to understand what’s going on.

Host: Right.

Lidong Zhou: So those methodologies, I think, is where we’re pushing. You know, it’s not just solving a specific problem. We have an incident; we try to solve this problem. Yeah, we can do that. But more importantly… this goes back to the theory meets practice…

Host: Right.

Lidong Zhou: …so, we need to come out of looking at the specific instances, but think about, you know, what methodologies we should adopt to change the status completely.

Host: So how do you implement, then, a brand-new thing? I mean, we talked about the beast that already exists, and is growing. What are you proposing with your research?

Lidong Zhou: Right, so, this is always a hard problem. We already have something running, and it has to keep running, and now it has a lot of problems we need to solve. So, one of the ways we deal with those challenges is trying to solve the current problems. You know, like CloudBrain and DeepView sort of try to fit into the current practice. But for some other projects, what we do is like, you know, what I talked about, changing from exception handling to error reporting – that actually is a system we build that we can transform automatically a piece of code that does error handling in the traditional way into a piece of code that actually does error reporting in the way that we desire.

Host: Right.

Lidong Zhou: And that helps because we don’t want everybody to rewrite the whole code base.

Host: No.

Lidong Zhou: It’s just not possible. So, we have to find ways to help developers to sort of do the transformation and also live with the current boundaries of the system. And we hopefully, gradually, we’ll move towards the right direction.

Host: Yeah, I think you see that in just about every place software exists is there’s a legacy system. You’ve got to retrofit some stuff that added complexity to it.

Lidong Zhou: That’s right.

Host: But you can’t just make everyone throw out what they’re already using. So, this is a big challenge. I’m glad you’re on the job.

(music plays)

Host: Well, we talked about what gets you up in the morning and all the work you’re doing to make sure that everything goes right… that is basically what you’re doing, is trying to make everything go right…

Lidong Zhou: Right.

Host: …but as we know – as you know more than I know – something always goes wrong!

Lidong Zhou: Right, unfortunately.

Host: The rhino… So, given what you see in your work every day, is there anything that keeps you up at night?

Lidong Zhou: Yes, I think we’re realizing that the kinds of distributed systems we’re designing, or building, are becoming more and more important. They’re becoming part of the sort of critical infrastructure of our society. And that puts a lot of burden on us to make sure that whatever we’re building can be mission critical.

Host: Right.

Lidong Zhou: And you know we have a lot of researchers working on formal methods, verification, just to make sure that the core of the system can be verifiable, will give some assurance that it’s actually working correctly. And, you know, we talked about applying machine learning and deep learning mechanisms, but it’s statistical. So sometimes – actually, naturally – there are cases where it breaks. So how we can safeguard this kind of system from what you call catastrophic issues, and this is also another thing that we have been putting a lot of thought into. And we’re not short of challenges, especially on making the cloud infrastructure really, you know, mission critical!

Host: Lidong, tell us your story. How did you end up at Microsoft Research, and how did you develop your path to the positions you hold right now?

Lidong Zhou: Yeah, looking back, I remember when I finished my PhD, I started job hunting and I got, you know, a couple of offers, and I talked to my advisor. Of course, that’s what you do when you’re a graduate student. And he basically gave me a very simple piece of advice. He basically said, well, just go where you can find the best colleagues, the colleagues with maybe, you know, Turing-Award caliber. So, I ended up going to Microsoft Research Lab where, at that time, we didn’t have a Turing Award winner, but within ten years, we had two! So that was how things started. Looking back, what’s really important is the quality of colleagues you have, especially in the early stages of my career. I learned how to do research in some sense. It’s not about getting papers published. It’s internal passion that drives research and I think the first phase of my career is more on personal development. I remember being pushed by my manager at the time, Roy Levin, to get out of my comfort zone. We started as a sort of technical contributor, but then, I was pushed to lead a project and there are always new challenges that you face. And you get a lot of support from your colleagues to get to the next stage, and that’s very satisfying. And then I went to MSR Asia, where I later became a manager of a research group, and I think that’s sort of the second phase of my career, where it’s not about my personal career development. It’s also about building a team and how you can contribute to other people’s success. And that turns out to be even more satisfying to see the impact you can have on other people’s careers and their success. And also, during that period of time, I also realized that it’s not just about your own team. You know, we can build the best systems research team in Asia Pacific, but it’s more satisfying if you can contribute to the community. And we talked about starting the workshop and getting the conference into Asia Pacific, and, you know, a lot of other things that we do to contribute to society, including, you know, the talent fostering and many other things. And those, in my mind, are becoming even more critical as we move on in our career.

Host: Yeah.

Lidong Zhou: So, I view this as sort of the three stages of my career. It started with personal development, learning, you know, what it means to love what you do and do what you love. And then you think about how you can contribute to other people’s success and increase your ability to influence others and impact others, and positively. And finally, in what you can contribute to the society, to the community. And I’ve been very fortunate to have been working with a lot of great, you know, leaders and colleagues, and I’ve learned a lot along the way. And I remember you know I worked with a lot of product teams as well. And they also offered a lot of career advice and support. So, this is just, you know, my story, I guess.

Host: You know, it sounds to me like almost a metaphor. You know, you start with yourself, you grow and mature outwards to others, and then the broader community impact that ultimately a mature person wants to see happen, right?

Lidong Zhou: I hope so!

Host: I get the sense that it is!

Lidong Zhou: It’s just about seeking the truth. It’s not about, you know, getting papers published. It’s not about, you know, chasing fame or, you know, all those things that we start to lose sight of, you know, what the true meaning of research is. It’s not about all these results that we try to get, but truly, it’s about finding the truth and enjoying the process along the way.

Host: At the end of each podcast, I ask my guests to give some parting advice to our listeners. What big, unsolved problems do you see on the horizon for researchers who may just be getting their feet wet with systems and networking research?

Lidong Zhou: Well, I think they are very fortunate to be a young researcher in systems and networking now. I remember I was talking to But[ler] Lampson when I started my career in 2003, and he said, you know, he was feeling lucky that he was doing all the work in the late seventies and early eighties because it was the right time to see a paradigm shift. And I think, now, we are at the point that we’re going to see another major paradigm shift, just like, you know, folks in Xerox PARC. What they did was, essentially, to define computing for the next thirty years. Even now, we’re sort of living in the world that they defined, looking at the PC, even with the phone. I mean, that’s just a different form factor, right? They sort of defined the mouse, the laser printer, all the things that we know about, and the user interface. And the reason that happened at that time was because the computing was becoming, you know, more powerful from supercomputers now to personal computing, because…

Host: Right.

Lidong Zhou: …you know, we can pack so much computation power into a small machine. And now, I think, you know, the computation power has reached another milestone where computing capability is going to be everywhere. And we’re going to have intelligence everywhere around us. The boundary between sort of the virtual world in computers and our physical world will disappear. And that will lead to really paradigm-shifting opportunities where we figure out, you know, what computing really means in the next, you know, ten years, twenty years. And this is what I would encourage everyone focus on rather than just incremental improvements to the protocols and so on. Because we are really seeing a lot of assumptions being invalidated. And we really have to look at the world in a very different view and from, you know, how we interact with sort of the computing capability and how we expose computing capability to do what we need to do. And it’s not just doing computing in front of a computer but, you know, doing everything with sort of the computing capability around us. And that’s just exciting to imagine. And I can’t even describe what the future will look like, but it’s up to our young researchers to really make it a reality.

Host: Lidong Zhou, it’s been an absolute pleasure. Thanks for joining us in the booth today.

Lidong Zhou: Thank you, Gretchen. Really a pleasure.

(music plays)

To learn more about Dr. Lidong Zhou and how researchers are working to bring order out of systems and networking chaos, visit


The post The brave new world of cloud-scale systems and networking with Dr. Lidong Zhou appeared first on Microsoft Research.

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.