Hacking our way towards ML-first Jupyter Notebooks
Speaker: Matt Dupree
Summary
In this Navigate 2023 talk, Matt Dupree discusses various challenges in data work and machine learning and proposes potential solutions. He highlights the importance of static analysis to address the problem of refactoring in data notebooks. Dupree emphasizes the need for automated testing workflows using IPython's hooks and profiles to minimize errors and missed opportunities. Furthermore, he suggests the development of Jupyter Bridge plugins to alleviate the repetitive typing of code and simplify the interaction between Python and JavaScript. The aim is to move away from inefficient approaches and empower practitioners in the field to overcome challenges more effectively.
Transcription
Does anybody recognize this scene? Hold on, something crazy happened. I was just in another talk, it's one of the other last talks of the day. The other speaker also had a Batman motif for his talk. I don't know, like, what are the odds? It's crazy. But okay, that's a hint. It's from Batman, which one? Does anybody recognize this scene? Which Batman is it from? Just yell it out, it's intimate, we can let's just...The Dark Knight, that's it, that's awesome.
Yes, it's from The Dark Knight. There's this scene where Batman is interrogating The Joker. He wants some information from him, right? And Batman is doing his thing that he usually does, he's just pummeling people. He's trying to, he's doing a beat up on him to get the information he wants. And the Joker just laughs at him. He's like, "You got nothing on me," right? The line is something like, "You have nothing, nothing to threaten me with, nothing to do with all your strength," right? Just a beautiful line. If guys, this is being recorded, if you haven't seen this movie, just, I would recommend leaving and go watch it. You can watch this recording later, it's going to be better. But okay, it's a great, so it's a great scene.
Alright, now, what is this? Wait, what are we talking about here? What's the title of the talk? This is going somewhere. So, for a long time before Kubernetes, I think there was something of this dynamic going on between devs and operations people and trying to wrangle prod right in infrastructure operations. We're powerful, we're superheroes within our organizations, we have a lot of abilities. We're capable, but prod and infrastructure kind of still laughed at us. It was hard to wrangle, it was kind of, there were these inevitable challenges that we couldn't surmount despite our superpowers.
Thankfully, we got Kubernetes, and things are starting to feel a little more manageable. That's what a lot of this conference has been about, what's possible when you have something like Kubernetes. It kind of gets us out of that Batman-Joker dynamic. But I want to talk about a different kind of Batman-Joker dynamic that's more related to machine learning and data work.
This is how I had been feeling when I was doing some data work and machine learning work at my last employer and it's the reason why I started DataChimp. I felt like Batman in that scene, right? Like, you're trying to wrangle this problem, and you're applying the tools that you have always worked in the past, but it wasn't really working.
So let's talk about that. So, what are the tools that we have as software people when we're trying to solve a problem? Abstraction, right? Classes and functions, those are the two main things that we, whenever there's some problem that we're trying to solve as a programmer, those are the things that we throw at the problem. There's this saying that nothing in computer science is solvable without like another layer of indirection or something like that, right? So these are the tools that we're used to using and these were the tools that were failing me when I was doing my data and machine learning work.
What was the problem? So even though I was using these tools that had always served me well as a programmer, it was hard for me to move from prototyping in a notebook to production. There was this painful refactoring process that I had to go through in order to do that. The next problem: I found myself worrying about mistakes or missed opportunities.
Trying to improve a machine learning model and no matter what sorts of classes or functions I created, it didn't really solve that problem for me. And then, the last problem, you know, I found myself typing the same code kind of over and over again in my data notebooks, and so these are the things that kind of functions in classes, my standard tools as a programmer, didn't really help me solve.
Okay, so, let's do a little pop quiz. This is, don't worry, this is relevant. But people know Fernando Perez created IPython, which is now Jupyter, right? Like people, people know this. This is a little bit of history. Okay, why did he create IPython? Is it A) he was procrastinating on his PhD thesis, B) he wanted a tool to make it easier to explore data with code, or C) and wanted a tool that made machine learning easier? Does anybody want to venture a guess? Again, it's intimate, just shout it out. Hey, oh, what you got? You guys ruined it, you know your history.
Yeah, so A is definitely a part of the history. Yes, he was in fact procrastinating on his thesis, that's a part of the history of data notebooks. B is also true, to be fair. If you look, he has an article where he kind of recounts the history, and B was also true. But importantly, C was not true. Right, he did not have machine learning explicitly in mind when he created IPython, which became Jupyter, and that explains why we kind of find ourselves in that battle, this Batman Joker situation, why we have problems that don't seem to go away. It's because the tool that we're using was not designed for machine learning.
So, what would a tool that makes machine learning easy, what would it look like? Well, let's go back to something that Steve Wozniak said yesterday. Okay, how many people were there for that? Is everybody there? All right, good. Most people were there. So, you guys remember, he said something about mainframe computers. He said something interesting. He said, look, you know, we had these compute before the personal computer, we had these big mainframe computers. The only people who understood how to operate them were like the geeks, you know. And like it wasn't really an approachable thing. And that's kind of what building machine learning systems are like now. We're in this stage where we have these mainframe-like tools that are inapproachable and kind of difficult to use and only understood by the geeks. And we want to move towards something that's approachable and usable.
As Steve talked about it in terms of like the human side of things, right? Something like the Apple II. So that's like an abstract kind of where we're headed. And more concretely, what we want out of a tool that makes machine learning easy is, we want these problems to go away. We don't want painful refactoring to move from prototype to production. We don't want to worry about mistakes or missed opportunities. We don't want to be typing the same code over and over again.
So, how do we do that? That's what I want to talk about. I want to talk about the kind of open source Lego blocks that help us achieve those things. The three main things I want to talk about are Python's AST module. AST stands for abstract syntax tree. We'll talk about that in a second. I want to talk about IPython's hooks and profiles. There's another open source kind of Lego block we can use. And then I also want to talk about the Jupyter protocol and mime types as a part of the IPython protocol. So these are kind of the Lego pieces that we can use to build the solution that will help us escape that kind of Batman Joker situation. Sound good? It's about to get technical, y'all. There's going to be code. All right, buckle up.
Here we go.
So let's start with this painful refactoring for production. How can we make that better? Okay, so first, let's just talk for a second about why the refactoring is painful in the first place. This is a comment from the creator of IPython, Jupyter. It makes a really interesting observation about the nature of data work and the kind of work that we do in our notebook. He says, we don't tend to program when we're working in a notebook with a well predefined objective. We use programming languages in an interactive discovery process. So, this is the mode that we're in when we're working in a notebook. Right, it's very much exploratory. It's prototyping. And that's a different mode from engineering a production thing. This kind of mismatch, these two different modes, has actually led to a little bit of a tiff within the data science community about whether notebooks are even the right tool for data work.
How many people are familiar with this talk? Has anybody seen this before? Okay, all right, we got a couple of nods. So, I'll just quick history. So, Joel Grus gave this talk, you can see it's from 2018. He said like, I actually think notebooks are bad. And you can see he's got the Squidward, he's got the thing because he thinks people aren't going to like it. A lot of people were like, well, actually, yeah, there's some good points here. And so, it's kind of created this debate within the community about whether notebooks are really the right tool for data work.
Now, I think they are the right tool for data work. And we just have to recognize the kind of two different modalities of work that we're doing when we're building machine learning models, when we're doing data work. The modality of exploration versus engineering. And so the question becomes, how can we make that transition between those two modes easier?
So, let's make that really concrete. So let's say you're working, you have a notebook. Okay, you have some code. You've got some imports. You're getting some data. And then finally, you're doing some data preparation. Now if you look at this third cell here, that cell is referencing this data frame variable that's defined in a previous cell. It's also referencing this train test split function from Scikit-Learn. Now, what we really want when we're shifting from a kind of prototyping mode or exploratory mode to a more production mode is, we want to take a cell like this that has these free variables that are unbound or bound in previous cells. We want to move from a cell like that to a function like this.
Right, if you look at things like Kubeflow or Flight, they're asking you to write functions that you can kind of decorate, and then they get translated into some DSL. So, you need to move from the notebook to a function, and the function needs to acknowledge the inputs, those free variables that I mentioned. So, in this case, the dataframe variable that was defined previously, and also the train test split function. Right, those things need to be included in this function that we create. And right now, the process for creating that function is kind of manual. You look at the cell, and you're like, "Okay, I see that there's this other thing defined here", and so you manually copy these things over into a script. This is the painful refactoring process that I'm talking about. So, how can we not do that? Because that's kind of a bummer to have to manually and painfully refactor these things into scripts that can be executed by these pipelines.
So, we can use static analysis. This is, you know, we're very used to static analysis when working with developer tools in other contexts, but we haven't really completely applied this to the data science context well.
So Python as a great AST model, it stands for Abstract Syntax Tree (AST). Abstract Syntax Tree, in case you're unfamiliar with the concept, is basically when you type code into your script file and Python executes it, it translates that string into a tree-like structure so that it can understand the code that you've written in a predictable and manageable way. So, you have this "Hello World" statement, if you parse that and print it out in Python, this is what you see here on iPython. You can parse it and then dump it out, you get this tree-like structure, and then you can work with this tree-like structure to manipulate code. And that's what we're going to do here to ease the process of refactoring.
One of the key functions for doing this is called 'walk'. It will take an abstract syntax tree, and as the name suggests, walk through that tree and yield every single node within that tree, so that you can analyze it or perform some sort of manipulation on the code. And you can do that in a structured way, so that you're not dealing with regexes and stuff. There's that kind of silly quote about how it's something like, "Oh, you have a problem, and then you use regex, now you have two problems". You want to avoid that, right. You want to use something like an abstract syntax tree so you can manage the code that you're working with.
With something like AST walk, it's surprisingly trivial to get a set of unbound variables within a Jupyter notebook cell. This is all the code that you really need. I'm hand-waving a little bit here with some of these things, but this is a set comprehension. We're using this AST walk that I mentioned, and then there's just this if-check within the set comprehension where we're walking through every node, and we're saying, if the node is a name (basically a variable) and the node has not already been bound, then we want to include it in this set comprehension. So now what we have here is a set of free variables. These are the things that we're going to need to move into our script from the notebook cell, and we're going to need to move these things in a way that uses imports or function parameters. Let's quickly look at this video and see what it looks like. So you have this cell here, and imagine, you know, adding this to a workflow. Maybe you want to call this step 'prep', and there you go. Now you have a function that importantly has the 'df' dataframe parameter and the import statement. It's already been moved for you. You've done the analysis statically using the AST module, you understand what the dependencies are of this task. You don't have to manually do it like a savage. You know what I'm saying, we have technology here. This is about, let's use it and move faster.
I'm going to stop there for a second. I'll make sure you guys are still with me. We're at the end of the day. Are we tracking? Do we have any questions? Anybody want to shout anything out? Is anybody bored? Are we good? We're hanging out. Awesome, all right. Oh yeah, I already clicked those things.
So that's painful refactoring to production. That's how we can make that better with the AST module. We're doing pretty good on time. So, what about the second problem? About worrying about mistakes or missed opportunities to improve our model or clean up our data? What sort of open source tools can we use to solve this problem?
Before we answer that, let's just look at this example. I brought this up in my workshop yesterday. This is a study that came out of Andrew Ng's research group. This guy is the Jesus of machine learning, right? You can't do better than Andrew Ng, in my opinion. I don't think that's that controversial, but the guy's well-known. And Santiago here is pointing out on Twitter that there's actually a problem with how they trained their model. There's kind of a sort of leakage happening here. So, if Andrew Ng can make that kind of mistake, it really suggests that we're in a bad spot as an industry. We really need better tools.
Andrew Ng knows this, and this is the impetus for data-centric AI. He has this great quote, "For decades, individuals have been looking for data problems and fixing them on their own. It's often been the cleverness of an individual's skill or luck with an individual engineer that determines whether it gets done well. Making this more systematic through principles and the use of tools will help a lot of teams build more AI systems". So we need more tools for this, he gets that.
Right now, we're at the beginning of this age of machine learning. If you think about when chainsaws first came out, it's very easy to hurt yourself. It's very easy to make a mistake. This chainsaw is from the '40s, it's very easy to wind up in the hospital. If you compare that chainsaw with this one, it's completely different. Look at these two triggers. You cannot operate this chainsaw in an unsafe way. It will shut off if you take any hand off of the trigger. So, it's designed in a way that's going to lead you to be safe. That's what we need building machine learning systems to be like. We don't want to be in a situation where the best of the best can easily cut their hands off or make a mistake. We want to be in this situation.
So, how do we do that? We can take a cue from software engineering here. This is, I'm using pytest and pytest-watch, a lesser-known related library. On the left, we have a script. We're just adding two things. On the right, we have the test for it. We run pytest-watch, we make a change to our script, and immediately we see that we've broken our code. There's not anything like this in a Jupyter notebook yet, right? The software engineers have it. Data scientists need it too, machine learning engineers need it too. What does it look like? Let's work through a quick data science example.
So, this is from the Iris dataset, this is the 'Hello World' of machine learning. Let's say that we have things like petal length, petal width, and we want to predict the variety of the flower. This is a pretty standard machine learning 'Hello World' type problem. We can say we have the petal width and petal length that's measured in centimeters, that's kind of what we expect. We train our model, we find out the model's not really doing that well. We're trying to debug it. What's going on here? We go to Slack and we say, "Hey guys, who collected the data? Were we using centimeters?" And then, oh, you know what, missed opportunity for that. It can mean, you know what I'm talking about, the Anakin meme. I should have just, can you just imagine it in the slide deck for me? If you don't know what the Anakin meme is, it's like, you know you're going to change the world for the better, right? For the better, right? That's Padme. All right, whatever, just Google it. I should have had that here, a little meta there.
Anyway, so let's say there's this mistake with the data collection, somebody's measuring with a different unit. There's not really a way to detect this ahead of time. We, it's easy to make this mistake. So, what would it look like if we could detect this sort of thing within a notebook? Here's a quick demo and then we'll talk about how it works.
So, we have a dataframe, we get our results. And let me pause this here. On the left, we have our standard notebook. On the right, we have a kind of automated test where we're asserting what the minimum petal length should be for our dataframe. We're making an assertion about the shape of our data. Let's say we've spoken to a domain expert and they say, "Look, there's no flowers that have 30 centimeter petal length. That's nuts for this species." We've encoded that domain expertise into an automated test.
So let's say we have that setup and we're working in our notebook. Now, we move from a good dataset to a bad Iris dataset, and then immediately, we see at the bottom there's a mistake. We see examples of all the rows that don't meet that assertion. That's the kind of workflow that we want. How do we build that? We can use IPython's event system.
Within a notebook, you can do this, you can say 'get_ipython().events.register', and after a cell is run, 'run_some_tests'. What does the 'run_some_tests' function look like? Well, we get some sort of result after we've executed the cell. We can get the source code from that cell, it's a part of the API. Maybe we want to do some sort of static analysis on it, I'm a little hand-wavy here, we can talk about that more later if that's interesting. Then we can execute some functions that will run the tests.
I'm kind of sketching how this would work. It's not that hard to set this up in a way where we can get this real-time feedback on the correctness of our wrangling code using IPython's event system. Now, this is okay, but really, we don't want to have to even think about this, we don't even want to have to run this code. So we can take this a step further by using IPython profiles. These are Python files that are run automatically when you start a new Jupyter session or a new IPython session.
So, you can stick this code below here inside a path like this on Mac (on Linux it's different). You stick it in that path, it'll automatically run this when IPython starts. Then, you're getting this sort of automated test-type workflow as you work in your notebook. That's what we want, I think. Does that seem interesting? Cool, people are like, "Yeah, yeah." Awesome.
All right, so hopefully that helps us worry less about the mistakes and missed opportunities that we make. We've got the double trigger chainsaw now, we've got tools helping us work in a smarter way. What about this last thing, typing the same code repeatedly over and over again?
So, this is an interesting one. Yeah, this is an interesting one, so let's see why. I'm actually curious, how many people work with, like, do machine learning or work with data? All right, cool, so we got a couple people here, but some, some of this, okay, just a couple of folks. The people who have done machine learning for a little bit, this has got to look really familiar to you, right? Like, you've written code like this so many times, you could write it in your sleep. You just use Matplotlib and scatter plot. This is muscle memory.
I remember when I first started programming, I had a mentor. I was doing a lot of stuff in the terminal and I was running into a problem. He said, "Well, let me see what you're doing." And so, I'm typing out all these commands in the terminal and he says, "Wow, you got all those commands memorized, huh?" I was feeling kind of proud of myself. I'm like, "Yeah, dude, I got these memorized." He's like, "That's bad. You don't want to have that stuff memorized. It suggests that you're missing an abstraction, you're missing some automation."
Yet here we are, as data scientists, machine learning engineers, this stuff is muscle memory, this stuff is so memorized. There's something missing here. Now, I'm going to wax philosophical a little bit, because I got time. I want to speculate as to the cause of why we haven't figured out that this is a problem. This is one of the most insidious problems, I think, with machine learning, because it's like we don't really realize that it's a problem.
Think about the psychological experience of doing data analysis. We write a small amount of code, we wait a few seconds, and on the other side of that code, there's the possibility of some sort of insight or some sort of new hypothesis about how to engineer a feature that can make a model better. So, we actually have a dynamic that's kind of like this. Does anybody recognize this? Yes, it's not just any casino, this one's nearby, this is the Hard Rock.
By the way, I'd never been to a casino until recently, I went to the one, the Hard Rock here. Holy smokes, it's crazy. It's designed to really appeal to the lizard brain back here, right? Like, you can just feel it. Working in a data notebook is like this. It has the same psychological dynamic. It has the same variable reward schedule that powers the slot machine. You do a small amount of work, you write a little line of code, and there's that possibility of some bombshell insight or, again, some idea for feature engineering to make your model better.
So, this problem of typing the same code over and over again, it doesn't even feel like a problem. It's kind of fun, you know? This is an insidious problem. It disguises itself as a problem. So, what's the solution to this? I think we can take a cue from Emacs. Does anybody know the history of Emacs, where it got started?
What's that? It makes a constant swapping? Yeah, okay, well, can we say a little more about the context here, the background? Does anybody know where it came from? How about the institution?
Well, Emacs came out of MIT, and interestingly, it came out of MIT's AI lab. That's kind of interesting, and there's another thing that came out of MIT. Does anybody recognize the symbol? I googled this and I was like, 'Oh yeah, what is it? The one on the right?' Yeah, the one on the right.
It's double lambdas for sure, but it's associated with a programming language. Does anybody know what it is? Scheme? Yeah, that's good. It's a Lisp. Okay, so Lisp also came out of MIT in the 50s, right?
What's really interesting about the relationship between... how many people, does anybody use Emacs here? We have any Emacs people? All right, yes, good, good, good. Okay, so if you've used Emacs, you know that it is the eminently hackable editor. People joke about building Tetris in Emacs, or they say it's an operating system. This thing is crazy hackable. The hackability of Emacs really inspired the extensibility of the more modern editors that we use, like VS Code or Atom. No offense to people who still use Emacs, I think it's great.
But one of the things that made Emacs so hackable is that the editor is written in the language that is the lingua franca of the users of that editor. Which is a fancy way of saying, Emacs is written in Lisp and the people who were using Emacs were at MIT and they were used to working in Lisp. So, if they want to extend Emacs, it's trivial. They're already working in the language.
That, unfortunately, is not true with Jupyter. You can do some extensions in Jupyter in Python, but you gotta do some JavaScript stuff too. So, you don't have this smooth transition when you want to extend Jupyter. It's not that hackable. You've got this language barrier, and being able to hack on the editor, I think, is the key to solving this repetitive code problem. And I'll show you what I mean.
So, yeah, what we need is something more like this. We're Python people. If we're working in the Jupyter notebook, we want to be able to extend the notebook with just Python. We don't want to learn another language. So, what does this look like for visualization?
Uh, yeah, I think I might need to pause this. Let me stop. Okay, so on the far left, we have a data frame or, sorry, on the far right... No, it is the left, it is the left. It's late, y'all. Okay, so on the far left, we have our data frame. We're getting the data. In the middle, we have just a cell that's describing some code that we want to be automatically executed once we've executed the code on the far left. Here we're using the previous Lego blocks of the hooks, the IPython hooks and the static analysis.
We've written this code and we say basically, 'Look, this is how I want to visualize this data frame in a particular way.' That's what it says in the middle. The details aren't that important. So, when we execute the code on the far left, we're going to get a lot of visualizations for free. I'm not writing any Matplotlib code on the far left. I'm not writing any Seaborn code. I get the visualizations for free with IPython hooks, with static analysis, and with the ability to hack VS Code a little bit, right? And I could just do it with Python. I can specify what I want to happen on cell execution with straight Python. Don't have to learn JavaScript.
So, how does this work? What's the Lego block that powers this? It's a little more complicated but there's magic in Jupyter notebooks called 'connect info.' This is to Fernando's credit. He built IPython to be multi-client from the beginning. That was a part of his vision that there would be a kernel, a backend, that would actually do the execution of the Python code, and a frontend that would show those results, and there could be multiple frontends. He built it to support that.
So, this 'connect info' magic is a reflection of that vision. When you type this magic, you get a JSON object that describes all the information you need to connect to the same Jupyter kernel that is being used in that notebook. So, once you have that information, you can spy on a currently existing Jupyter notebook session. We had that ability already with Jupyter hooks. But because we're trying to sidestep this JavaScript problem, we have to use this magic.
The next piece that we can use, this is an open source package within Jupyter Lab, is this 'services' package which makes it easy to use JavaScript to interface with the Jupyter kernel. This is something that is used by Jupyter notebook, and it's a block that they expose that we can use to improve our experience in working with data.
So, what does that look like? It's actually pretty nice. It's not too bad. In just a few lines of code, we can connect to a kernel and request to execute some Python code, and then we can get the result of that Python code.
Something that's interesting about the result of that Python code is, it's already designed to be displayed in multiple contexts. So, if you look at the reply that we get here, the important part is right here. We have MIME types for the reply of our Python code request. So, here we've asked it to grab some data frame and IPython. This is built into the protocol. When you get a response for executing some Python code, you're going to get this MIME type keyed dictionary. In this case, there are two keys: 'text/html' and 'text/plain'. These are just two different representations of the same data frame.
But once you have these representations, it's not difficult to stick that into a notebook or some sort of VS Code plugin. You can leverage the multi-client nature of the IPython protocol to make IPython and Jupyter feel more hackable for Python users. They can just write a little bit of Python code and you can show them results. You just have to write a little bit of glue code, a little bit of bridge code that sits between the Python and the JavaScript. You're just relaying that response, that HTML response, into some sort of display.
Alright, so there's the third problem, typing the same code repeatedly. That can go away if we make our editor, our IDE, or Jupyter notebook more hackable. Make it easier for people who are used to working in Python.
Just to sum up, the solution to the first problem, or building block, is static analysis using Python's AST module. There's so much of this in the software engineering world. Static analysis, linters, formatters. If you use IntelliJ's products, they make refactoring crazy easy for a lot of stuff. VS Code is still catching up, but we're nowhere near that for data notebooks. We really need to be leaning more into static analysis.
The next problem, worrying about these mistakes or missed opportunities, we really want more automated testing workflows. We can use IPython's hooks and profiles to get that.
The last thing, typing the same code over and over again, for that, we can build Jupyter bridge plugins that abstract away JavaScript from the Python user and show them the result they need to see, based on the code that they're working with.
Those are the Lego blocks. Now, I'm trying to use those Lego blocks to build a solution. I'm not here to try and sell you on that solution. You can use Data Chimp or not. The important thing to me is that we're not in this situation anymore as a field. We've got to stop standing there, like Batman standing there like an idiot trying to wrangle these things with tools that aren't going to get the job done.
Use Data Chimp or not, but don't stand there like that. Find the blocks, the building blocks that you need to get the workflow that we want so that we can really fulfill the promise of data and machine learning, and stop accidentally cutting our hands off or whatever.
So yeah, that's my talk.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.