Data Science in Notebooks vs. IDEs
Pros and Cons and When to use What?
Computer programming in the context of software development has a long history spanning several decades, from the early high-level languages in the 1950s to the vast palette of different languages and tools that we see today. In time, best practices and standard design patterns of coding have evolved and, as a direct consequence, plenty of different products supporting this way of working have found their way to the market. One important group of such tools is made up of various types of IDEs, that is Integrated Development Environments.
An IDE, loosely speaking, is a program with a user interface in which all development of a coding project is done. It typically consists of a code editor, a compiler, an interpreter and a debugger, but most commonly also many other useful features, making coding more effective. In the modern IT community, most programmers are spending much of their time working in IDEs of different kinds. According to recent statistics of downloads from April 2022, the most popular alternatives seem to be Visual Studio, Eclipse, VS Code and PyCharm.
Another tradition of computer programming has its roots in the universities and especially in the mathematics- and physics departments, namely mathematical and statistical modelling. In this context, the structured design patterns of software development and object-oriented programming have had less influence on the tools created. Instead, matters like pedagogy and capabilities in visualisation, easy and flexible script execution, symbolic computation etc. have driven this style of working with code. One of the earliest and also most prominent software packages for the purpose of modelling is Wolfram Mathematica released in 1988, soon to be followed by other popular and widely used programs, especially Maple (1992).
One of many core features in Mathematica, making it a true success story, was the concept of a notebook environment, described by founder Steven Wolfram as “… an interactive document that freely mixes code, results, graphics, text and everything else.”. A notebook environment is partitioned into separate blocks of code called cells. Each cell can be executed independently of the others, although interdependencies between code blocks are very common. Since most notebook-based software packages support markdown it is easy for the user to tell the story of the script by including nice headers and texts, explaining what is happening and why. To further enhance the storytelling capabilities, it is also possible to present figures, plots and graphs between cells, thus making the notebook suitable for live demos and presentations to a wide audience. Interactive notebooks got traction in the Python community with the IPython project and especially its spin-off, Project Jupyter, in 2014. Some popular environments for AI and machine learning are Jupyter Notebook, Google Colab, Deepnote and Databricks.
A data science production project is typically more involved than both a classic software development project and a one-way modelling task since code, models and data have to be tied together to work properly over time. One has to consider version controlling, job scheduling, error handling, and testing and building involving more than just code. Thus, the emerging field of MLOps has sparked interest by addressing the problem of having a combined “DevOps + DataOps + ModelOps” way of working with AI and machine learning.
Up until a few years ago, the majority of data science production projects were either fully developed in an IDE framework or as a combined IDE/Notebook setup, using notebooks just for one-way model training. However, some notebook-based cloud solutions have made a great effort and created tools capable of supporting all best practices of MLOps. Databricks has similarly made major progress and their open-source tools, especially Delta Lake and MLflow are crucial centrepieces for DataOps and ModelOps respectively. It's no longer self-evident not to use a development scheme based primarily on interactive notebooks rather than on IDEs.
Pros of Databricks notebooks and IDEs
- Pedagogy – As stated above, one of the true advantages of doing data science in a notebook is the possibility to explain your results directly in an accessible manner, not only to people sharing the technical background, but also to a more general audience. A well written notebook can in fact sometimes be a substitute to a PowerPoint presentation.
- Interactive collaboration – A Databricks notebook allows for the opportunity to have several persons working on the very same code simultaneously. Although this can be seriously problematic due to lack of control, it can also be of great value in the exploratory phase of a project. The active cells in which someone else is working are clearly highlighted by colors, and coding can be followed live. It is also possible to leave comments attached to cells, much like the similar possibility in Office Word.
- Preconfigured tools from ML-Ops – When launching a new instance of Databricks, both Delta Lake and ML Flow are preconfigured and ready for use, thus making both version controlling of data and models immediately available. The auto logging feature of ML Flow will recognise what data was used for training, and save a reference when running a model training.
- GUI for job scheduling – Running jobs on a regular basis is an easy task in Databricks. The job scheduling GUI allows for a several notebooks to be executed in a sequence with interdependencies, so that one notebook in the sequence can run only if the previous finished successfully. The scheduling also has functionality for sending status updates and notifications from the job by email.
- Easy access to the Spark stack – Databricks is built upon Apache Spark and therefore has massive capabilities for distributed computing by default. This is a big advantage when working with huge amounts of data.
- Analysis close to the data – One of the cornerstones of architectural principles for data management is to bring analysis to the data rather than the other way around. The reason being the security aspects involved in relocating data as well as the potential risks of diminishing data quality. Databricks sits in the cloud, typically next to the source data lake and within the same walls of security.
- Multilingual environment – Although not super useful, the Databricks notebook environment gives the user the possibility of switching languages between cells. The framework currently supports Python, R, Scala, Java and SQL.
- Meant for doing data science – The interactive notebook and Databricks as a whole is a constrained environment which does not allow the end user full control. Some typical admin features that normally would be taken for granted are simply locked whereas others can be accessed with more or less hacky workarounds.
- Maximal control – An IDE gives the user full control over the project structure, the external tools that are being used and how they are configured. The system is designed for being as supportive as possible without limiting the freedom of the developer.
- Environment control – In contrast to the interactive notebooks in Databricks, an IDE project does not come with a Python environment, unless you choose to go with the one installed globally. Instead, it is common practice to set up a project specific virtual environment using either Pipenv or Conda. The environment configuration file, e.g. Pipfile or Conda.yaml, will govern what libraries and what versions are being installed. All dependencies are fully managed, making sure that the final environment stays healthy.
- Environment variables – In some projects, even in pure data science, it is necessary to set up connections to external resources, such as databases and APIs. Credentials, endpoints and secrets should never be exposed as hardcoded strings in your codebase, and for that reason, environment variables come in handy. For example, using Pipenv loading a collection of predefined environment variables is easily done by including an .env-file to the codebase which should be in .gitignore. It is possible to work with secrets also in Databricks, but in a less transparent manner by typing them one by one into the Databricks CLI.
- Syntax highlighting – Syntax highlighting is a small, but very useful feature making coding in the IDE environment much more tractable than in Databricks. It increases readability and reduces friction in any project involving more than one person.
- Smooth access to version controlling – Version controlling is one of the most important practices in any modern development/data science project. Using an IDE one can easily set up a git project for versioning of the code, controlling what is being sent to remote using .gitignore. The same holds for versioning of data using DVC and for models using, for instance, MLflow. All of these tools can be used also in Databricks, but with some restrictions, especially for Git.
- Debugging support – Adding to the maximal freedom given by the IDE, debugging is one of the core features that sets this way of working apart from the notebook. When it comes to working on more complex development involving classes, inheritance, big project structures and dependencies, this functionality is a true blessing. Making mistakes when coding is practically inevitable, and finding the source of a bug can be really challenging, especially for errors of logic nature.
- Project organisation – As already stated, IDEs allow for a much more versatile organisation of your project structure compared to the more or less linear structure imposed by a notebook project.
- Allows for test driven development – Any software project should include a rich test suite performing unit testing to all parts of the codebase for ensuring that good quality is always maintained. For a Python project, this is commonly done using the comprehensive test package Pytest which has a seamless integration to popular IDEs, like PyCharm and VS Code. In Databricks notebooks, testing is much less straightforward, and the official recommended solution is actually to do it in an IDE by Databricks Connect. There are, however, hacky workarounds for making it happen in the notebook environment, but certainly not recommended for any serious project meant for production.
Some guidelines for choosing a development framework
Providing precise and general guidelines for when to choose one of the above coding regimes in favour of the other is impossible. There is simply no clean-cut decision boundary and in many cases, both frameworks work just fine. In addition, Databricks projects can be run in basically any IDE by Databricks Connect, and many IDEs include notebook support. How to set up your project optimally also depends on external factors and not only the nature of the technical task itself. The mix of experiences in the team and the number of resources available must also be taken into account. For a junior group of only a handful of individuals, the Databricks solution will probably always be more tractable. On the contrary, for a big and experienced team, a more advanced non-managed solution based on IDE programming might be the way to go. It is definitely possible to set up a fully open-source, spark compatible and cloud-agnostic ML-Ops stack replicating all of the good functionality in Databricks, yet maintaining full control over the development environment.
As a general guideline, it is probably fair to say that a project involving a huge amount of data and an “off-the-shelf” modelling scheme (sklearn, XGBoost, PyTorch etc.) often can benefit a lot from a Databricks-based solution. Also, for explorative analysis meant for gaining knowledge and presentation purposes, Databricks is fantastic. For more complex projects requiring one or more of either custom-made models, many-model orchestration, dependencies on external resources or a rigorous test suite, the IDE regime will probably be more appropriate.