Themisto: The future of code documentation

Themisto: The future of code documentation

A human-centred AI system to assist Data Science code documentation in computational notebooks

Table of contents

No heading

No headings in the article.

I want to preface the following anecdote by saying pirating games is unethical and should not be done.

Imagine this, you want to play a game on your pc but it is not available in your country…so you do what any unassuming, unethical child would do… look for a torrent link to the game.

You search and search, clicking through various fishy links and downloading and deleting numerous shady-looking zip files....inviting a plethora of viruses to go ham on your system and steal your entire family’s history, after searching far and wide on the internet, you finally stumble upon a valid torrent link that lets you download the game without any hiccups, and like the pirate you are, you greedily extract the treasures in the zip file only to see that the set up is in Russian and there is some weird music playing in the background that you can't seem to stop. You are pulling your hair out in frustration, you were so close to greatness, only if there was some sort of text file that would have guided you through the process…

This is a basic example of the importance of code documentation…You might not think of it this way, but a good example of code documentation is a README file. It answers several important questions about the framework and tells you how you can include it in your project, how to install it, and how to run tests.

The same can be applied when you are writing code…documentation of code has several benefits, it helps future developers or data scientists who are working on your code to maintain it, it can help you to learn from your mistakes and contrary to popular belief documentation means more time spent coding and less time writing instructions for other people, it is like a manual which would result in fewer bugs because the code will be easier to understand for everyone involved in the process.

I know you are the senior data scientist of your team and everyone should know that the database is named after your ginger cat Mittens without an explanation, but you are making the newly hired data science intern lose his mind and re-write the entire code from scratch because of your quirky variable naming tendencies.

Writing clean code is eloquent and much sought after, but one cannot overlook the fact that clean code is not necessarily as informative on its own. For example, college professors would write beautiful 8 to 10-line functions that look amazing and yet, they put it on a test and ask you to explain what they are doing and lo and behold, over half the class failed as it might not have been obfuscated, but their usage of shortcuts because of years of experience and abstract methods confused people.

Computational notebooks are interactive, web-based documents that allow users to mix text, code, and visualizations in a single document. They are popular in data science because they provide a flexible and efficient platform for performing data analysis, exploring datasets, and prototyping models.

In a computational notebook, users can write and run code, visualize results, and document their work in a single place. This makes it easy for users to keep track of their progress, share their work with others, and reproduce their results.

The popularity of computational notebooks in data science can be attributed to several factors. Firstly, they provide a user-friendly interface for data analysis and modeling. Secondly, they allow users to easily switch between writing and executing code, making it easier to experiment and iterate. Finally, the fact that computational notebooks are web-based makes it easy for users to share their work and collaborate with others.

Some of the most popular computational notebooks in data science include Jupyter Notebooks, Google Colab, and Microsoft Azure Notebooks. They are widely used in industry, academia, and research, and are a critical tool for data scientists and data engineers.

But the lack of proper documentation in computational notebooks is a common problem in data science. Despite the many benefits of computational notebooks, they can also be a source of frustration for users who are trying to understand or reproduce someone else's work.

One of the main reasons for the lack of proper documentation in computational notebooks is that they were designed for rapid experimentation and iteration, rather than long-term collaboration or reproducibility. As a result, many users focus on writing code and generating results, rather than documenting their work clearly and concisely.

There is a lack of incentives for users to document their work. Data scientists and data engineers are often under tight deadlines and are focused on delivering results, rather than writing detailed documentation. Additionally, the fast-paced nature of data science can make it difficult to keep up with documenting every step of the process.

So how do we tackle this problem? Some IBM researchers developed a human-centered AI system to assist data science code documentation in computational notebooks called Theimesto…an automatic documentation generation system that supports data scientists to write better-documented computational narratives.

So how does it work? There are some prerequisites, it can’t just give suggestions to someone writing a code like “Oh I wouldn't have written it that way” or someone who is trying to understand what’s written, “maybe you should’ve studied more if you can’t understand this much”. Although that might be a fun project to undertake, just an AI that rather than helping you with code documentation, judges your code.

Some of the design implications include:

The system should support more than one type of documentation generation,as obviously not everyone is trying to print “hello world” some people are also trying to perform difficult mathematical equations like (insert quick math meme).

Certain types of documentation are irrelevant to the code. Like writing “//The weather is nice” in a code about developing a Loan Approval model. The AI needs to consider that.

External resources such as Uniform Resource Locators (URLs) and the official API descriptions may also be useful. For times when the AI goes “I don’t know bro let me google that real quick”.

The Themisto system has two components: the client-side User Interface (UI) is implemented as a Jupyter Notebook plugin using TypeScript code, and the server-side backend is implemented as a server using Python and Flask.

For example, this excerpt shows the user interface of Themisto as a Jupyter Notebook plugin. Each time the user changes their focus on a code cell, as they may be inspecting or working on the cell,(a) the plugin is triggered and a light bulb icon appears next to the code cell, indicating that there are recommended markdown cells for the selected code cell. When a user clicks on the light bulb icon, Themisto gives you three options in the dropdown menu:

(b) a deep-learning-based approach when it has a clear answer for you

(c) a query-based approach that googles the documentation. And my favorite,

(d) a user prompt approach that snarkily tells the user to write their documentation.

But Theimesto isn't the end all be all for code documentation, at its best, it’s a useful documentation-generating AI system that saves loads of time and provides a solution to the common problem of lack of proper documentation in computational notebooks, making it easier for data scientists and data engineers to understand, collaborate on, and reproduce each other's work, and at its worst, it’s just a reminder that you need to provide context to your code cells, either way, it establishes clear guidelines and best practices for data science documentation. This will help ensure that computational notebooks are used in a manner that promotes transparency, collaboration, and reproducibility, which are critical to the success of data science projects. Thanks for reading.

For more information about this topic, read "Documentation Matters: Human-Centered AI System to Assist Data Science Code Documentation in Computational Notebooks": https://dl.acm.org/doi/10.1145/3489465