Recently I wrote an unconventional article about exposing analytics use cases in virtual reality. Though it was just a hackathon project, it pushed me to think about what APIs (and in which form) should be exposed by headless BI platforms.
When we talk about front-end development, we usually talk about Javascript/Typescript libraries. This was the case with the VR demo mentioned above. But, especially in the case of data (analytics), Python language became extremely popular not only on the back end but also on the front end. One of the most popular ecosystems nowadays is Streamlit.
An idea popped into my head: create a data application utilizing a full set of APIs, which should be provided by headless BI platforms.
Currently, one of the most feature-rich data applications is the one allowing users to build reports (visualizations/charts/insights), so I decided to create such an application using Streamlit and our Python SDK.
This article is backed by an open-sourced demo. It contains not only the Streamlit app but also a corresponding end-to-end data pipeline. It is worth mentioning that the demo allows you to create a single pull request to deliver everything consistently:
- Extract from data sources and load to the data warehouse (Meltano)
- Data transformations (dbt models)
- Declarative definitions of analytics (GoodData)
- Data applications (VR demo, Streamlit)
Why Headless BI?
We describe it here.
In particular, you can connect Streamlit directly to data warehouses or even to files, but headless BI offers more:
- Declare a semantic model just once (logical data model, metrics, reports, …)
- Connect any clients (including Streamlit), while relying on a single source of truth
- Provide low-enough latency to end users (scalability, caching)
- Prevent data warehouses from becoming performance bottle-necks or being too costly
Solution
Let me spoil it here and show you the full picture first. This is a screenshot of the final application:
What can you see in the picture? What am I going to talk about in the following chapters?
Use cases in self-service analytics!
Briefly:
- Semantic mode — presented in the left panel. Users build reports by selecting business names. No SQL!
- Reports: presented in the main canvas. Various visualization types.
- Interactivity: filters, sorting
- Context awareness – catalog is filtered based on an already existing report
- Multi-tenancy – switch between multiple isolated workspaces
- Caching – both Streamlit and GoodData caching
If you want to start immediately with a hands-on experience instead of preparing the whole ecosystem on your laptop, you can try it here.
Otherwise, start with the top-level README to prepare data and analytics, then follow it with the README for the Streamlit app to start the app locally.
Semantic model
The demo repository contains all the information about how the semantic model is generated.
We want to expose the model to end users in the Streamlit data application. Python SDK provides various functions for this purpose. It is possible to list each type of entity – e.g. list attributes, facts, metrics, etc. Additionally, it provides a function to return the full catalog.
Moreover, the SDK provides a function to filter the model by the already existing report. What does it mean? When you put some entities into a report, it can limit what other entities you can combine them with. The model consists of datasets connected by relations. Not all datasets must be connected, and even when they are, the direction of the connection can impact the ability to combine the entities.
Finally, we want to cache the catalog so we do not call the backend with every page refresh.
For instance, here is the function collecting the whole semantic model (catalog):
Then, a Streamlit component like “multiselect” can be populated by catalog entities:
Helper functions are used here to extract IDs and titles. Also, the Streamlit state is utilized here to set the selected values.
Report executions
Python SDK provides various options on how to execute reports. Because we are building a Python application, it makes sense to use the Pandas extension, which can return Pandas data frames. They can be printed 1:1 in Streamlit or they can be directly passed as arguments to various visualization libraries provided by Streamlit, in this case, I use the Altair and Folium libraries.
We need to collect all the selected catalog entities and fill them into a report definition.
Every unique request is cached by Streamlit. It is possible to clear the cache by using a dedicated button in the left panel.
Metrics
Although GoodData provides an editor for creating metrics in a custom MAQL language (which is far easier to use than SQL), the users often just want to create very simple metrics like SUM(fact) or COUNT(attribute). The Streamlit application supports it, allowing users to pick a fact/attribute as a metric and for each to specify an analytics function (SUM, COUNT, …).
Filters
The application provides an option to pick an attribute as a filter. It is possible to list all the available values for each attribute and display them in the Streamlit “multiselect” component.
Here is how the attribute values can be collected from the server:
Though I implemented only positive attribute filters (attribute values equal to multiple values), GoodData, through Python SDK, provides many other types of filters out-of-the-box, e.g. negative filters, metric value filters, date filters, etc.
Sorting, paging
I decided to apply sorting and paging in the Streamlit application, on the full result set(data frame). However, GoodData supports sorting/paging out-of-the-box. In the future, I would like to extend the current solution accordingly.
Multi-tenancy
GoodData provides an option to create isolated workspaces. It is easy to support it in the Streamlit app — we just list the available workspaces, populate them to a dedicated “selectbox” and let users pick the workspace which they wanna explore.
Why Streamlit Rocks?
It is really easy to onboard. Many building blocks are already implemented and easy to use, e.g. checkbox, multiselect, inputbox(textarea), etc.
Streamlit offers first-class support for state management. It is easy to persist even more complex variables to state and access them (after page reload) using dict or the property syntax.
It is possible to cache even very complex structures. You just simply use the @st.cache_data annotation and the result of the annotated function is cached for each combination of values of function arguments.
Finally, Streamlit provides a good cloud offering. Developers must register, and then they can create apps and bind them to GitHub repositories. Any merge to the repository redeploys the app with zero downtime. Cool! Moreover, once the app is displayed in the browser, it provides a developer console containing logs, settings, etc.
Where Streamlit Fails?
Although state management is powerful and easy to use, it is sometimes tricky, especially when you need to refresh components based on changes in other components, which is the case with catalog filtering. When you pick an attribute in “View by” you can limit the list of metrics. The most robust solution I found is to specify the “key” property of selectbox/multiselect components. But, sometimes it did not work as expected and I spent hours finding a workaround solution. That is why the code is full of “debug” calls, btw 😉
Regarding cache management — the @st.cache_data annotation can be put on class methods, but it does not work. I contributed to the corresponding Streamlit forum.
There is a big difference between Javascript/Typescript apps and Streamlit apps – page reloading. Every action in Streamlit requires a full reload of the page. Sometimes it’s handy, but often it’s not, as it does not perform. This is a general limitation of the Streamlit architecture, when everything is running on the Streamlit server, not in the user’s browser.
With rising latency between the Streamlit application and the GoodData, the application starts behaving weirdly during the page reload – e.g. the same selectbox is displayed twice – once active and once inactive.
Custom page design is quite hard to achieve. In my case, for instance, I wanted to create a top bar containing e.g. the workspace picker, but I did not find a solution for it. There is a corresponding issue opened for years.
Moreover, a typical self-service analytics application provides a drag-and-drop experience. However, implementing this feature with standard Streamlit building blocks seems impossible. Fortunately, my colleague successfully overcame this limitation by implementing a separate React application. This React application can easily be integrated with a native Streamlit app. I plan to write about the integration in a follow-up article.
Finally, I was sad that Gitlab is not supported. What a pity! My pipeline benefits from Gitlab a lot. To test the cloud deployment, I finally pushed from the local to a Github “clone” repo, and it worked as expected. Personally, I would appreciate it a lot if it would be possible to trigger the deployment from the pipeline, even before the merge, to create a DEV environment, which can be used as a part of the code review. It would be perfect if the URL to such DEV deployment could be put to the pull request as a comment 😉
So, Should You Use Streamlit?
Short answer — definitely yes.
Long answer — definitely yes, if you are OK with the limitations described in the previous chapter. Otherwise, Streamlit (and Python in general) provides so much functionality and so many libraries in the area of data analytics/science. Personally, I am most excited by the idea of mixing the demo app I described here with an embedded Jupyter notebook(library exists), and providing a mixed experience for data analysts/scientists.
Try Headless BI for Yourself
Ready to experience the power of headless BI? Start your 30-day free trial today.