Since its release in 2008, pandas, a Python library for data manipulation and analysis, has held a prominent position in data analytics and has become an industry favorite among data scientists. The pandas library offers robust data structures and an arsenal of manipulation functions that, when combined with Python’s simplicity and versatility, revolutionized code-based data analytics. However, with this flexibility comes some limitations. pandas can be restricted to static, unscalable, and code-heavy applications, which may affect performance. As a comparison, we introduce Atoti.
Outcomes and findings
This piece outlines the outcomes of a comparative analysis of Atoti and pandas. The analysis is illustrated through a simple Value at Risk (VaR) use case implementation. The outcomes highlight how Atoti can enable more advanced analytics and shows how Atoti natively addresses the potential limitations of the pandas approach, such as:
- Atoti supports data modeling, which systematically infers a semantic dimension to the dataset. This facilitates its transformation into attributes and aggregable metrics.
- Atoti improves the response time by up to 99% for performing complex, non-linear VaR calculations, while aggregating them at different levels.
- Atoti drastically reduces memory usage by up to 97%, thanks to its embedded compression and memory-optimization, and due to its elimination of redundancy and duplicate data structures when compared with pandas.
- Atoti introduces the concept of generic metrics, which are configured once and computed at any level of aggregation. In contrast, pandas requires explicit coding for each aggregation level, resulting in redundant, code-heavy scripts.
- Atoti UI and Atoti’s JupyterLab extension offer dynamic, no-code visualization widgets that are flexible and readily available. These widgets natively connect to the data model, providing advanced visual analyses of components. To achieve similar results with pandas, other libraries like matplotlib or plotly are necessary. However, these libraries also introduce limitations around heavy, repetitive coding, and their results are often static and lack interactivity.
- Atoti pushes the boundaries of advanced analytics by leveraging its branching capabilities to perform simulations and What-if analyses, which data structure duplication in pandas can only partially achieve. It also facilitates the creation of semantically dependent metrics, such as marginal and incremental VaR.
Building the notebooks for advanced VaR analytics
In order to set the right context for a comparative analysis, we developed two notebooks in parallel. The first notebook uses Atoti Python API (version 0.9.1) while the second notebook uses pandas (version 2.2.2). These notebooks provide VaR analytics, such as Value at Risk (VaR) and Expected Shortfall (ES) calculations at different confidence levels, and aggregates/decomposes them at different granularities.
We used a rather large PnL dataset of ~5.5GB. This dataset includes different trades, which are our most granular level. It also includes their respective PnL vectors. These vectors detail the trades’ Profits and Losses over 250 past dates. The VaR use case requires computing the Value at Risk and is done by applying a quantile function with a specific confidence level to the PnL vectors after first aggregating them (through summation) at the granularity level of interest.
The necessity of vector summation before the application of the quantile function makes this calculation non-linear. Combined with its vectorial aspect (array data type), this makes up most of the complexity of this use case.
In order to evaluate the performance differences between Atoti Python API and pandas, we chose to:
- Compute VaR and ES:
- At two different confidence levels: 95% and 99%
- At three different granularities: top-of-house (global value for entire financial institution), combination of book and trade, and combination of all attributes
- And proceed to track, for each of those queries, the:
- Response time
- Memory usage
Constructing and querying the VaR metrics
In Atoti, we load data into tables, which we then join to form our data model. On top of this model, we build an OLAP cube designed and optimized for multidimensional querying. Once established, our dataset components form a semantic layer and carry inherent analytical meaning. Most qualitative attributes become analysis dimensions or axes, which we then use to aggregate and decompose our quantitative variables (i.e. measures). We can also further enrich the semantic layer with more complex and specific measures. Thanks to this semantic layer, we only define and configure a measure once, and can subsequently use a measure to display and analyze at any level of granularity, with any combination of analysis hierarchies.
The above code is all we need in Atoti. From there, we can use Atoti UI widgets to produce no-code queries dynamically by simply dragging and dropping measures and analysis hierarchies. For comparison, we will also run queries against the cube that resemble pandas and return a DataFrame. However, this is not necessary for performing advanced analytics in Atoti. The UI automatically generates these queries and reflects them in visualizations, without needing extra code.
On the other hand, with pandas we would use the groupby function to first aggregate our PnL vectors up to the right level. Then, apply the quantile function on the grouped vectors to get our desired metric. However, the calculation would only be viable and performed at the specific level of granularity specified within the groupby call. This means we would need a separate block of code, with a separate function call, to execute for each of the desired aggregation levels (in our case, the 3 aforementioned granularities). Not only is this repetitive and code-heavy, but also memory-demanding, as each aggregated set is stored in its own DataFrame.
The code above only yields our metrics at the top-of-house level. To get the remaining two levels of aggregation, we had to write similar code but configure it to statically adjust the level of aggregation, as seen below, for the example of the Book/Trade level.
Comparing response time performance and memory footprint
We tracked query response times and incremental memory usage (beyond the load of the initial dataset) for each of the three aggregation levels.
Using Atoti drastically decreases both variables, as detailed in the table below, and thus allows for faster, flexible, but also more efficient analytics.
Library | Aggregation Level | pandas | Atoti Python API | % Decrease |
CPU Time (seconds) | Top-of-house | 8.41 | 0.0156 | 99.81 |
Book/Trade | 115 | 0.0312 | 99.97 | |
Trade | 382 | 0.219 | 99.94 | |
Wall Time (seconds) | Top-of-house | 10.1 | 0.0517 | 99.49 |
Book/Trade | 165 | 5.96 | 96.39 | |
Trade | 512 | 11.2 | 97.81 | |
Peak Memory (MB) | Top-of-house | 7740.5 | 163.41 | 97.89 |
Book/Trade | 10486.4 | 226.64 | 97.84 | |
Trade | 10728.24 | 857.36 | 92 |
💡 Note: These performance measures may vary depending on the specs and state of the used machine. We have run these tests on a personal computer of 32GB RAM and 20 logical cores.
Atoti also offers a large array of further performance optimization tools that would decrease these pandas inefficiencies even more. An example of this is the aggregate providers that additionally improve query response time, as detailed in this notebook.
Heightened analytics with improved performance and reduced overhead
From the performance metrics above, the use of Atoti Python API expands upon the analytical functionalities that pandas offers. Response times also improved by 99% and memory footprint reduced by up to 97%.
We invite you to explore these two notebooks that accomplish similar outcomes but highlight the optimizations and progress Atoti brings to advanced analytics.