Topological Data Analysis: An Introduction for Non-Data Scientists

The purpose of this article is to introduce topological data analysis (TDA) to non-practitioners of data science, which is an underserved population in the information marketplace.  Ideally, these non-practitioners have some sort of tie into data science but aren’t practitioners.   The motivation for this task stems from the fact that most introductions are too technical—in some cases—even for data scientists.  The challenge for data scientists is that TDA originates in a subfield of mathematics called topology which involves the concept of shape among many other technical concepts…all of which make it difficult to follow traditionally good starting points like the usually reliable Wikipedia article.  YouTube has similar and sometimes better starting points including this one by our founder, Gunnar Carlsson, but there is still room to offer some conceptual simplification.

TDA is a new category of AI which focuses on unlocking the information hidden in complex data.  Complex data is not necessarily the same thing as Big Data.  While the volume, variety, velocity and veracity of data are issues for anyone working with complex data, they aren’t necessarily defining aspects.  Complex data is found in a dataset wherever the underlying phenomena that the data is measuring is complex such as logs, human behavior, medical treatments or any data that is represented with a large number of columns. ‘Columns’ is a spreadsheet term for what data scientists call features, mathematicians call dimensions or what everyone else calls variables.  It’s all roughly the same thing. As the number of rows in your dataset grows, it may eventually reach a point where it is appropriate to call it Big Data but if you don’t have a large number of columns, it is relatively simple data even if Big.  TDA is purpose-built to capture the complexity of this kind of data and make the information within accessible.

TDA is a data science methodology.  To understand what that means, it is important to remind ourselves what scientific methodology means.  Science is a quest for the discovery of patterns in nature.   An example pattern might be that water freezes at 32F. You discover that tricky phenomenon with a thermometer & glass of water on a cold day.   It’s a repeatable pattern that anyone can validate under ordinary conditions.  Science is built upon discovering repeatable patterns in reality.   Data science is exactly this but focused on data and finding data patterns.

As with science, generally speaking, the grasp of a repeatable pattern enables a person to have predictive power over their world.  The power to predict is a subset of the power to control.  Being able to predict is one of the key motivating goals of data science generally speaking. Weather, transportation schedules, prices, financial matters are all examples where the power to predict is fairly significant.

TDA as a data science methodology, then, is a new approach to identifying patterns usually for the purpose of generating predictors.  It is focused on complex data sets and surfaces hitherto unseen patterns in those data sets.  How does this work?  First, TDA doesn’t overcome the need to have clean or even pristine data—although it does help occasionally to work with datasets having those issues.  Second, TDA focuses on what makes data points similar to each other (e.g. row-by-row comparisons) and using this to compress the information into an interactive graphical representation of the whole data.  This graphic representation contains groups, many previously unseen, that naturally occur in the data.

TDA relies on generating rich graphical representations of complex data.  The representations are not static, ideally, but instead, allow the analyst to interrogate or manipulate a graphic-driven interface to identify unseen groups in the data.  Identifying dataset groupings are based on the concept of similarity and different algorithms produce varying results.   In traditional data science, clustering and principal component analysis are methods of grouping data by similarity. TDA exceeds these approaches and as a result, you will find TDA is a growing body of scientific publications as the power of TDA becomes more well known. Over 50 scientific publications can be found here.  

The graphical nature of TDA and its ability to clearly capture the underlying relationships and groups lead to explainable AI  which is largely not present in traditional machine learning methodologies.  The benefits of explainable AI over non-explainable AI or difficult-to-explain AI is that, with it, you can model faster with better accuracy.  In other words, via TDA you can grasp a dataset at a more meaningful level. This allows you to produce more models across more datasets in a shorter time.  Tweaking, correcting or simply sustaining existing models is also enhanced.  Explain-ability impacts security positively-speaking or risk management generally speaking.  Due to its value, explain-ability as a benefit of TDA warrants further exploration and will form the focus of the next installment of this TDA introduction.

Here is a list of resources for anyone to better grasp TDA in a variety of domains:

Generating Risk Models using TDA
TDA in Professional Basketball
Introduction to TDA
3 min intro for Math types
Using TDA to locate Cyber Unknown Unknowns

Leave a Reply