Sort, Filter, and Segment Data

13.12.2019 | 0

Episode #7 of the course Data fundamentals by Colby Schrauth and Serge LeBlanc

Welcome back!

Yesterday, we shared the second critical area for practically working with and understanding data: joining data together. Relational data models are powerful. Each dataset lives in its own domain, yet an architected link provides the power to analyze holistically.

Today, we’ll share the third and final area for practically working with and understanding data: sort, filter, and segment. Let’s jump right in!

Sort

The beauty of sorting data is that it’s a fairly simple concept to grasp and can provide insights incredibly fast. Sorting is often referred to as ordering, and there are two main concepts to internalize here:

Ascending sort is when things go from low to high. Here are a few examples:

• 1, 2, 3, 4, 5

• Amanda, Colby, Marissa, Timber (i.e., alphabetical)

• 2019-01-01, 2019-02-01, 2019-03-01 (i.e., old dates to new)

Descending sort is when things go from high to low. Here are the same examples, but now in descending sort:

• 5, 4, 3, 2, 1

• Timber, Marissa, Colby, Amanda

• 2019-03-01, 2019-02-01, 2019-01-01

Sorting is a quick way to see the highs and lows in a dataset. In spreadsheet tools, there’s a feature dedicated to sorting. In SQL, this is done by the ORDER BY operator. In BI Software, sorting is typically performed by clicking on the field of interest.

Filter

Filtering data is the act of purposefully choosing to include and/or exclude certain data points. Here are a few examples of filtering data:

• You have 24 months of customer support data but only care to see the prior 12 months. You’d thus want to exclude the oldest months via a filter.

• You have customer support data by country but only want to analyze interactions in North America. You’d thus want to exclude all other countries via a filter.

Filtering, like sorting, is a fairly straightforward concept (i.e., to include or exclude). The concept we see most folks stumble on is filtering for multiple criteria. When looking to join multiple filters together, you have two forms of logic:

AND: Data must meet all filter criteria to be included/excluded. Carrying on with our example above, we want to see the prior twelve months AND requests in North America (i.e., only the data that meets both conditions).

OR: Data must meet at least one filter criteria to be included/excluded. Carrying on with our example above, we want to see the prior twelve months OR requests in North America (i.e., we’ll see North America requests older than twelve months OR other country requests within the prior twelve months).

In spreadsheet tools, there’s a feature dedicated to filtering, along with formulas for more complex AND/OR logic. In SQL, filtering is done via a WHERE clause and use of the AND/OR operators. In BI software, filtering is typically a built-in feature directly within the UI.

Segment

Datasets typically contain a great deal of variability and depth, making them hard to process. Segmentation is powerful because it enables you to categorize/label data points to their core theme, thus making your dataset easier to interpret. Let’s jump into an example:

Pretend that you work for a business that sells a product to every country in the world (as of this writing, there are 195 countries in the world). It would be a difficult exercise to analyze and extract insights for all 195 countries. However, we could segment each country to its appropriate continent and then have only seven areas to analyze.

We’ve used segmentation to help our companies understand the:

• time to customer purchase (one day, two to three days, four to seven days, etc.)

• volume of help tickets by type (billing, how to, product bug, etc.)

• style of device (mobile, desktop, tablet, etc.)

In spreadsheet tools, segmentation is best done via pivot tables. In SQL, segmentation is done via a GROUP BY clause. In BI software, segmentation is typically a built-in feature directly within the UI (many times titled “Pivot”).

Sorting, filtering, and segmenting data are three powerful techniques for simplifying a dataset and making it easier to understand the things that matter most. Tomorrow, we’ll change gears and tackle our first data measurement principle: central tendency!

—Colby and Serge

Recommended book

The Visual Display of Quantitative Information by Edward R. Tufte

Share with friends