Episode #3 of the course Data fundamentals by Colby Schrauth and Serge LeBlanc
Yesterday, we explored the three main areas of the data profession:
• data analytics
• data science
• data engineering
Today, we’ll cover the different data tools needed to start working with data. Let’s get started!
Where to Start?
The vast majority of data analysis projects can be tackled by using a combination of three types of data tools:
• spreadsheets (Excel, Google Sheets, etc.)
• databases and SQL (MySQL, PostgreSQL, etc.)
• business intelligence (BI) software (Looker, Tableau, etc.)
To help us better understand why we only need these three tools, it’s useful to cover the different capabilities of data tools.
Data Tool Capabilities
Each data tool can be classified by a combination of five different capabilities:
1. Storage: Does the tool have the capability to keep data in it?
2. Structure: Does the tool automatically provide a level of organization to data?
3. Access: Does the tool allow you to view and/or extract data?
4. Manipulate: Does the tool enable you to build calculations, edit values, and more?
5. Sharing: Does the tool have the ability to build reports, graphs, and more?
Each data tool has its own area of specialization, and certain ones combine multiple capabilities in a single tool.
Why Only Three Data Tools?
We strongly believe that it’s best to only focus on learning spreadsheets, databases and SQL, and a BI tool when first starting to work with data for three reasons. First, as previously mentioned, the combination of these three tools can help you solve the vast majority of data analysis problems.
Second, because of their capabilities, they are the most popular tools used in data analysis, so learning these is critical for a career in data, and you can find a large amount of information by searching online.
Finally, these tools make it easy to learn key principles of data analysis, without being distracted by the complexities of other tools or programming languages, like Python and R. Now let’s take a closer look at each type of tool and their range of capabilities.
Spreadsheets. Spreadsheets are the most popular and arguably the most powerful type of data tool in the industry because they offer the entire spectrum of capabilities in a single tool. Using Excel as an example:
• Storage: Data is contained within the file (the more data, the larger the file).
• Structure: Data is structured into rows, columns, and individual cells.
• Access: Data can be accessed by opening the file.
• Manipulate: Excel provides numerous mathematical and formatting functions.
• Sharing: Visualizations can communicate insights and the file itself can be shared.
Databases and Structured Query Language (SQL). Most of the world’s data is stored in a database, and the most common way to access this data is through writing SQL. Storing data in a database and writing SQL to access it unlocks an entire spectrum of capabilities, with the exception of Sharing:
• Storage: Data is stored in a local database (e.g., laptop) or on a cloud server.
• Structure: Data is structured into tables and schemas.
• Access: Data is accessed via SQL code.
• Manipulate: SQL can apply mathematical functions, transforms, etc.
Business Intelligence Software. Business intelligence tools are software applications that are designed primarily to analyze and report on data stored in a database. Metaphorically, BI Software sits atop a database and unlocks the sharing capability (e.g., dashboards, scheduling reports, etc.). Behind the scenes, the BI tool is writing SQL to access a database, and returning the data in its graphical user interface. BI software alone has the following capabilities:
• Access: This connects directly to a database.
• Manipulate: This provides drag-and-drop functionality to data analysis.
• Sharing: This makes it easy to create beautiful visualizations.
But because BI software connects to a database, the entire capability spectrum is unlocked, with the added advantage of being online and capable of working with large datasets.
In summary, for tasks where data is available in a CSV or Excel file, a spreadsheet tool is all you need for your analysis. For data sets that live in a database, the combination of SQL and a BI tool should be your method of choice.
Tomorrow, we’ll take a look at different data files and data types, and the impact they have on data analysis.
—Colby and Serge
Data Smart: Using Data Science to Transform Information into Insight by John W. Foreman
Share with friends