Question, Preparation, Discovery, and Action.

Basic Steps for Successful Data Analysis and Data Science

Taylor J. Segell
12 min readFeb 15, 2022
Source Image

I entered the world of Data Science roughly 6 years ago and it is safe to say that I am obsessed. The entire process has captivated me and now is what I enjoy spending my free time doing. Yes, even Data Cleaning. By no means am I an expert and there is still loads for me to learn which I look forward too, however, I thought I would put on paper what I have gathered are the essential steps to a successful project being completed and how to uncover the truth from what seems to be a group of unrelated strings, integers, and booleans.

Step 1: Question

Motivation, and Planning

Knowing what your business goal is when undertaking a data project, and understanding the business or activity that your data project supports, is the first step in the Data Analysis lifecycle. Addressing a clearly defined organizational need in order to ensure the team you are working with is on the same wavelength. It is even more important to consult with the people whose processes or businesses you want to improve before you even consider implementing a plan of action. Once everyone is on the same page, establish a timeline and key performance indicators (KPIs).

While creating the project’s timeline, recognize how long the insights will be required and how frequently they will be performed. A project that is required to answer a single question will have different functional process and performance indicators than one that will conduct analyses on a monthly or weekly basis, or that will generate conclusions continuously based on real-time data.

Implementation of deadlines is crucial to keep the project moving forward, however, if it is clear a deadline could be missed it is the team's duty to inform the stakeholders with an updated time. I personally would shy away from excuses, I still have not met anyone who cares for them. Another aspect to stay aware of is “Scope Creep”. Throughout the course of the project, the objectives can slowly start to change if you are not bearing them in mind. While recalculating a plan is acceptable, it is important to ensure that the change is intentional and that unnecessary analysis is avoided.

When embarking on a data science project, keep the following guidelines in mind:

  • Identify and meet with all stakeholders as early as possible in the process and on a regular basis thereafter.
  • Unless you are working on pure research and exploratory project, identify the questions that the project must address as soon as possible.
  • Determine all data sources and recipients.
  • Bear in mind that communication is just as critical as the raw data and analyses themselves.
  • Be prepared to modify a project if data becomes outdated, unavailable, or unsuitable for answering the questions posed.
  • When a bump in the road is encountered, simply inform the stakeholders so they have the proper expectations.
Source Image

Data Collection and Extraction

Following the determination of your objective, the next step is to locate your data. The success of a data project is dependent on the combination of data from as many reliable and unbiased or known biased sources as possible. Some of the methods that could be used to gather the information that is required include:

Internal Databases:

  • One method that could be used is connecting to the database that has been provided by stakeholders or to which team members have made reference.

External Databases:

  • Look for open datasets that are relevant to the business goal; however, make certain that the data comes from a reputable source.

APIs:

  • The use of APIs when streaming data is required or when APIs are advantageous due to the fact that they are always up to date.

Web-scraping:

  • Web-scraping is used when there is no central database or repository, but the data is distributed across a large number of websites all over the internet.

Helpful Tools:

  • SQL
  • Python
  • R
  • RBDMS
  • MongoDB
  • Hadoop
  • Selenium, Scrapy, BeautifulSoup

Step 2: Preparation

Source Image

Cleaning, and Wrangling

The next step is easily everyone’s favorite part, right? Probably not. This is the phase we make sure the data is ready to work with, Cleaning and Wrangling are what I consider to be the most important steps in the process. Insufficient preparation of data for analysis will prevent even the most talented Data Scientists and Analysts from uncovering meaningful insights. Although that is a somewhat moot point because the best players would never put themselves in this position.

According to Mode.com “Data cleaning is a process by which inaccurate, poorly formatted, or otherwise messy data is organized and corrected.” This helps ensure you have valid results at the end of your analysis. Data Cleaning also is helpful in terms of productivity, organization, and improving mapping. It can also lead to cutting unnecessary costs down the road due to the fact you will have little to correct during analysis and visualization.

The Primary Steps to Data Cleaning and Wrangling are:

  • Remove unnecessary features and data points that are irrelevant to the completion of the objective
  • Simple housekeeping to the data structure such as typos, incorrect data types, extra spaces, etc.
  • Removal of outliers that could skew the results of the analysis.
  • Dropping duplicate data points, to ensure no feature is bearing extra weight.
  • Reconciling missing data by either removing or imputing the data-dependent on the needs of the project.
  • Standardizing the data and correcting incorrect data types and syntax
  • Validation of the data or essentially double-checking to make sure the dataset is in the best form possible.

In order to achieve the best results, it is critical to ensure that the data is homogeneous, free of errors and that only relevant data points to the business goal are present. While the housekeeping phase is in progress, it is critical that your data and project remain compliant with data privacy regulations throughout the process. Along with compliance Data Governance is something that should never be ignored or taken lightly, and it should always be regarded as a top priority by all organizations.

A final aspect to consider is the need to be on the lookout for data bias, which can be either intentional or accidental. Machine Learning and artificial intelligence (AI) are concerned that not only can AI have difficulties detecting it, but it can also create it on its own. Unfortunately, humans are biased creatures, and because we are the ones who are providing the training data, it is nearly impossible to avoid bias in the data we provide.

“A major contributor to this bias is the data that OpenAI used to train GPT-3. The system’s training data comes from a wide variety of sources — including Wikipedia and Reddit — which contain inherent biases that find themselves baked into GPT-3’s generations. (How AI Training Data Contributes To Its Bias)

Helpful Tools:

  • Python (Pandas Primarily) and R
  • MS Excel
  • KNIME
  • OpenRefine
  • Tabula
  • CSVKit

Bonus: (ETL, ELT, ETLT)

Processes such as ETL (Extract, Transform, and Load), ELT (Extract, Load, and Transform), or ETLT (Extract, Transform, Load, and Transform) are in essence a combination of the collection and cleaning process. These methods are extremely beneficial when working with big data and help expedite the process. These processes are used to properly collect, clean, and load your data into your data warehouse, saving valuable time. The differences, advantages, and disadvantages of each process would require another full article, but as I am sure you can tell, it is just a question of when you Transform the data (Cleaning and Wrangling) occurs. They all have applications when they are more appropriate than the others, depending on the use case.

Helpful Tools:

  • Kafka
  • Xplenty
  • Spark
  • Airflow
  • Luigi

Step 3 Discovery

Feature Manipulation and Analysis.

Now to the title track of the data analysis process, the analysis. After we have adequately wrangled, cleaned, and organized the data sources, it is time to bring all of the data together and boil it down to the insights we set out to find. Some of the methods we use in this step include using exploratory data analysis, statistical analysis, predictive modeling, and data mining.

Exploratory data analysis (EDA) is a technique for deciphering the messages contained within a dataset. This is perhaps the most commonly used approach, as it entails sorting and classifying the data, performing additional validation on the data, and creating rudimentary visualizations to identify trends that would be missed with just a raw collection of numbers and letters. This process can lead to the discovery of basic metrics like mean, variance, standard deviation, and more. EDA typically leads into the more mathematically intensive aspects of analysis. The main way this is done is through the use of algorithms from packages like NumPy, SciPy, etc. These algorithms have grown ingrained in today’s data world, and they encompass mathematical calculations for data analysis. Correlation and causation are mathematical formulae or models that aid in determining the links between data variables.

Next up comes the Machine Learning portion of the analysis. There are many subsets of Machine Learning such as Supervised, Unsupervised, and Semi-Supervised. Supervised ML is definitely the most widely used today, however Unsupervised and Semi-Supervised models such as Natural Language Processing and Neural Networks have been growing at a rapid pace.

Regression modeling still is the most commonly utilized form of predictive analysis. This form of modeling analyzes data by simulating the relationship between two variables. For instance, establishing if a shift in social media sentiment (an independent variable) has a direct effect on the stock price of a company (dependent variable). These techniques are a subset of inferential statistics, the study of statistical data in order to derive inferences about the relationships between various sets of data. A few popular examples would be Linear Regression, Support Vector Machines, and Naive Bayes.

The Basic Categories of Analysis:

Descriptive:

  • It establishes what has already occurred through a descriptive type of analysis. Prior to delving deeper into the problem, the analyst will typically perform this step. In the case of a business analyst, this type of analysis may be used if they are hoping to record historical occurrences in the hopes of understanding which decisions were beneficial and which actions were detrimental, as well as to discover actions that were taken that may have had a negative impact on growth, profit, performance, and other factors. Despite the fact that these insights may not result in concrete decisions, the process of aggregating and expressing the data will assist them in determining how to proceed.

Diagnostic:

  • Descriptive and diagnostic analyses are both performed on historical data, which is similar to Descriptive and Diagnostic analyses. The primary difference between Diagnostic and Descriptive analysis is that Diagnostic analysis focuses more on why, whereas Descriptive analysis focuses on what. Most of the time, an analyst will begin with Descriptive analysis to identify a problem or weak point, and then move on to Diagnostic analysis to determine the route cause that resulted in that outcome.

Predictive:

  • This type of analysis is exactly what the name implies. A data analyst will once again use historical data to predict what might happen if a particular course of action is taken. With the introduction of Machine Learning and artificial intelligence, this type of analysis has exploded in popularity, as it provides a more straightforward and affordable method of forecast modeling and trend prediction.

Prescriptive:

  • While predictive analytics has unquestionably changed the game, prescriptive analysis is the next step in the analytics process. When compared to predictive analysis, which provides the analyst with a hypothesis about what will happen in the future, prescriptive analysis employs the aforementioned forms in order to produce a recommendation on how best to proceed in order to achieve the set goal.

Despite the fact that there are other methods of analysis, I believe that these are the most important four pillars.

Helpful Tools:

  • Python
  • Power BI
  • R
  • Excel
  • Spark
  • Jupyter Notebooks
  • SAS
  • KNIME
D3.js

Visualization

The time has come to combine quantitative and qualitative analysis with visual representations of the data and findings. Understanding what the data has to tell us has never been easier than it is now, thanks to the dissemination of statistical insights discovered through the medium (pun slightly intended) of impactful visualizations. This is at the heart of what people envision Data Analysis to be all about and personally, is my favorite part.

The following little-known profound quote beautifully explains this point:

“A picture is worth a thousand words” — Source Disputed

I am delighted to be able to introduce you to such a wonderful quote, which I am 100 percent certain you have never heard before. I’m only here to lend a hand. Putting aside the tangential silliness, my only amendment to this quote is as follows: “A picture is worth a thousand words, but a picture with a thousand words is worth nothing.” Making certain that we do not go overboard with our visuals in the hopes of making them stylish and unique while cramming too much information into them can be more than just counterproductive; it can also destroy the insight we are attempting to convey.

In the opinion of David O’Neil, Steve Gerst, and Sharyl Prom of Delta Associates, the following six characteristics of effective visualizations or slides are to:

  1. Effectively communicate the significance of the data.
  2. Provide a clear and concise point that can be summarized in a sentence that is supported by facts to back up the insight.
  3. The use of the strongest metric/metrics possible in order to avoid obscuring the story.
  4. Use of the appropriate graphic styling to simplify the insight while not detracting from what needs to be known.
  5. Refrain from over-styling the visuals by focusing on key Bleeders & Leaders®(i.e. Features of Importance) rather than simply overloading it with data.
  6. Maintaining a clean and minimal aesthetic while concentrating on the insight rather than the glitz.

(Business Insights: How to Find and Effectively Communicate Golden Nuggets, 2014, p.15)

Even before I began my professional career in data, I was a strong believer in the maxim “More is Less and Less is More.” While passion and creativity are important characteristics of a Data Scientist or Analyst, a well-balanced approach is guaranteed to be an asset in the long run. If it takes your audience or stakeholders more than 3–5 seconds to comprehend the meaning of a visual, you have lost your way and all of your hard work may have been for naught.

Helpful Tools:

  • Plotly/ Dash
  • Matplotlib, Seaborn
  • Tableau
  • Cognos
  • PowerBI
  • D3.js, Chart.js
  • Google Charts
  • Looker

Step 4: Action

Presentation and Communication

source image: https://biteable.com/blog/how-to-make-good-presentation/

It is now time to tell the story that has been provided to you by the data that has been so generously provided to you. Everything leads up to this point, and it doesn’t matter how good one is at the first steps in the process, whether one is an effective data wrangler, amazing aggregator, statistical wizard, or impactful visualization artist, if the insights cannot be communicated to the stakeholders, the project was a waste of time, money, and resources. In any situation, communication is essential, and returning to the visualization aspect, less is usually more, as long as the less is effective. The balance is difficult to strike because going into excessive detail may result in data overload and loss of interest, while trying to keep it too concise may result in the presentation failing to convey the found insight. Preparing for any questions that may be asked and practicing the presentation are both essential steps. It is highly recommended that you collaborate with your team members during this stage in order to bounce ideas off one another about how to explain the findings in the most effective and efficient manner while keeping the audience intrigued and engaged.

Helpful Tools:

  • Tableau
  • Powerpoint
  • Google Slides
  • Infogram
  • Zoho

It is only when all of these processes come together that the true power of data can be unleashed, thus resulting in actions taken with impacts of enormous positive magnetic. Every step deserves our full attention, and they are all equally important in terms of the role they play in the hunt for the truth. The information available to you is vast, and it is designed to assist you in making informed decisions that are data-driven and business/customer-oriented. However, in light of the large amounts of data that are used in the modern-day, we require more than knowledge of how to clean data using Python’s Pandas, how to find the proper K in Kmeans Clustering, the skill to properly query a database, etcetera. What is needed in tandem is an endless appetite for knowledge, relentless curiosity and skepticism, exceptional communication skills, and the ability to recognize that the little things are truly the big things. One must also have assurances that the data being utilized is done in an ethical, unbiased, and applicable manner.

When all that comes together, all that is left is to take actions that can have a profound impact and growth for your stakeholders, the general public, and yourself.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Written by Taylor J. Segell

Data & AI Tech Specialist at IBM. Focus in Data Sci/Gov and HDM. Data Fabric and CP4D advocate. When not nerding out, I can be found in the mountains.

No responses yet

Write a response