A primer on how to become a data scientist
How do I become a good data scientist? Should I learn R* or Python*? Or both? Do I need to get a PhD? Do I need to take tons of math classes? What soft skills do I need to become successful? What about project management experience? What skills are transferable? Where do I start?
Data science is a popular topic in the tech world today. It is the science that powers many of the trends in this world, from machine learning to artificial intelligence.
In this article, we discuss our teachings about data science in a series of steps so that any product manager or business manager interested in exploring this science will be able take their first step toward becoming a data scientist or at least develop a deeper understanding of this science.
We all have heard conversations that go sometime like this: "Look at the data and tell me what you find." This approach may work when the volume of data is small, structured, and limited. But when we are dealing with gigabytes or terabytes of data, it can lead to an endless, daunting detective hunt, which provides no answers because there were no questions to begin with.
As powerful as science is, it's not magic. Inventions in any field of science solve a problem. Similarly, the first step in using data science is to define a problem statement, a hypothesis to be validated, or a question to be answered. It may also focus on a trend to be discovered, an estimate, a prediction to be made, and so on.
For example, take MyFitnessPal*, which is a mobile app for monitoring health and fitness. A few of my friends and I downloaded it about a year ago, and then used it almost daily for a while. But over the past 6 months, most of us have completely stopped using it. If I were a product manager for MyFitnessPal, a problem I might want to solve would be: how can we drive customer engagement and retention for the app?
Today's data scientists access data from several sources. This data may be structured or unstructured. The raw data that we often get is unstructured and/or dirty data, which needs to be cleaned and structured before it can be used for analysis. Most of the common sources of data now offer connectors to import the raw data in R or Python.
Common data sources include the following:
In the data science world, common vocabulary includes:
|⇄ Observations or examples||⇄ are like the rows in a database.||⇄ For example: A customer record for Joe Allen.|
|⇅ Variables, signals, or characteristics|
|⇅ are like the columns|
|⇅ For example: Joe's Height.|
Several terms are used to refer to data cleaning, such as data munging, data preprocessing, data transformation, and data wrangling. These terms all refer to the process of preparing the raw data to be used for data analysis.
As much as 70–80 percent of the efforts in a data science analysis involve data cleansing.
A data scientist analyzes each variable in the data to evaluate whether it is worthy of being a feature in the model. If including the variable increases the model's predictive power, it is considered a predictor for the model. Such a variable is then considered a feature, and together all the features create a feature vector for the model. This analysis is called feature engineering.
Sometimes a variable may need to be cleaned or transformed to be used as a feature in the model. To do that we write scripts, which are also referred to as munging scripts. Scripts can perform a range of functions like:
Sometimes the data has numerical values that vary in magnitude, making it difficult to visualize the information. We can resolve this issue using feature scaling. For example,consider the square footage and number of rooms in a house. If we normalize the square footage of a house by making it a similar magnitude as the number of bedrooms, our analysis becomes easier.
A series of scripts are applied to the data in an iterative manner until we get data that is clean enough for analysis. To get a continuous supply of data for analysis, the series of data munging scripts need to be rerun on the new raw data. Data pipeline is the term given to this series of processing steps applied to raw data to make it analysis ready.
Now we have clean data and we are ready for analysis. Our next goal is to become familiar with the data using statistical modeling, visualizations, discovery-oriented data analysis, and so on.
For simple problems, we can use simple statistical analysis using the mean, medium, mode, min, max, average, range, quartile, and so on.
We could also use supervised learning with data sets that gives us access to actual values of response variables (dependent variables) for a given set of feature variables (independent variables). For example, we could find trends based on the tenure, seniority, and title for employees who have left the company (resigned=true) from actual data, and then use those trends to predict whether other employees will resign too. Or we could use historic data to correlate a trend between the number of visitors (an independent variable or a predictor) and revenue generated (a dependent variable or response variable). This correlation could then be used to predict future revenue for the site based on the number of visitors.
The key requirement for supervised learning is the availability of ACTUAL Values and a clear question that needs to be answered. For example: Will this employee leave? How much revenue can we expect? Data scientists often refer to this as "Response variable is labeled for existing data."
Regression is a common tool used for supervised learning. A one-factor regression uses one variable; a multifactor regression uses many variables.
Linear regression assumes that the unknown relation between the factor and the response variable is a linear relation Y = a + bx, where b is the coefficient of x.
A part of the existing data is used as training data to calculate the value of this coefficient. Data scientists often use 60 percent, 80 percent, or at times 90 percent of the data for training. Once the value of the coefficient is calculated for the trained model, it is tested with the remaining data also referred to as the test data to predict thevalue of the response variable. The difference between the predicted response value and the actual value is the Holy Grail of metrics referred to as the test error metric.
Our quest in data science modeling is to minimize the test error metrics in order to increase the predictive power of the model by:
Unsupervised learning is applied when we are trying to learn the structure of the underlying data itself. There is NO RESPONSE VARIABLE. Data sets are unlabeled and pre-existing insights are unclear. We are not clear about anything ahead of time so we are not trying to predict anything!
This technique is effective for exploratory analysis and can be used to answer questions like
Analysis of variance (ANOVA) is a common technique used to compare the means of two or more groups. It's named ANOVA though since the "estimates of variance" is the main intermediate statistics calculated. The means of various groups are compared using various distance metrics, Euclidean distance being a popular one.
ANOVA is used to organize observations into similar groups, called clusters. The observations can be classified into these clusters based on their respective predictors.
Two common clustering applications are:
If a stable state is not achieved, we may need to refine the number of clusters (i.e., K) we assumed in the beginning or use a different distance metrics.
The final clusters can be visualized for easy communication using tools like Tableau* or graphing libraries.
In my quest to understand data science, I met with practitioners working in companies, including Facebook, eBay, LinkedIn, Uber, and some consulting firms, that are effectively leveraging the power of data. Here are some powerful words of advice I received:
R is a favorite tool of many data scientists and holds a special place in the world of academia, where data science problems are worked on from a mathematician's and statistician's perspective. R is an open source and rich language, with about 9,000 additional packages available. The tool used to program in R is called R Studio*. R has a steep learning curve, though its footprint is steadily increasing in enterprise world and owes some of it popularity to the rich and powerful Regular Expression-based algorithms already available.
Python is slowly becoming the most extensively used language in the data science community. Like R, it is also an open source language and is used primarily by software engineers who view data science as a tool to solve real customer-facing business problems using data. Python is easier to learn than R, because the language emphasizes readability and productivity. It is also more flexible and simpler.
SQL is the basic language used to interact with databases and is required for all tools.
Below is a list of important soft skills to have, many of which you might already have in your portfolio.
Your goal is to give them direct recommendations based on your solid prediction algorithm and accurate results. We recommend that you create four or five slides where you clearly tell this story—storytelling backed by solid data and solid research.Visualization. Good data scientist needs to communicate results and recommendations using visualization. You cannot give 200-page report for someone to read. You need to present using pictures, images, charts, and graphs.
Now it's time to decide. What type of data scientist should I become?
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804