Recently, I posted “Interview with a Data Scientist” at my company’s blog site. In it, my friend and colleague Yan Li answers four questions about being a data scientist and what it takes to become one. In my view Yan’s responses provide a bracing reminder that data science is something truly new, but that it rests on universal principles of application development.
Why is Data Science New?
My first career job was with an econometrics firm that had applied statistical models since 1979 to predict future values of economic indicators. As early as the mid 1990s Capital One’s founders at Signet Bank had conceived their “Information Based Strategy” where modelers, many with scientific backgrounds, used statistical modeling and analysis to implemented the company’s “innovative approach to targeted marketing based on customer profitability analysis“. Data mining, which Yan mentions, also dates from the mid-90s or earlier.
What’s new is general recognition in the business community of the value of scientific method in data analysis. I believe that those paying for data analysis have over time realized that ad hoc BI methods and canned reporting, while useful, are limited. Extracting value from data requires a more disciplined approach. They found that approach in the observation, question, hypothesis, prediction, and testing of the scientific method.
Data Science Rests on Universal App Dev Principles
In spite of the complex tools and techniques needed for data science, Yan identifies communication with business people as the most important data science skill. “A lot of times, business domain experts may not even know what their problems are and may require some preliminary analytical efforts to present to the business user to help them articulate the business problem better.” In other words, help them define the business requirements. Isn’t that the key of any successful app dev project? Some things never change.
Why its Bracing: Hard Work with Lots of Tools
In her answers, Yan mentions a wide range of tools and techniques, and says data collection and prep takes at least 70% of her time. She has to write and tune Oracle SQL, pull data using Splunk, Pig, and Hive, model using R, and visualize using Tableau. An offhand mention of dimensions implies data modeling on top of all that.
Maybe it is just early days and proper support structures will emerge over time, but this is a startling list of tool skills. Beyond that, anyone who’s worked in ETL development knows that source data quality can require many lines of code to correct, and data from one source can require substantial processing before it matches data from another source.
Finally, Yan concludes by reminding us that some experiments fail, and that answers generate more questions in a never ending iterative cycle.
IBM’s Big Data webpage warns us that 1/3 of 4.4m data and analytics positions will go unfilled in 2015. Judging from Yan’s description of her job, the remaining 2.9m will be very busy.