Category Archives: Data Science

Two Design Principles for Tableau Data Sources

It’s not unusual for talented teams of business analysts to find themselves maintaining significant inventories of Tableau dashboards. In addition to sound development practices, following two key principles in data source design help these teams spend less time in maintenance and focus more on building new visualizations: publishing Tableau data sources separately from workbooks and waiting until the last opportunity to join dimension and fact data.


Imagine a business team — let’s call it Marketing Analytics — with read-only access to a Hadoop store or an enterprise data warehouse. They gain approval for Tableau licenses and Tableau Server publication rights for five tech-savvy data analysts. After a few initial successes with some impactful visualizations, the team gathers steam. After a while the team finds itself supporting scores of published workbooks serving a few hundred managers and executives. In spite of generally sound practices, Marketing Analytics struggles to maintain consistency from one Tableau workbook to another.

Continue reading

More on “Select Failed. [2646] No more spool space”

Also see the previous related post Escaping Teradata Purgatory (Select Failed. [2646] No more spool space)

Not too long ago I posted on how to avoid the dreaded “No more spool space” error in Teradata SQL. That post recounted approaches to restructuring SQL queries so that they would avoid being cancelled for using inordinate amounts of Teradata resources. Teradata is an immensely powerful, even if aging, database engine but it does little to help one not steeped in knowledge of its structure to use its resources efficiently.

But what if, as sometimes happens, your DB admin team further tightens the screws by  reducing spool space, or imposing new execution time or CPU usage limits? Then, you’ll have to go further to make queries efficient, as happened on one team that I was a part of. Beyond the steps previously recommended, here’s what we did: Continue reading

Leadership Must Prioritize Data Quality

Data quality improvements follow specific, clear leadership from the top. Project leaders count data quality among project goals when senior management encourages them to do so with unequivocal incentives, a common business vocabulary, shared understanding of data quality principles, and general agreement on the objects of interest to the business and their key characteristics.

Poor data quality costs businesses about “$15 million per year in losses, according to Gartner.” As Tendü Yoğurtçu puts it, “artificial intelligence (AI) and machine learning algorithms are only as effective as the data they use.” Data scientists understand the difficulties well, as they spend over 70% of their time in data prep.

Recent studies report that data entry typos are the largest source of poor data quality (here and here). My experience says otherwise. From what I’ve seen, operational data is generally good, and data errors only appear when data changes context. In this post I’ll detail why data quality is management’s responsibility, and why data quality will remain poor until leadership makes it a priority. Continue reading

Leader’s Data Manifesto at #EDW19: Building a Foundation for Data Science

It’s been a truism that data is a resource, but to prove it you just have to follow the money. As the illustration shows, the vast majority of corporate market value draws from intangible assets. Just as money is an abstraction that represents wealth, data is an abstraction that represents these intangible assets.

It’s year three after initial rollout of the Leader’s Data Manifesto (LDM). Since then, many widely publicized events have highlighted the value of data and metadata, and the importance of sound data management (here, here, and here). Recently at Enterprise Data World, John Ladley, Danette McGilvray, James Price, and Tom Redman presented this year’s LDM update. They reintroduced the Manifesto, recounted events of the past year, discussed strategy for the coming year, and issued a call to action for data professionals. Continue reading

Enterprise Data Prep for Analytics: Two Principles

Data scientists spend most of their time doing data integration rather than gathering insights. In my interview with data scientist Yan Li, she said that data collection and prep takes at least 70% of her time. Obviously, there’s a lot of integration work to do on data that’s new to analytics efforts, but not every analysis uses brand new data. Organizations can improve analytics efficiency by staging commonly used data pre-integrated for data science.

For years, large organizations have supported data warehouses, but prevailing data warehousing practices often fail when faced with “big data” volume and velocity. Still, warehousing teams in large organizations can pre-prep frequently-used internal data. Examples include reference and master data, production and sales records, and so on.    Continue reading

Anonymize Data for Better Executive Analytics

Reading articles about data anonymization makes it clear that it is not an entirely effective security measure (here and here), but still part of a robust security capability, and required if your organization is affected by GDPR. (I use “anonymization” as a general term encompassing techniques that de-identify personal data within a given data set.)

But there’s a positive side of anonymized data that hasn’t received much press. Providing anonymous data to senior managers who don’t need access to personal data can encourage them to take a broader perspective, and thereby bring new energy to fact-based senior planning and analysis. Continue reading

Toward an Analytics Code of Ethics

In data management and analytics, we often focus on correcting apparent inability and unwillingness on the part of business leaders to effectively gather and capitalize on data resources. With that perspective, we often see ethics as a side issue difficult to prioritize given the scale and persistence of our other challenges.

At least that was my perspective, and my initial response when confronted recently by a family member on this topic. Her view from outside the field was that ethics should be a primary concern. As I’ve reflected on this conversation, I’ve come around to her point.

In recent years we’ve seen many examples of data misuse due to ethical lapses. Here’s a post that gives five examples, including police officers looking up data on individuals not related to any police business, an employee passing personal data including SSNs to a text sharing site, and Uber’s “god view”, available at the corporate level, which an employee used in 2014 to track a journalist’s location. Continue reading

Start Data Quality Improvements with a New Definition

What is Data Quality anyway? If you are a data professional, I’m sure someone from outside our field has asked you that question, and if you’re like me you’ve fallen into the trap of answering in data-speak.

To my listener, I’d guess that the experience was similar to having a customer service rep who has just turned down his simple request justify it by describing byzantine company policies.

There’s a ton of great writing available on data quality, and I in no way mean to disparage it or its value in the field. But in that writing I’ve yet to find a concise and compelling definition that’s useful to non-data professionals. I’ll review one or two prevailing definitions and then offer one that could help us unlock real data quality improvements. Continue reading

Sound Data Culture Enables Modern Data Architectures

Modern data architectures, by enabling data analytics insights, promise to drive order of magnitude value gains across many business sectors (here, here, and here). Not so long ago, big data presented a daunting challenge. Although tools were plentiful, we struggled to conceptualize the architecture and organization within which to capitalize on those tools. Now solid frameworks have emerged. This post reviews two promising models for modern data architecture, and discusses two key cultural values critical to their successful adoption: drive to solve business challenges and drive for universal data correctness. Continue reading

Fixing Tableau Desktop Blue Screen or Unresponsive

Tableau desktop (10.2.2 on Windows 7 at work) was consistently locking up my computer or causing a BSOD when I tried to start it. After struggling for a while trying to solve the problem, I found out it was because it used all resources when opening the log file, which had over time grown to 24gig. Apparently my version of Tableau desktop doesn’t periodically clean up the log files.

However, if the …/Logs folder isn’t there at Tableau startup, it just builds a new one and starts fresh, so whenever Tableau isn’t running you can just delete it. So, to make that happen automatically, I’ve added a batch file with these commands to my startup folder: Continue reading