In my work I have used Snowflake quite a lot. The so called Data Cloud is very popular. Getting started is easy.
One of the issues I have with Snowflake, it’s lacking primary keys! When you insert data, there’s no way to guarantee that data isn’t duplicated. Of course, Snowflake being SQL compatible you can add some primary key constraints, but they’re mostly ignored.
This should square any data engineer, architect or analyst away – but yet people just seem to accept.
Solutions for the problem, as it always gets a problem sooner later I have seen:
- Copy data from one table to another, de-duplicating while copying. Often tools like DBT are used for that. I have my thoughts about DBT, but it’s pretty insane you have to use it for such mundane work.
- Using DISTINCT or similar everywhere, so de-duplicate the data while retrieving. Good thing is, since Snowflake is very fast that’s not much of a performance issue. Also the primary key constraint does can help a little, if you use group using the primary key Snowflake deduplicate out-of-the-box.
But it’s quite a hassle. - Just ignore the problem. Depending on how your pipeline’s setup the amount of duplicate data can be pretty low. Usually you can only get duplicate data if the process that’s uploading your to stages crashes. So you’d have a few percent of duplicate data. Until of course something breaks, there’s a crash and suddenly half of the records in your table are duplicated.
What amazes me, everyone has the above problem. But rather then deciding that Snowflake might not be the best tool for the job, people start doubling-down on Snowflake. Maybe because it costs so much per month. Or because all the tooling you have to invest time in just to get it working reliable.