The joys of data cleansing
Originally published on LinkedIn, our COO Sam Rhynas shares her thoughts about data cleansing, and how we talk about it.
I’ve just read a post on LinkedIn which was basically a bit of a rant about poor practice in machine learning. While I agreed with many of the points that were mentioned, the way they were discussed and the way that solving them was suggested was, in my opinion, quite poor.
The thing that frustrated me the most was reading a description of data cleansing as being “grunt work done by the low paid”! I thought that was an appalling description and talking about it like that is one of the reasons that some people think it’s an awful, dreary part of working with data.
I love the data cleansing side of things – that’s doesn’t mean everyone else will but I am absolutely confident if I enjoy doing it, other people will enjoy doing it too!
I sometimes feel like I’m doing some kind of data forensics work when I’m doing data cleansing! Learning, understanding, making valid connections, digging into the mysterious values.
I know someone who’s just starting out on their data science journey and they were recently telling me about some data cleansing they were doing. They had replaced all the missing values with 0’s in their test data set. We had a very interesting conversation about why that wasn’t necessarily the right approach and discussed the analysis and thought process they had gone through to establish that that was the right thing to do. They went away and re- thought their approach to it and I hope that was really positive discussion and helpful for them.
Data cleansing is so much more than deleting swathes of stuff or just replacing values.When you’re first learning, and don’t really have a client it can feel like that, but once you’re doing it with real data for real people, it feels very different.