The Importance of Data Availability, Hygiene, and Accessibility for Machine Learning
At most companies, there is a stockpile of data waiting to be turned into breakdowns, insights, and forecasts.
Machine learning (ML) has made it possible to build models that turn that data into actionable information for your employees.
But the data has to first be accessible and clean enough to analyze.
We spoke about data hygiene best practices with Ethan Steinman from Infinia ML. Infinia ML solves business challenges using machine learning. Ethan has worked with clients to make sure their data is following best practices before deploying their machine learning algorithms.
The conversation below has been edited for length and content.
Why is it so important for companies to have good data?
I joined Infinia ML because, throughout my career, I watched the software industry spend a decade trying to find out how to store and retrieve data efficiently. To me, machine learning is how we actually start to use that data.
Because much of this data hasn’t been used before, the data that we’ve been storing for the last decade probably isn’t the right data.
We get caught in this wheel where you have to start doing ML with crappy data, and from there you inform your ongoing tactical decisions to get better data. Then the machine learning becomes even more useful and you get on a positive feedback loop.
There’s upfront cost and there’s a lot of pain to fix past decisions. But it gets better over time.
What are some best practices for data accessibility, data hygiene, and data availability?
We have a lot of companies that come to us and say “we think we know what problem we want to solve and we want to apply machine learning to it.” They might’ve tried other ML companies in the past, or a traditional data analytics approach, but didn’t get the results they wanted to see.
As we start to get engaged, our business development (BD) lead will bring in an engineer and we’ll start looking at sample data, identify or confirm their problem, and then tell them what data we’d need to solve that problem.
The main reason projects have failed in the past is because the customer doesn’t have the data or doesn’t have it in a way that’s useable.
One thing you need for machine learning to work is some sort of grounded truth. Internally you might hear it called “labeled data”. You have a problem and you have data that says this piece of data indicates the good side or the bad side.
You train your machine learning model off that labeled data. Without that data, it’s very difficult and your machine learning results won’t be as good.
The other side of data hygiene and data accessibility is a lot of companies have their data across a ton of disparate systems that don’t talk to each other.
We end up having to write up fairly complex data pipelines to get the data to the model so it can actually do the prediction.
In some cases, we’ll get sample data from a customer out of some of these data sources and they’ll tell us that Column B is a Boolean value and we look at the distribution of values and there’s “True, False, 0, 1, and 6”.
We’ll go back and say “Okay, True and False work. 0 and 1 might work. What’s up with 6?”
A lot of the time they don’t know. They end up having to go back to their backend team, IT, or whoever owns the database to figure out what that value means.
How has the implementation of machine learning led to better business solutions?
We work with the data the client gives us.
We build models depending on what they have and we build the data pipelines to connect all the pieces together. At the end, we deliver recommendations to the client about areas they should invest in to reduce long-term maintenance costs and when we need to start retraining the machine learning model.
We had one company that responded to government RFPs.
They had been writing a couple thousand proposals a year for over a decade and started their proposals with a blank Word document.
Their underlying request was “we want to increase our win rate; how can you help us do that?”
Our BD team went in and asked “what goes into win rate?”.
Writing good proposals is an obvious way to win more proposals. We asked what their current process was and where we could apply machine learning successfully?
We created a mini search engine for them using machine learning to plug-in pieces of a new RFP they want to respond to and find relevant past proposals. It would search their proposal storage database, pull text out of the past proposal, and instead of starting for scratch, they can start from the proposal they wrote for this client about a similar problem.
They can filter it by dollar value or by whether the proposal was won to make sure the text was effective.
They can also use it for someone writing a proposal to search for who wrote the successful proposal to reach out and ask them to review or tailor it to the people we’ll be talking to.
As another example, many companies have time sheets and want to keep track of billable hours, who’s not billable, and why they’re not billable.
We’re doing a project for a fairly large company that does client and professional services. We have wide access into their timesheet tracking systems and are looking for anomalies, like an employee that’s been billable at 100% for the last six months and now she’s not. Or an employee that bills time against a client project but not a billable time code under that project. Perhaps it’s an education thing where they need to bill or track their time differently.
We can bundle those anomalies into a UI for an executive to look at and make decisions from.
How often do you have to retrain the model and what does the long-term engagement look like?
Most software projects have high initial costs and then they tail down to the costs of running the servers or software.
With machine learning, it doesn’t tail down like that because you have to do the model retraining which adds spikes of additional cost to keep it working.
Retraining models is not really time-based. It’s based on the data that’s coming into the system and, if possible, we try to gather feedback from a human to see if they’re still happy with the results.
We’re looking for either a change in the incoming data or changes in the distribution of prediction. For example, if we’re answering a yes or no question and we’ve been answering 98% yes for 6 months and all of a sudden, we see declines into 95, 93, or 87, we know something is changing there.
We try to use data-driven approaches like that to determine if it’s time for model retraining. The user feedback gives us the subjective flavor. We might hear, “your model used to give us really good data, but now the things I’m getting back aren’t useful to me anymore.”
Sometimes the data might not indicate there’s a problem but the user feedback lets us know we need to retrain.
We also have a system called the Auditor which is the tool we use to monitor our machine learning models. It has a lot of machine learning built into it as well.
Whenever we install models into a customer environment, we recommend installing the Auditor so we can monitor the model over the long term and the client can get many years of value out of the model instead of 6 months.
How do you educate people on the importance of data hygiene?
We do things like internal webinars at our clients. It is a great sales opportunity to cross-sell at a large account, but it also helps educate other people at the company about the data we’re using and the format we need the data in.
It gives us a chance to come in and partner with the IT team. It’s important to build a good connection with the IT team. You just have to recognize they are busy people so we make sure we make them understand we are partners.
They’re usually happy to make changes as long as you’re willing to make some changes for them. They’ll tweak their data pipeline or clean a pile of old data so you can train on it.
You just need to show value in return, and the value for the IT department is different from the value for the business.
What is the tech stack you use?
We settled on Vue.JS for pretty much all our UI work. It’s beginner-friendly but powerful enough for when we need to do more complex things.
On the backend, we’re pretty much all Python. We use Django and Flask depending on what the problem is. Postgres is our database of choice.
For machine learning, it’s pretty much all in PyTorch. We have some substantial investments in PDF processing technology. I haven’t seen many open-source technologies that are close to what we have, and even proprietary solutions.
PyTorch is in Python and that’s why we keep the rest of our stack in Python. It allows for a pretty easy transition between what the Data Scientists and the Engineers are working on day-to-day. It avoids unnecessary context switching.
We run all these on Kubernetes because we have to deploy to client environments on a regular basis. We essentially use Kubernetes as our deployment engine. All our client environments are consistent if we want to do updates or a restart. It allows us to run the same commands everywhere without knowing the specifics of that client environment.