Data Science in Finance: Successful Machine Learning Projects in Fintech
Filip Stachura, CEO at Appsilon Data Science spoke on Fintech Open Mic Night – Credit Scoring. Traditional credit scoring used financial history and basic info about the prospective borrower: job, address and so forth. Abundance of data fueled by emergence of data-collecting companies opened completely new sources of information for risk managers. Optimal credit scoring model of today could use very large sets of data from variety of sources and look for relationships among previously unnoticed areas and factors. This can lead to better allocation of lending capital and greater financial inclusion, as a consequence. Appsilon stands at the edge of discovery exploring new appliances of data science methodologies & tools in finance – hope you will enjoy our video.
My name is Filip
I’m the CEO and Cofounder of Appsilon Data Science. Before founding Appsilon, I worked for Microsoft in California. My background is mostly technical. I have a double-degree in Applied Mathematics and Computer Science but I mostly work with business people. I do not code anymore, I only think about what is possible and what isn’t.
We work with corporate clients. Most of them are international from Western Europe and the States. I can’t mention every company we work with but some of them are Fortune 500 companies. These are pretty advanced topics that we solve. As you see, these are not only Finance companies as we are industry agnostic. We mostly focus on the right process. Even though I am going to, at the core, address finance and insurance today. I think that this is still very important to know that my opinions may vary because of the different industries we work with.
As we see with the machine learning and data science process, there are four core ingredients to the process and we address all of them. First and foremost is the data. This is probably the most important ingredient that we may sometimes forget because it is obvious.
And through the different projects we’ve accomplished, this is a recurring fact that, increasing, the data quality is the simplest method to deliver better accuracy. You can do that in various ways. I think that a lot of companies think that they already have their data problem solved because they’ve already implemented a data warehouse or a data lake, or some other form of database but the crucial thing is the data quality.
To get there, you can think of many different solutions. But one that we’ve found that works in practice is to implement similar processes to source code. When programmers work on a program, they implement continuous integration. They implement tests. Whenever you change the source code, you run the tests automatically. We do the same with data; whenever we get a new data set from our client or if they gather data on a daily basis, we run tests on a daily basis and validate the quality. Through that, you can get a report that says, ok, all check, you have a green status. You feel that you are safe and that this is not going to destroy your quality or accuracy in any way.
The second part of the data thing is the data acquisition part. Nowadays you can think of tons of different sources of data. You can get data from IoT, so for example, in the insurance industry you can gather the data from the driver’s mobile phone to give them a better offer, insurance offer in this case. The same goes with credit scoring. If you have a source of data that you can use, it’s probably worth consideration when creating a hypothesis. This is a second part that I’m going to mention further. The whole credit scoring thing is very similar to research, as the whole of data science is. Going through the acquisition part, IoT is a part, but of course, social media. You can actually get new features on your own by, maybe, sending a little computation to the device that your client is using. Through that you can validate what performance of the computation you get and the device itself so that you can see if your client is using a high end product or a low end product. You can use all of the classic methods. They are still valid, like credit history and transactional history and so on, all of the time frames. You also have feature engineering where you try to construct new features from the existing data. I probably have a few more ideas. But I can go over them later on – during the panel.
Models. We already got a good explanation from Vladimir, on the pros and cons of going linear or with gradient boosting trees, up to deep learning stuff. But the most crucial thing is that deep learning is costly. You have to hire a very specialized team with a deep background of the topic and good understanding of the consequences. Linear models are much cheaper and can be delivered faster so if you can get a linear model rolling in a month, maybe it’s better to have something rolling in a month rather that have something working for several months and see the results only in the future. The third part, same as data, not with enough focus nowadays is on reproducible research. What I mean by reproducible research is that when you have a model. Let’s say you have a data scientist that implemented a gradient boosting tree or linear model, that in one month, or several months, even six months, you can get exactly the same results you can get today. There are many ways to do that. This is something very important to understand the process right and making sure you can get the same results and also show to the business people, ‘OK, we are moving further, we make progress, because, we were 1-2% percent worse last month and we get better on a weekly basis’. This is very important.
And the last but not least is the interface part. Here, this can be an API, if you use the forecast or prediction on a Mobile, of course you can use an API. But what’s more interesting for Appsilon is dashboards. And I don’t mean dashboards like PowerBI or Tableau, but smart dashboards that also gather data from the people that are using the dashboard. So if you have an expert that works with dashboards, we can also track their behavior to get their behavior into the model itself later on and improve accuracy.
So, here is a list of some potential ideas that I can think of in Finance, of course, this is pretty fuzzy so you can mix and match. For example, if you have a chat bot, and you are a credit company, then you can do some sentiment analysis to create features for credit scoring. I think this is not something very strict, but there are tons of ideas. You can reuse ideas from other industries. You can take image recognition to the finance industry as well, if it improves something, then it is worth trying.
Let’s define success for successful credit scoring project within your company or other finance projects.
I think, in my opinion, the main points are to get a business advantage. We see that technology is the key to success in the future. And to get there, we need to be faster, we need to automate, use fewer resources, or get better accuracy. And what is also very important, is that we need to keep that advantage. As I said, this is not something that can be done once. This model that you are going to implement within your company needs to live as well. So you need to keep track of the accuracy you have. If it’s crucial then you need to improve as well, through gathering new data sources, through improving the model, through experiments.
To get constant improvements, this is also a valid point for implementing the project itself, I think that one thing that proved to be unsuccessful and a complete disaster is fixed scope. So you have a project that says that you have only eight weeks to implement this stuff. This is doable if the only thing you need to implement is some kind of model. But the crucial thing is the accuracy itself, and you cannot guarantee any kind of accuracy before you validate the model itself. If the work is to implement the data science platform or the dashboard, this is something easier to estimate. It’s still software work so we all know software projects are hard because they are complicated and there are unknown unknowns but for data science or machine learning projects, this is difficult because you get to one point, and then you have an idea, ‘oh maybe we need to consider another feature’ or maybe we can get a new data source. It’s a very important to redefine the scope of the project and do this as fast as possible. You can get there with SCRUM. I believe there are some other ways. For us SCRUM proves to be very successful.
Iterations are also important because they give you a rhythm of work and you validate hypothesis. And validating hypothesis is a good way of thinking about this.
So it’s good to keep iterations. Regarding accuracy, I’ve already mentioned data governance. Code review is another thing we took from software engineering where you have data scientists read their code to make sure there are no logical mistakes in the model, so you improve quality. You have a double check. And reproducible research I’ve already described.
This is an example from Domino Data Lab. This is a startup from San Francisco. It’s just one example of a data science project where you get, kind of, continuous integration for machine learning projects so you get statistical results of the data model and everything is kept in a docker containers so you can rerun the model after a week or a month. There are others as well.
One last thing I want to mention is about dashboards. I think that it’s fairly important as it helps with regulations as we’ve seen in a previous presentation. This is just one example of a dashboard we’ve build. It’s for reducing churn. In the middle, you get predictions for what to offer to your client. You can do that during a phone call. Through that you can reduce the churn and upsell at the same time.
Three last points to keep in mind and to sum up. Fixed scope, fixed price: these are things that might stop your projects and are pretty dangerous. These are things that might stop your progress and are pretty dangerous. Getting off track; if you lose track of your current accuracy or your current quality of the data. Data quality is the most important quality and way to improve your models.