How to rank your data … fast

Introduction

At every stage of a project, it is useful to take advantage of methods that help simplify and speed up our work. Some of those methods are not so obvious.

In this couple of pages, I will show my thought process of solving a problem of ranking / ordering a data set using math to your advantage and not being scared of it. This came up to be a pretty fast solution. What numbers are we talking about? Let’s say that before implementing this solution a data set of 60 000 rows was ranking for about 1.5 hour, but after I implemented this solution it was reduced to just 1.5 minutes. Let’s dive into this.

The problem

Let’s imagine that there is a table inside a database called Car. This table contains information about specific cars and their parameters. Users want to see the rank of a car. Best case scenario would be if those ranks are points ranging from 0 to 100. The bigger the number, the better the car. Let’s assume that for calculating the rank for a car app takes a couple of parameters under consideration. I specified those parameters below.

Parameter	Description	Type
Fuel consumption (MPG)	Describes how many miles a car can travel burning one gallon of fuel.	Float
Engine type	Type of engine e.g.: petrol, diesel, hybrid or electric.	Integer
Driving performance	A float number representing a car’s ability to accelerate, it’s top speed etc.	Float

Keep in mind that I will only include three parameters for presentation purposes, but feel free to take as much as you want.

So how can we get those random numbers and calculate a ranking for a car that will be between 0 and 100? It can be tempting to calculate each car individually, but what when there are 100 000 rows or 1 million rows? It gets slower dramatically. Well, maybe I will use update rows in bulk instead of saving them individually to the database. Yes that’s an improvement, but what about memory usage? It’s still not perfect. The actual solution for this problem is to move all the calculations to the database itself. That’s it. One query and you are done. Don’t believe me? Just watch.

Math for the win

I have bad news I forgot to tell. After implementing this solution, you won’t be able to use this meme again.

Yeah, sorry :(. Anyway, back on the track. What is the root of the problem? I want to generate a number in some arbitrary range based on specified parameters. What can I use to achieve this? For me, this is a perfect job for a math function. There are many math functions that would achieve this. ReLU is one of them, but we will use a different one since it’s available in virtually every database engine. We’ll use arctangent (atan).

Let’s see how this function looks.

This is almost perfect. We can just paste parameters instead of X, and we will have a value. Let’s tweak this a bit to be a better fit for our solution.

Why is this perfect? This function is top limited, meaning that no matter the X value, Y value will never be bigger than the limit. I mean, it can be equal to the limit but if X would equal to infinity, so it’s not a problem. In the image above, you can see the overall idea of it. This function for every X will generate a unique Y with a value in the range 0 ≤ y < 100.

Let’s take a look at an equation.

Parameters:

A – Divider. The bigger the number, the slower Y values will raise.
T – Top limit. This is a maximum Y value.
B – Bottom limit. This is the lowest value of Y.
X – Sum of the parameters, e.g. Fuel consumption, engine type etc.

It may seem that from simple atan this function suddenly got complex. Don’t worry, I will explain.
The first part of the equation is responsible for limiting the function to a desired value. In this case, it’s 100.

Then there is atan. Square root helps to eliminate negative Y values that were present on the first image that showed raw atan function.
Third and last part is just adding a value to the Y. This can be used if for example I would like to generate numbers in the 100 to 200 range or other ranges. B would be 100 in such a case.

That’s it. Now let’s talk about implementation.

Example of implementation

I will be using Django to show the structure of a table. The query is available in Django ORM and also in raw SQL. Thanks to raw SQL it’s a more generic approach that can be translated to virtually any framework, so the solution here is not limited only to Django.

The model:

from django.db import models

class Car(models.Model):
ENGINGE_TYPES = (
(50, “Petrol”),
(100, “Diesel”),
(150, “Hybrid”),
(200, “Electric”),
)

fuel_consumption = models.FloatField()
engine_type = models.IntegerField(choices=ENGINGE_TYPES)
driving_performance = models.FloatField()
ranking = models.FloatField(null=True, blank=True, default=None)

Just for example, I specified random numbers that represent the number of points based on the engine type. The rest of the fields should be clear to understand.

The ORM query:

from math import pi

from django.db.models import F
from django.db.models.functions import ATan, Sqrt

from .models import Car

MAX_VALUE = 100
MIN_VALUE = 0
A = 20

TOP_LIMIT = (MAX_VALUE-MIN_VALUE) / (pi/2)
X = (
F(“fuel_consumption”) +
F(“engine_type”) +
F(“driving_performance”)
) / A

Car.objects.update(
ranking=(TOP_LIMIT * ATan(Sqrt(X)) + MIN_VALUE)
)

It’s pretty straight forward. I made a couple of helper variables to make code cleaner, but the update query is very simple and of course fast. Update is not required at this point. Everything depends on the implementation. This can be easily translated to create a new, helper column using annotation. It only shows that it can be done in one SQL query and minimal python processing.

The SQL query:

UPDATE [car]
SET [ranking] = (
(
((100 – 0)/(3.14/2)) * ATAN(
SQRT(
((
[test_car].[fuel_consumption] +
[test_car].[engine_type] +
[test_car].[driving_performance]
) / 20)
)
)
) + 0
)

In this example I used MS SQL Server. The raw SQL, the same as an ORM query, is using only built-in functions and is relatively short and easy to understand and as long as you understand the solution. PI of course can be specified more accurately.

That’s it. A couple of lines of code and a problem that can be easily badly optimized is solved.

Final thoughts

General group of such functions is called activation functions and holds many specific functions like ReLU and others. I already explained why I used atan instead of a function that would be better suited for this purpose. I understand that this function doesn’t grow linearly. That’s one disadvantage of using this specific function. There is always an option to use a different one, but I didn’t find, at the time that I came up with this, a function that would look pretty much the same and would be implemented in SQL.

The improvement in time is quite drastic. I mentioned going from 1.5 hours to 1.5 minutes. Previous solution for this problem was processing each row individually and doing many unnecessary comparisons. Also, it was executing one db update query each row, so each update had to initialize the connection to the DB and python processing in between those queries also took some time. Executing one query that does this all at once of course will use more resources on SQL server, but it will be much faster.

This solution can be easily linked with other math operations for data like linear transformation that would help to fit some arbitrary number range into 0 to 100 range and also would keep the ratio between numbers and many others. Math and other domains already solved a lot of problems that cause a struggle on a daily basis in this business, so why not use them? The only thing to do is to find and learn about them.

Data and AI

Data and AI

How to rank your data … fast

Introduction

The problem

Math for the win

Example of implementation

Final thoughts

USA, Durham

Germany, Berlin

Poland, Gdynia

Sweden, Stockholm