HOW DATABASE PERFORMANCE ISSUES ALMOST MADE ME LOSE My SANITY

So the folks at HNG 11 are making me write about an interesting issue I faced as a backend developer. So, I chose this one - the time Postgres almost took my sanity, lol. It eventually lead to a lot of research and self learning on database architectures, database modularity and just overall dealing with slow databases as a result of load.

Well, now I’m going to show you how I fixed it and got my peace of mind back, lol.

INTRO TO PROBLEM

I helped to build the backend (with Flask and Postgres) of a web application and after a week or so I was done with it. Or so I thought. The Issue was: many users were trying to use the web app at a particular time .

Boom, performance issues and timeouts out of nowhere.

“Yo. Backend guy, you’re up”, was the text in my dms.

So going through this carefully, I was faced with a couple of issues. I needed to find a way to efficiently handle this high traffic and fast.

WHAT I DID

So what did I do? Well, since the issues were arising from the database, I had to find out which queries were being regularly used and which queries were essentially being bottlenecks and clogging up the entire process. That’s when I learnt (by force) about APM (Application Perfomance Monitoring) tools.

To pinpoint those trouble some queries, I:

SELECT pid, username, application_name, client_addr, state_query FROM pg_stat_activity WHERE state = ‘active’;

The output of this, helped me to understand what users were making use of what queries. And to be perfectly honest, it was looking perfectly fine to me so I decided to randomly divide the users and route their queries to a distributed db system.

Next, I had to use something called a Sharding technique to help with modularity, sacalabiltity and all those other big words that essentially mean division of labour + reuseability. Think of Sharding as a way if dividing the database into smaller pieces ie the shards. I eventually settled on using the userID as the sharding key to “shard” horizontally and distribute the load to the other shards.

In code, it looked like this:

from sqlalchemy import create_engine

from sqlalchemy.orm import sesssionmaker

shards = {

‘shard_1’: create_engine(‘posgresql://mydblinkforshard1’),

‘shard_2’: create_engine(‘posgresql://mydblinkforshar2’),

}

def get_shard(user_id):

   if user_id %2==0:

     return shard_1:

   else:

     return shard_2

def get_session(user_id);

   shard = get_shard(user_id)

   session = sessionmaker(bind=shards[shard])

   return session

Yeah, you guessed it. I just use one shard for users with even userIDs and the other shard for users with odd IDs (insert evil genius laugher)

Next, what was left to do was to migrate the data in the old db architecture carefully (based on the userIDs) into the new shard based architecture. Emphasis on carefully (data loss go whyne you but no panic).

from sqlalchemy import MetaData, Table

def migrate_data():
    metadata = MetaData()
    users_table = Table('users', metadata, autoload_with=engine)
    connection = engine.connect()

    for row in connection.execute(users_table.select()):
        shard = get_shard(row['id'])
        shard_engine = shards[shard]
        shard_connection = shard_engine.connect()
        shard_connection.execute(users_table.insert().values(row))
        shard_connection.close()

migrate_data()

Now that the heavy lifting was over, what was left was to implement a sort of load balancing structure with HAProxy to distribute incoming traffic and and reroute failed queries. After installations etc, the set up looked something like this.

frontend ft_db
    bind *:5432
    mode tcp
    default_backend bk_db

backend bk_db
    mode tcp
    balance roundrobin
    server shard1 shard1_host:5432 check
    server shard2 shard2_host:5432 check

Lastly, I had to set in place a way to make sure the issues don’t come up again. So after research, I eventually found out about Prometheus and Grafana dashboards to keep track of the databases performance. The yaml:

apiVersion: v1
kind: Service
metadata:
  name: prometheus
spec:
  selector:
    app: prometheus
  ports:
    - protocol: TCP
      port: 9090
      targetPort: 9090

CONCLUSION

Yeah. So , HNG made me write about this and in hind sight, it wasn’t so bad and I should probably do this more often.

PS: you should probably consider joining.

Ciao!