2018-11-05
*post*
================================================================================

Asynchronous Python at Kumparan

================================================================================

0 CONTENTS

*post-contents*

*post-intro*

1 INTRO

This is a transcript of my lightning talk at  PyCon 2018. The alternative, clickbait-y title: “How Kumparan Handle More Than 10 Million Tracking Events on Daily Basis?”.

*post-overview*

2 OVERVIEW

Kumparan is an Indonesian-based news platform. Several systems that running on our data platform are build on top of Python asyncio. So, in this 5 minutes talk, we would like to share our experiences.

*post-async-in-python*

3 ASYNC IN PYTHON

Let’s review some basic concept about how to do asynchronous programming in python. There are a bunch of libraries that you can use in order to do asynchronous programming in python. In python 3.5 or later, you can use asyncio package. asyncio is available in the python standard library.

import asyncio

In order to do asynchronous programming using python asyncio, you need to understand these 3 basic concepts: CoroutineTask and Event loop.

The first one is CoroutineCoroutine is a function that have many entry points for suspending and resuming the execution. The second one is Task. Task is class that we can use to schedule coroutines to run concurrently. The last one is Event loop. Event loop is responsible for scheduling (suspend or resume) one or more Coroutine(s) simultaneously.

*post-event-loop*

4 EVENT LOOP

Let us start from event loop. The event loop is the core of every asyncio application. You can create new event loop by calling this function:

# Create new event loop
loop = asyncio.new_event_loop()

or you can get the existing event loop by using this function:

# Get the current event loop
loop = asyncio.get_event_loop()
*post-coroutine*

5 COROUTINE

Coroutine is just a function. You can define new coroutine using async def keyword followed by the coroutine name, the inputs and the output.

async def process(input: str) -> str:
    ...

You may notice that we use type annotation. At Kumparan, we also use mypy as our static type checker.

Inside a Coroutine, you should not call a function that block the main thread because it will disrupt the event loop.

import timeasync def process(input: str) -> str:
    # Don't do this
    time.sleep(5)

You can call other coroutine using the await keyword.

async def process(input: str) -> str:
    await asyncio.sleep(5)
    return “Processed: {}.format(input)

A line with await keyword is an example of entrypoint where the execution of the coroutine can be paused or resumed.

You can not call coroutine directly, you need to create or get the event loop first then run the coroutine inside the event loop.

loop.run_until_complete(process("test"))
loop.close()
*post-task*

6 TASK

The last basic concept is a TaskTask is a python class that can be used to schedule one or more coroutines to run concurrently.

from typing import Nonedef callback(future: asyncio.Future) -> None:
    processed = future.result()
    print({} is here”.format(processed))async def main() -> None:
    task1 = asyncio.create_task(process(“input 1))
    task2 = asyncio.create_task(process(“input 2))
    task1.add_done_callback(callback)
    task2.add_done_callback(callback)
    await asyncio.sleep(5)loop.run_until_complete(main())

The nice thing about task is you can attach a callback function. This callback will be executed when the coroutine is finished. This come in handy when we need to handle the error.

*post-asyncio-at-kumparan*

7 ASYNCIO AT KUMPARAN

So how we use python asyncio at kumparan? asyncio is a perfect fit for high-performance web-servers, database connection libraries, distributed task queues, etc.

These are a list of services that we built on top of python asyncio:

  1. Tracker API
  2. Tracker Transporter
  3. A/B Test Splitter API
  4. Trending Stories API
  5. Personalized Feed API
  6. And more …

Most of them are API server.

*post-use-case*

8 USE CASE

I will show you an example of how we build our service on top of python asyncio. The use case is for tracking events receiver.

Our goal is to be able to collect tracking events as many as possible, so we implement Fire-and-Forget approach on top of asyncio in order to reduce the response time.

The implementation is very simple and easy to reason.

# NOTE: Simplified version
async def track(request: Request) -> Response:
    # ...
    try:
        # Fire
        task = tracker.collect(event)
        # ...
        task.add_done_callback(callback)
        # and Forget (Return the response immediately)
        return api.success()
    except ValueError as e:
        return api.error(status_code=400, error=str(e))
    except Exception as e:
        error = "/v1/track failed"
        return api.error(error=error, exception=e)

First, we define a coroutine called track. This coroutine will be executed on every http request on the tracker endpoint. Inside this coroutine, we schedule another coroutine and wrap it using asyncio.Task to run concurrently. Then we attach some callback function to handle an error and that’s it. Easy right?

With this approach, we are able to achieve response time in less than 50ms.

As you can see, mostly are in less than 5ms and we are able to collect more than 10,000,000 tracking events on daily basis.

As you may notice, in October 29 we able to collect more than 72 Million of tracking events. This happen because there is a breaking news about Lion Air Crash and with our Fire-and-Forget approach, we can handle this traffic spike without a problem.

*post-lessons-learned*

9 LESSONS LEARNED

  1. The async library from the community is not mature yet, sometimes you need to implement it by yourself.
  2. It’s easy to make mistake by calling a blocking function. There is no tool that helps developers to spot this mistake.

And thanks everyone!

================================================================================

TAGS

*post-tags*

================================================================================

LINKS

*post-links*