Monday 3 September 2018

Transferring data from mysql to DynamoDB

I want to transfer data(21M rows) from mysql database to DynamoDB. I am using boto Python API and django 1.3.1 to export data from mysql and transfer it to DynamoDB. Below is the code:

      conn = boto.connect_dynamodb()
      start_date = datetime.date(2012, 3, 1)
      end_date = datetime.date(2012, 3, 31)
      episode_report = TableName.objects.filter(viewdt__range=(start_date, end_date))
      #Paginate 21 million rows in chunks of 1000 each
      p = Paginator(episode_report, 1000)
      table = conn.get_table('ep_march')
      for page in range(1, p.num_pages + 1):
          for items in p.page(page).object_list:
              item_data = {
                        'id': int(items.id),
                        'user_id': format_user(items.user),     #Foreign Key to User table
                        'episode_id': int(items.episode.id),          #Foreign Key to Episode table
                        'series_id': int(items.series.id),      #Foreign Key to Series Table
                        'viewdt': str(items.viewdt),
                     }
              item = table.new_item(
                                    # Our hash key is 'id'
                                    hash_key= int(items.id),
                                    # Our range key is 'viewdt'
                                    range_key= str(items.viewdt),
                                    # This has the
                                    attrs=item_data
                                )
              item.put()

The issue is that the process has been running for more than 12 hours and has still transferred 3M rows. Any ideas to speed up the process?
I would create more threads and parellize the transfer and see if that helps.
Thanks.

First, what is the provisioned throughput of your DynamoDB table? That will ultimately control how many writes/second you can make. Adjust accordingly.
Second, get some sort of concurrency going. You could use threads (make sure each thread has it's own connection object because httplib.py is not threadsafe) or you could use gevent or multiprocess or whatever mechanism you like but concurrency is key.

0 comments:

Post a Comment