Originally from the User Slack
@Stewart: Is there a good way to do a live migration from dynamodb to scylladb?
@Felipe_Cardeneti_Mendes: Alternator or CQL? For the former you should be able to use our migrator if your source has a fixed schema
For the latter, you’ll probably want something along the lines of https://github.com/fee-mendes/scylladb-alternator-lambdastreams/ and replace the alternator logic with CQL. We are in process of publishing an article discussing these.
@Stewart: we are trying to migrate with CQL .
@Felipe_Cardeneti_Mendes does scylla spark migrator don’t support DDB to CQL ?
@Felipe_Cardeneti_Mendes: it doesnt (yet) unfortunately
@Stewart: so sad
When do you think this feature will be released?
@Felipe_Cardeneti_Mendes: I don’t think I have a good estimate. the Migrator for DynamoDB today relies on an deprecated & no longer supported library which we likely should replace, and only after plan to extend its functionality
@Stewart: If so, what is the best way to live migrate to ddb -> scylladb as currently recommended by scylladb?
@Felipe_Cardeneti_Mendes: Export a DDB S3 backup, load it to ScyllaDB, capture and replay events from ddb streams via lambda, kinesis or other method you prefer.
pretty much a write heavy CQL code, at most a few deletes once you get to the streams replay part
@Stewart: Is there a blog post or article that explains the process well? I am currently working on moving all of our ddb’s to scylladb and would like to do a POC.
@Felipe_Cardeneti_Mendes: the GH repo above contains most of it, but it starts with a DDB->Alternator migration. Changing to CQL would require updating the s3Restore.py
and the lambda function to use CQL instead, but the idea is pretty much there.
and yes, there’s an article in the works as I mentioned last week. Probably should go out next week or so
and this is how you export a S3 backup https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport_Requesting.html
Requesting a table export in DynamoDB - Amazon DynamoDB
@Stewart: I haven’t worked with ddb -> scylladb migration yet, so I’m a little confused.
Can I check out the relevant documentation first and then ask some more questions?
@Felipe_Cardeneti_Mendes
If so, is it possible to use spark migrator to live migrate the table using alternator and then read/write the created table based on cql?
@Felipe_Cardeneti_Mendes: unfortunately not, the non-key fields within alternator are stored in a blob map which is unintelligible for CQL clients.
I will share you the docs we are currently working on today - ping me if I somehow forget lol
@Stewart: ok thanks !
@Felipe_Cardeneti_Mendes any updates?
@Felipe_Cardeneti_Mendes: I sent you an email yesterday following your Slack registered address
@Stewart: thanks!!
@Felipe_Cardeneti_Mendes
I found a very interesting way to do a dynamodb -> scylladb live migration. If it works, I’d be happy to share it with you. It’s currently in PoC.
@Felipe_Cardeneti_Mendes: certainly
@Stewart: I just successfully implemented the code and PoC for dynamodb live migration, and wanted to share it briefly.
First of all, the architecture is as follows
- use kafka-connect-dynamodb cdc to stream dynamodb’s change event to kafka. i also changed some code for some config.
- referring to debizium mongodb extractor, implement smt directly so that the message of dynamodb cdc topic can be used in scylladb sink. Reference documentation (https://debezium.io/documentation/reference/1.9/transformations/mongodb-event-flattening.html)
- provision the scylladb sink connector, and using the SMT Just i created, transform the message and sink the data to scylladb.
flow is
dynamodb
-> dynamodb cdc source connector
-> kafka
-> scylladb sink connector (with ddb cdc transform)
-> scylladb
With this flow, data flows and data is loaded from ddb to scylladb in real time.
If you are curious about my implementation, I can schedule a meeting for you.
@Felipe_Cardeneti_Mendes
GitHub: GitHub - fetch-rewards/kafka-connect-dynamodb: A Kafka Connect Source Connector for DynamoDB
@Felipe_Cardeneti_Mendes: Oh, that’s a great idea, which effectively solves the DDB->CQL problem. How have you accomplished the bulk loading part? This is typically a requirement for a full migration (ie: export from DDB->ScyllaDB, and only after carry out your steps for syncing changes)
@Stewart: if you see that repo, they support initial snapshot for connector.
initial sync - automatically detects and if needed performs initial(existing) data replication before tracking changes from the DynamoDB table stream
Synced(Source) DynamoDB table unit capacity must be large enough to ensure INIT_SYNC to be finished in around 16 hours. Otherwise there is a risk INIT_SYNC being restarted just as soon as it’s finished because DynamoDB Streams store change events only for 24 hours.
INIT_SYNC can be skipped with init.sync.skip=true configuration
@Felipe_Cardeneti_Mendes: Indeed. Super cool
This is a great find actually! I probably should play with it, but are you interested in sharing your walkthrough and history with the community? We can definitely meet
@Stewart: maybe I could write a blog about it later. currently I just finished testing with sample table, and need to do real mirgration with it.
@Felipe_Cardeneti_Mendes: Sounds like a plan! Let us know how it goes
@Stewart: I’ll upload the sample code to github after testing the migration, but if you want to test it before then, I can send you the source code.
I ran a test yesterday with a real table base, and it worked fine.
However, I realized that for struct and list types in dynamodb, there might be a problem.
For nested data types like struct in struct and list in struct in dynamodb, the scylladb sink connector could not create a table normally.
In this case, I think I have to develop an application that consumes dynamodb cdc topic and writes to scylladb by myself, or trasnform the cdc topic to KSQLDB and make it into a format that can be used in scylladb, but I wonder how to handle this case in the case of dynamodb alternator.
@Felipe_Cardeneti_Mendes: Well, it makes sense as it can be defined to a udt, or frozen/unfrozen collection, nested structs can be considered an edge case
(though unfortunately not that uncommon in ddb)
but for writing directly to dynamodb alternator it should be a no-brainer.
@Stewart: Surprisingly, there are quite a few cases where we use the nested type.
In addition, I need to do a full scan of both dynamodb and scylladb to compare the data between them to see if it was migrated properly, which is also not easy.
@Felipe_Cardeneti_Mendes: There’s also the even more edgy case where a list/map may contain values of different types. Whereas in CQL a list/map can be only of a specific type.
@Stewart: yes right
Rather, I’ve been running scylladb for about 2~3 years so I’m pretty familiar with it, but I’m not incredibly knowledgeable about dynamodb.
I just found out yesterday that dynamodb supports this type of thing for the first time, because scylladb obviously doesn’t support it. lol
@Felipe_Cardeneti_Mendes: Well, we DO support it (via Alternator). The way it is accomplished is that the payload gets serialized to a blob, and then we don’t need to care about its type, as long as the input is valid.
But CQL has this restriction and switching protocols will inevitably require you some app changes when using such types. And these are likely app-specific, which makes it almost impossible to automate it.
@Stewart: So currently my single message transformer (smt) supports boolean, integer, long, list, map, and I was going to support nested map and list types as well, but I stopped because I think the PR needs to be merged to use nested types.
https://github.com/scylladb/kafka-connect-scylladb/pull/71
GitHub: Add complex types support and tests by Bouncheck · Pull Request #71 · scylladb/kafka-connect-scylladb
@Felipe_Cardeneti_Mendes: A cool project which would probably be worth spending some hours on for fun would be to allow a CQL driver to manipulate Alternator tables. This would allow shard-awareness, prepared statements and all other goodies from the CQL protocol. But the penalty would be the serialization/deserialization being deferred to the client-side. Whether it is worth it or not depends on the application
@Stewart: In preparation for this migration, I also wanted to check what tables are created internally when using Alternator, but I found a cool way to migrate to cdc and didn’t check it out.
@Felipe_Cardeneti_Mendes: You may bump the PR comments, it has no progress for almost 2 years now