Data… it’s such a small word isn’t it?
And yet… it causes so many problems. Acquiring it, storing it, moving it, cleaning it… actually doing something with it!
With the right tool (hint… DoubleCloud) the right type of data, employed in the right way, can be invaluable.
So what’s the problem?
Well actually it’s not a problem… it’s problem (s), plural.
The one I’ll be focussing on today is (as you probably guess from the title) transferring data.
You see, The first problem with data is how spread out it can get meaning to do anything with it, it has to be transferred somewhere.
Organizations collect data, their different departments or even individual employees collect data, it gets stored in different places, in different formats… it’s a mess!
So to ‘tidy’ it up it needs moving (or transferring).
How Does Transferring Data Work?
Sounds simple right?
You take some data from here, you put it there… Congratulations! You now have data!
Well let me tell you, if it was that simple I, and a lot of people like me, wouldn’t have jobs.
You see, the devils in the details…
First of all we need to define what data ‘is’.
What Is Data?
As per Wikipedia (as their definition will be a lot better than mine will be):
Data — is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted.
Ok, so maybe I can define it a little better. For me, data is something that stores information and has some kind of structure to it… like a spreadsheet in which every row and column contains a piece of information.
Moving it from one place to another requires two completely separate types of storage… one to initially hold it and one to move it too.
What data will we be moving though?
Well… there’s plenty to choose from!
It could be tables from databases, structured or semi-structured data pulled from an API, raw file or a streaming platform (Like Apache Kafka® for instance) or even completely uncategorised data that looks useful and has some kind of structure to it.
Whatever the type of data however, before we start to move it we need to define the structure we want once it’s moved, how much data we’ll be moving and start to read it so it can be written elsewhere.
Now for tables that process isn’t difficult as almost any kind of table will have schema attached to it. That means to transfer a table from one database to another we just need to be able to define its schema, read the data and then write it into our new, target database… simples!
It gets a bit trickier with unstructured data (take my example of API’s earlier or maybe S3 files or even streaming messages) as we’re going to have to create some schema for it ourselves.
Now if you’re lucky this can just be done once.
With tasks like API’s or S3 buckets, we’ll always know what type of structure will be required so it’s fairly easy to accomplish but for other tasks there’s little choice but to make each decision on a case-by-case basis.
This creates what we call a snapshot of your data.
Relying on Wikipedia again, a snapshot of your data can be defined as…
What’s A Snapshot?
Snapshot is the state of a system at a particular point in time.
The term was coined as an analogy to that in photography. It can refer to an actual copy of the state of a system or to a capability provided by certain systems.
It’s even easier with tables as each database you create can be configured so that it can be operated via an API (I like to use SQL for this) so we can select all the data we need from the table to be written.