In a few days Oracle will release Data Sync, a utility that facilitates the process of loading data from various on-premise sources into BI Cloud Services (BICS). The Data Sync utility comes as an addition to several other already available means of loading data into BICS: data load wizard to upload data from files, SQL Developer cart deployment feature, and RESTful APIs. What is special about Data Sync is that it makes the effort of data loading more manageable from a capability and scalability perspective (full loads vs incremental loads of many tables), and its scheduling, notification and automation aspects as well. This approach makes data loading a lot closer to conventional ETL methods. And if you ever worked with DAC before to run execution plans of Informatica mappings, you will find that Data Sync follows a similar methodology of defining connections, tasks and jobs. However, instead of referencing Informatica mappings, as is the case with DAC, Data Sync itself handles the mapping of source columns to target columns. And it supports SQL Queries so your opportunities for doing data transformations are endless. In this blog I present the key features of Data Sync.
The whole Data Sync software sits in one place. There is no server/client components to it. So you basically extract the zipped file you download from Oracle onto a machine on your network that can connect to the different data sources you wish to source data from. Once you unzip the file, you configure the location of your java home and that’s it! You are ready to launch Data Sync and start moving data.
There are 3 main steps to follow when designing data loads in Data Sync. Below the item menu there are 3 corresponding tabs to configure: Connections, Projects and Jobs.
Connections: Under Connections is where you define connections to both on-premise databases as well as Oracle BI Cloud Databases. By default there is a connection called TARGET, which you map to one of your BICS instances. (Note that when configuring the BICS connection, use the BI url and not the cloud database url). Though the default name of the out-of-the-box connection is TARGET, when defining these connections we are not really specifying which connection serves as a source and which is a target. All we do here is setup the connections, test that the connections work and have them ready to be used by projects and jobs.
Project: A project defines data mappings and transformations between sources and target tables. A source may be a data file, a database table or a SQL Query that joins multiple tables. If you need to do transformations on the data you can use the “Data From SQL” option and embed all the transformation logic in a query. The “Target Table” here doesn’t mean it’s the target in any one particular database. It’s just a placeholder structure of where the data will end up and in what columns. (As you will see on the Jobs tab, the target table can be assigned to any connection that you would like to use as a target database.)
One important feature that you configure in Projects is the Load Strategy. This is where you define whether the data load for a particular target table is always a full load or incremental. As you see in the dialogue, you get several options to define how the load happens.
If you chose to “Update table”, you get more things to setup when you click OK. First you select the column(s) that make up the user key of the target table. In my example, CUST_ID is the user key, which means the update strategy matches new data on CUST_ID in order to determine whether the new records need to be inserted or updated in the existing target table. On the Filters tab of the same dialogue box, you specify which date column from the source table/SQL query you would like to use for detecting incremental records (these are records that have been newly added/updated in the source since the last time the data load took place from that source).
Jobs: A job is where you relate projects to source and target connections. By default you get one job that loads data into the default connection called: TARGET. However, you can create a new job and then override the default source and target connections with other connections that you have already defined under the Connections tab.
For example, in the New Job windows below, I kept the data sources the same but the TARGET connection (which I configured to be my BICS Test instance) is overridden with a connection to my BICS Production instance. So when the new job runs, data is still sourced from the same databases but instead of being loaded into my BICS Test, it gets loaded into BICS Prod. You may also chose to move data from on-premise databases to other on-premise databases using the same approach. One thing I tried and which doesn’t seem to be currently working with Data Sync is sourcing data from a BICS Database to load into an on-premise database. (If you are looking to copy data from a BICS Cloud Schema database to an on-premise database, I explain how to do that in an earlier blog: Oracle BI Cloud Service: How to migrate from Test to Production?)
Besides defining jobs, we can also schedule jobs and monitor their execution under the Jobs tab. As shown in the screenshot below, the latest job that ran contained one task that successfully loaded 2 rows into the BICS Prod target database.
One last thing: Now suppose you’ve been running data loads incrementally but then decide you need to refresh all the data in the cloud and get a fresh start. This requires a full load run. To do that we don’t need to redefine the Load Strategy which is already defined to work incrementally in the project mappings. All we need to do is go to Tools and then select “Re-extract and load all data”. This clears all the Refresh Dates on the tables. In other words, we are telling Data Sync to forget about when the last data was extracted/loaded for all the tables and just run everything as if it was the very first run.
Alternatively, if you chose to do a full load on selective tables only, as opposed to all tables, you can navigate to these tables in your source and target connections (see the Refresh Dates tab below) and clear the corresponding Refresh Date column. When you run the job the next time, the tables that have their Refresh Date cleared will run in full load mode whereas the other tables continue to run incrementally as usual.