cancel
Showing results for 
Search instead for 
Did you mean: 

Smart Data Integration - Oracle Log Reader Adapter - Exact technical working way?

0 Kudos

Hi Experts,

We want to use this SDI Oracle Log Reader feature to perform real-time data loads of some tables of a source Oracle database in BW4/HANA without adding any extra load on that source database, and we are working on a Proof of Concept about it.


We have in fact already installed the Data Provisioning Server at HANA side and the Data Provisioning Agent (in our case in the same server where the source Oracle database is located, not apart), setting up the Oracle Log Reader.

We have completed the setup, and it seems to be working fine, so that as soon as we update a record in that source Oracle database we see the change at Hana Virtual table level, and if we create an extraction request we see the extraction retrieves one single data record.


But, our doubt is about how exactly this Oracle Log Reader technically works; how it reads the information from the redo logs, where it saves it until the BW extraction is launched, as we cannot be sure if the created LogReader user at Oracle level is not in fact reading the own database and not the redo logs (because we had to grant the "select" permission for the tables we want to capture), and we cannot find anywhere how this process exactly works.


My personal theory is that the Oracle Log Reader is in fact checking the online redo logs of the Oracle database, and when it detects a change there that affects the tables we have selected in BW4/HANA it saves this change information into a different location, probably in the file structures we see the agent has created under /usr/sap/dataprovagent/ds-lite/sqla/tmp/dslite_repo_<process_id> or in the file called dslite_repo.db, but it might also do it at Oracle level, in the other set of tables that the Oracle Log Reader user has created as owner in the database.

Can someone please confirm how this work?


And then, another doubt arises to us: How did the initial full load of the HANA virtual tables happened? Because for sure all the existing change records of the tables we have selected were not part of the redo logs (even more when we have a script that compresses the offline redo log files moving them to a different location every hour).
Was perhaps that initial load done by using a direct "select" sentence on the real table at Oracle level? Or how exactly?



Thank you very much, and best regards
Jose Sanz

Accepted Solutions (0)

Answers (2)

Answers (2)

werner_daehn
Active Contributor
0 Kudos

To expand on Jeff's extensive answer a bit...

You need the select privilege on the Oracle table primarily for the initial load, which is just a select-from-table like statement.

The changes are really read from the Oracle transaction log using Oracle's log reader api. The log reader api is invoked with a request "I want to get all changes from the system change number xyz onwards" and using the log information data dictionary table, it knows where this transaction is to be found. It could be in the online redo log or - if older - in an archive log. You can proof that easily by removing the archive logs from the Oracle database' reach and ask for an older system change. The request will fail with "Cannot access the archive log file <..> containing the change data".

Another proof point, the delta is read from the transaction log only, are the statements being executed against the Oracle database during the setup. Some supplemental logging is turned on to add more information to the redo log. For example when a record is deleted, at least its primary key data should be saved in the redo log.

What I said above is not 100% correct but precise enough. The log reader has to have select privs on the table to check what columns it has at the moment. Imagine a column got added. An older redo log entry will not know about that column. The log miner has to reconcile the two pieces of information, e.g. provide you with the wide table but return null for this column in such case.

jeffrey_kresse
Employee
Employee
0 Kudos

Hello Jose, I don't think I can achieve the full level of detail you want, but perhaps a high to mid-level explanation of log reader adapters. I am writing this from the perspective of what I would call "native" SDI, meaning I have experience with replications configured directly in HANA, rather than BW4/HANA. So some specific BW4/HANA features or workflows may escape me.

The select privilege is definitely needed at least for initial load. But real time does not just query the table constantly, that is not a feasible option for changed data capture. Some customers replicate hundreds of tables at once, those queries would be outrageously expensive.

The adapters instead use different methods (LogMiner for Oracle, sybfilter for MSSQL, and a native log reader for DB2, among others) to query the redo log, discarding untracked table information, and passing on 'marked' changes to the agent for return to HANA. Tables are marked by being added to source side tables, usually created by a setup script during init process. While writing this I am referencing a MSSQL implementation, one of the tables is ra_marked_object, which holds the object_id of all my subscribed tables. I think on Oracle this would instead be ORACLEMARKEDOBJECT table. We use LogMiner to check each SCN of each LSN and reference against the object ids in the marked object table.

For more tracing that should prove SDI is actually log reading, you can enable trace of that process in .../dataprovagent/configuration/com.sap.hana.dp.adapterframework/OracleLogReaderAdapter/OracleLogReaderAdapter.ini. Add the following line, save, restart the dpagent. Then during realtime replication the trace messages will appear in the remote source instance log, .../dataprovagent/log/<remotesource>/<remotesource#.log>

UI.logging.log_trace_settings=TRACE.LRTRACE=true

The initial load is quite simply done via JDBC, using either the same SQL that the subscription uses, or the virtual table definition if the entire Oracle table is being replicated. HANA takes the SQL it generated from the design of your task, sends it to the agent, and the agent applies it on the remote source using the adapter of the remote source. If the task has partitions, this select is done once for each partition. If you want to prove this to yourself, turn up the dpagent log level to ALL and run an initial load. You should see a line in framework.trc that contains "STREAMING_SET_STATEMENT" or "FEDERATION_SET_STATEMENT", this will tell you the SQL used for initial load, it will be a valid SQL statement that you could execute in Oracle.

As for where the data is stored after it is returned from Oracle, it will be in dpagent memory, not in any file on agent side. The ds-lite and sqla components of the agent are for the File Datastore Adapters, they have no bearing on Oracle replication. Rather, you can see the data moving through the dpagent by watching the memory pools/queues grow and shrink in framework.trc. You will see a line something like this:

DPFramework | MemoryBoundedQueue.adjustUsage [] - RCRQ: 10712, [Pool: max=682.67mb,cur=10.46kb,peak=15.88kb,tot=14.01mb; Queue: cur=1(10.46kb),peak=1(10.46kb),tot=793(7.76mb)]

The Pool settings describe the amount of shared memory that all response queues can use, up to 682 MB in my case. The Queue settings describe this particular replication, so my single test table currently has 10.46kb in memory on dpagent side, waiting to be sent to HANA. There will be multiple, even many, queues created based on how many tables are in replication. Here, we have the receiver response queue "RCRQ 10712", with the number being the identifying information, you should be able to track this number through your framework.trc files back to the start of either initial load or the first portion of the CDC rows being moved in a certain timeframe.

Hope this can clear up some of your questions.

Best regards,

Jeff

PS: If you want to avoid stressing the Oracle machine, I would recommend placing the dpagent on separate hardware. There is a non-negligible memory footprint of the dpagent especially with high volume replication.