| Last Modified On : | June 17, 2009 2:48 PM PDT |
Rate |
|
A common pattern when processing data electronically is parsing the data, applying certain rules, and then loading to destination systems. Hand coding a data processing solution in 3GL (C#, C++, VB, Java**and so forth) requires substantial coding effort to develop and maintain the system. In addition, handling large input files (XML or any other format) require careful considerations on efficiently processing the data.
Microsoft* Integration Services addresses this problem by providing a powerful environment for visually building the solution with minimal coding effort.
Another key factor that impacts scalability is the number of round trips to the database. One round trip per row to the database seriously limits the throughput of an application. Processing data in batches by sending several rows worth of data in a single database round trip allows an application to scale efficiently. However, bulk loading to Oracle* from Integration services framework today requires the use of third party components. In this white paper, we are going to look at a solution option for building high throughput application to parse large XML files and perform bulk load to Oracle database.
XML Processing
Typical 3GL approach to XML processing would involve understanding the schema and parsing the elements of interest from the XML instance. While small-medium documents can be loaded as in-memory trees, large documents would require streaming or chunking to efficiently process the input. Writing the code in 3GL would require understanding these important details and hand coding an appropriate solution.
Integration services software on the other hand, hides the complexity involved in parsing an XML and exposes XML as in-memory relational table. Large XMLs are mapped to in-memory table by transparently allocating memory for the rows in use and freeing up the memory once the batch of rows is consumed.
Oracle* Bulk Loading
When an application needs to insert (and update) several rows of data, it is optimal to process several rows of data in one round trip. This bulk processing or fast load capability allows an application to scale efficiently.
Performing bulk operation, also known as Array Binding, is relatively straight forward with ODP.NET drivers as described in this How to: Bind an Array to an ODP.NET Database Command article.
Performing bulk operations in Oracle from Integration Services requires few additional steps and coding. The rest of this paper discusses throughput numbers and a code sample on how to perform bulk operation through integration services.
Unit and Bulk Performance Characterization
The "Customers" table available Oracle SH schema was used as a test bed for validating bulk insert using Integration Services. This table has 24 columns of mixed numeric and varchar data types.
Input XML File
The data to be loaded to Customers table was in XML format with each XML file having several rows of data.
Performance
The table and chart summarizes the performance difference between row-by-row processing and bulk load processing in integration services with Customers Data set.
Please note that several factors influence the overall performance and depending on the number of columns and column length, appropriate batch size needs to be selected.
|
Rows/File |
File Size (MB) |
Row-By-Row Insert (Seconds) |
Row-By-Row Throughput (Rows/Second) |
Bulk Insert Time (Seconds) |
Bulk Load Throughput (Rows/Second) |
|
500 |
0.52 |
2.6 |
192.31 |
1.8 |
277.78 |
|
1,000 |
1.03 |
4.1 |
243.90 |
1.8 |
555.56 |
|
2,000 |
2.07 |
7.2 |
277.78 |
2.3 |
869.57 |
|
5,000 |
5.17 |
14.8 |
337.84 |
3.3 |
1,515.15 |
|
10,000 |
10.33 |
28.0 |
357.14 |
4.8 |
2,083.33 |
Code Approach
A custom destination Script Component available under Data Flow Transformations was used for implementing the bulk insert logic.
The destination script component, by default, provides three methods:
All these methods are automatically invoked by the integration services data flow pipeline during appropriate events:
PreExecute - Preparing the custom destination component to consume data
Input0_ProcessInputRow - Invoked for every row that arrives in the pipeline
PostExecute - After all rows are processed, this method is invoked to perform cleanup.
At this point, all that needs to be done is allocate buffers within this script component class and populate the buffers whenever a new row arrives. Once, the number of rows reaches the defined batch size, stored procedure is invoked to pass the buffered data using array binding technique available in ODP.NET.
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
private OracleConnection _conn;
private OracleTransaction _tx;
private int _rowCount;
private int _rowsPerRoundTrip;
private int _iterations;
// Define buffers for all the columns
private int[] countryId;
private string[] custCity;
private int[] custCityId;
….
public override void PreExecute()
{
base.PreExecute();
_rowCount = 0;
_rowsPerRoundTrip = <read from configuration variable>;
_iterations = 0;
string connectionString = <read from configuration variable>;
_conn = new OracleConnection (connectionString);
_conn.Open();
// Allocate space
countryId = new int[_rowsPerRoundTrip];
custCity = new string[_rowsPerRoundTrip];
custCityId = new int[_rowsPerRoundTrip];
…..
}
// Make sure that the last batch of rows are processed and loaded to database
public override void PostExecute()
{
base.PostExecute();
if (_rowCount > 0)
{
LoadData();
}
_conn.Close();
}
// Process the incoming rows and copy to buffer
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
countryId[_rowCount] = Row.COUNTRYID;
custCity[_rowCount] = Row.CUSTCITY;
custCityId[_rowCount] = Row.CUSTCITYID;
…..
_rowCount++;
if (_rowCount > _rowsPerRoundTrip - 1)
{
LoadData();
}
}
// Function to invoke database stored procedure
private void LoadData()
{
_tx = _conn.BeginTransaction();
OracleCommand cmd = new OracleCommand("Stored Proc Name", _conn);
cmd.CommandType = CommandType.StoredProcedure;
cmd.ArrayBindCount = _rowCount;
OracleParameter pcountryId = new OracleParameter("pi_country_id", OracleDbType.Int32);
pcountryId.Direction = ParameterDirection.Input;
pcountryId.Value = countryId;
cmd.Parameters.Add(pcountryId);
….
cmd.ExecuteNonQuery();
cmd.Dispose();
_rowCount = 0;
_tx.Commit();
}
// Oracle Stored Procedure definition
CREATE OR REPLACE PROCEDURE proc_insert_customer
(pi_country_id IN number,
pi_cust_city IN VARCHAR2,
...
) IS
BEGIN
insert into customers(cust_id, cust_first_name, cust_last_name,....)
values (pi_cust_id,pi_cust_fname,pi_cust_lname,...);
END;
Conclusion
In this paper, we saw how custom Oracle Bulk Processing capability can be incorporated in to Integration Services while leveraging integration services for parsing XML data and preparing the data for backend systems.
This mix provides an efficient mechanism to build scalable applications on Intel® Platforms while reducing the time-to-market.
About the Author
ChandraMohan Lingam (Chandra) is an Architect at Intel where he works for the Technology Development Group developing solutions to address Intel's manufacturing needs. Chandra has a MS degree from Arizona State University, Tempe and a Bachelors Degree in Engineering from Thiagarajar College of Engineering, Madurai, India.
Download article as a PDF file (130 KB).
*Other names and brands may be claimed as the property of others.
