I'm continuing my previous post to include a few more tips to help you...mainly concerned with execution performance. I know that with my projects, time is always the issue. What may seem like a small 5-10 min transfer operation becomes large when coupled with 45 other processes that must run in the morning as part of a data warehouse load.
With SQL Server 2008, I recently read that SSIS 2008 has further enhanced the internal dataflow pipeline engine to provide even better performance, you might have heard the news that SSIS 2008 has set an ETL World record of uploading 1TB of data in less than half an
hour.
Pulling High Volumes of Data
As part of a recent project, we had to pull data from a source table which had ~ 4 million
records to a new target table. Initially when the SSIS package started,
everything looked fine, data was being transferred as expected but gradually the
performance degraded and the data transfer rate went down dramatically. During
analysis we found that the target table had a primary clustered key and two
non-clustered keys. Because of the high volume of data inserts into the target
table these indexes got fragmented heavily up to 85%-90%. We used the online
index rebuilding feature to rebuild/defrag the indexes, but again the
fragmentation level was back to 90% after every 15-20 minutes during the load.
This whole process of data transfer and parallel online index rebuilds took
almost 12-13 hours which was much more than our expected time for data transfer.
Then we came with an approach to make the target table a heap by dropping all
the indexes on the target table in the beginning, transfer the data to the heap
and on data transfer completion, recreate indexes on the target table. With this
approach, the whole process (by dropping indexes, transferring data and
recreating indexes) took just 3-4 hours which was what we were expecting.
Rows Per Batch Settings
Rows per batch – The default value for this setting is -1 which
specifies all incoming rows will be treated as a single batch. You can change
this default behavior and break all incoming rows into multiple batches. The
allowed value is only positive integer which specifies the maximum number of
rows in a batch.
Maximum insert commit size – The default value for this setting is
'2147483647' (largest value for 4 byte integer type) which specifies all
incoming rows will be committed once on successful completion. You can specify a
positive value for this setting to indicate that commit will be done for those
number of records. You might be wondering, changing the default value for this
setting will put overhead on the dataflow engine to commit several times. Yes
that is true, but at the same time it will release the pressure on the
transaction log and tempdb to grow tremendously specifically during high volume
data transfers.
The above two settings are very important to understand to improve the
performance of tempdb and the transaction log. For example if you leave 'Max
insert commit size' to its default, the transaction log and tempdb will keep on
growing during the extraction process and if you are transferring a high volume
of data the tempdb will soon run out of memory as a result of this your
extraction will fail. So it is recommended to set these values to an optimum
value based on your environment.
Note: The above recommendations have been done on the basis of
experience gained working with DTS and SSIS for the last couple of years. But as
noted before there are other factors which impact the performance, one of the
them is infrastructure and network. So you should do thorough testing before
putting these changes into your production environment.
Avoid SELECT *
I know this recommendation has been around forever for most DBAs and developers, but in SSIS there is an additional important consideration to avoid using SELECT *. The Data Flow Task (DFT) of SSIS uses a buffer (a chunk of memory) oriented architecture for data transfer and transformation. When data travels from the source to the destination, the data first comes into the buffer, required
transformations are done in the buffer itself and then written to the
destination. The size of the buffer is dependant on several factors, one of them is the
estimated row size. The estimated row size is determined by summing the maximum
size of all the columns in the row. So the more columns in a row means less
number of rows in a buffer and with more buffer requirements the result is
performance degradation. Hence it is recommended to select only those columns
which are required at destination.
Even if you need all the columns from the source, you should use the column
name specifically in the SELECT statement otherwise it takes another round for
the source to gather meta-data about the columns when you are using SELECT *.
OLEDB Destination Settings
There are couple of settings with OLEDB destination which can impact the
performance of data transfer.
Data Access Mode – This setting provides the 'fast load' option which
internally uses a BULK INSERT statement for uploading data into the destination
table instead of a simple INSERT statement (for each single row) as in the case
for other options. So unless you have a reason for changing it, don't change
this default value of fast load. If you select the 'fast load' option, there are
also a couple of other settings which you can use as discussed
below.
Keep Identity – By default this setting is unchecked which
means the destination table (if it has an identity column) will create identity
values on its own. If you check this setting, the dataflow engine will ensure
that the source identity values are preserved and same value is inserted into
the destination table.
Keep Nulls – Again by default this setting is unchecked which means
default value will be inserted (if the default constraint is defined on the
target column) during insert into the destination table if NULL value is coming
from the source for that particular column. If you check this option then
default constraint on the destination table's column will be ignored and
preserved NULL of the source column will be inserted into the destination.
Table Lock – By default this setting is checked and the recommendation
is to let it be checked unless the same table is being used by some other
process at same time. It specifies a table lock will be acquired on the
destination table instead of acquiring multiple row level locks, which could
turn into lock escalation problems.
Check Constraints – Again by default this setting is checked and
recommendation is to un-check it if you are sure that the incoming data is not
going to violate constraints of the destination table. This setting specifies
that the dataflow pipeline engine will validate the incoming data against the
constraints of target table. If you un-check this option it will improve the
performance of the data load.