Issues Importing Data from Excel

Sometimes data needs to be imported into systems using an Excel file format.

This article is about understanding the problems encountered when using Excel for data import, and some ways to avoid them.

Let me say straight up that while I love Excel, there are issues with using Excel files for data transfer. In theory it should be easy, but in practice its fraught with problems. Typically, Excel is not the best format to use for data transfer. Far better to use a simple open format such as fixed width or comma or tab separated text files or XML.

 

Why is Excel used as a data transfer format?

This is usually because users need to use Excel to modify or view the file. Most users  know Excel. Excel gives them the opportunity to view or modify a file before its loaded into another system.  This is especially true when the user is a competent spread-sheet user and needs some ability to sort, compare, count or calculate with the data. Often Excel is a cost-effective alternative to having a custom application developed to manipulate data.

 

So, in some cases the benefits of using an Excel format for data transfer, outweigh the problems. But lets get better acquainted with these problems….

 

The Problems

Problem #1: Mixed Data Types in a Column Cause Missing Data.

Internally Excel stores numbers and text in a different object types. The Excel OLEDB driver is used by most programs (other than Excel itself) to read Excel files.  The OLEDB driver decides if a column is text or numeric or memo (>255 characters) by examining the contents of the first 8 rows. OLDB decides this column is the same type as the first non empty cell. If the first non empty cell is text then the driver treats the column as a text column and looks only at text cells in that column. Numeric and memo data in a text column is ignored and so is text data in a numeric or memo column, etc.. There is nothing the programmer can do about this apart from bypassing OLEDB and using another access method.

 

Even SqlServer Integration Services (SSIS) suffers from this problem because its  Excel data source uses the OLEDB driver under the hood.

 

The problem is documented in http://support.microsoft.com/kb/194124 (PRB: Excel Values Returned as NULL Using DAO OpenRecordset).

The programming method of using Excel ADO is documented in http://support.microsoft.com/kb/278973 (ExcelADO demonstrates how to use ADO to read and write data in Excel workbooks)

 

Problem #2: Unintended Changes in Structure

Excel is very flexible and its easy for users to add extra worksheets,  remove or re-arrange columns or to add an extra row at the top or change column headings.

Any of these things will cause a data upload to stuff up.

The problem can also occur when data is loaded into Excel from a .CSV or text file, edited and then saved.

 

Problem #3: Lost leading zeroes

Codes with leading zeroes are common in business systems (e.g. employee ID 0004578). These leading zeroes are silently truncated by Excel both when the string 0004578 is typed directly into a cell or the formula area. Excel will have the number 4578 instead of the text string 0004578.

Note that leading zeroes will be truncated when data is loaded into Excel from a .CSV or text file and then saved.

 

So problem #2 and problem #3 are both due to using Excel to edit data, rather than with the Excel format.

 

Solutions

My first rule is to avoid using the Excel format unless users need to use the Excel tool to edit the data.

 

Given that Excel must be used,  we will deal with these problems in reverse order.

 

Problem #2 (Unintended Changes in Structure)  &  Problem #3 (Lost Leading Zeroes)

These can only be dealt with by user education and good error handling in the data load.

Users can be taught to:

  • Ensure that leading zeroes are captured (by using a single quote)
  • Leave the worksheet name alone
  • Leave the column names alone
  • Don’t introduce extra columns or remove columns

 

Error handling can attempt validate that the expected columns are present and that codes match existing master tables where possible and that values fall within expected ranges.

The feedback of errors to the correct users can be tricky if the data load is occurring unattended or asynchronously.

 

Problem #1

In its MSDN article, Microsoft describes a work-around that involves changing the registry so that more than 8 rows are read to determine the data type. This is not a great solution because there can still be mixed types in a column.

I think you need to either force users to enter data of a single type consistently in every row or bypass the Microsoft Excel Jet/DAO/OLDEB driver.

 

 Given the problems with the Microsoft OLEDB driver, I feel its best to bypass it and find another BIFF reader.

 

Alternative 1: ExcelDataReader

I recently tried this library available on codeplex. It’s a lightweight and fast library written in C# for reading Microsoft Excel files (’97-2007).

It is open source and works on .Net 2.0 and later as well as mobile and mono.

Look for it at http://exceldatareader.codeplex.com/

 

I used both the convert to DataSet and the direct use of the reader. It worked fine for the 2 files I was working with – (a customer master of about 8000 rows and a product file of about 500 rows). It recognised data, even in mixed columns and was able to convert the data whenever C# had a valid cast. This snippet of code shows how I tested this from the file upload click event on an ASPX form.

 

using Excel; // from http://exceldatareader.codeplex.com/

// It seems necessary to save as a file before using FileUpload1.PostedFile.InputStream

String tempPath = System.IO.Path.GetTempPath() + “\\”+System.IO.Path.GetRandomFileName();

FileUpload1.SaveAs(tempPath);

var stream = FileUpload1.PostedFile.InputStream;

//Reading from a binary Excel file (’97-2003 format; *.xls)

IExcelDataReader excelReader = ExcelReaderFactory.CreateBinaryReader(stream);

try

{

// DataSet – Create column names from first row

excelReader.IsFirstRowAsColumnNames = true;

DataSet excelDataSet = excelReader.AsDataSet();

GridViewAll.DataSource = excelDataSet;

GridViewAll.DataBind();

}

catch (Exception ex)

{

errorMessage = FileUpload1.FileName + ” does not seem to be in Excel (*.xls) format. Error creating or binding dataset from excelReader : “ + ex.Message;

}

finally

{

excelReader.Close();

System.IO.File.Delete(tempPath);

}

 

Alternative #2: Using The Office Automation APIs

Using the Office API to read the spread sheet data will get around the problems of mixed types, and also provides a huge amount of functionality.

However the Office API is not typically chosen to provide an Excel data upload for these reasons:

  • The office API  error handling is not suitable for “unattended execution”. http://support.microsoft.com/kb/257757. “Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behaviour and/or deadlock when Office is run in this environment.
  • Office needs to be installed on the machine running the upload. If this is a server then it would not typically have office installed. Office requires licensing and service packs.
  • Each version of Office has a different API, so although the load will probably work between versions, there may be some problems with versioning.

 

Alternative #3: SpreadsheetGear

Another possibility for error-free reading of spread sheets is SpreadsheetGear.

This can be used for unattended execution. This is a great product that has worked well for me in a project that required programmatically pushing values into a spread-sheet to get calculated results.

SpreadsheetGear  does require a license, however,  and  is probably too expensive and feature rich for what should be just a lightweight utility.

Advertisements

About renatobuda

I am a software developer and sometime dba, business intelligence analyst and data wrangler.
This entry was posted in Data Focus and tagged , , , , . Bookmark the permalink.

2 Responses to Issues Importing Data from Excel

  1. justinhimt says:

    using c# you can solve this like the following…

    Excel.Range formatRange;
    formatRange = xlWorkSheet.get_Range(“a1”, “b1”);
    formatRange.NumberFormat = “@”;
    xlWorkSheet.Cells[1, 1] = “098”;

    Soure :
    http://csharp.net-informations.com/excel/csharp-format-excel.htm

    Justin

  2. renatobuda says:

    Hi Justin, this method uses the office automation API.
    That’s great when you have Office installed on your target machine, but its not always available. Also its not recommended by Microsoft for unattended execution (scheduled batch).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s