Debunking Dataset Myth

Posted: Tuesday, February 24, 2004 1:52 PM by Barry Gervin
Filed under: , , , ,

Many people think that datasets are stored internally as XML. What most people need to know is that Datasets are serialized as XML (even when done binary) but that doesn't mean they are stored as XML internally - although we have no easy way of knowing, it's easy to take a look at the memory footprint of datasets compared to XmlDocuments.

I know that if datasets were stored as XML, then in theory, datasets should be larger since BeginLoadData/EndLoadData implies there are internal indexes maintained along with the data.

It's not easy to get the size of an object in memory, but here is my attempt.

long bytecount = System.GC.GetTotalMemory(true);
DataSet1 ds =
new DataSet1();
ds.EnforceConstraints =
false;
ds.Order_Details.BeginLoadData();
ds.Orders.BeginLoadData();
ds.ReadXml("c:\\test.xml");
bytecount = System.GC.GetTotalMemory(
true) - bytecount;
MessageBox.Show("Loaded - Waiting. Total K = " + (bytecount/1024).ToString());

long bytecount = System.GC.GetTotalMemory(true);
System.Xml.XmlDocument xmlDoc =
new System.Xml.XmlDocument();
xmlDoc.Load("c:\\test.xml");
bytecount = System.GC.GetTotalMemory(
true) - bytecount;
MessageBox.Show("Loaded - Waiting. Total K = " + (bytecount/1024).ToString());

I tried these examples with two different xml files - both storing orders & orderdetails out of the northwind database. The first example was the entire result set of both tables. The dataset memory size was approximately 607K. The XmlDocument was 1894K, over 3 times larger. On a second test, I used only 1 record in both the order and order details tables. The dataset in this case took 24K and the XmlDocument took 26K, a small difference.  You will notice that in my dataset example I have turned off index maintenance on the dataset by using BeginLoadData. Taking this code out resulted in a dataset of 669K, an increase of approximately 10%. An interesting note is that if you put in a BeginLoadData and EndLoadData, the net size of the dataset is only 661K. This would imply that leaving index maintenance on during loads is inefficient in memory usage.

The speed of loading from XML is a different story.  Because the XmlDocument delays (I'm assuming) the parsing of the XmlDocument, the time to load of the full dataset from an XML file is 1/3rd of the time to load the DataSet from XML. I would be careful in being too concerned about this. Loading a dataset from a relational source like a DataAdapter that involves no Xml parsing and is much faster.

If you load up Anakrino and take a look at how the Dataset stores it's data, each DataTable has a collection of columns, and each column is in fact a strongly type storage array. Each type of storage array has an appropriate private member array of the underlying value type (integer, string, etc.). The storage array also maintains a bit array that is used to keep track of which rows for that array are null. The bit array is always checked first before going to the typed storage array and returns either null or the default value. That's pretty tight.

Comments

Barry Gervin

February 24, 2004 3:45 PM
This has been a long standing question since we released the DataSet as part of ADO.NET version 1.0. The DataSet is intimately aware of XML; it can and write XML, create relational schema (tables, columns, relationships) based on XSD schemas, and persist its relational schema as XSD. There are properties on the DataSet, DataTable, and DataColumn for specifying XML properties, such as Namespace, and even specify how columns are mapped to XML elements, attributes, or simple types. It seems natural, given this intimate knowledge of XML, for the DataSet to store data in memory as XML.

While this seems logical on the surface, Barry is right; the DataSet does not store data in memory as XML.

XML is a generic, self-describing format for encoding data. Its self-describing nature lends itself well to general tools for querying, transforming, validating, and annotating the data. Further, XML provides an inherently hierarchical model that allows you to model structured, semi-structured, or completely unstructured data. However, where your data is highly structured, such as when modeling results from a database query, the data can be more efficiently stored without the redundancy of describing the type and schema of each node.

Although the DataSet has built-in knowledge for reading and writing XML data and schemas, and properties for controlling how the XML is generated, the in-memory representation is not XML. Rather, the DataSet today stores strongly typed values in an optimized column-wise storage. If you create a column of type boolean, for example, we'll create an array of booleans to hold the values for that column. (It is worth noting that we have, at times, considered changing this to row-wise storage, which has a different set of trade-offs but is still optimized over XML storage for structured data).

So why doesn’t the size of the table always (approximately) equal the size of the data it holds? The answer is simple. Rather than reallocate each column array whenever an additional row is added to the table (which would get expensive when loading a table), we pre-allocate space according to the MinimumCapacity property of the encompassing DataTable, and have an algorithm for growing the array whenever the capacity is exceeded. This improves performance when loading a bunch of rows, but may mean that there is more memory allocated at certain times than is actually used. For example, you may see little increase in memory used as you add the first 25 records, and then see the amount of memory for the table nearly double when you add the 26th record.

If you know the size of the data you'll be using ahead of time, you can use the MinimumCapacity property to tune both memory usage and performance. For example, if you knew there would be less than 30 records, you could set the MinimumCapacity property to 30. The DataSet would preallocate 30 elements in each typed column array, and as long as you didn't exceed 30 records it wouldn't have to reallocate the arrays, and would never allocate more than enough memory for the 30 records. If you added a 31st record the DataSet would then reallocate the arrays to hold, say, 60 records.

Note that the number of storage records may not exactly equal the number of rows in the table; modified rows have pointers to two storage records (one for the original value and one for the current value). So say you set MinimumCapacity to 30 and then insert 30 rows (and accept changes, so they become unmodified). If you then modify one of the rows, the DataSet will again reallocate memory in order to hold the new current value of the modified row.

As you can see, the DataSet does not store data internally as XML, but uses an optimized storage mechanism that balances between performance and memory usage. Understanding how the DataSet allocates memory for storage can help you tune your applications performance and memory usage.

I hope this helps clear up some of the long standing confusion over how the DataSet optimizes storage internally while providing rich XML semantics.

Michael Pizzo
Architect
WebData Team
Microsoft Corporation.

TrackBack

February 24, 2004 9:12 PM
Take Outs: The Digital Doggy Bag of Blog Bits for 24 February 2004

TrackBack

February 28, 2004 3:36 PM

TrackBack

June 10, 2004 3:09 PM

TrackBack

June 10, 2004 3:12 PM

TrackBack

July 1, 2004 8:08 PM

Barry Gervin

May 17, 2005 11:30 AM
Mi e-mail is floverso@hotmail.com

The optimization only operate over typed datasets?

MinimumCapacity only operate with defined DataTables, and the method "ds.Order_Details.BeginLoadData" too.

1- If I load a XML without schema defined previously... or if I load the data form DB: this optimization apply?.. How I execute BeginLoadData or MinimumCapacity over a data dynamical?

2- If I execute:
ds.EnforceConstraints = false;
ds.table.BeginLoadData();
Then load the data (well as your example), I understand what not index creation or maintenance exists. Then execute
ds.EnforceConstraints = true;
ds.table.EndLoadData();

The index are recreated with EndLoadData?

Thanks

Barry Gervin

May 17, 2005 11:41 AM
These optimizations would also apply to untyped datasets. You would however have to new up a new datatable before you fill it (so you have something to call BeginLoadData on and set the MaxCapacity accordingly.

Barry Gervin

August 9, 2006 11:03 AM
Wow, this is great! I was searching for the memory usage of Dataset and XMLDoc. Now with your findings, I'm fully convinced to use Dataset rather than XMLDoc. Thanks a million!

New Comments to this post are disabled

Search

Syndication

Barry Gervin