I love the questions in MSDN forums! They lead me down new paths in SharePoint that I would have never tried.
Such a simple question: "if a document is uploaded to doc library, will the checksum of the document change". (http://social.msdn.microsoft.com/Forums/en-US/sharepointdevelopment/thread/4b184d0a-8fce-4ec7-a9d9-7badf205b85e)
Well, I know of a few cases where it can, but I'll come back to that. Here's the very weird thing I discovered... I uploaded the same file (a 2003 Word document) three different ways, and got two different file sizes in the library, and when downloaded they were all different from the the originally uploaded file. SharePoint changed the files!
Test using a Word 2003 (.doc) document
(File size from right-clicking the file in Windows Explorer and selecting Properties)
- Original file on disk (C:):
- File uploaded by clicking "Upload" in the library and then checking it's size from Open with Windows Explorer:
- File uploaded with "Upload Multiple":
- File uploaded by dragging from C: (Windows Explorer) to Open with Windows Explorer:
139,264 bytes (but I think this is the original file size due to some caching – when I reopened the Open with Windows Explorer it changed to 139,776)
Now I downloaded the file using drag and drop from Open with Windows Explorer:
- File uploaded by clicking "Upload"
- File uploaded by clicking "Upload Multiple"
- File uploaded with dragging from C: (Windows Explorer) to Open with Windows Explorer:
Next I wrote some .Net code to access the documents via the API and the size reported from SPFileItem.File.Length is the same as the downloaded numbers (140,288, 139,776, 139,776).
Remember... the original file on C: was 139,264 bytes.
Now I opened each of the "uploaded and then downloaded" files in a HEX viewer:
- The file uploaded by clicking "Upload Multiple" and and the file uploaded with dragging from C: (Windows Explorer) to Open with Windows Explorer:
All bytes identical until the end of the file where there is what looks like random bytes (different in both files) and an incomplete fragment of an XML structure (same in both files). (junk in the upload buffer???)
- File uploaded by clicking "Upload"
First byte changed from 00 to D0
bytes/text added at the end of the file with metadata from the library columns!
So... I don't think you can rely on the check sum of the file!
Test using a TXT or PDF file
A simple text (.TXT) was unchanged by SharePoint and always reported the same file length no matter how I uploaded or downloaded it. The same for a PDF.
Test using an Excel 2003 file
Size on drive C: 22,528 bytes
Size inside of SharePoint: 30,208 bytes (regardless of how it was uploaded or downloaded)
Test using an Excel 2007 file
Size on drive C: 16,501 bytes
Size inside of SharePoint: 19,076 bytes (regardless of how it was uploaded or downloaded)
This one get even more interesting! Office 2007 files are actually ZIP files. It looks like SharePoint unzipped the file, added some folders and some files and then rezipped the file. Here’s the before and after:
- New folder in the root of the zip file called “[trash]” that contains two tiny files “0001.dat” and “0002.dat”.
- The file sizes in “_rels” have changed
- The “docProps” folder has a new file named “custom.xml”
- The “docProps\core.xml” file has been changed to include the SharePoint content type
- and probably some other things I missed
Now back to how (I used to know) files can be different after upload...
- If you are using Information Rights Management - the IRM wrapper is removed on upload (so files can be indexed for search) and reapplied on download.
- If you have fields bound between an Office document (Word, etc) and the columns in the library, a user who edits the library columns will also be changing the content of the file.
Summary (or at least some guesses)
- SharePoint modifies Office 2003 documents by appending to the end of “uploaded” files, but not for Upload Multiple or Explorer drag and drop.
- SharePoint modifies Office 2007 documents by modifying the files in the “DOCX/XLSX” ZIP files, and for all ways they are uploaded.
- SharePoint does not appear to modify non-Office files
If you know anything about this, please post a comment and share!