11/20/2010

SharePoint Changes Files as they are Uploaded

 

I love the questions in MSDN forums!  They lead me down new paths in SharePoint that I would have never tried.

Such a simple question: "if a document is uploaded to doc library, will the checksum of the document change". (http://social.msdn.microsoft.com/Forums/en-US/sharepointdevelopment/thread/4b184d0a-8fce-4ec7-a9d9-7badf205b85e)

Well, I know of a few cases where it can, but I'll come back to that. Here's the very weird thing I discovered... I uploaded the same file (a 2003 Word document) three different ways, and got two different file sizes in the library, and when downloaded they were all different from the the originally uploaded file.  SharePoint changed the files!

 

Test using a Word 2003 (.doc) document

(File size from right-clicking the file in Windows Explorer and selecting Properties)

  • Original file on disk (C:):
       139,264 bytes
     
  • File uploaded by clicking "Upload" in the library and then checking it's size from Open with Windows Explorer: 
       140,288 bytes
  • File uploaded with "Upload Multiple":
       139,776 bytes
  • File uploaded by dragging from C: (Windows Explorer) to Open with Windows Explorer:
       139,264 bytes  (but I think this is the original file size due to some caching – when I reopened the Open with Windows Explorer it changed to 139,776)

Now I downloaded the file using drag and drop from Open with Windows Explorer:

  • File uploaded by clicking "Upload"
      140,288  )
  • File uploaded by clicking "Upload Multiple"
      139,776 
  • File uploaded with dragging from C: (Windows Explorer) to Open with Windows Explorer:
      139,776 

Next I wrote some .Net code to access the documents via the API and the size reported from SPFileItem.File.Length is the same as the downloaded numbers (140,288, 139,776, 139,776).

Remember... the original file on C: was 139,264 bytes.

Now I opened each of the "uploaded and then downloaded" files in a HEX viewer:

  • The file uploaded by clicking "Upload Multiple" and and the file uploaded with dragging from C: (Windows Explorer) to Open with Windows Explorer:
      All bytes identical until the end of the file where there is what looks like random bytes (different in both files) and an incomplete fragment of an XML structure (same in both files).  (junk in the upload buffer???)
     
  • File uploaded by clicking "Upload"
      First byte changed from 00 to D0
      bytes/text added at the end of the file with metadata from the library columns!

So... I don't think you can rely on the check sum of the file!

 

Test using a TXT or PDF file

A simple text (.TXT) was unchanged by SharePoint and always reported the same file length no matter how I uploaded or downloaded it. The same for a PDF.

 

Test using an Excel 2003 file

Size on drive C:  22,528 bytes

Size inside of SharePoint: 30,208 bytes   (regardless of how it was uploaded or downloaded)

 

Test using an Excel 2007 file

Size on drive C:  16,501 bytes

Size inside of SharePoint: 19,076 bytes   (regardless of how it was uploaded or downloaded)

This one get even more interesting!  Office 2007 files are actually ZIP files. It looks like SharePoint unzipped the file, added some folders and some files and then rezipped the file. Here’s the before and after:

image

 

What’s changed?

  • New folder in the root of the zip file called “[trash]” that contains two tiny files “0001.dat” and “0002.dat”.
  • The file sizes in “_rels” have changed
  • The “docProps” folder has a new file named “custom.xml”
  • The “docProps\core.xml” file has been changed to include the SharePoint content type
  • and probably some other things I missed

 

Now back to how (I used to know) files can be different after upload...

  • If you are using Information Rights Management - the IRM wrapper is removed on upload (so files can be indexed for search) and reapplied on download.
  • If you have fields bound between an Office document (Word, etc) and the columns in the library, a user who edits the library columns will also be changing the content of the file.

 

Summary (or at least some guesses)

  • SharePoint modifies Office 2003 documents by appending to the end of “uploaded” files, but not for Upload Multiple or Explorer drag and drop.
     
  • SharePoint modifies Office 2007 documents by modifying the files in the “DOCX/XLSX” ZIP files, and for all ways they are uploaded.
     
  • SharePoint does not appear to modify non-Office files

 

If you know anything about this, please post a comment and share!

.

3 comments:

Bert said...

Also .msg files (office e-mail save format) are changed, and both this and excel changes DO happens on excel drag-and-drop (WinMerge tested). Preamble and final bytes of office files are changed. Tested with Sharepoint Foundation 2010 and pc with Office 2013. Hi.

Nam Lai said...

I have the same issue with uploading an image file. The original file size is: 102,278 bytes and the uploaded file size is: 102,716 bytes. However, this issue doesn't happen with all image files. Does anyone have any explanation for this issue? We have a requirement to restrict the image file size to 100KB.

Mike Smith said...

Nam,

I have that see a file size change accept for Office documents. What kind of image file were you using?

Mike

Note to spammers!

Spammers, don't waste your time... all posts are moderated. If your comment includes unrelated links, is advertising, or just pure spam, it will never be seen.