Identical Files. But Not.

"Please leave a message at the beep, we will get back to you when your support contract expires."

Moderators: phlip, Larson, Moderators General, Prelates

Identical Files. But Not.

Postby Hammer » Mon Apr 09, 2012 2:09 pm UTC

OK, I'm ripping my hair out and you all know what happens to this forum if I go crazy, so you all best help me figure out what evil spirit I've offended and what must be sacrificed or I might do something rash like set everyone's title to "Belial Watches Fox News" or start banning people at random or make this post the only one anyone can see until it gets solved. You've been warned. This is me.

Seriously, I'm baffled.

I am generating a csv file with PHP with a .txt extension on a CentOS server.

If I FTP this file down using either ASCII or Binary mode, it works fine. If I right-click and Save As from my browser (happens in IE and Firefox), it blows up my merge document (Word 2007).

To make the bad file good, all I have to do is open it in GVim and save it. Saving it in Notepad does not make it good. I've tried every combination of line endings under the sun.

The files are exactly the same size.
The md5 checksums are identical in both text and binary mode.
Inspection with od says the files are identical.
Diff says the files are identical.
A C program that does a byte by byte inspection says the files are identical.

I'm at a loss as to what is even different about these two files, never mind why they are different. Has anybody ever seen anything like this?
"What's wrong with you mathematicians? Cake is never a problem."
User avatar
Hammer
Because all of you look like nails.
 
Posts: 5486
Joined: Thu May 03, 2007 7:32 pm UTC

Re: Identical Files. But Not.

Postby Xanthir » Mon Apr 09, 2012 2:20 pm UTC

Based on your information, the only thing I can think of is that Windows is tagging the file as "from the internet" when you do a Save As from your browser, and somehow this breaks things. This metadata is held outside the actual file's data, so nothing else can detect the difference. FTP programs don't trigger the tagging, and saving it another program whitewashes it.

I have no idea how this would break anything, but it's the best I've got.
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))
User avatar
Xanthir
My HERO!!!
 
Posts: 4002
Joined: Tue Feb 20, 2007 12:49 am UTC
Location: The Googleplex

Re: Identical Files. But Not.

Postby Hammer » Mon Apr 09, 2012 2:35 pm UTC

Xanthir wrote:Based on your information, the only thing I can think of is that Windows is tagging the file as "from the internet" when you do a Save As from your browser, and somehow this breaks things. This metadata is held outside the actual file's data, so nothing else can detect the difference. FTP programs don't trigger the tagging, and saving it another program whitewashes it.

I have no idea how this would break anything, but it's the best I've got.


Gee Willikers THAT'S IT THAT'S IT THAT'S IT THAT'S IT THAT'S IT!!!!

Thank you so much, Xanthir! I've been staring at this for days. What was I thinking when I was looking for the problem with a file in the problem file instead of somewhere else entirely. I hate Windows. But I love you.
"What's wrong with you mathematicians? Cake is never a problem."
User avatar
Hammer
Because all of you look like nails.
 
Posts: 5486
Joined: Thu May 03, 2007 7:32 pm UTC

Re: Identical Files. But Not.

Postby Yakk » Mon Apr 09, 2012 2:48 pm UTC

I like Xanthir's theory, but cannot confirm it. If that is the case, we could poke at it and see what clears the 'from the internet' tag, and/or experiment with what would stop the explosion of the merge document.
To make the bad file good, all I have to do is open it in GVim and save it. Saving it in Notepad does not make it good. I've tried every combination of line endings under the sun.
These are somewhat tangential but... I'm assuming by 'every combination of line endings' you mean "CRLF" vs "CR"? (and not, say, "go to top left of the page", then a set of LF until you reach the right line number...)

Does it have an empty final line? (Some programs behave poorly if you don't have that empty final line in really quirky ways)
If I FTP this file down using either ASCII or Binary mode, it works fine. If I right-click and Save As from my browser (happens in IE and Firefox), it blows up my merge document (Word 2007).
A detail you are missing -- clearly the Save As didn't blow up the merge document. It was the operation afterwards -- when you used (Word 2007) to merge it into a previous version of the same document, I'm guessing?

What happens if you copy the file (to a new name) using windows explorer? A dos command line? Notepad, where you highlight the entire file, copy, then open a new notepad, paste, and save it out under a different name? (I could see gvim saving when you tell it to save, while notepad being all smart and saying "nothing to do here, I won't save it out" -- and I could see gvim deleting the file then recreating it, while notepad might reopen the file for writing without deleting it, or other minor differences.)

...

Here is information on the "from the internet" flag:
http://www.howtogeek.com/70012/what-cau ... remove-it/
Sysinternals streams:
http://technet.microsoft.com/en-us/sysi ... s/bb897440
(I already had it from downloading some sysinternals bundle of file system utilities)
and see if the operation that cleans the file also removes the "from the internet" stream.

I'm wondering if the "from the internet" flag must cause the problem, or if there is a way to make it take the "from the internet" file and process it correctly by changing its contents.

Assuming you save your sysinternals utilities to c:\bin,
c:\bin\streams -d $filename$ will delete the streams associated with $filename$.

You can also see streams by typing dir /r (without the streams command). I haven't spotted how to delete ADS (alternative data streams) without the streams command (or using the same API is uses)...

...

I wonder what is is that word is reading from the ADS that makes the merge blow up. I'd be tempted to look at the file access involving the file in question, and maybe do a diff on the resulting logs. (procmon from sysinternals can be told to tell me about everything that reads a file called "foobar".) That would generate more information, but "delete the ADS" is probably the real answer to the practical problem.
Last edited by Yakk on Mon Apr 09, 2012 3:00 pm UTC, edited 1 time in total.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.
User avatar
Yakk
 
Posts: 10039
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Identical Files. But Not.

Postby Hammer » Mon Apr 09, 2012 2:59 pm UTC

Yakk wrote:I like Xanthir's theory, but cannot confirm it.

I can. If you right-click on the file and look at its properties, you see a message about how the file came from another computer and has been blocked for your protection. There's an Unblock button next to it. Click the button and all the issues go away.

This is especially fun since it only exists on XP and Vista, and because whether it happens to any particular file is based on a less-than-predictable set of invisible rules about the file type, the directory to where it's being saved, and what you have done previously in that directory. It's also silent. This is not one of Mickey's better features.
"What's wrong with you mathematicians? Cake is never a problem."
User avatar
Hammer
Because all of you look like nails.
 
Posts: 5486
Joined: Thu May 03, 2007 7:32 pm UTC

Re: Identical Files. But Not.

Postby Yakk » Mon Apr 09, 2012 3:03 pm UTC

Yes. That is pure awesome.

Almost as awesome as the fact that Word 2007 both reads that ADS and then decides to go batshit insane when it finds it.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.
User avatar
Yakk
 
Posts: 10039
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Identical Files. But Not.

Postby Hammer » Mon Apr 09, 2012 3:11 pm UTC

Yakk wrote:Yes. That is pure awesome.

Almost as awesome as the fact that Word 2007 both reads that ADS and then decides to go batshit insane when it finds it.


The really fun part is that the initial merge with that file works fine. No indication whatsoever that there is an issue. But, when you save the docx, close it, and reopen it it tries to reattach the data source, throws a whole bunch of misleading error messages, and then corrupts the document. And I couldn't even just fix the file and walk away because I have a bunch of non-computer-savvy old ladies using this process. I had to find out why this was happening. I am less than pleased with Microsoft at this particular moment. LESS THAN PLEASED!!!
"What's wrong with you mathematicians? Cake is never a problem."
User avatar
Hammer
Because all of you look like nails.
 
Posts: 5486
Joined: Thu May 03, 2007 7:32 pm UTC

Re: Identical Files. But Not.

Postby Xanthir » Mon Apr 09, 2012 4:06 pm UTC

Yay Hammer!

And yay me, I finally have a title! \o/
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))
User avatar
Xanthir
My HERO!!!
 
Posts: 4002
Joined: Tue Feb 20, 2007 12:49 am UTC
Location: The Googleplex

Re: Identical Files. But Not.

Postby Jplus » Tue Apr 10, 2012 11:34 am UTC

Completely orthogonal to the already solved issue, but I thought you might like to know about the Windows native Notepad++.
Hey, like coding? Perhaps you should check out the red spider project.
Feel free to call me Julian. J+ is just an abbreviation.
User avatar
Jplus
 
Posts: 1091
Joined: Wed Apr 21, 2010 12:29 pm UTC

Re: Identical Files. But Not.

Postby Hammer » Tue Apr 10, 2012 7:49 pm UTC

Jplus wrote:Completely orthogonal to the already solved issue, but I thought you might like to know about the Windows native Notepad++.

I know. But I like vi ... :oops:
"What's wrong with you mathematicians? Cake is never a problem."
User avatar
Hammer
Because all of you look like nails.
 
Posts: 5486
Joined: Thu May 03, 2007 7:32 pm UTC


Return to The Help Desk

Who is online

Users browsing this forum: Google [Bot], Solumnant and 8 guests