flat file creation question

June 24, 2019 at 12:52:29
Specs: Windows 7, 8GB
This is the objective I would like to achieve:
I would like to create a flat file, containing all the file names of a given drive/partition, or directory.
It must be 1 line per file entry, and using whatever separator that suits (like a colon, or whatever).
The data that must be included for EACH file must be:
- its name
- its path
- its last modification date & hour, in separated entries for year, month, day, hour and minutes (hour in 24 hour notation)
- its size in bytes
- a file content verification of some kind, like an MD5 checksum

My question here is what method, which software or which programming language could be used best for such an objective ?

Note that some drives contain millions of files, which means that the resulting flat file also would contain exactly that amount of lines. The issue here is not just that these files would be big, but the fact that creation time for these files should be limited. Stated differently: it should also be a method that is focused on performance. I don't expect such a file to be created in seconds or minutes, but after that; the less, the better.

In addition, I would also add the following: container files - like but not limited to Zip files - should also be processed as well, in the way that all the files within such a structure, be treated in the same way as the "normal files".

The creation of the flat file will be used as input for further processing of the data, but on purpose the question here only concerns the creation of the flat file in a uniform matter, so that the subsequent processing can be anything.


See More: flat file creation question

Reply ↓  Report •

#1
June 24, 2019 at 22:20:31
That sounds exactly like my "all.bat". I mainly use it as a shortcut for finding files and information on my drives, and it contains all you mentioned except the checksum. I had planned on adding that as well. I made it to EXclude any browser cache files, and would recommend if you want these included, to make a separate "flat file" for them since there are so many and you usually do not have any interest in them. It takes about 5 minutes to run on a 300-G drive with currently around 350,000 files. It does not deal with zipped files content.
@echo off & setlocal
:: takes a "snapshot" of the disk for comparison purposes between operations.
:: the files' fullpath, date+time, and size are recorded for posterity, in
:: case any operation changes any of these properties of the file. An "fsum"
:: on each file would be REALLY nice, to see if content changed!
:: note that option "/a" with no specs will get ALL files {h,s incl}
set specs=
if "%1"=="" set out=%~n0 else (
set out=%1
set specs=*.%1
)
:: this is normally not used, but it's handy for other bats to use this
echo output going to %out%fils -- specs = %specs%
:: see notes on c:\all.bat. changed format to timestamp*size...
echo xxx browser cache files will be filtered out
(for /f "tokens=*" %%a in ('dir /a-d /b /s %specs%^| findstr /v /i "xxx\\.*\\.*"') do echo *%%~ta*%%~za*%%~nxa*%%~pa*)>%out%fils

this might work for fsum - untried: (limited trials: fsum option multiplied the time by at least 10, on a fairly small sample, from 4 secs to 40 secs, so a full-disk run with fsum will incurr signicant delays.)
(for /f "tokens=*" %%a in ('dir /a-d /b /s %specs%^| findstr /v /i "xxx\\.*\\.*"') do call :xx "*%%~ta*%%~za*%%~nxa*%%~pa*" "%%a")>%out%fils
goto :eof

:xx
for /f "tokens=1" %%b in ('fsum -jnc %2 2^>nul') do echo %~1%%b*

message edited by nbrane


Reply ↓  Report •

#2
June 25, 2019 at 00:46:06
That is interesting, especially the speed of it.

The FSUM utility, is yours this one: http://www.fastsum.com/support/user...
?


Reply ↓  Report •

#3
June 25, 2019 at 01:12:53
For future processing, you probably want a separator that cannot appear in the output, which means we need to pick from the list of characters disallowed from file names, and exclude those that are used in path navigation (colon and backslah). This means we're left with the terminal control type characters, input redirection characters, the wildcard characters, and null. I'll go with the pipe / bar, since .psv files do exist. Also, I see where your performance hit is coming from. You're generating hashes, which means any program or script can only move as fast as your storage I/O.

PowerShell makes the first part easy. It can also read .zip files. Opening other file container types would require more code than I'm willing to put in, or making use of a program like 7-zip.

[cmdletbinding()]param(
  [parameter(mandatory=$true)][String]$searchPath,
  [String]$outFilePath = 'out.psv'
)
dir $searchPath -Force -Recurse -File | 
 foreach { 
   try { $md5 = (Get-FileHash -LiteralPath $_.FullName -Algorithm MD5 -ErrorAction Stop).Hash }
   catch { $md5 = $null; Write-Error $_ }
   @($_.Name, $_.FullName, $_.LastWriteTime.ToString('yyyy|M|d|H|m'), $_.Length, $md5) -join '|'
 } | 
 Out-File $outFilePath

How To Ask Questions The Smart Way


Reply ↓  Report •

Related Solutions

#4
June 25, 2019 at 01:18:24
Razor, can you detail on this : "generating hashes"

I don't know what that means, and where these occur ...
And what is the solution for that problem ?


Reply ↓  Report •

#5
June 25, 2019 at 04:33:10
> It takes about 5 minutes to run on a 300-G drive
> with currently around 350,000 files.

This action takes around 1 hour and 20 minutes on my C drive (the only big drive I have currently to my disposal). I'm including all files, so also the browser cache files, but I think it shouldn't be that much of a difference. This is about 338 thousand files having a total size of some odd 200 Gig


Reply ↓  Report •

#6
June 25, 2019 at 10:41:40
Hashing is the process of taking a lot of data, and running through it to get a (hopefully unique) value. Generating a CRC32, MD5, SHA, or whatever of every file requires you to open every file, read it all as if you were doing a file copy, and then close it. This'll take hours on any heavily used disk, and it's a huge bottleneck in your data collection.

How To Ask Questions The Smart Way

message edited by Razor2.3


Reply ↓  Report •

#7
June 25, 2019 at 12:00:16
OK I understand.

However, I need some checksum in order to go and hunt for duplicate files, also across different disks.
What I didn't really mention is that I'll be running this on backup and archive disks primarily. Some of these disks may actually never ever change again, some may. My project is actually not meant to find files on active disks, but more aimed at finding files in backups and archives.
I'm now running the tool on my C: disks because it is easier to test that way.

But ... I need the checksums.:) If it takes a long time, that's OK.

message edited by Looge


Reply ↓  Report •

#8
June 25, 2019 at 18:33:52
My bad! That was the C drive, not the 300G drive. The time for the 300G drive with 350000 files was 25 minutes. I think the FSUM-to-flat-file option is probably not a realistic option.

I just use the file's size. Some types of files have identical sizes, but most are unique. My method to date is to search the "flat-file" for an exact size match (as well as extension, but not name or date). Any that are found can then be fsummed to check for exact content duplicate. This would save the time expenditure of having to fsum all the files each "snapshot". f/e: I need to look for "A.exe", which is 34227 bytes. I run a for-loop find on the flat-file targeting the size and extension and generate a temp file of all that match. Then using that subset, I would do the fsum on those, and see if any match my target. (fc /b could also work. not sure which is faster but probably the fsum utility - slavasoft ca 2007)

message edited by nbrane


Reply ↓  Report •

#9
June 26, 2019 at 00:17:32
I also considered that, but I don't know how this will turn out in practice.
Say I have 10 big/huge external disks. I do a search on files in a central location, based on file size. This may or will find file names, and possibly spread out over some, more, or all disks.

I then have to connect each of those specific disks, just to do a checksum on 1 or very few files.

From my experience I see that getting a checksum of some files really takes a long time, typically big Zip files are in that case.

It's a difficult call, since having all of these checksums is overkill, but not having them seems like I'm missing something.


Reply ↓  Report •

#10
June 26, 2019 at 00:27:26
Just a question, on that C: drive, do you have Windows Indexing enabled ?

Reply ↓  Report •

#11
June 26, 2019 at 19:59:25
Yes, but I don't use it. My processor is slow (by current standards), and I prefer to use the methods I've been using for locating files.
Fsum is great for identifying/verifying files, but some form of triage needs to be used. Do you really want to FSUM all the crap in your browser cache? All the myriad files generated with every key-stroke or mouse move? The burden of TMI is that OS control of the user interface will entail increasing loads of overhead, redundancy, and garbage. The ideal would be to have a means to discriminate between "actual" actions and incidental actions undertaken without user knowledge or consent. But, of course, there will probably be ones that DO need to be audited, on the one hand, and many that do not on the other hand. If you have enough drive-space, mirroring might be the only "automatic" option that will ensure complete recovery, and even that only captures the current file-system, and not past ones, (for which you enter a never-ending cyclic redundancy combinatorial disaster of epic proportions.)
I guess it all depends on what you really need, and how to discriminate, and how anal you are about keeping a record of every bit past and present. The only system that even comes close to that is Earth's biosphere and the DNA that supports it. That should tell you something...
I do understand your problem with multiple drives (and possibly OS'), and the difficulty of pinning one down for an exact duplicate, but I can't see past the fsum bottleneck.

message edited by nbrane


Reply ↓  Report •

#12
June 27, 2019 at 00:40:38
I need to read that a couple of times to really understand, but I do have 1. the time and 2. the diskspace ... Indeed I do not need to have these browser cache files, but this is the same challenge I have in my backup scripts (which exist independently from this project) : I do indeed do not need all, but the time needed to exclude this or that, just doesn't weigh up. It's especially an issue for scripting itself, since it gets confusing & time consuming. For the backup I ended up with : on the C drive, backup everything, except one thing: C:\Windows and all it in. But I do have all the crap in C:\Users indeed. And, these same files - once backed up - indeed need to be going through this project (the one from this thread).

The checksum is indeed slowing down, a lot in fact. I found that big files really take a long time ... but the pro is that these files are ... big, and thus cover a lot of content.
Currently it takes a couple of hours for a medium to a big disk, and it takes up to a calculated full working day for huge disks ( + 1.000.000 files ). This is without checksum.
With it, will be a duplicate of that.
Take in mind, some of these disks are backup disks. I am going to keep them for years, some without ever changing.


Reply ↓  Report •

#13
June 27, 2019 at 10:47:27
Unless this is a one-off thing, what needs to happen is for you to have a database of files, then you only hash the stuff you don't have / that was changed. Or don't fret about execution times. Either or.

Getting MD5s from zipped files ended up easier than I anticipated. I figured I would need to decompress the file to a memory container, and use a .NET object to get the hash. Turns out the zip library just gives you a stream, and Get-FileHash will just take a stream. I just need to worry about some minor cleanup! It does add a lot of lines to the script, though.

[cmdletbinding()]param(
  [parameter(mandatory=$true)][String]$searchPath,
  [String]$outFilePath = 'out.psv'
)
Add-Type -AssemblyName System.IO.Compression.FileSystem -ErrorAction SilentlyContinue

function Format-Entry {
param($entry, $md5, [String]$addToPath = [String]::Empty)
  if ($addToPath) { $fullName = ($addToPath, $entry.FullName -replace '/','\' -join ':') }
  else { $fullName = $entry.FullName }
  @($entry.Name, 
    $fullName, 
    $entry.LastWriteTime.ToString('yyyy|M|d|H|m'), 
    $entry.Length, $md5.Hash) -join '|'
}

function Inventory-Zip {
param([String]$zipPath)
  $zipFile = [IO.Compression.ZipFile]::OpenRead($zipPath)
  $ret = @()
  foreach ($e in $zipFile.Entries) {
    if ($e.Length -gt 0) {
      $eStream = $e.Open()
      $ret += Format-Entry $e (Get-FileHash -InputStream $eStream -Algorithm MD5) $zipPath
      $eStream.Dispose()
    }
  }
  $zipFile.Dispose()
  return $ret
}

dir $searchPath -Force -Recurse -File | 
 foreach { 
   if ($_.Extension -ne '.zip') {
     try { $md5 = Get-FileHash -LiteralPath $_.FullName -Algorithm MD5 -ErrorAction Stop }
     catch { $md5 = $null; Write-Error $_ }
     Format-Entry $_ $md5
   } else {
     Inventory-Zip $_.FullName
   }
 } | 
 Out-File $outFilePath

How To Ask Questions The Smart Way

message edited by Razor2.3


Reply ↓  Report •

#14
June 27, 2019 at 20:10:20
@razor: I tend to steer away from pipes as delimiters, simply from my experiences with batch (which, I know, is deprecated in favor of powershell, but...). This opens up those container files in a neat way.
@op: I think I understand the hardware and time/effort problems of connecting multiple volumes. The "across-the-board" fsum would eliminate the need to check files with same size and/or date and/or name to identify which volume(s) the file resides on. It looks to me like a trade-off in index-file generation-time vs how often you actually need to do a multi-volume search. That, in turn, depends on how many of the volumes are usually, or conveniently on-line. If you have the volumes mounted into usb "sleds", they can be swapped out fairly easily to any computer on your network, or simply left hooked up to any computers that are kept online to the network. Or, invest in terabyte drives for the project to minimize the no. of volumes.
Powershell will most likely beat batch in all performance categories, but I am not familiar enough to suggest any code. Razor is your friend when it comes to that!

Reply ↓  Report •

#15
June 27, 2019 at 20:17:38
Way back in #3, I explain why I picked pipes. Yeah, if your future processing is in batch, you can just always double quote the file name and path, and get away with commas. If it's any other language, pipes would just make future steps easier.

How To Ask Questions The Smart Way


Reply ↓  Report •

Ask Question