Link to home
Start Free TrialLog in
Avatar of yo_bee
yo_beeFlag for United States of America

asked on

Follow up to this question: 29049283: How can I reduce the size of the data being inserted to the MSSQL DB

This is a follow up to this question.  
https://www.experts-exchange.com/questions/29049283/Powershell-Dir-Output-to-of-large-file-system-What-is-the-most-ideal-method-to-use-to-store-the-results.html?anchorAnswerId=42251236#a42251236

So the accepted solution is doing what I want it to do, but I have 1000's of duplicated values bases on the directories and files that are being queried.   As Qlemo suggested I would be better off creating second table or not more to house the distinct values and just reference them rather than build a monster flat table.

So lets use this as an example:

\\Server1\Folder-A\subfolder-x\File0001 -- \\Server1\Folder-A\subfolder-x\File1000  There is no need to keep writing \\Server1\Folder-A\subfolder-x  

Can someone (Qlemo or anyone else that helped) point me in the right direction on how to accomplish this so my DB can be smaller in size?
Avatar of aikimark
aikimark
Flag of United States of America image

Are you trying to populate the database with file information or directory information?
Avatar of yo_bee

ASKER

yes
I am using the NTFSSecurity Module to handle long file names and iterating through a very large volume on my NAS
I know you are iterating through your large NAS volume.  After reading the prior related question and this question, it isn't clear to me if you are trying to store file-related information in your database or directory-related data in your database.

Please clarify this.
Avatar of yo_bee

ASKER

I am storing   Unique key Random GUID, Directory, Name, CreationTime, LastWrittenTime, LastAccessTime.  (6 fields in all).
Most of these directories are deep nested folder structures.  So this leads to a lot of repetitious data that as suggested can be stored in another table and referenced.
yo_bee

Are you storing one row/file or one row/directory?  Neither of your last two comments answered this question.
It is one row per file.
Avatar of yo_bee

ASKER

If you run
Get-ChildItem -Path "Z:\traffic" | Select Name,DirectoryName,Length,LastAccessTime,LastWriteTime,CreationTime,Extension

Open in new window

This is the output that is being written to the SQL DB and each file will create a new record which in turn creates a lot of repetitive directory entries.   So as Qlemo suggested in my previous question to write the directory to a separate table to reduce the total size of the DB.

Here is a sample of the output
test.csv
First thing I notice is that you've got a bunch of junk in there.
Example:
,DIALOG,4/28/2017 13:41,4/28/2017 13:41,4/28/2017 13:41,,
,RTF,4/28/2017 13:41,4/28/2017 13:41,4/28/2017 13:41,,
,SQLiteStudio,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,
,iconengines,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,
,imageformats,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,
,platforms,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,
,plugins,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,
,printsupport,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,

Open in new window


Second thing would be to create two tables, one for each unique non-empty directory and one for the file data.  You can put a unique index on the directory table to prevent duplicates.
Avatar of yo_bee

ASKER

You do realize that the csv I posted was purely an example. This is not the real subset of data I am gathering.  the purpose of this gathering is to present to the executives of the firm how much data is stale and how much is relevant.  

As it was suggested from my previous question I should build a table that holds the distinct directory paths to reduce the size of the overall DB.  

Second thing would be to create two tables, one for each unique non-empty directory and one for the file data.  You can put a unique index on the directory table to prevent duplicates.

  This is the reason for posting this question.
Avatar of yo_bee

ASKER

Also I have consumed all the memory on my server (19 GB of RAM) and almost all of it is from SQLSERVER
SOLUTION
Avatar of Qlemo
Qlemo
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Yes, an index on names might be helpful for searching, an index on ID for joining, but both are best created only after the bulk inserts have been performed.
Avatar of yo_bee

ASKER

I ran into an issue with the process using up all available memory and it brought the job to a crawl.

I ended up killing the job. I plan to rerun the job.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
or you can put a unique index on the directory table and do this insert for every file.
Avatar of yo_bee

ASKER

The go of this task is to gather my CIFS vols and build a case for management that there is a $H!T load of stale data.  So the lastaccesstime and created date I feel are very important to build a case.

The reason I am using Get-ChildItem2 cmdlet is to overcome the longfile name issue that know I will run into at some point.


Here is the error I am getting when I running the oDBA suggested.

Select-Object : Property cannot be processed because property "Directory" already exists.
At line:65 char:61
+ Get-ChildItem2 -Path $Path -Recurse -File -Verbose:$false | Select-Object -Prope ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (\\netapp01\Imag...001\A000271.TIF:PSObject) [Select-Object], PSArgumentException
    + FullyQualifiedErrorId : AlreadyExistingUserSpecifiedPropertyNoExpand,Microsoft.PowerShell.Commands.SelectObjectCommand
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of yo_bee

ASKER

Other than using the Get-Childitem2 do you know any other way to gather these attributes for files that have LONGNAME (260)?
Avatar of oBdA
oBdA

AlphaFS for example, http://alphafs.alphaleonis.com/
Requires an external DLL, though.
Avatar of yo_bee

ASKER

The cmdlet that I am using NTFSSecurity (Get-ChildItem2) uses that DLL.
That would account for the differences in returned properties. The Get-ChildItem2.ps1 I found relied on the Windows APIs only.
Isn't really relevant to pumping that data into SQL, though.
No comment has been added to this question in more than 21 days, so it is now classified as abandoned.

I have recommended this question be closed as follows:

Split:
-- oBdA (https:#a42258428)
-- Qlemo (https:#a42252907)
-- aikimark (https:#a42253017)
-- oBdA (https:#a42273066)


If you feel this question should be closed differently, post an objection and the moderators will review all objections and close it as they feel fit. If no one objects, this question will be closed automatically the way described above.

Pber
Experts-Exchange Cleanup Volunteer