asked on

Follow up to this question: 29049283: How can I reduce the size of the data being inserted to the MSSQL DB

This is a follow up to this question.
https://www.experts-exchange.com/questions/29049283/Powershell-Dir-Output-to-of-large-file-system-What-is-the-most-ideal-method-to-use-to-store-the-results.html?anchorAnswerId=42251236#a42251236

So the accepted solution is doing what I want it to do, but I have 1000's of duplicated values bases on the directories and files that are being queried. As Qlemo suggested I would be better off creating second table or not more to house the distinct values and just reference them rather than build a monster flat table.

So lets use this as an example:

\\Server1\Folder-A\subfolder-x\File0001 -- \\Server1\Folder-A\subfolder-x\File1000 There is no need to keep writing \\Server1\Folder-A\subfolder-x

Can someone (Qlemo or anyone else that helped) point me in the right direction on how to accomplish this so my DB can be smaller in size?

aikimark

Are you trying to populate the database with file information or directory information?

yo_bee

ASKER

yes
I am using the NTFSSecurity Module to handle long file names and iterating through a very large volume on my NAS

aikimark

I know you are iterating through your large NAS volume. After reading the prior related question and this question, it isn't clear to me if you are trying to store file-related information in your database or directory-related data in your database.

Please clarify this.

yo_bee

ASKER

I am storing Unique key Random GUID, Directory, Name, CreationTime, LastWrittenTime, LastAccessTime. (6 fields in all).
Most of these directories are deep nested folder structures. So this leads to a lot of repetitious data that as suggested can be stored in another table and referenced.

aikimark

yo_bee

Are you storing one row/file or one row/directory? Neither of your last two comments answered this question.

Qlemo

It is one row per file.

yo_bee

ASKER

If you run

Get-ChildItem -Path "Z:\traffic" | Select Name,DirectoryName,Length,LastAccessTime,LastWriteTime,CreationTime,Extension

Open in new window

This is the output that is being written to the SQL DB and each file will create a new record which in turn creates a lot of repetitive directory entries. So as Qlemo suggested in my previous question to write the directory to a separate table to reduce the total size of the DB.

Here is a sample of the output
test.csv

aikimark

First thing I notice is that you've got a bunch of junk in there.
Example:

,DIALOG,4/28/2017 13:41,4/28/2017 13:41,4/28/2017 13:41,,
,RTF,4/28/2017 13:41,4/28/2017 13:41,4/28/2017 13:41,,
,SQLiteStudio,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,
,iconengines,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,
,imageformats,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,
,platforms,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,
,plugins,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,
,printsupport,11/2/2016 5:56,5/19/2017 10:22,5/19/2017 10:22,,

Open in new window

Second thing would be to create two tables, one for each unique non-empty directory and one for the file data. You can put a unique index on the directory table to prevent duplicates.

yo_bee

ASKER

You do realize that the csv I posted was purely an example. This is not the real subset of data I am gathering. the purpose of this gathering is to present to the executives of the firm how much data is stale and how much is relevant.

As it was suggested from my previous question I should build a table that holds the distinct directory paths to reduce the size of the overall DB.

Second thing would be to create two tables, one for each unique non-empty directory and one for the file data. You can put a unique index on the directory table to prevent duplicates.

This is the reason for posting this question.

yo_bee

ASKER

Also I have consumed all the memory on my server (19 GB of RAM) and almost all of it is from SQLSERVER

SOLUTION

Qlemo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

aikimark

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Qlemo

Yes, an index on names might be helpful for searching, an index on ID for joining, but both are best created only after the bulk inserts have been performed.

yo_bee

ASKER

I ran into an issue with the process using up all available memory and it brought the job to a crawl.

I ended up killing the job. I plan to rerun the job.

ASKER CERTIFIED SOLUTION

oBdA

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

aikimark

or you can put a unique index on the directory table and do this insert for every file.

yo_bee

ASKER

The go of this task is to gather my CIFS vols and build a case for management that there is a $H!T load of stale data. So the lastaccesstime and created date I feel are very important to build a case.

The reason I am using Get-ChildItem2 cmdlet is to overcome the longfile name issue that know I will run into at some point.

Here is the error I am getting when I running the oDBA suggested.

Select-Object : Property cannot be processed because property "Directory" already exists.
At line:65 char:61
+ Get-ChildItem2 -Path $Path -Recurse -File -Verbose:$false | Select-Object -Prope ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (\\netapp01\Imag...001\A000271.TIF:PSObject) [Select-Object], PSArgumentException
+ FullyQualifiedErrorId : AlreadyExistingUserSpecifiedPropertyNoExpand,Microsoft.PowerShell.Commands.SelectObjectCommand

SOLUTION

oBdA

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

yo_bee

ASKER

Other than using the Get-Childitem2 do you know any other way to gather these attributes for files that have LONGNAME (260)?

oBdA

AlphaFS for example, http://alphafs.alphaleonis.com/
Requires an external DLL, though.

yo_bee

ASKER

The cmdlet that I am using NTFSSecurity (Get-ChildItem2) uses that DLL.

oBdA

That would account for the differences in returned properties. The Get-ChildItem2.ps1 I found relied on the Windows APIs only.
Isn't really relevant to pumping that data into SQL, though.

Pber

No comment has been added to this question in more than 21 days, so it is now classified as abandoned.

I have recommended this question be closed as follows:

Split:
-- oBdA (https:#a42258428)
-- Qlemo (https:#a42252907)
-- aikimark (https:#a42253017)
-- oBdA (https:#a42273066)

If you feel this question should be closed differently, post an objection and the moderators will review all objections and close it as they feel fit. If no one objects, this question will be closed automatically the way described above.

Pber
Experts-Exchange Cleanup Volunteer