[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 488
  • Last Modified:

SSIS: Find Similar Files and move to another folder

I have a package that loops thru files that may look like ths:

2001SurveyCompanyA_Q.xlsx
2001SurveyCompanyA_R.xlsx
2001SurveyCompanyA_S.xlsx
2001SurveyCompanyA_C.xlsx

sometimes there may be more than one company survey:

2001SurveyCompanyB_Q.xlsx
2001SurveyCompanyB_R.xlsx
2001SurveyCompanyB_S.xlsx
2001SurveyCompanyB_C.xlsx

2001SurveyCompanyC_Q.xlsx
2001SurveyCompanyC_R.xlsx
2001SurveyCompanyC_S.xlsx
2001SurveyCompanyC_C.xlsx


Now, i need to process one file group at a time, t here will always be the four files together, the name of the company will be the company name to the underscore. but every group will have a q,r,s and c file.  I want to proces them one company at a time, then move to the next set.

right now, i use a for each loop, to get every file, and process them by a variable to the _r.xlsx,_c.xlsx, etc....so, i want to take similar files, process them, then move to the next set. possible in ssis ?
0
Auerelio Vasquez
Asked:
Auerelio Vasquez
  • 4
  • 4
1 Solution
 
8080_DiverCommented:
Absolutely!  In your For Each loop, it would seem that you are segregating the files based upon the last (rightmost) 7 characters.  However, I have a couple of questions:
  1. Why do you feel compelled to process the files "one company at a time"?  Is this because of how you think of the processing or is this some sort of requirement that is being imposed on you?
  2. Just to clarify, is my understanding that all _R.xlsx files are to be in one folder, all _Q.xlsx files in another, etc.?
0
 
Auerelio VasquezETL DeveloperAuthor Commented:
The reason that we have to procreas each type separare is that they go ínto different tables. They are not in different folders just to be clear. All files are delivered to the same folder. _q files go to a question table that is built on the fly with a select into statement. All _ tables are built on the fly. Make sense? All companies will have the four files. Once I process the four files, the loop has to find the next four files. Whats good here Is that everything to The left of The underscosre, Is The company name. So for each company, I nee to process it's four files, then move on to the next set.
0
 
8080_DiverCommented:
Okay, I am a litle confused.  
The reason that we have to procreas each type separare is that they go ínto different tables.
So, the data for a _?.xlsx is processed to a given table.  Is that by company or are all files of a given type processed to the same table?
Are the different tables identified by the type of file (e.g. Q, R, etc.) or are they also identified by the company name?
In other words, if you process the following files, how do they look in your database?
  • 2001SurveyCompanyA_Q.xlsx
  • 2001SurveyCompanyB_Q.xlsx
  • 2001SurveyCompanyC_Q.xlsx
Do they all wind up in a Qeustions table or do they wind up in a 3 separate tables, segregated (and I would assume named) by company?  If they are in the same table, do you carry the Company Name (e.g. 2001SurveyCompanyA) in each row so you can identify which rows came from each company?
Again, though, I am wondering whether the processing by company is more important than just making sure all files are processed.  In other words, I am looking at the possibility that processing all of the _Q.xlsx files and then all of the _R.xlsx files might not be easier to accomplish.

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Auerelio VasquezETL DeveloperAuthor Commented:
ok, so to answer your questions.

_?.xlsx

let's see. to explain easier.

Each company has data for each type _?.xlsx. Let's say company had the four files _q,_c,_q,_r the first time the loop goes thru, it check for the existence of the table. If it's q, it's called a question table. If hte table exists it's dropped. So as the loop processes, it builds the table based on the variable that holds _?.xlsx. IT builds the four tables. On the next run, it drops the tables and repeats. There is only ever 1 company stored in the four tables at any given time. As it is now, we only have one company every time it runs, so it's very easy, it check the tables, drops them, recreates based on the content in the spreadsheet then moves to the next file.

If i proces the four files together, it would be like this:

2001SurveyCompanyA_Q.xlsx -----> Question table
2001SurveyCompanyA_C.xlsx -----> Choice Table
2001SurveyCompanyA_R.xlsx -----> Response Table
2001SurveyCompanyA_S.xlsx ------> Survey Table

In each table there will only ever be one company's data, until the next run. Now, if there are going to be multiple companies in the folder, that is where the change comes, and the requirement for me to process only that company' four files, then move to the nexc company (and it's four files).

Each company is identified, by a variable that takes the left string all the way to the underscore and that value is stored in each table for later processing.

FYI: afte these tabels are loaded, the data is then moved to a data mart, but that;s a differnet set of rules. That is why we can drop the tables after every loop, this is just a staging area.
0
 
8080_DiverCommented:
May i suggest a slightly different approach.  If you carry the Company Name in the staging table and then break the data out into the data mart based upon extracts that use the company name, you can process all of the files in one fell swoop and then break the data out.
I can foresee an objection in that you would have to handle the break out a bit differently but i have to wonder how you know which company should be associated with the data in the staging.
However, if you set up a FOR EACH within a FOR EACH, you can probaably initially find all of the _Q.xlsx files, extract the company name portion and use that to set up your inner FOR EACH loop to find all of the files for the given Company, you can probably pull off doing things the way you are wanting to.
0
 
Auerelio VasquezETL DeveloperAuthor Commented:
Ok, that's where the problem lies, i'm not sure, how to pass the variable from the outer to the inner for each loop. Didn't even know you could do such a thing. Is there any guidance ? I can post the .xml of the package here, and maybe you can assist me with this? and setting up how i'm looking to do so?
0
 
8080_DiverCommented:
basile,
I don't work from XML. ;-)  I use Visual Studio/BIDS.
If you have it in Bids, then you can add the variable that you are using for the outer FOR EACH as a variable of the over all package.  I would then add the one for the inner loop as an over all package variable.
You will need 2 fileconnections, one for each loop, and you'll need to set those up so that they use expressions that use the appropriate variable for the "connection" (i.e. fully pathed file name).
 
0
 
Auerelio VasquezETL DeveloperAuthor Commented:
This answer didn't quite get me all the there, but it was a good starting point.
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 4
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now