asked on

count degrees in the file - python

I have a file with the list of jobs in each line. each job is a string and it is a line. 5th element in it is degree. I need to count different types of degree in the file for each job. I have 5 different degrees in the file.
I wrote below: but it is not counting total for each job. degrees are repeat some jobs. can you help?

with open("100 Jobs - MedDeviceManuf.txt", 'r') as f:
for job in f:
degree = job.rstrip().split(',')[5]
types_degree = {}
if degree in types_degree:
types_degree[degree] += 1
else:
types_degree[degree] = 1

print str(types_degree)

pepr

Can you show few lines of the text file?

Iryna253

ASKER

IT Business Analyst,Siemens,Hutchinson KS USA,NA,Full time,bachelors degree,2,SAP,Word,Excel,PowerPoint,Outlook,excellent oral & written communication skills,leadership,team player
Senior IT Business Analyst,Siemens,Tarrytown NY USA,NA,Full time,bachelors degree,5,excellent oral & written communication skills,presentation skills,Business process mapping,Business Requirements Analysis
Business Analyst,Fresenius Medical,Austin TX USA,NA,Full time,bachelors degree,3,analytical skills,organizational skills,excellent oral & written communication skills,SQL,access,business Intelligence software,goal oriented,independent

DrDamnit

Put types_degree = {} above your "with" line.

Iryna253

ASKER

Thank you, I did it, and it counted correct now. Now, I want to show this numbers on a pie chart. I did it the manual way, is there a way of inserting output numbers into the pie chart, so I am not typing them into the code?

Output was {'Ph.D.': 10, 'bachelors degree': 73, 'masters degree': 11, 'associates degree': 1, 'NA': 5}

import matplotlib.pyplot as plt

types_degree = {}
with open("100 Jobs - MedDeviceManuf.txt", 'r') as f:
for job in f:
degree = job.rstrip().split(',')[5]
if degree in types_degree:
types_degree[degree] += 1
else:
types_degree[degree] = 1
print
print str(types_degree)

labels = 'Ph.D', 'Bachelors Degree', 'Masters Degree', 'Associates Degree', 'N/A'
sizes = [10, 73, 11, 1, 5]
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral', 'red']
explode = (0, 0.1, 0, 0, 0)

plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=10)
plt.axis('equal')

DrDamnit

You should put that in another question to get more eyeballs on it. I generally do server stuff with python, never touched pie charts.

:-)

Iryna253

ASKER

oh ok, I will do. Can you help please with one more question for this .txt file? I am trying to count categories that I identified in each jobs into a dictionary called skillsets. I would like to count how many programming and database is in list of jobs I have in .txt file.

number = {}
with open("100 Jobs - MedDeviceManuf.txt", 'r') as f:
for job in f:
skillsets = { 'programming' : ['scripting language', 'r', 'python', 'C'] , 'database' : ['SQL', 'relational database']}

for category in skillsets:
category = skillsets.keys()
if category in job.rstrip().split(',')[7:]:
number[category] += 1
else:
number[category] = 1
print
print str(number)

DrDamnit

I don't understand your last question, can you please restate it or give me sample output of what you're getting now and tell me what's wrong with that?

Iryna253

ASKER

Now I am getting an error:
number[category] = 1

TypeError: unhashable type: 'list'

My output should be something like that:

programming : 3 out of 100 jobs
database: 5 out of 100 jobs

The code should do the following: got to the file, find job (each line), find field [7] of the line, identify words (like 'scripting language' or 'C') in that field, and add them or identify them to keys/category (programming, data base), finally give me a number of those keys/category founded in all jobs.

DrDamnit

First, if the data you presented above is a real sample, then "field 7" (the eighth spot) may or may not even be correct.

Assuming this is a CSV, you have comma separated values in that field.

Regardless, I just wouldn't try it this way. I would likely do this in two passes:

1. First pass: extract / index all the keywords that are in the file (Word, SAP, C, etc...)
2. Second pass: loop through each one to build the counts

This is really a job for a database. But, if you insist on doing it this way, try it with the two passes I described above. The first pass is required because we don't know what all the unique terms are (or how they are categorized, really). You'll have to get all the uniques and then manually categorieze them into "database" or "Office work" or "programming".

Also, to be lazy, I would categorize each of these terms in separate files on the disk so that when the script loads, I can just load the dictionary from that file.

Then, the second pass will simply read each line in the csv, compare that field to the pre-defined dictionaries that you have already created by analyzing the uniques, and simply incrementing an integer counter.

Iryna253

ASKER

I actually identified all unique skills into the categories that I called "programming" and "data base" and I added them to the dictionary called "skillsets". Now, I need to match those skills to the categories, which I stuck with. Can you recommend a link where I can read about it please?

ASKER CERTIFIED SOLUTION

DrDamnit

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Suhas .

No comment has been added to this question in more than 21 days, so it is now classified as abandoned.

I have recommended this question be closed as follows:

Accept: DrDamnit (https:#a41327206)

If you feel this question should be closed differently, post an objection and the moderators will review all objections and close it as they feel fit. If no one objects, this question will be closed automatically the way described above.

suhasbharadwaj
Experts-Exchange Cleanup Volunteer