6/1/17 Notes (Problems with merging and data frame size)
- I checked the shape of the attitudes survey after the duplicates were deleted and all the years have different numbers of columns. Looking at the surveys, I think this is because different questions were asked different years so I will have to type different commands for each year.
- Trying to append the Attitudes data to the Confidence data
- The command “ConfidenceS14.append(AttitudesS14)” did not work. I think it didn’t work because this function is for connecting lists and I was trying to connect two data frames.
- I am going to try this syntax that I found online. This syntax should only add new columns so that there won’t be any repeating columns. This did not work
- DataFrame.append(other, ignore_index=False, verify_integrity=False)
- I am going to try a different syntax that is specifically for CSV files
- import csv with open('data.txt', 'rb') as match_data: reader = csv.reader(match_data) match_data = {tuple(row[:2]): row for row in reader} with open('m_list.txt', 'rb') as match_list, open('done.txt', 'wb') as outfile: reader = csv.reader(match_list) writer = csv.writer(outfile) for row in reader: row = tuple(row) if row in match_data: writer.writerow(match_data[row])
- Here is what I typed:
import ConfidenceS14
with open('ConfidenceS14', 'rb') as match_data:
reader = csv.reader(match_data)
match_data = {tuple(row[:2]): row for row in reader}
with open('AttitudesS14', 'rb') as match_list, open('done.txt', 'wb') as outfile:
reader = csv.reader(match_list)
writer = csv.writer(outfile)
for row in reader:
row = tuple(row)
if row in match_data:
writer.writerow(match_data[row])
- This did not work
- I found a different method that makes it so you can compare two files line by line, here is the example:
- file1 = open('some_file_1.txt', 'r') file2 = open('some_file_2.txt', 'r') FO = open('some_output_file.txt', 'w') for line1 in file1: for line2 in file2: if line1 == line2: FO.write("%s\n" %(line1)) FO.close() file1.close() file2.close()
- It keeps telling me that my files don’t exist so I replaced their shorthand name with the full file name.
- Then I got an error saying that I needed an integer for FO. Thinking about it, this matches up lines in the files, but not by name so I am going to look for a different command
- A command called “merge” combines data frames based on a similar id word in both files
- Originally I tried this format: merge(ConfidenceS14,AttitudesS14,by"First Name",all=True). This did not work
- Then I tried this format: merged_inner = pd.merge(left=ConfidenceS14,right=AttitudesS14, left_on='First Name', right_on='First Name')
- This worked but only gave me part of each chart
- This method for merging can only merge two files but eventually I want to merge all the files together so I will either have to find a different way to merge, or i can continue to merge new files with the already merged file like some kind of python merge-inception.
- Fixing issue with data frame size.
- Because the merge function only gives me part of each of the charts, Dr. McColgan suggest I set the data frame size to be large enough to fit all the data.
- Here is a command I found to create an empty data frame:
- > df <- data.frame(matrix(ncol = 300, nrow = 100)) > dim(df) [1] 100 300
- I tried this but I’m getting an error saying that matrix is not defined.
- I keep getting errors so I am going to try to just display the new file using loc. This displayed more columns but less rows
- I found a command that stops pandas from having a maximum column number:
- pandas.set_option('display.max_columns', None)
- This Worked!
- I was able to add the semantics data using this function”
ConAttSem14 = pd.merge(left=ConAtt14,right=SemanticsS14, left_on='First Name', right_on='First Name')
merged_inner
# what's the size of the output data?
pd.set_option('display.max_columns', None)
merged_inner.shape
ConAttSem14
- “ConAtt14” is what I named the file of the merged Confidence and Attitude files.
Comments
Post a Comment