Explore the Data

It is important that you get comfortable with the dataset before we start making changes. To this end, read through the informational files to learn about the survey. Parts of the methodology file may have to do with technical matters you don’t understand or details that may not appear important, but that is OK. Just read through it and get out of it what you can.

Here are some questions I always ask myself when I get a new dataset. Can you find the answers? You will not need to turn in the answers but it is important to make sure that you are able to answer them. If you are hazy on any of them, you are likely to encounter trouble as you work on the parts of the exercise that you do have to turn in.

How many observations are there in the dataset?

How many variables are there in the dataset?

What is the unit of analysis for the dataset? (That is, what does each row of data represent? Is it individual people? Households? Companies?)

What population was the sample represented in this dataset taken from?

Now let’s turn away from the dataset as a whole and think about individual variables. In this exercise, we will be using the variables race3m1, race3m2, race3m3, race3m4, and Q16. You can look them up in the topline report or you can run a frequency distribution for each one in SPSS. Ask yourself what units the variable is measured in and what is the coding scheme? What is the level of measurement?

Looking at race3m1, race3m2, race3m3, and race3m4, you should be able to see that Pew asked one question about the respondent’s race. But they allowed people to identify up to four races and their responses were recorded in the order they were reported. Pew put the responses into four different variables. Race3m1 contains the first response people gave, race3m2 the second and so on from there. In other words, people who only reported being of one race only have a response filled in for the first variable (race3m1) and the other three are left blank. People who reported two races will have responses recorded for the first two variables, but blanks in the other two. Only people who reported four different races will have responses recorded for all four variables.

Now look at the variable “hisp.” We’re not actually going to use this variable but you should notice that Pew asked a separate question about whether respondents are Hispanic or not. Why did Pew do this? Pew wrote an interesting piece about this. Read: http://www.pewresearch.org/fact-tank/2015/06/15/is-being-hispanic-a-matter-of-race-ethnicity-or-both/

At the end of the datafile, Pew has included some race variables they created (in technical language we say that Pew “recoded” its original race variables). I can see at least three recoded race variables: racethn, racecmb, and racethn2. Check them out by creating frequency distributions and by looking at the variable view within SPSS. What do you think of the way that Pew recoded the variables?

Here’s one additional thing to notice. The documentation Pew provides does not tell us how they created the recoded race variables. So, for example when they created the variable racethn, we can’t tell what happened to a person who said they are white AND black. Not documenting this kind of information is simply not acceptable practice and it is one of the reasons the reproducibility protocol was invented. Other people should always be able to tell how variables were created.