Decoding the Tapes

black cassette tape on top of red and yellow surface

I have done a lot of work quickly on this CDC Mortality data project. I have decided to slow down and discuss where the project stands at this point and a rehash of why I decided to tackle this data monstrosity.

As the Harmonika Remix of the Neelix song Born & Raised invades my brain waves, we will begin this journey.

The original idea that spawned this project came from my want to research mental health, specifically PTSD and Anxiety. I have mentioned this in a couple of my videos on youtube and blog posts here. The most general definition is that I want to look for the correlation between events in society and increasing occurrences of mental health problems. The events can be anything from war to chemicals introduced into our environment. The whole idea is only limited by the data I can feed it and the algorithms in my mind.

There are tools on the Internet to research specific aspects of what I am looking for, but I never found an easy way to put it together. Of course, I can build the profile and do the research one hundred percent manually, but I would prefer that the machines do the work. So I went looking for raw data sets.

I found that the CDC had the mortality data available for download. I thought this was great, no better place to start that kind of research. On top of that, they even have an FTP to download the data. I talk through this whole thing in the video below.

So I downloaded all of the data, everything on the FTP. There is a whole lot of other data there to sift through. I was just interested in Mortality data, which is at https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/.

After uncompressing it, there is 155GB of data to process. There is a lot of data there ranging from 1968 – 2018. The problem is it seems to be in like 4 different formats. Currently, we are working with Mort99us.zip since we also have a document for that here https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/mortality/Mort99doc.pdf. This document explains how to access tape locations and define the data.

To use this data for any programming, it had to be manually transcribed to a dictionary. In this case, a JSON file that is read by the PERL script that is processing this data. The dictionary has been checked for accuracy many times, though minor modifications and spelling fixes are still required.

To properly decode the condition codes, you also need the file here https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10/. The Perl script uses all of these to process the file. My folder is laid out like below.

-rw-rw-r-- 1 bvest bvest 1419176 Dec 16 13:50 allvalid2009-idc10-dictionary
-rw-rw-r-- 1 bvest bvest 700211 Dec 16 19:36 allvalid2011-idc10-dictionary
-rw-rw-r-- 1 bvest bvest 101113856 Dec 17 12:53 check.output
-rw-rw-r-- 1 bvest bvest 62748 Dec 17 13:06 tape-values.json
-rw-rw-r-- 1 bvest bvest 3970 Dec 17 14:35 decode_mortality_file.pl

The script reads in the data from Mort99us.dat, which presents a string like the one below.

0               11039994860303999999924860300099917510  10111063381808  1 2140064 990999      99999             199904999049990000 009     C229085000240271500211K729 21C229                                                                                                                                                                   02 C229 K729

The script splits this string into an array. This was the best way to do it after thinking about it for a while. The original idea used Perl substr a lot, but then I had a vision that splitting it into an array would look like blocks on a tape. Since the document frequently refers to tape locations, it just clicked right into place. For instance, if you see Tape Location 120-125, it is exactly the same as @array[120..125]. This made so much sense.

So we split the string into an array that looks like the one below.

$VAR1 = ' ';
$VAR2 = ' ';
$VAR3 = ' ';
$VAR4 = '0';
$VAR5 = ' ';
$VAR6 = ' ';
$VAR7 = ' ';
$VAR8 = ' ';
$VAR9 = ' ';
$VAR10 = ' ';
$VAR11 = ' ';
$VAR12 = ' ';
$VAR13 = ' ';
$VAR14 = ' ';
$VAR15 = ' ';
$VAR16 = ' ';
$VAR17 = ' ';
$VAR18 = ' ';
$VAR19 = ' ';
$VAR20 = '1';
$VAR21 = '1';
$VAR22 = '0';
$VAR23 = '1';
$VAR24 = '9';
$VAR25 = '9';
$VAR26 = '9';
$VAR27 = '3';
$VAR28 = '6';
$VAR29 = '3';
$VAR30 = '0';
$VAR31 = '1';
$VAR32 = '0';
$VAR33 = '1';
$VAR34 = '9';
$VAR35 = '9';
$VAR36 = '9';
$VAR37 = '9';
$VAR38 = '9';
$VAR39 = '9';
$VAR40 = '9';
$VAR41 = '1';
$VAR42 = '3';
$VAR43 = '6';
$VAR44 = '3';
$VAR45 = '0';
$VAR46 = '1';
$VAR47 = '0';
$VAR48 = '6';
$VAR49 = '3';
$VAR50 = '9';
$VAR51 = '9';
$VAR52 = '1';
$VAR53 = '0';
$VAR54 = '0';
$VAR55 = '1';
$VAR56 = '0';
$VAR57 = '1';
$VAR58 = ' ';
$VAR59 = ' ';
$VAR60 = '1';
$VAR61 = '0';
$VAR62 = '1';
$VAR63 = '1';
$VAR64 = '1';
$VAR65 = '0';
$VAR66 = '3';
$VAR67 = '8';
$VAR68 = '3';
$VAR69 = '3';
$VAR70 = '1';
$VAR71 = '3';
$VAR72 = '0';
$VAR73 = '6';
$VAR74 = ' ';
$VAR75 = ' ';
$VAR76 = '2';
$VAR77 = ' ';
$VAR78 = '4';
$VAR79 = '5';
$VAR80 = '7';
$VAR81 = '0';
$VAR82 = '1';
$VAR83 = '1';
$VAR84 = '6';
$VAR85 = ' ';
... snipped for length this goes to 441

Then we loop through the dictionary to get the tape locations. The dictionary also contains any values that were in the dictionary for that specific field. For instance, most fields that want a state name also provide a list of state codes. Amusingly, some state fields have different numbers for the same state. Wyoming can be 52 or 53.

Below is a sample from the tape-values.json. This is Tape Location 119-120 which is the State of Occurrence from FIPS data. The elasticsearch fieldname is FIPS_State_of_Occurrence. Field names are as close to the definition as possible.

"119-120":{
"es_field_name": "FIPS_State_of_Occurrence",
"field_size":2,
"definition": "State of Occurrence (FIPS)",
"values":{
"01":"Alabama",
"02":"Alaska",
"04":"Arizona",
"05":"Arkansas",
"06":"California",
"08":"Colorado",
"09":"Connecticut",
"10":"Delaware",
"11":"District of Columbia",
"12":"Florida",
"13":"Georgia",
"15":"Hawaii",
"16":"Idaho",
"17":"Illinois",
"18":"Indiana",
"19":"Iowa",
"20":"Kansas",
"21":"Kentucky",
"22":"Louisiana",
"23":"Maine",
"24":"Maryland",
"25":"Massachusetts",
"26":"Michigan",
"27":"Minnesota",
"28":"Mississippi",
"29":"Missouri",
"30":"Montana",
"31":"Nebraska",
"32":"Nevada",
"33":"New Hampshire",
"34":"New Jersey",
"35":"New Mexico",
"36":"New York",
"37":"North Carolina",
"38":"North Dakota",
"39":"Ohio",
"40":"Oklahoma",
"41":"Oregon",
"42":"Pennsylvania",
"44":"Rhode Island",
"45":"South Carolina",
"46":"South Dakota",
"47":"Tennessee",
"48":"Texas",
"49":"Utah",
"50":"Vermont",
"51":"Virginia",
"53":"Washington",
"54":"West Virginia",
"55":"Wisconsin",
"56":"Wyoming",
"72":"Puerto Rico",
"78":"Virgin Islands",
"66":"Guam",
"60":"American Samoa",
"69":"Northern Marianas"
}
},

The above dictionary produces the debugging output below. All tape locations print similar debugging output when using the test script at the end.

Tape Locations:119-120
Raw Value:01
Definition:State of Occurrence (FIPS)
ES Field:FIPS_State_of_Occurrence
Value:Alabama

All of these are added to a Perl hash that when complete is converted to JSON for ingestion in Elasticsearch. The Perl hash for one random document is below.

{
'NCHS_PMSA_MSA_of_Residence' => 'Nonmetropolitan counties',
'Entity_Axis_Contion_4' => 'Other and unspecified hydronephrosis',
'IDC10_Code_10th' => 'C64 ',
'Usual_Occupation' => 'Blank, Unknown, NA',
'Number_of_Endity_Axis_Conditions' => '04',
'Population_Size_of_County_of_Occurrence' => 'County of less than 100,000',
'Race_Recode_3' => 'White',
'Count_of_Residence' => '999',
'FIPS_city_of_Residence' => 'Balance of county; or city of less than 100,000 population',
'record_type' => 'RESIDENTS',
'Injury_at_Work' => 'Unnown',
'Age_Recode_52' => '65 - 69 years',
'FIPS_State_of_Residence' => 'Alabama',
'FIPS_PMSA_MSA_of_Residence' => '0000',
'Population_Size_of_City_of+Residence' => '…All other areas in the U.S.',
'Number_of_Record_Axis_Conditons' => '04',
'Place_of_Death_and_Decedent_Status' => 'Hospital, clinic or Medical Center - Inpatient',
'Race_Recode_2' => 'White',
'Day_of_Week_of_Death' => 'Sunday',
'Kind_of_Business_or_Industry' => 'Blank, Unknown, NA',
'FIPS_County_of_Occurrence' => 'County of less than 100,000 population',
'State_of_Birth' => 'Iowa',
'Maritial_Status' => 'Widowed',
'Detail_Age_2' => ' 01-59, 99 … Minutes',
'Cause_Recocde_113' => '034',
'Population_Size_of_County_of_Residence' => 'County of less than 100,000',
'Cause_Recode_39' => '12',
'Metropolitan_Nonmetropolitan_County_of_Residence' => 'Nonmetropolitan county',
'State_of_Occurrence' => 'Alabama',
'Entity_Axis_Contion_2' => 'Secondary malignant neoplasm of other specified sites',
'Current_Data_Year' => '1999',
'FIPS_State_of_Occurrence' => 'Alabama',
'Region_Division_State_Subcode_Residence' => '363',
'Hispanic_Origin' => 'Non - Hispanic',
'Record_Axis_Condition_1' => 'Malignant neoplasm of kidney, except renal pelvis',
'FIPS_CMSA_of_Residence' => 'Not a CMSA',
'Infant_Cause_Recode_130' => '027',
'PMSA_MSA_Population_Size' => 'Area of less than 100,000 or nonmetropolitan area',
'City_of_Residence' => '999',
'Age_Recode_27' => '65 - 69 years',
'Age_Recode_12' => '65 - 74 years',
'FIPS_County_of_Residence' => 'County of less than 100,000 population',
'Education_Recode' => '12 years',
'Detail_Age_3' => '8',
'Region_of_County' => '363',
'Education' => '4 years of high school',
'State_of_Occurence_Code_Expanded' => 'Alabama',
'Resident_Status' => 'RESIDENTS',
'State_of_Residence' => 'Alabama',
'Hispanic_Origin_Race_Recode' => 'Non - Hispanic white',
'Entity_Axis_Contion_3' => 'Renal and perinephric abscess',
'Detail_Race' => 'White',
'Sex' => 'Male',
'Entity_Axis_Contion_1' => 'Malignant neoplasm of kidney, except renal pelvis',
'County_of_Occurrence' => '999',
'Expanded_State_of_Residence_Code' => 'Alabama',
'Cause_Recode_358' => '117',
'Month_of_Death' => 'January'

And finally that is converted to the JSON below for ingestion into Elasticsearch.

{"NCHS_PMSA_MSA_of_Residence":"Nonmetropolitan counties","Entity_Axis_Contion_4":"Other and unspecified hydronephrosis","IDC10_Code_10th":"C64 ","Usual_Occupation":"Blank, Unknown, NA","Number_of_Endity_Axis_Conditions":"04","Population_Size_of_County_of_Occurrence":"County of less than 100,000","Race_Recode_3":"White","Count_of_Residence":"999","FIPS_city_of_Residence":"Balance of county; or city of less than 100,000 population","record_type":"RESIDENTS","Injury_at_Work":"Unnown","Age_Recode_52":"65 - 69 years","FIPS_State_of_Residence":"Alabama","FIPS_PMSA_MSA_of_Residence":"0000","Population_Size_of_City_of+Residence":"…All other areas in the U.S.","Number_of_Record_Axis_Conditons":"04","Place_of_Death_and_Decedent_Status":"Hospital, clinic or Medical Center - Inpatient","Race_Recode_2":"White","Day_of_Week_of_Death":"Sunday","Kind_of_Business_or_Industry":"Blank, Unknown, NA","FIPS_County_of_Occurrence":"County of less than 100,000 population","State_of_Birth":"Iowa","Maritial_Status":"Widowed","Detail_Age_2":" 01-59, 99 … Minutes","Cause_Recocde_113":"034","Population_Size_of_County_of_Residence":"County of less than 100,000","Cause_Recode_39":"12","Metropolitan_Nonmetropolitan_County_of_Residence":"Nonmetropolitan county","State_of_Occurrence":"Alabama","Entity_Axis_Contion_2":"Secondary malignant neoplasm of other specified sites","Current_Data_Year":"1999","FIPS_State_of_Occurrence":"Alabama","Region_Division_State_Subcode_Residence":"363","Hispanic_Origin":"Non - Hispanic","Record_Axis_Condition_1":"Malignant neoplasm of kidney, except renal pelvis","FIPS_CMSA_of_Residence":"Not a CMSA","Infant_Cause_Recode_130":"027","PMSA_MSA_Population_Size":"Area of less than 100,000 or nonmetropolitan area","City_of_Residence":"999","Age_Recode_27":"65 - 69 years","Age_Recode_12":"65 - 74 years","FIPS_County_of_Residence":"County of less than 100,000 population","Education_Recode":"12 years","Detail_Age_3":"8","Region_of_County":"363","Education":"4 years of high school","State_of_Occurence_Code_Expanded":"Alabama","Resident_Status":"RESIDENTS","State_of_Residence":"Alabama","Hispanic_Origin_Race_Recode":"Non - Hispanic white","Entity_Axis_Contion_3":"Renal and perinephric abscess","Detail_Race":"White","Sex":"Male","Entity_Axis_Contion_1":"Malignant neoplasm of kidney, except renal pelvis","County_of_Occurrence":"999","Expanded_State_of_Residence_Code":"Alabama","Cause_Recode_358":"117","Month_of_Death":"January"}

This is just the beginning of this project. There will be much more data from many sources available for research when this is finished. My path here is to make it easier for everyone to use this data and any other verifiable data to do detailed research and create their own view.

You can get the files needed to run this from here https://github.com/b-vest/cdctoes. This does not include the Mortality99us file. You will need to get that from here https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/. Documentation is crude at the moment but will come in time as I lock everything down.

Stay Tuned.

Leave a Reply

Powered by WordPress.com.

Up ↑

%d bloggers like this: