Category: data

  • Analysis of v-safe response data

     Amar Jesani shared in mfc group link to an article about v-safe data release.

    The actual data could be downloaded from this website called icandecide.

    The 5GB file can be extracted with p7zip to a 25GB CSV file.

    $ md5sum consolidated_health_checkin.zip
    53ff7a8153f44eaab4166f722b726fe1  consolidated_health_checkin.zip
    $ md5sum consolidated_health_checkin.csv
    345cf6ca148832141260aab8638bf0dc  consolidated_health_checkin.csv

    $ wc -l consolidated_health_checkin.csv
    144856044 consolidated_health_checkin.csv

    (That’s 144 million records in this CSV file)

    $ head -n 5 consolidated_health_checkin.csv
    SURVEY_STATIC_ID,REGISTRANT_CODE,RESPONSE_ID,STARTED_ON,STARTED_ON_TIME,DAYS_SINCE,ABDOMINAL_PAIN,CHILLS,DIARRHEA,FATIGUE,FEELING_TODAY,FEVER,HAD_SYMPTOMS,HEADACHE,HEALTH_IMPACT,HEALTH_NOW,HEALTH_NOW_COMPARISON,VACCINE_CAUSED_HEALTH_ISSUES,HEALTHCARE_VISITS,ITCHING,JOINT_PAINS,MUSCLE_OR_BODY_ACHES,NAUSEA,PAIN,PREGNANT,PREGNANCY_TEST,RASH_OUTSIDE_INJECTION_SITE,REDNESS,SITE_REACTION,SWELLING,SYSTEMIC_REACTION,TEMPERATURE_CELSIUS,TEMPERATURE_FAHRENHEIT,TEMPERATURE_READING,TESTED_POSITIVE,TESTED_POSITIVE_DATE,VOMITING,DURATION_MINS,PREFERRED_LANGUAGE
    vsafe-0-day-dose1,222-10271-84782,s244305050865137831057660547899056617007,12/31/2020,4:55:13 PM,0,,,,,Good,No,,Mild,N/A,,,,,,,,,Mild,,,,,Pain,,Headache,,,,,,,,English
    vsafe-0-day-dose1,222-10325-02776,s258811629454233188277362395339553379505,05/19/2021,3:16:15 PM,0,,,,,Good,No,,,N/A,,,,,,,,,,,,,,None,,None,,,,,,,0.85,English
    vsafe-0-day-dose1,222-10368-05218,s256518678527351061889187968276580937945,04/27/2021,4:11:31 PM,0,,,,,Good,No,,,N/A,,,,,,,,,,No,,,,None,,None,,,,,,,0.72,English
    vsafe-0-day-dose1,222-10453-23273,s245552707728162053684731534374544736656,01/12/2021,3:31:16 PM,0,,,,,Good,No,,,N/A,,,,,,,,,,No,,,,None,,None,,,,,,,,English

    As you can see there are many columns, which we will have to decode.

    Combing through the whole file again and again is taking a lot of time on my computer. So I decided to write a python script that’ll do all analysis in one pass of the file.

    But that was taking even more time.

    So I decided to put this data into postgreSQL to do the analysis.

    $ sudo -u postgres createuser health

    $ sudo -u postgres createdb vsafe -O health

    $ cat load.sql
    SET datestyle TO dmy;
    CREATE table if not exists checkin (
            SURVEY_STATIC_ID varchar, — eg: vsafe-0-day-dose1
            REGISTRANT_CODE varchar, — eg: 222-10271-84782
            RESPONSE_ID varchar,
            STARTED_ON DATE,
            STARTED_ON_TIME varchar,
            DAYS_SINCE int,
            ABDOMINAL_PAIN varchar,
            CHILLS varchar,
            DIARRHEA varchar,
            FATIGUE varchar,
            FEELING_TODAY varchar,
            FEVER varchar,
            HAD_SYMPTOMS varchar,
            HEADACHE varchar,
            HEALTH_IMPACT varchar,
            HEALTH_NOW varchar,
            HEALTH_NOW_COMPARISON varchar,
            VACCINE_CAUSED_HEALTH_ISSUES varchar,
            HEALTHCARE_VISITS varchar,
            ITCHING varchar,
            JOINT_PAINS varchar,
            MUSCLE_OR_BODY_ACHES varchar,
            NAUSEA varchar,
            PAIN varchar,
            PREGNANT varchar,
            PREGNANCY_TEST varchar,
            RASH_OUTSIDE_INJECTION_SITE varchar,
            REDNESS varchar,
            SITE_REACTION varchar,
            SWELLING varchar,
            SYSTEMIC_REACTION varchar,
            TEMPERATURE_CELSIUS varchar,
            TEMPERATURE_FAHRENHEIT varchar,
            TEMPERATURE_READING  varchar,
            TESTED_POSITIVE varchar,
            TESTED_POSITIVE_DATE varchar,
            VOMITING varchar,
            DURATION_MINS FLOAT,
            PREFERRED_LANGUAGE varchar
    );

    COPY checkin FROM ‘consolidated_health_checkin.csv’ DELIMITER ‘,’ CSV HEADER

    $ psql -U health vsafe -f load.sql

    COPY 144856043
     

    That took about 25G space as well.

    Beautiful. Now we can do all kinds of querying.

    Actually, not yet. There’s one more thing we have to do. Create some indexes for making queries easier.

    CREATE INDEX checkin_health_impact_idx ON public.checkin USING btree (health_impact);

    Now, there are some issues with this data. For example:

    ERROR: could not create unique index “checkin_pk”
      Detail: Key (response_id)=(s252082802016465320050574992159464366472) is duplicated.
      Where: parallel worker

    response_id is duplicated, although it looks like every response might be unique.

    But let’s ignore that now for an interesting query result:

     

    That’s the distribution of the Health Impact column. 81 million responses say N/A, 56 million responses include no value (null) for this column  and the tail kind of begins there.

    When I do select count(distinct(registrant_code)) from checkin; I get 9,552,127 which means only 9.5 million registrant_codes are included in the dataset. Since v-safe allows adults to respond on behalf of children, it is probably likely that there are more individuals in the dataset than the registrant_codes.

    Then I ran select count(distinct(registrant_code)) from checkin where health_impact like ‘%Get care from a doctor or other healthcare professional%’; and it returned 797,396. Which means at least 797K people checked this option (with or without other options)

    Now let us look at the variable of interest, healthcare_visits. The query I ran is select healthcare_visits, count(*) from checkin where health_impact like ‘%Get care from a doctor or other healthcare professional%’ group by healthcare_visits ;

    The result is

     

    Note that I haven’t deduplicated by registrant_code here. 

    So I tried a different query: select count(*) from checkin where healthcare_visits  like ‘%Hospitalization%’; the answer to which is 83,690.

    Let us try deduplicating by registrant_code on that:

    select count(distinct(registrant_code)) from checkin where healthcare_visits  like ‘%Hospitalization%’; returns: 71,911

    Which means, there’s some amount of duplication in the row data as to registrant_codes and reports. In other words, from the same registrant_code, you can have multiple reports of Hospitalization.

    This data is rather messy and I’m not exactly sure how icandecide is arriving at “individual” in their numbers because all I see are registrant_code.

    Now, on to some more interesting stuff. What is the distribution of systemic_reaction in registrant_codes who reported Hospitalization?

    select systemic_reaction, count(distinct(registrant_code))from checkin where healthcare_visits  like ‘%Hospitalization%’ group by systemic_reaction ;

    That turned out disappointing because the result was 68,170 NULL fields.

     

     But among the non-null fields, “None”, “Fatigue or tiredness”, “Headache”, etc are leading. (Do note that this is a multi-value column and there could theoretically be a symptom that appears in the tail of this column multiple times thus occurring more number of times than these ones.)

    I also looked at the other files available for download.

    It seems like the Consolidated_health_checkin_u3[1].zip must be under 3 children. The consolidated_registrants[1].zip file makes me think that each registrant_code actually uniquely identifies an individual. Because children are having separate registrant_code with guardian registrant_code mapped in this file. The other files are about race/ethnicity and vaccine that was administered.

     

    The under 3 file includes 116394 reports. Some of the discrepancies in number between my analysis and ICAN’s dashboard probably comes from them adding both these together. 

  • Want to Predict COVID? Ask the medical officers or lab managers

    From the beginning of this pandemic I’ve had very accurate predictions of COVID surge, lull, and fall from two kinds of people – PHC medical officers and lab managers.

    The PHC medical officers see anywhere upwards of 200 sick people per day and they get to see how many people are coming in with COVID like symptoms and notice patterns before they are even tested.

    The lab managers keep a track of test positivity rate (and test rate) and can sense that it is getting overwhelming vs underwhelming.

    The only kind of people who haven’t particularly been helpful are the people who draw graphs based on numbers from government sources.

  • Three Stories On Connecting Health Data

     Story 1

    There was a small research institute with 20 people. Someone among the staff did an online survey to ask everyone their demographic details (name, age, gender, email address) and their diet.
    After four months, another person ran another survey. By then there were 23 people in this institute. This surveyor asked everyone their name, email address, haemoglobin.
     
    Now the director of the institute wanted to connect diet to haemoglobin levels. So they took up the older survey and assumed that since email addresses will probably remain the same, they could use that field to “connect” these databases together. But, alas, there was one person who changed her email from @gmail.com to @institute.org
     
    But never mind. They knew who it was. So they just fixed this by copy pasting rows in a spreadsheet.
    Story 2
    A survey was done in 20 villages. There were two teams of 5 data collectors who divided the villages equally. They went to households and collected from the people demographic details, answers to a lot of questions about health, and also the GPS location of the households.
    Then, 5 months later, the PI got more funding to do a haemoglobin study for 600 people. The PI decided to divide this fund in such a way that 300 people who come to a nearby hospital would get tested and 300 people from the previous survey (6 villages) would get tested.
     
    One of the previous data collection teams was called in. Turns out 5 of those villages were surveyed by the same team in the past. Fresh from the previous survey, they went to these villages and quickly located the households they had previously surveyed. Once they were in, they used the names of the individuals to locate their past record from the surveying app and added haemoglobin values too. But in some households there were no people as they had gone for work.
    In the remaining one village, they used the GPS location to find out the households. It was slightly harder, but it was doable because the application with which they collected the data could directly point them to the household location. It worked when the GPS would work. When the GPS wouldn’t work, they would look at the names of the people in the households and ask people whether they knew where those houses were. Somehow they made it work.
     
    In the hospital, meanwhile, some of those missing people from these villages had come and they were getting haemoglobin tests. But this data was not being collected.
    Story 3
    The Government of Karnataka decided to do tribal health research. They collected data (demographics, height, weight, BP). Then they assigned to each individual a unique ID number. Something called Namma-ID. They told them that they should keep these Namma-ID numbers safe and that these would give them benefits in healthcare, etc. And someone in the government had the idea that the data they collected should be available for researchers. Any researcher who signs a confidentiality clause would be given all the data (name, father/husband name (still patriarchal in 2021), home address, village name, Namma-ID, etc included). There was one doctor in the area who got themselves access to this data. This doc narrowed down the data to their own village and the set of people with hypertension and saved that in a spreadsheet. 
     
    Every time someone came to their clinic, the doctor would ask if they had a Namma-ID and if so the doctor would look at their height, weight, and BP from the GoK data. If they didn’t bring their Namma-ID, the doctor would ask their name and try to search. Sometimes the doctor would have to try various spellings to get the right person’s record. But somehow the doctor would find the right record and add more details when that particular visit was over.
    Then there were times when people who were new to the village came to the doctor’s clinic. The doctor would spend minutes searching for this new person’s record. The poor villager would be sitting on the patient’s chair wondering why the doctor wasn’t asking any question about the health issue.