“The Company” Data Set – Diversity Data

notebook and charts
Photo by Lukas on Pexels.com

What Is This Data Set?

This data set contains dummy data for self-identification of diverse fields that a company might ask their employees. The fields we created are Gender, Gender Identity, Race/Ethnicity, Veteran, Disability, Education, and Sexual Orientation. There are 4968 rows of data and 34,776 total data points tied to “employees” in the core The Company demographic data set (found here).

Why Did We Build This Data Set?

When looking around for HR dummy data, the options are pretty few, which is why we built ”The Company Data”. Looking more specifically for diversity related data, we could not find anything. There might be other data sets out there, but we could not find any, so we decided to build our own.

With so much focus on DEI in the workplace now, we thought it would be essential to start testing data people can use to test new products or build their analytics. Our hope for this data, along with our core Demographic company data, is that this gives our People Systems and Analytics colleagues data that they can use to test and share their ideas without the fear of sharing confidential data. This data allows people to test out new vendor systems during a trial period or build code using Python/R and share their work with others in the space to get feedback.

How We Built It

The data was all completely randomized using the RANDBETWEEN function in Excel. We created a column for each data point below and randomized number from 1 to 100. Then we created a nested IF formula to determine the field selection for each row. So, for example, for Gender: 

Column 1 =RANDBETWEEN(1,100)

Column 2 =IF(COLUMN1<=51,”male”,’female”)

For data points with more options, we nested multiple IF statements and changed the number value to compare to the RANDBETWEEN column.

Data Points

Gender

  • Field Type – Single Select
  • Options – male, female
  • Description – Binary options of male or female for gender. This field tends to be required for US Benefit providers.


Gender Identity

  • Field Type – Single Select
  • Options – female, male, prefer not to say, Non-binary/third gender, Prefer to self-describe
  • Description – Field to signify the gender identity of the employee.

Race/Ethnicity

  • Field Type – Single Select
  • Options – White, Asian, Hispanic or Latino, Two or More Races, Black or African American, Native Hawaiian or Other Pacific Islander, American Indian or Alaska Native
  • Description – Field to signify race/ethnicity based on US EEO reporting requirements.

Veteran

  • Field Type – Boolean (Y/N)
  • Options – 1,0
  • Description – Field to flag if someone is a military veteran or not.

Disability

  • Field Type – Boolean (Y/N)
  • Options – 1,0
  • Description – Field to signify if someone has a disability or not. We may want to add more specifics to this field in the future.

Education

  • Field Type – Single Select
  • Options – Undergraduate, Some College, High School, PhD, Graduate
  • Description – Field to signify the level of education completed.

Sexual Orientation

  • Field Type – Single Select
  • Options – Heterosexual, Missing, Bisexual, Gay, Lesbian, Other LGBTQ+
  • Description – Field to signify someone’s sexual orientation. Missing is the placeholder field to show the employee did not complete a response.

Not Perfect But A Start

We were honestly a little nervous about making this dataset. We were concerned about doing it right and ensuring that we did not miss any groups of diversity that we should have been tracking or writing anything incorrect. We are sure something is still missing, or changes will be needed, but we hope to change and grow this data set over time.

Some Caveats

  • We are not DEI experts – We tried to provide accurate and correct field options for this data set. For example, we followed some of the guidelines provided by UMass Amherst in their work with IBM here for their LGBTQ+ data. There may be some gaps in this data because this is not our field of expertise. If you disagree with fields, options, terms, etc., please contact us here, and we would be happy to discuss.
  • The data is entirely random – This also means the data may not be close to real world examples. Any correlations or results between the data were not done intentionally. Random number generation was used to determine which employee received which result. 
  • US centric data – As stated above, we are not DEI experts. We have a general knowledge of DEI based on what data we have seen in our work, but this is very US centric. We know some information about US DEI programs but little about any programs outside of the US. There may be other categories, field options, and data interests for anyone who collects and uses this type of data outside of the US. We do not know much about this, so we could not include it. Please let us know if you have other fields or field options that you think would be valuable.

Leave a Reply

Discover more from Koluit

Subscribe now to keep reading and get access to the full archive.

Continue reading