“The Company” Data Set

What Is This Data Set?

This data set contains dummy data for self-identification of diverse fields that a company might ask their employees. The fields we created are Gender, Gender Identity, Race/Ethnicity, Veteran, Disability, Education, and Sexual Orientation. There are 4968 rows of data and 34,776 total data points tied to “employees” in the core The Company demographic data set (found here).

Why Did We Build This Data Set?

When looking around for HR dummy data, the options are pretty few, which is why we built ”The Company Data”. Looking more specifically for diversity related data, we could not find anything. There might be other data sets out there, but we could not find any, so we decided to build our own.

With so much focus on DEI in the workplace now, we thought it would be essential to start testing data people can use to test new products or build their analytics. Our hope for this data, along with our core Demographic company data, is that this gives our People Systems and Analytics colleagues data that they can use to test and share their ideas without the fear of sharing confidential data. This data allows people to test out new vendor systems during a trial period or build code using Python/R and share their work with others in the space to get feedback.

How We Built It

The data was all completely randomized using the RANDBETWEEN function in Excel. We created a column for each data point below and randomized number from 1 to 100. Then we created a nested IF formula to determine the field selection for each row. So, for example, for Gender:

Column 1 =RANDBETWEEN(1,100)

Column 2 =IF(COLUMN1<=51,”male”,’female”)

For data points with more options, we nested multiple IF statements and changed the number value to compare to the RANDBETWEEN column.

Data Points

Gender

Field Type – Single Select
Options – male, female
Description – Binary options of male or female for gender. This field tends to be required for US Benefit providers.

Gender Identity

Field Type – Single Select
Options – female, male, prefer not to say, Non-binary/third gender, Prefer to self-describe
Description – Field to signify the gender identity of the employee.

Race/Ethnicity

Field Type – Single Select
Options – White, Asian, Hispanic or Latino, Two or More Races, Black or African American, Native Hawaiian or Other Pacific Islander, American Indian or Alaska Native
Description – Field to signify race/ethnicity based on US EEO reporting requirements.

Veteran

Field Type – Boolean (Y/N)
Options – 1,0
Description – Field to flag if someone is a military veteran or not.

Disability

Field Type – Boolean (Y/N)
Options – 1,0
Description – Field to signify if someone has a disability or not. We may want to add more specifics to this field in the future.

Education

Field Type – Single Select
Options – Undergraduate, Some College, High School, PhD, Graduate
Description – Field to signify the level of education completed.

Sexual Orientation

Field Type – Single Select
Options – Heterosexual, Missing, Bisexual, Gay, Lesbian, Other LGBTQ+
Description – Field to signify someone’s sexual orientation. Missing is the placeholder field to show the employee did not complete a response.

Not Perfect But A Start

We were honestly a little nervous about making this dataset. We were concerned about doing it right and ensuring that we did not miss any groups of diversity that we should have been tracking or writing anything incorrect. We are sure something is still missing, or changes will be needed, but we hope to change and grow this data set over time.

Some Caveats

We are not DEI experts – We tried to provide accurate and correct field options for this data set. For example, we followed some of the guidelines provided by UMass Amherst in their work with IBM here for their LGBTQ+ data. There may be some gaps in this data because this is not our field of expertise. If you disagree with fields, options, terms, etc., please contact us here, and we would be happy to discuss.
The data is entirely random – This also means the data may not be close to real world examples. Any correlations or results between the data were not done intentionally. Random number generation was used to determine which employee received which result.
US centric data – As stated above, we are not DEI experts. We have a general knowledge of DEI based on what data we have seen in our work, but this is very US centric. We know some information about US DEI programs but little about any programs outside of the US. There may be other categories, field options, and data interests for anyone who collects and uses this type of data outside of the US. We do not know much about this, so we could not include it. Please let us know if you have other fields or field options that you think would be valuable.

“The Company” Data Set – Diversity Data

What Is This Data Set?

Why Did We Build This Data Set?

How We Built It

Data Points

Gender

Gender Identity

Race/Ethnicity

Veteran

Disability

Education

Sexual Orientation

Not Perfect But A Start

Some Caveats

Like this:

Leave a ReplyCancel reply

What Is This Data Set?

Why Did We Build This Data Set?

How We Built It

Data Points

Gender

Gender Identity

Race/Ethnicity

Veteran

Disability

Education

Sexual Orientation

Not Perfect But A Start

Some Caveats

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Koluit