Data Standards

Version 1.0 Produced 2022-11-18

Introduction

The purpose of this document is to provide a single source for guidance on the data standards which HESA adheres to across numerous fields within the HESA data collections.

The following data standards are adopted across the HESA data collections which are XML based:

Use of XML schemas that adhere to the W3C XML Schema Recommendation.
Use of the Unicode character set.
Use of UTF-8 for encoding Unicode characters.
Representation of "no data" - Where there is not a specific code for "no data", an empty string should be used.

Text submission.
Date submission.
Numerical submission.
Postcode submission.
Provider defined identifiers.
UKPRN.

Text submission

The general policy for ‘free text’ submitted to HESA is to support all Latin-based characters for names, addresses, and general text fields. This support does not extend to non-Latin characters.

All Unicode code charts for Latin characters are supported*. These are Basic Latin (excluding the C0 control characters), Latin-1 (excluding the C1 control characters), Latin Extended A, Latin Extended B and Latin Extended Additional. This set corresponds to Unicode code points U+0020 to U+007E, U+00A0 to U+024F and U+1E00 to U+1EFF.

The character set chosen will support Welsh and Gaelic languages as well as all European and most other languages using a Latin-based character set.

The Unicode charts that list each of the characters in this range can be found on the Unicode web site. The specific sets that are defined here are shown in the following PDF documents:

Basic Latin* (x20-x7E).
Latin-1 (A0-FF).
Latin Extended A (100-17F).
Latin Extended B (180-24F).
Latin Extended Additional (E00-EFF).

Files must be encoded with UTF-8 and schema validation will be in place to ensure this. Providers must specify the encoding used in their XML files in the first line of the file (i.e. <?xml version="1.0" encoding="UTF-8" ?>) and to ensure that their files are actually saved with that encoding. If XML files are edited with some text editors and the encoding is not specified or does not match the actual file encoding, there may be problems when submitting these files for validation.

* The Basic Latin character set is broken out to exclude 3C and 3E which are the less than symbol (<) and the greater than symbol (>) respectfully.

Date submission

All dates submitted to HESA must be completed in adherence to ISO 8601. Date and time values are ordered from the largest to smallest unit of time: year, month, day. This allows dates to be naturally sorted by, for example, file systems. Each date and time value has a fixed number of digits that must be padded with leading zeros.

Date representations accepted by HESA are in one of three forms:

YYYY-MM-DD.
YYYY-MM.
YYYY.

[YYYY] indicates a four-digit year, 0000 through 9999.

[MM] indicates a two-digit month of the year, 01 through 12.

[DD] indicates a two-digit day of that month, 01 through 31.

For example, “10 June 1990” will be represented as “1990-06-10”.

Numerical submission

Numerical values are submitted to HESA as either whole numbers, decimals, or percentages.

Whole numbers

Where a whole number is required, a numeric value without a fractional component must be returned. Where any fractional values exist, these must be rounded to the nearest whole number. These values may be returned with or without leading zeros e.g., 001 or 1.

Decimals

Where a decimal is required, a numeric value that may or may not contain a fractional part must be returned.

Percentages

Where a percentage is required, a numerical value ranging between 0 and 100 must be returned. This value can be recorded to one decimal place.

Postcode submission

Postal codes in the United Kingdom are defined by the Royal Mail. All postcodes submitted to HESA must pass schema validation of format. The format of a postcode is alphanumeric and variable in length, ranging from six to eight characters (including a space). Each postcode is divided into two parts separated by a single space - the outward code and the inward code respectively. The outward code includes the postcode area and the postcode district, respectively. The inward code includes the postcode sector and the postcode unit respectively. If the full postcode is not known, the outward part of the postcode can be returned. It is expected that in most cases a full postcode will be provided. Examples of postcodes include “SW1A 0AA”, “GL50 1HZ”, and “BS1 6NB”.

Individual postcodes are validated by HESA against the latest available data from the ONS Postcode Directory (ONSPD) but only a warning rather than an error will be generated if validation fails. This is intended to assist providers that attach importance to the accuracy of their contact information but who may not be in a position to validate postcodes themselves.

Provider defined identifiers

Provider defined identifiers must be unique within the context of the field in which they are submitted to HESA. These identifiers must adhere to the aforementioned guidance within the ‘Text Submission’ section and are restricted to a maximum field length of 50 characters.

Providers should endeavour to keep identifiers consistent across HESA collections wherever possible.

UK Register of Learning Providers

The UK Provider Reference Number is the unique identifier allocated to providers by the UK Register of Learning Providers (UKRLP). All UKPRNs submitted to HESA must pass schema validation of format. The format of a UKPRN is numeric and a fixed length of eight characters. HESA validates submitted UKPRNs against the UKRLP data to ensure that only valid UKPRNs are accepted.

This single register of learning providers is anticipated will replace the plethora of provider identifiers used by different organisations in the education sector.

Go back to the Homepage