Tasrif
open-source wearable health data processing library
Tasrif is an open source data processing package for health data primarily collected through wearables built by QCRI alongside the SIHA project. The project was built as a convenience utility package to make it easier to process data collected by SIHA by health professionals of varying technical ability.
Health data is usually timeseries data, representing some value like weight, BMI, heart-rate and the time this data point was measured.
The following is an example of health data for Steps.
{
"Date": ["12-07-2021", "13-07-2021", "14-07-2021", "15-07-2021"],
"Steps": [ 2100, None, None, 5400]
}
There are different operations we might want to perform on this data. For instance, we might want to remove all null
entries. You could use the pandas dropna function for this. You might also want to take the mean of the values after dropping the null
values.
Tasrif provides operators
for each of these common operations performed on health data, extracted primarily from pandas, but also from facebook Kats and tsfresh. On top of this, there are a number of custom operators we have created from operations we found ourselves performing frequently on health data. You can see an example of these operators
in play in the docs.
My tasks in this project mainly revolved around creating a YAML based specification for tasrif pipelines, allowing for preprocessing to be done with minimal coding.
So for instance, the YAML configuration below produces the same result as the pipeline example from the docs linked above:
example.yaml
modules:
- tasrif.processing_pipeline: [sequence]
- tasrif.processing_pipeline.pandas: [drop_na, concat, mean]
pipeline:
$sequence:
- $drop_na
- $concat
- $mean
examply.py
import pandas as pd
import tasrif.yaml_parser as yaml_parser
import yaml
df1 = pd.DataFrame({
'Date': ['05-06-2021', '06-06-2021', '07-06-2021', '08-06-2021'],
'Steps': [ 4500, None, 5690, 6780]
})
df2 = pd.DataFrame({
'Date': ['12-07-2021', '13-07-2021', '14-07-2021', '15-07-2021'],
'Steps': [ 2100, None, None, 5400]
})
with open("example.yaml", "r") as stream:
try:
p = yaml_parser.from_yaml(stream)
except yaml.YAMLError as exc:
print(exc)
df = p.process()
print(df[0])
# Steps 4894.0
# dtype: float64
This is a small pipeline example, and perhaps the usefulness of the YAML spec is not so apparent here. We have larger examples of YAML config files and their corresponding python equivalent that can be found in the examples folder in the Tasrif GitHub repo.