diff --git a/README.md b/README.md index 6c4714438dbecfdcbacca78b62144332de09a4a5..dcfc456e261db42d65207e3179ab10f0f6db1825 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,5 @@ -1. Datensatz heruntergeladen und entpackt. -2. wc -l yelp_academic_dataset_review.json (wie viele Zeilen hat der Datensatz) -> 8021122 -3. Datensatz geteilt mit "split -l 350000 yelp_academic_dataset_review.json" -> 23 Dateien je ~ 250mb -4. Datenstruktur ermittelt -> -{"review_id":"qCMDfOjWdoyNE-oU3h9DKg","user_id":"JtmLdyw4k1xV78jgjhKM_w","business_id":"vdR_vmmgfI56bwSZYYHtXg","stars":5.0,"useful":0,"funny":0,"cool":0, -"text":"Tasty burgers and sandwiches..Staff were super friendly including the owner..I was just visiting the area and found this place yelp and thought would give -this place a try..highly recommended!","date":"2019-08-01 06:51:51"}$- +# Hausarbeit für das Modul „Maschinelles Lernen“ - -5. Testweise w2v -> Datei mit nur "review text" als Zeilen -6. grep -o -P '(?<=text":").*(?=",)' yelp_academic_dataset_review.json > review_text_only.txt -7. covert to lowercase : - sed -e 's/\(.*\)/\L\1/' regextest.txt > output.txt -8. remove special characters - remove newline character: - sed -i 's/\\n/ /g' small-samlple.txt - remove everything not a word or space: - \ No newline at end of file +Dokumentation: https://gitlab.cvh-server.de/w2v/w2vp/-/tree/master/Dokumentation/w2v.pdf +Slidecast: https://hs-bochum.sciebo.de/s/xOkCb5a8A65jify +Modelle: https://gitlab.cvh-server.de/w2v/w2vp/-/tree/master/models \ No newline at end of file